Reading Parquet Files from a Java Application

Recently I came accross the requirement to read a parquet file into a java application and I figured out it is neither well documented nor easy to do so. As a consequence I wrote a short tutorial. The first task is to add your maven dependencies.

<dependencies>
 <dependency>
 <groupId>org.apache.parquet</groupId>
 <artifactId>parquet-hadoop</artifactId>
 <version>1.9.0</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.0</version>
 </dependency>
</dependencies>

To write the java application is easy once you know how to do it. Instead of using the AvroParquetReader or the ParquetReader class that you find frequently when searching for a solution to read parquet files use the class ParquetFileReader instead. The basic setup is to read all row groups and then read all groups recursively.

Continue reading “Reading Parquet Files from a Java Application”