Reading Parquet Files from a Java Application

Recently I came accross the requirement to read a parquet file into a java application and I figured out it is neither well documented nor easy to do so. As a consequence I wrote a short tutorial. The first task is to add your maven dependencies.

<dependencies>
 <dependency>
 <groupId>org.apache.parquet</groupId>
 <artifactId>parquet-hadoop</artifactId>
 <version>1.9.0</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.0</version>
 </dependency>
</dependencies>

To write the java application is easy once you know how to do it. Instead of using the AvroParquetReader or the ParquetReader class that you find frequently when searching for a solution to read parquet files use the class ParquetFileReader instead. The basic setup is to read all row groups and then read all groups recursively.

package de.jofre.test;
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.column.page.PageReadStore;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
import org.apache.parquet.format.converter.ParquetMetadataConverter;
import org.apache.parquet.hadoop.ParquetFileReader;
import org.apache.parquet.hadoop.metadata.ParquetMetadata;
import org.apache.parquet.io.ColumnIOFactory;
import org.apache.parquet.io.MessageColumnIO;
import org.apache.parquet.io.RecordReader;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.Type;
 
public class Main {
 
  private static Path path = new Path("file:\\C:\\myfile.snappy.parquet");
 
  private static void printGroup(Group g) {
    int fieldCount = g.getType().getFieldCount();
    for (int field = 0; field &lt; fieldCount; field++) {
      int valueCount = g.getFieldRepetitionCount(field);
 
      Type fieldType = g.getType().getType(field);
      String fieldName = fieldType.getName();
 
      for (int index = 0; index &lt; valueCount; index++) {
        if (fieldType.isPrimitive()) {
          System.out.println(fieldName + " " + g.getValueToString(field, index));
        }
      }
    }
    System.out.println("");
  }
 
  public static void main(String[] args) throws IllegalArgumentException {
 
    Configuration conf = new Configuration();
 
    try {
      ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
      MessageType schema = readFooter.getFileMetaData().getSchema();
      ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
 
      PageReadStore pages = null;
      try {
        while (null != (pages = r.readNextRowGroup())) {
          final long rows = pages.getRowCount();
          System.out.println("Number of rows: " + rows);
 
          final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
          final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
          for (int i = 0; i &lt; rows; i++) {
            final Group g = recordReader.read();
            printGroup(g);
 
            // TODO Compare to System.out.println(g);
          }
        }
      } finally {
        r.close();
      }
    } catch (IOException e) {
      System.out.println("Error reading parquet file.");
      e.printStackTrace();
    }
  }
}

11 thoughts on “Reading Parquet Files from a Java Application”

    1. Nice one – I’ve been tied up in knots trying to extract data from a parquet for ages. I got this working in a few minutes 🙂

      Just a couple of small points.

      1. As ‘MATTIA CASOTTO’ says, the ‘less then’ symbol is displayed as a HTML escape sequence.
      2. recordReader.read() returns an object – you’ll need to cast it to a Group

      It’s not a big deal, but the code won’t work as-is

  1. This was helpful. Thanks very much. The output here from recordRecorder.read() has an indented structure if there are arrays or objects. Instead of indented, could it use a JSON formatter? Thanks

  2. I was redirected here from your post at stackoverflow. Your code works for me. Thanks.
    If I want to read multiple parquet files from a folder, do you have some ideas about this problem?
    Appreciate your help.

  3. NICE, had to set javax.net.ssl.keyStore to a keystore I created using KMS cert since we are using CDH, but it worked! Now to convert this to CSV…

  4. Thank you for the information. One thing I noticed is you are checking if a fieldType is Primitive. If not you are discarding it. Is there a reason behind this? How can we get (possibly in JSON format) values that are not primitive?

  5. I am getting the below error when trying to use the above code
    java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.(Lorg/jets3t/service/security/AWSCredentials;)

    CODE:
    String PATH_SCHEMA = “s3://” + object.getBucketName() + “/” + object.getKey();
    Path path = new Path(PATH_SCHEMA);
    Configuration conf = new Configuration();
    conf.set(“fs.s3.awsAccessKeyId”, credentials.accessKeyId);
    conf.set(“fs.s3.awsSecretAccessKey”, credentials.secretKey);
    try {
    ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
    MessageType schema = readFooter.getFileMetaData().getSchema();
    ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
    PageReadStore pages = null;
    try {
    while (null != (pages = r.readNextRowGroup())) {
    final long rows = pages.getRowCount();
    System.out.println(“Number of rows: ” + rows);
    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
    final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
    }
    } finally {
    r.close();
    }
    } catch (IOException e) {
    System.out.println(“Error reading parquet file.”);
    e.printStackTrace();
    }

    I am using the exact version of the dependencies mentioned above

    Please help here.

    1. Hi, you are using some AWS S3 dependency and call a method that does not exist. It has nothing to do with my code.

  6. Just for your own benefit…

    The following is your code updated so that there are no calls to deprecated APIs with latest version of the libraries:

    private void readParquet(final File f) throws IOException
    {
    final Path path = new Path(f.toURI());
    final HadoopInputFile hif = HadoopInputFile.fromPath(path, new Configuration());
    final ParquetReadOptions opts = HadoopReadOptions.builder(hif.getConfiguration())
    .withMetadataFilter(ParquetMetadataConverter.NO_FILTER).build();

    try (final ParquetFileReader r = ParquetFileReader.open(hif, opts))
    {
    final ParquetMetadata readFooter = r.getFooter();
    final MessageType schema = readFooter.getFileMetaData().getSchema();

    while (true)
    {
    final PageReadStore pages = r.readNextRowGroup();

    if (pages == null)
    return;

    final long rows = pages.getRowCount();

    System.out.println(“Number of rows: ” + rows);

    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
    final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
    for (int i = 0; i < rows; i++)
    {
    final Group g = recordReader.read();

    printGroup(g);
    }
    }
    }
    }

    private static void printGroup(Group g)
    {
    final int fieldCount = g.getType().getFieldCount();

    for (int field = 0; field < fieldCount; field++)
    {
    final int valueCount = g.getFieldRepetitionCount(field);
    final Type fieldType = g.getType().getType(field);
    final String fieldName = fieldType.getName();

    for (int index = 0; index < valueCount; index++)
    {
    if (fieldType.isPrimitive())
    {
    System.out.println(fieldName + " " + g.getValueToString(field, index));
    }
    }
    }

    System.out.println("");
    }

  7. Hello, thank you for the statement that this is not a trivial task 🙂 Unfortunately, your code does not seem to work on my system (windows 10) .. my pom.xml only contains the two dependencies you suggested and the code is identical (except for < and (Group) cast corrections), yet I get the error

    "Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FSDataInputStream.getWrappedStream()Ljava/io/InputStream;"

    How can this be if the pom.xml is the same?

Leave a Reply to Justin Cancel reply

Your email address will not be published. Required fields are marked *