Read Parquet file from S3 without hadoop

Question

I need to read a parquet file from S3 using Java in a maven project. I am using the below code to read the Parquet file, but the serverless app I am deploying exceeds the limit of 50Mb when I include the parquet and Hadoop dependencies. I need Hadoop Path and Configuration classes to read the file on S3. Is there any way I can avoid Hadoop altogether?

`
        List simpleGroups = new ArrayList<>();
        ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), conf));
        MessageType schema = reader.getFooter().getFileMetaData().getSchema();
        List fields = schema.getFields();
        PageReadStore pages;
        while ((pages = reader.readNextRowGroup()) != null) {
            long rows = pages.getRowCount();
            MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
            RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
            for (int i = 0; i < rows; i++) {
                SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
                simpleGroups.add(simpleGroup);
            }
        }
        reader.close();
`

Accepted Answer

Have you tried [S3 select][1]? This will avoid Hadoop altogether. Also take a look at [example][2] to search data using S3 select with [simple sql queries.][3]

[1]: https://docs.aws.amazon.com/AmazonS3/latest/dev/SelectObjectContentUsingJava.html
  [2]: https://github.com/aws-samples/s3-select-phonebook-search
  [3]: https://github.com/aws-samples/s3-select-phonebook-search/blob/master/src/main/java/com/amazonaws/samples/s3select/s3_select_demo/S3SelectDemoLambdaHandler.java

Read Parquet file from S3 without hadoop

相关内容