Read Parquet file from S3 without hadoop

0

I need to read a parquet file from S3 using Java in a maven project. I am using the below code to read the Parquet file, but the serverless app I am deploying exceeds the limit of 50Mb when I include the parquet and Hadoop dependencies. I need Hadoop Path and Configuration classes to read the file on S3. Is there any way I can avoid Hadoop altogether?

List<SimpleGroup> simpleGroups = new ArrayList<>(); ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), conf)); MessageType schema = reader.getFooter().getFileMetaData().getSchema(); List<Type> fields = schema.getFields(); PageReadStore pages; while ((pages = reader.readNextRowGroup()) != null) { long rows = pages.getRowCount(); MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema); RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema)); for (int i = 0; i < rows; i++) { SimpleGroup simpleGroup = (SimpleGroup) recordReader.read(); simpleGroups.add(simpleGroup); } } reader.close();

AWS
已提问 4 年前2672 查看次数
1 回答
1
已接受的回答

Have you tried S3 select? This will avoid Hadoop altogether. Also take a look at example to search data using S3 select with simple sql queries.

AWS
Vivek_S
已回答 4 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则