Read Parquet file from S3 without hadoop

0

I need to read a parquet file from S3 using Java in a maven project. I am using the below code to read the Parquet file, but the serverless app I am deploying exceeds the limit of 50Mb when I include the parquet and Hadoop dependencies. I need Hadoop Path and Configuration classes to read the file on S3. Is there any way I can avoid Hadoop altogether?

List<SimpleGroup> simpleGroups = new ArrayList<>(); ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(filePath), conf)); MessageType schema = reader.getFooter().getFileMetaData().getSchema(); List<Type> fields = schema.getFields(); PageReadStore pages; while ((pages = reader.readNextRowGroup()) != null) { long rows = pages.getRowCount(); MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema); RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema)); for (int i = 0; i < rows; i++) { SimpleGroup simpleGroup = (SimpleGroup) recordReader.read(); simpleGroups.add(simpleGroup); } } reader.close();

AWS
posta 4 anni fa2672 visualizzazioni
1 Risposta
1
Risposta accettata

Have you tried S3 select? This will avoid Hadoop altogether. Also take a look at example to search data using S3 select with simple sql queries.

AWS
Vivek_S
con risposta 4 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande