Big data extraction with AWS Glue and then store it in database?

0

Hi!

I need to extract a lot of data from big excels. Now Im doing it in my java springboot backend service, extracting the column names and his list of related values and then saving then in the database, so later the user can normalize the column names so my ai agent can have a cleaner context about what each part of the data is, but for big excels Im having some problems, memory issues, timeouts etc...

It makes sense to upload the excel to S3 and ingest it from glue, run a job, and when the job is ready get the result and store the column names and the values in my RDS database? Or there is another better approach for handeling this kind of situations?

Thank you!

1 Answer
0

AWS Glue is well-suited for your use case of processing large Excel files. As a serverless ETL service that uses Apache Spark, AWS Glue can efficiently handle big data workloads by distributing processing across worker nodes, which would help address your memory issues and timeouts.

Here's an approach you could take:

  1. Upload your Excel files to Amazon S3
  2. Create an AWS Glue job to extract and process the data from these files
  3. Configure the job to identify column names and values
  4. Store the processed results in your RDS database

AWS Glue offers several advantages for this workflow:

  • It can handle large datasets that cause problems in your current Java application
  • It uses distributed processing with Apache Spark for faster transformations
  • You can control the compute resources (DPUs) based on your workload needs
  • It supports auto-scaling in AWS Glue 3.0 to optimize resource usage

For your specific need to normalize column names later, you could:

  • Extract the schema information from the Excel files
  • Store both the raw column names and the data in your RDS database
  • Allow users to normalize these column names as you currently do

This approach would offload the heavy data processing from your application server to AWS Glue, while still maintaining your workflow for column normalization and AI context preparation.

If your Excel files are particularly large, consider saving the intermediate data in a format like Parquet, which is more efficient for processing and can be easily read by AWS Glue and other AWS analytics services.
Sources
Export data by using AWS Glue - AWS Prescriptive Guidance
AWS Glue ETL - AWS Prescriptive Guidance

answered 5 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions