- Newest
- Most votes
- Most comments
AWS Glue is well-suited for your use case of processing large Excel files. As a serverless ETL service that uses Apache Spark, AWS Glue can efficiently handle big data workloads by distributing processing across worker nodes, which would help address your memory issues and timeouts.
Here's an approach you could take:
- Upload your Excel files to Amazon S3
- Create an AWS Glue job to extract and process the data from these files
- Configure the job to identify column names and values
- Store the processed results in your RDS database
AWS Glue offers several advantages for this workflow:
- It can handle large datasets that cause problems in your current Java application
- It uses distributed processing with Apache Spark for faster transformations
- You can control the compute resources (DPUs) based on your workload needs
- It supports auto-scaling in AWS Glue 3.0 to optimize resource usage
For your specific need to normalize column names later, you could:
- Extract the schema information from the Excel files
- Store both the raw column names and the data in your RDS database
- Allow users to normalize these column names as you currently do
This approach would offload the heavy data processing from your application server to AWS Glue, while still maintaining your workflow for column normalization and AI context preparation.
If your Excel files are particularly large, consider saving the intermediate data in a format like Parquet, which is more efficient for processing and can be easily read by AWS Glue and other AWS analytics services.
Sources
Export data by using AWS Glue - AWS Prescriptive Guidance
AWS Glue ETL - AWS Prescriptive Guidance
Relevant content
- asked 2 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 6 months ago