- Newest
- Most votes
- Most comments
Based on your requirements and constraints, here are my thoughts on your data lake optimization challenge:
Apache Iceberg Evaluation
Apache Iceberg would be a good fit for your use case. It provides hidden partitioning and a manifest-based file system that could significantly improve your lookup performance without disrupting your existing hourly partitioning scheme for Spark jobs. Iceberg's metadata files track the exact files containing specific records, which would help with your single-record lookups.
Key benefits for your scenario:
- Maintains compatibility with your existing Glue/Spark jobs
- Provides efficient lookups via metadata files without scanning all Parquet files
- Supports record-level updates (addressing your current limitation)
- Handles your scale (and future growth) well
- Works with your existing S3 storage
Simpler Alternatives
For your relatively low-volume lookups (100-300/day), a few simpler approaches could work:
- Secondary lightweight index in DynamoDB with TTL:
- Store only key-to-file mappings (not full records)
- Set 28-day TTL matching your retention policy
- This approach would be much less expensive than storing full records
- Write costs would be manageable since you're only storing mapping data
- Optimized Athena queries with partitioning improvements:
- Add more granular partitioning beyond just date/hour (perhaps by key ranges)
- Use Athena workgroups with result reuse enabled for common lookups
- Implement query acceleration using Athena's adaptive query execution
- Serverless ElasticSearch/OpenSearch for lookups:
- Index only the key fields and file locations
- Provides fast lookups with minimal management overhead
- Can be updated in near real-time from your Flink application
Maintaining Indexes in a Streaming Context
For your custom manifest approach, consider:
- Append-only manifest design:
- Create new manifest files for each batch of records rather than updating existing ones
- Periodically compact/merge manifest files in the background
- This eliminates read-modify-write cycles
- Leverage Flink's state management:
- Have your Flink application maintain and emit manifest files alongside data files
- This keeps index creation in the same processing pipeline as your data
Lightweight Indexing Solutions
- Amazon Athena Federated Query:
- Create a lightweight index in DynamoDB or ElasticSearch
- Use federated queries to join your index with the actual S3 data
- S3 Select with targeted queries:
- Use your manifest files to identify specific Parquet files
- Then use S3 Select for efficient in-file filtering
Given your constraints around minimizing complexity and operational overhead while improving lookup performance, I'd recommend Apache Iceberg as the most balanced solution. It provides the indexing benefits you need without requiring a separate database system to maintain, and it's designed specifically for data lake optimization. The transition would be relatively smooth for your existing Glue/Spark jobs, and it would solve both your lookup performance issues and your record modification limitations.
Sources
Amazon SageMaker Feature Store offline store data format - Amazon SageMaker
Best practices to optimize S3 Express One Zone performance - Amazon Simple Storage Service
Building scalable AWS Lake Formation governed data lakes with dbt and Amazon Managed Workflows for Apache Airflow | AWS Big Data Blog
Scala script example - streaming ETL - AWS Glue
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
