Skip to content

Optimizing S3 Data Lake for Low-Frequency Individual Record Lookups Prioritizing Simplicity

0

We have a large-scale S3 data lake with the following characteristics:

  • Source: AWS Flink application writing Parquet files directly to S3
  • Volume: ~4000 Parquet files per hour, ranging from 200GB to 1TB of Parquet data per hour
  • Daily Volume: 12TB+ across ~96k files
  • Record Count: ~80-130M records/hour or ~2.3-2.5B records/day
  • S3 Partitioning by date and hour
  • No lakehouse format (no Iceberg, Hudi, etc.), just uncompressed Parquet files in S3 being read by downstream AWS Glue (Apache Spark) jobs and AWS Athena
  • Each record has 4 different keys associated with it that it can be looked up by
  • Data likely to double in the next ~8 months
  • 28 day rolling retention policy

Usage Patterns:

  • Primary use case: Glue/Spark jobs - Read entire hour partitions for aggregations and analytics - performing fine
  • Secondary use case: UI Lookups - Single key look ups for displaying a single requests payload in a UI. ~100-300 requests/day (low usage - but important users/usecase)
    • Users must specify the date the record took place in (hour is optional)
    • If no hour filter is provided, there's an 80% change the query times out past our 40 second API limit
    • If hour is provided, 90% success rate

Constraints:

  • Write-heavy workload due to the amount of unique keys we see
  • Low number of UI users, but fixing timeouts is critical as they are important users
  • Data volume expected to double within a year
  • Would like to avoid full re-architecture, but have already considered migrating to Apache Iceberg for other reasons (ex: we currently have no way to modify records)
  • Would like as minimal downstream impact on the Glue jobs due to the amount of jobs dependent on it
  • Would like to keep cost and complexity relatively low due to the user base being small and our team already supporting a lot of other use cases

Solutions Considered:

  • Apache Iceberg: Use hidden partitioning/indexing for secondary lookups while maintaining hourly physical partitions for Spark jobs
  • Custom manifest/index files: Create per-hour index files (one per key type) mapping keys to specific S3 file paths. See below:

The idea was to create manifest files as such for each primary key in each hourly partition: /manifest/datetime=2026-01-01--00/pk_type_1/partition<0_n>.parquet

We would read a batch of S3 events and create a map of key -> s3_path that we would store in a manifest file in S3/EFS. To make the manifest files smaller, we hash the key value and mod by 1000 (arbitrary number for now without more testing). This would allow us to look through 1/1000 of the keys to make look up faster.

The challenge here is maintaining these in a streaming architecture is complex (S3 event → SNS → SQS → Lambda would require expensive read-modify-write cycles or some type of batching of manifest files). This part is incredibly complex and would require a significant amount of complexity. Annoyingly most efficient read based file systems, like Parquet, don't support appends and require a full read + modify + write to make any modifications.

I thought about running the batch job 5 minutes past the hour each hour and falling back to simple Athena queries when the cache doesn't yet exist, but that seems like a pretty large pitfall and delay in a realtime system.

I of course have also considered DynamoDB (prohibitively expensive write cost for the amount of reads we do) and am also considering spinning up a PostgreSQL database to test with. I just wanted to see if I could architect a solution for this that wouldn't result in a need for a traditional database as that is only more management for our team (who is currently trying to move away from DB management).

Questions:

  1. Would Apache Iceberg be a good fit for this use case with hidden partitioning and their manifest file system?
  2. Is there a simpler solution I'm missing? I'm aiming for ~5-8 second queries with minimal operational overhead, not sub-second performance.
  3. For the custom manifest approach, what's the best way to maintain indexes in a streaming context without expensive read-modify-write operations?
  4. Are there lightweight indexing solutions better suited for streaming writes and infrequent point lookups?
asked 6 days ago60 views
1 Answer
0

Based on your requirements and constraints, here are my thoughts on your data lake optimization challenge:

Apache Iceberg Evaluation

Apache Iceberg would be a good fit for your use case. It provides hidden partitioning and a manifest-based file system that could significantly improve your lookup performance without disrupting your existing hourly partitioning scheme for Spark jobs. Iceberg's metadata files track the exact files containing specific records, which would help with your single-record lookups.

Key benefits for your scenario:

  • Maintains compatibility with your existing Glue/Spark jobs
  • Provides efficient lookups via metadata files without scanning all Parquet files
  • Supports record-level updates (addressing your current limitation)
  • Handles your scale (and future growth) well
  • Works with your existing S3 storage

Simpler Alternatives

For your relatively low-volume lookups (100-300/day), a few simpler approaches could work:

  1. Secondary lightweight index in DynamoDB with TTL:
  • Store only key-to-file mappings (not full records)
  • Set 28-day TTL matching your retention policy
  • This approach would be much less expensive than storing full records
  • Write costs would be manageable since you're only storing mapping data
  1. Optimized Athena queries with partitioning improvements:
  • Add more granular partitioning beyond just date/hour (perhaps by key ranges)
  • Use Athena workgroups with result reuse enabled for common lookups
  • Implement query acceleration using Athena's adaptive query execution
  1. Serverless ElasticSearch/OpenSearch for lookups:
  • Index only the key fields and file locations
  • Provides fast lookups with minimal management overhead
  • Can be updated in near real-time from your Flink application

Maintaining Indexes in a Streaming Context

For your custom manifest approach, consider:

  1. Append-only manifest design:
  • Create new manifest files for each batch of records rather than updating existing ones
  • Periodically compact/merge manifest files in the background
  • This eliminates read-modify-write cycles
  1. Leverage Flink's state management:
  • Have your Flink application maintain and emit manifest files alongside data files
  • This keeps index creation in the same processing pipeline as your data

Lightweight Indexing Solutions

  1. Amazon Athena Federated Query:
  • Create a lightweight index in DynamoDB or ElasticSearch
  • Use federated queries to join your index with the actual S3 data
  1. S3 Select with targeted queries:
  • Use your manifest files to identify specific Parquet files
  • Then use S3 Select for efficient in-file filtering

Given your constraints around minimizing complexity and operational overhead while improving lookup performance, I'd recommend Apache Iceberg as the most balanced solution. It provides the indexing benefits you need without requiring a separate database system to maintain, and it's designed specifically for data lake optimization. The transition would be relatively smooth for your existing Glue/Spark jobs, and it would solve both your lookup performance issues and your record modification limitations.
Sources
Amazon SageMaker Feature Store offline store data format - Amazon SageMaker
Best practices to optimize S3 Express One Zone performance - Amazon Simple Storage Service
Building scalable AWS Lake Formation governed data lakes with dbt and Amazon Managed Workflows for Apache Airflow | AWS Big Data Blog
Scala script example - streaming ETL - AWS Glue

answered 6 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.