- Más nuevo
- Más votos
- Más comentarios
You cannot write directly to s3 a proper Iceberg table without using the Iceberg format and thus updating metadata snapshots.
It doesn't matter how you make changes to the table, the changelog view works using the snapshot history, up to the point you create the changelog, make sure the job creates new snapshots and that range is included in the view.
It seems like you're facing an issue where the activity from the Kinesis stream is not reflected in the Iceberg changelog when using a Glue streaming job. Let's break down your setup and possible reasons for the behavior you're experiencing.
-
Kinesis Stream to Iceberg via Glue Streaming Job: You're using a Glue streaming job to consume data from a Kinesis stream and write it to an Iceberg table.
-
Changelog Generation with Extra Jars: You're using another Glue job to generate the changelog using the Iceberg version 1.4.3 with the --extra-jars parameter.
3)Observation: Activity from the Kinesis stream is not reflected in the changelog, whereas manual inserts or merges via Athena or Glue jobs are reflected.
Possible Reasons:
Direct S3 Writes: If the Kinesis stream is directly writing data to S3 and bypassing the Glue catalog or Iceberg's transaction mechanism, it might not be captured in the changelog. This could be the case if the Glue streaming job is not properly configured to interact with Iceberg or if the writing mechanism bypasses Iceberg altogether.
Transaction Commit: Iceberg captures changes through transaction commits. If the Kinesis stream data isn't being committed within a transaction context, it might not be reflected in the changelog. Ensure that the Glue streaming job is properly committing transactions after writing data to the Iceberg table.
Configuration Issue: There might be a configuration issue in how the Glue streaming job interacts with Iceberg or how Iceberg is configured to capture changes. Check your Glue job settings, Iceberg configuration, and ensure compatibility between the Iceberg version used for writing and reading.
Compatibility Issue: Ensure compatibility between the Iceberg version used for writing data (via Glue streaming job) and generating the changelog (via Glue job with --extra-jars). Incompatibility between versions could lead to issues in capturing changes properly.
Glue Job Execution: Verify that the Glue streaming job and the job for generating the changelog are executed properly without errors. Check logs and monitoring metrics to ensure there are no issues during execution.
To troubleshoot, you may need to dive deeper into the Glue streaming job's configuration, Iceberg setup, and how data is being written from the Kinesis stream to the Iceberg table.
Contenido relevante
- OFICIAL DE AWSActualizada hace 3 años
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace un año
To add some more detail - I am filtering using the view creation using "start-timestamp" instead of "start-snapshot-id", according to these docs: https://iceberg.apache.org/docs/latest/spark-procedures/#usage_17.
I wonder if using a timestamp is the problem?
I don't see a problem, is going to get the history of snaphosts and filter by timestamp, the question is if the history does have new snaphosts
Yes agreed, but I can now confirm, that when I use the snapshot-id as part of the procedure, it works as intended. But when I pass in a timestamp, I get the following error: "Cannot find snapshot older than 1970-01-20T20:07:01.200+00:00"
So I believe this must be a bug in the Iceberg code for the changelog procedure.