- Newest
- Most votes
- Most comments
Hi,
Looking at the "GENERIC_INTERNAL_ERROR: org/objenesis/strategy/InstantiatorStrategy" it seems to be general issue when querying over an Apache Hudi dataset of Merge on Read (MoR) table type.
This error might occur in case when a delete action is performed, because then an AVRO file will be added automatically to Hudi dataset which might cause the above error when trying to query in Athena. In case, if there was an AVRO file added to your Hudi dataset/S3, remove the file and try to query again from Athena.
Now the problem should be fixed and query should succeed, However you might see that the deleted records are re-appearing again in Athena query results. This might happen because Athena can not read the delta commits, delta commits will be stored in AVRO file which helps keep a track of the deleted records and prevents query those deleted values.
DELTA_COMMIT - A delta commit refers to an atomic write of a batch of records into a MergeOnRead type table, where some/all of the data could be just written to delta logs.
But, with AVRO file in Hudi dataset it seems it is not possible to query in Athena, because it seems Athena expects compacted data from source which seems it only support a table with single format(like the whole values in the file should be in parquet format only). With Merge on Read tables Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. All updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files[1]. This is the reason the AVRO files gets added to your bucket after delete.
With Merge on Read, you are only writing the updated rows and not whole files as with Copy on Write (CoW). This is why Merge on Read is helpful for use cases that require more writes, or update/delete heavy workload, with a fewer number of reads. Delta commits are written to disk as Avro records (row-based storage), and compacted data is written as Parquet files (columnar storage). To avoid creating too many delta files, Hudi will automatically compact your dataset so that your reads are as performant as possible[2].
A MoR table type is typically suited for write-heavy or change-heavy workloads with fewer reads.
Apache Hudi provides three logical views for accessing data:
-
Read-optimized – Provides the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables.
-
Incremental – Provides a change stream between two actions out of a CoW dataset to feed downstream jobs and extract, transform, load (ETL) workflows.
-
Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline
As of this writing, Athena supports read-optimized and real-time views but not the incremental. On MoR tables, all data exposed to read optimized queries are compacted. This provides good performance but does not include the latest delta commits. However, the snapshot queries contain the freshest data[2].
So, it is recommended to compact the data at source to query with Athena.
This should help understand why we might get the deleted records on running the query again as it is unable to read the delta commits. I would request you to refer these articles for more insights [1], [2].
