I have a Kinesis firehose sending data through a Lambda transform, then converting to parquet format via the schema in a Glue table, and then to S3 where it is read by Athena/QuickSight. The data is JSON documents, representing mobile app events sent to Kinesis via Pinpoint. This part of the system works fine.
I also have about 1G worth of JSON documents on disk, representing event data from before we had this firehose set up. I need to get this historical data into the same format and location as the current stuff, so it can all be queried together in QuickSight.
What I would like to do is just feed this data to the existing Glue table, because it has the correct schema and output location set up. That doesn't seem to be possible, though, so I'm trying to use a Glue ETL job. The job consists of three nodes:
- S3 data source. The schema is inferred from the input. The input has fields we don't want, and is also missing some fields that we do want, that are specified in our Glue schema. Input fields are all strings, should we be doing type conversion at this stage?
- Transform Apply Mapping. This stage appears to do the type conversion, and drops the fields we don't want. It can't add new fields, however.
- S3 data target. This appears to only allow the parquet conversion, no type conversion or schema manipulation.
This is failing on two points:
- A string-to-integer conversion. I don't know which one, but I get "Unsupported case of DataType: com.amazonaws.services.glue.schema.types.IntegerType@1c4e8366 and DynamicNode: stringnode". As I said the JSON data is all strings, and I don't know which stage should be responsible for type conversion. I have run the raw data through a Python script that converts all the integer-type fields to integers with no problem, though I suppose Java might have different requirements. I believe the Transform Apply Mapping should be doing this, but I don't know why it's failing.
- The bigger problem is that the resulting parquet files (the ones that survive conversion) cause errors in QuickSight. It just says it failed to read the parquet files. I assume this is because the QuickSight dataset is expecting the full set of fields specified in the Glue table, but I don't have any way of adding those fields during the ETL job. Of course, it could be any other error, there's zero information.
What I'd really like to do is just use our existing Glue table as a schema for ingesting and converting this JSON. Is possible at all? Maybe using a ETL script instead of the visual editor?
Any other possible solutions would be very welcome, too!
The ETL job writes compressed parquet files by default, but the current Firehose writes uncompressed files. Could that mismatch cause Athena/QuickSight to fail to read the converted parquet files?