I'm getting errors for a variety of attempts to read my S3 backup for DynamoDB.
The data looks like this:
{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.85"},"CurrentTime":{"n":"1661540734"}}
{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.84"},"CurrentTime":{"n":"1661540736"}}
I tried the following which succeeds, but then the query fails:
CREATE EXTERNAL TABLE IF NOT EXISTS sensor_data_presto (
sensorID struct<s:string>,
temperature struct<n:float>,
currentTime struct<n:bigint>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/';
The following query, gets the following error:
SELECT DISTINCT sensorID.s
FROM sensor_data_presto;
Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 39 [character 40 line 1]
I also tried the following which succeeds, but then the query fails:
CREATE EXTERNAL TABLE IF NOT EXISTS `mydatabase`.`sensor_data_regex` (
sensorid string,
temperature float,
currenttime bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='/^\{"SensorId":\{"s":("[a-z]*-[0-9]*"){1}\},"Temperature":\{"n":"([0-9]*[\.]?[0-9]*)"\}{1},"CurrentTime":\{"n":"([0-9]*){1}"\}\}$/g'
) LOCATION 's3://bucket-name/'
TBLPROPERTIES ('has_encrypted_data'='false');
I tried the above with and without the beginning / and ending /g. I tried it with and without the ending $.
The following query gets the following errors:
SELECT * FROM sensor_data_regex;
GENERIC_INTERNAL_ERROR: PatternSyntaxException thrown initializing deserializer org.apache.hadoop.hive.serde2.RegexSerDe. Cause: Illegal repetition near index 1 /^{"SensorId":{"s":("[a-z]*-[0-9]*"){1}},"Temperature":{"n":"([0-9]*[.]?[0-9]*)"}{1},"CurrentTime":{"n":"([0-9]*){1}"}}$/g ^
I was trying to replicate the example in Chapter 7: DynamoDB from the book Seven Databases in Seven Weeks, 2nd Edition for use in my college class. I was able to update the entire pipeline to work on AWS until the final step that uses Athena. I would appreciate advice on correcting these errors or completing the pipeline (cost effectively) to query the DynamoDB data.