Query DynamoDB Backup to S3 with Athena

0

I'm getting errors for a variety of attempts to read my S3 backup for DynamoDB.

The data looks like this:

{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.85"},"CurrentTime":{"n":"1661540734"}}
{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.84"},"CurrentTime":{"n":"1661540736"}}

I tried the following which succeeds, but then the query fails:

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_data_presto (
    sensorID    struct<s:string>,
    temperature struct<n:float>,
    currentTime struct<n:bigint>
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/';

The following query, gets the following error:

SELECT DISTINCT sensorID.s
FROM sensor_data_presto;

Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 39 [character 40 line 1]

I also tried the following which succeeds, but then the query fails:

CREATE EXTERNAL TABLE IF NOT EXISTS `mydatabase`.`sensor_data_regex` (
  sensorid string,
  temperature float,
  currenttime bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
 'input.regex'='/^\{"SensorId":\{"s":("[a-z]*-[0-9]*"){1}\},"Temperature":\{"n":"([0-9]*[\.]?[0-9]*)"\}{1},"CurrentTime":\{"n":"([0-9]*){1}"\}\}$/g'
) LOCATION 's3://bucket-name/'
TBLPROPERTIES ('has_encrypted_data'='false');

I tried the above with and without the beginning / and ending /g. I tried it with and without the ending $.

The following query gets the following errors:

SELECT * FROM sensor_data_regex;

GENERIC_INTERNAL_ERROR: PatternSyntaxException thrown initializing deserializer org.apache.hadoop.hive.serde2.RegexSerDe. Cause: Illegal repetition near index 1 /^{"SensorId":{"s":("[a-z]*-[0-9]*"){1}},"Temperature":{"n":"([0-9]*[.]?[0-9]*)"}{1},"CurrentTime":{"n":"([0-9]*){1}"}}$/g ^

I was trying to replicate the example in Chapter 7: DynamoDB from the book Seven Databases in Seven Weeks, 2nd Edition for use in my college class. I was able to update the entire pipeline to work on AWS until the final step that uses Athena. I would appreciate advice on correcting these errors or completing the pipeline (cost effectively) to query the DynamoDB data.

asked 2 years ago929 views
1 Answer
0
Accepted Answer

The major problem in getting a working solution turned out to be that I forgot to delete the _SUCCESS and manifest files in the S3 bucket. These files were from the EMR job that transferred the DynamoDB data to S3. After this change, the first query above on the sensor_data_presto table works. Even after deleting the two files, the regular expression based approach in the sensor_data_regex table did not work. The working CREATE and SELECT commands are repeated below for clarity:

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_data_presto (
    sensorID    struct<s:string>,
    temperature struct<n:float>,
    currentTime struct<n:bigint>
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/';

SELECT DISTINCT sensorID.s
FROM sensor_data_presto;
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions