Query DynamoDB Backup to S3 with Athena

0

I'm getting errors for a variety of attempts to read my S3 backup for DynamoDB.

The data looks like this:

{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.85"},"CurrentTime":{"n":"1661540734"}}
{"SensorId":{"s":"sensor-10"},"Temperature":{"n":"63.84"},"CurrentTime":{"n":"1661540736"}}

I tried the following which succeeds, but then the query fails:

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_data_presto (
    sensorID    struct<s:string>,
    temperature struct<n:float>,
    currentTime struct<n:bigint>
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/';

The following query, gets the following error:

SELECT DISTINCT sensorID.s
FROM sensor_data_presto;

Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 39 [character 40 line 1]

I also tried the following which succeeds, but then the query fails:

CREATE EXTERNAL TABLE IF NOT EXISTS `mydatabase`.`sensor_data_regex` (
  sensorid string,
  temperature float,
  currenttime bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
 'input.regex'='/^\{"SensorId":\{"s":("[a-z]*-[0-9]*"){1}\},"Temperature":\{"n":"([0-9]*[\.]?[0-9]*)"\}{1},"CurrentTime":\{"n":"([0-9]*){1}"\}\}$/g'
) LOCATION 's3://bucket-name/'
TBLPROPERTIES ('has_encrypted_data'='false');

I tried the above with and without the beginning / and ending /g. I tried it with and without the ending $.

The following query gets the following errors:

SELECT * FROM sensor_data_regex;

GENERIC_INTERNAL_ERROR: PatternSyntaxException thrown initializing deserializer org.apache.hadoop.hive.serde2.RegexSerDe. Cause: Illegal repetition near index 1 /^{"SensorId":{"s":("[a-z]*-[0-9]*"){1}},"Temperature":{"n":"([0-9]*[.]?[0-9]*)"}{1},"CurrentTime":{"n":"([0-9]*){1}"}}$/g ^

I was trying to replicate the example in Chapter 7: DynamoDB from the book Seven Databases in Seven Weeks, 2nd Edition for use in my college class. I was able to update the entire pipeline to work on AWS until the final step that uses Athena. I would appreciate advice on correcting these errors or completing the pipeline (cost effectively) to query the DynamoDB data.

질문됨 2년 전787회 조회
1개 답변
0
수락된 답변

The major problem in getting a working solution turned out to be that I forgot to delete the _SUCCESS and manifest files in the S3 bucket. These files were from the EMR job that transferred the DynamoDB data to S3. After this change, the first query above on the sensor_data_presto table works. Even after deleting the two files, the regular expression based approach in the sensor_data_regex table did not work. The working CREATE and SELECT commands are repeated below for clarity:

CREATE EXTERNAL TABLE IF NOT EXISTS sensor_data_presto (
    sensorID    struct<s:string>,
    temperature struct<n:float>,
    currentTime struct<n:bigint>
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/';

SELECT DISTINCT sensorID.s
FROM sensor_data_presto;
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인