DMS MSSQL Server to S3 in Parquet format datetime to timestamp issue

0

Hello all,
I recently extracted a database schema from MSSQL Server to S3 in Parquet format via DMS. All of the data looks great, with the exception of any column that was a datetime data type in MSSQL Server.
I ran a Glue Crawler on the output and it correctly identified the column names and data types, specifically identifying the datetime columns as timestamp. The issue I have is that when I run a query in Athena on the DMS extracted data and look at a timestamp column I am getting +49311-02-25 00:00:00.000, when I should see 2017-05-05 00:00:00.
For the same table I did a manual extract via Spark and output it to Parquet and stored it in S3. I then created the table with an identical definition, save for the file location. When I queried the manually extracted table in Athena it displayed the timestamp column correctly.

Any ideas what might have happened? Any suggestions for how to configure my DMS S3 Endpoint to generate the correct timestamp?

Thanks,
Kyle

asked 5 years ago1477 views
4 Answers
0

I have some updated information. I was able to get the schema of the DMS generated parquet and it has this for the column that will be a timestamp

optional int64 PROCESS_DATE (TIMESTAMP_MICROS);

Here is the schema for the pyspark 2.3.2 generated parquet file using defaults for this version

optional int96 PROCESS_DATE;

Also, it was brought to my attention that if you take the int64 value from the DMS parquet, eg PROCESS_DATE = 1493942400000000, and translate as a timestamp in nanoseconds it comes out to 2017-05-05. Plus it appears that is not associated to MSSQL Server, as I test extracting a MySQL table with a datetime column and it did the exact same thing.

answered 5 years ago
0

After further research, I discovered that I was incorrect in assuming that the actual value was in nanoseconds, it is actually in microseconds just like the Parquet Schema states. I no longer think this is a DMS issue and believe it is an issue with Athena not being able to read the value correctly. Using https://www.epochconverter.com/ with the 1493942400000000 value it comes back with 2017-05-05 and identifies it as a microsecond time value. If you remove the last three zeros from it and run it again, you get the same result and it identifies it as milliseconds.

However if you run this in Athena SELECT from_unixtime(1493942400000) you get the following response

+49311-02-25 00:00:00.000

Oh, it looks like from_unixtime is expecting the value to be in seconds.

SELECT from_unixtime(1493942400)
2017-05-05 00:00:00.000	

So I guess the hacky solution would be to set the column to BIGINT, then select timestamp column in Athena like so from_unixtime(col1/1000/1000).

answered 5 years ago
0

For me adding parquetTimestampInMillisecond=true to the extra connection attributes of the target endpoint fixed this

answered 5 years ago
0

equationcrunchor wrote:
For me adding parquetTimestampInMillisecond=true to the extra connection attributes of the target endpoint fixed this
Thank you @equationcrunchor - this worked for me extracting from SQL Server 2016

answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions