Skip to content

In Glue ETL I have a job to switch a column from string to one that encodes date and time. I have the option to convert to timestamp or date but questioning the absence of datetime which I prefer

0

Coming from the world of Python/Pandas it seems odd to me that datetime is not the default. I associate timestamp data types with hard sciences research where precision matters. I am dealing with data where seconds do not matter but minutes in the hour does. I am hesitant to switch to timestamp columns because of the fixed end date in 2035 among other things I mentioned. I can create custom transformations I believe in PySpark but from my cursory research the preference there also is for timestamp

asked 10 months ago548 views
2 Answers
2

In AWS Glue ETL, you're correct that the primary options for converting a string column to a date and time format are typically timestamp or date. While it may seem unusual coming from a Python/Pandas background where datetime is common, AWS Glue follows conventions more aligned with Apache Spark.

For your specific use case where you need to represent date and time but don't require second-level precision, you have a few options:

  1. Use the timestamp type: Despite your concerns, the timestamp type in Glue (which is based on Spark) is actually quite versatile and doesn't have the 2038 problem you might be thinking of. It can represent dates well beyond 2035.

  2. Use a string format: You can convert your string to a formatted timestamp string that includes only the precision you need. This can be done using the "Format timestamp" transformation in Glue.

  3. Use separate date and time columns: If you prefer, you could split your data into separate date and time columns.

If you decide to use the timestamp type, you can control the precision when you format it for output. For example, you can format a timestamp to show only up to minutes using a pattern like "yyyy-MM-dd HH:mm".

Remember, even if you use a timestamp internally, you have control over how it's formatted when you write the data out. You can use the "Format timestamp" transformation to convert a timestamp column to a formatted string based on your specific needs.

If you still prefer to work with a datetime-like object, you might consider creating a custom UDF (User Defined Function) in PySpark. However, for most use cases, the built-in timestamp type should be sufficient and is well-integrated with Glue's transformations and optimizations.

Ultimately, while it might feel different from the Python/Pandas approach, using timestamp in Glue/Spark is a standard practice and should meet your needs for representing date and time data, including cases where you need minute-level precision without seconds.

Sources
Convert a column to timestamp type - AWS Glue
Convert a timestamp column to a formatted string - AWS Glue

answered 10 months ago
EXPERT
reviewed 10 months ago
0

That was a detailed answer that put my concerns at ease thank you

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.