How to use Glue bookmark to track last processed row using timestamp per grouping column

1

I have a source dataset that I need to load into a Glue Job incrementally. This dataset contains three columns: 'device_id', 'timestamp' and 'reading'.

Because different devices may send readings at different times (and this may include data with timestamps that are EARLIER than the latest timestamps from other devices), I can't simply use the ‘timestamp’ column as the bookmark key. There’s a strong chance that other devices (identified by their ‘device_id’) may have records with older timestamps that need processing.

What I therefore want to be able to do, is configure a Glue bookmark to track the last processed timestamp PER device_id, rather than tracking the last processed timestamp ACROSS ALL device_ids - is this possible with Glue bookmarks, or do I need to consider another alternative?

cgddrd
질문됨 2년 전2539회 조회
2개 답변
2

AWS Glue uses one or more columns as bookmark keys to determine new and processed data. But this does not work as you intend to use it.

  • For single columns used as a bookmark, Glue considers these as unique IDs and read all IDs greater than the last val
  • For multiple columns listed as bookmarks, it works to identify the last value from both columns. The docs are not detailed as to how they work, but my test did not pick up all cases where there were new id's and timestamps.

You can specify jobBookmarkKeys and jobBookmarkKeysSortOrder in the following ways:

create_dynamic_frame.from_catalog — Use additional_options.

create_dynamic_frame.from_options — Use connection_options.

Use the below example when using from_catalog

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "devices", table_name = "device_reading",
    transformation_ctx = "datasource0",
    additional_options = {
        "jobBookmarkKeys": ["device_id","timestamp"],
        "jobBookmarkKeysSortOrder": "asc"
    }
)

Please follow this reference for any more information https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

profile pictureAWS
답변함 2년 전
AWS
전문가
검토됨 2년 전
0

Glue job bookmark works in this fashion:

  • For Amazon S3 Data sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed.

  • For JDBC data stores Job Bookmarks you can specify the column names to use as bookmark keys. By default PRIMARY key is used but each bookmark keys has to follow a rule that either they are increasing or decreasing with no gap.

Thus there is no issue when it come for S3 data sources, But if it the case that it is JDBC data source you have to compound keys as your timestamp column is not contiguous or try to use a single column which have contiguous data.

additional_options = {
        "jobBookmarkKeys": ["device_id","timestamp"],
        "jobBookmarkKeysSortOrder": "asc"
    }
AWS
지원 엔지니어
Shubh
답변함 2년 전
AWS
전문가
검토됨 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠