1 Answer
- Newest
- Most votes
- Most comments
2
You are correct !
If you delete and recreate a table in AWS Glue using a crawler, it can affect the job bookmark for ETL jobs. This is because the job bookmark, which tracks previously processed data, is also deleted when you delete a job.
Moreover, if you reset a job bookmark, the next run of the job will process the entire dataset. This is because resetting a bookmark clears the state information that AWS Glue uses to track processed data.
Relevant content
- asked a month ago
- asked 2 months ago
- asked 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
thanks can you guide me with this following question: consider that before reseting bookmark, job A took about 10 s. then I reset the bookmark for job A (resource bucket of this job has 20 objects totally and total bucket size is 9 G) and now runtime is about 30 s. I want to use this test information for this>>> I have another job , lets say job B (resource bucket of this job has 40 objects totally and total bucket size is 90 G) . I need to estimate time and cost for situation if bookmark of job B is reset..so how can I do that?how can I compare it with previous one? shall I consider number of bucket or size bucket (using proportions)? if object number is correct answer: I can say job A has 20 object, now we can estimate job B would take 2 * 30s= 60s . since it has 2times more object but if size bucket is correct answer: I can say job A is about 9G, now we can estimate job B would take 10 * 30s= 300s . since it has 10times more size
Well, it really depends on how the job is formed. AWS Glue pricing is based on the Data Processing Unit (DPU) used, which is a measure of consumption of compute resources. The cost would depend on how long the job runs and how many DPUs it uses. If job B takes longer to run or uses more DPUs, it will cost more.
I know that... but I did one experiment as I said in comment, now I want to use the information and generalize this information to others jobs..and then try to guess how long other jobs will take after reseting bookmark? all of them use 2dpu, and I know how can I estimate cost when I know how long they will take..I don't know how can I estimate 'runtime' for them (without resetting job B bookmarks. I want to guess how long job B will take to run if I reset )
You observed that resetting the bookmark for job A increased its runtime from 10s to 30s. If we assume that the increase in runtime is proportional to the number of objects, then job B (with twice the number of objects) would indeed take about 60s. On the other hand, if the increase in runtime is proportional to the size of the bucket, then job B (with ten times the size) would take about 300s.
However, these are rough estimates and the actual runtime could be different. It’s also worth noting that the cost of running the job will depend on the runtime and the number of DPUs used. So, if the runtime increases, the cost will also increase proportionally.
This is like finding the exact formula for gambling. There isn't one.
Thanks..but my question was which one is the answer? 60 or 300? I know this is just like estimation. But I need to know should I consider number of object or size ?!