glue job bookmark

0

Hi everyone. I did one experiment and found out in Glue if we delete a table and re-create it by crawler it has effect on glue bookmark (for ETL jobs). it is like reset bookmark. am I correct? and question 2 is : if we reset bookmark for one job. does it mean that in next run time this should scan all the data? Thanks

profile picture
gh02
feita há 3 meses225 visualizações
1 Resposta
2
Resposta aceita

You are correct !

If you delete and recreate a table in AWS Glue using a crawler, it can affect the job bookmark for ETL jobs. This is because the job bookmark, which tracks previously processed data, is also deleted when you delete a job.

Moreover, if you reset a job bookmark, the next run of the job will process the entire dataset. This is because resetting a bookmark clears the state information that AWS Glue uses to track processed data.

profile picture
ESPECIALISTA
respondido há 3 meses
profile picture
ESPECIALISTA
avaliado há 3 meses
profile picture
ESPECIALISTA
avaliado há 3 meses
  • thanks can you guide me with this following question: consider that before reseting bookmark, job A took about 10 s. then I reset the bookmark for job A (resource bucket of this job has 20 objects totally and total bucket size is 9 G) and now runtime is about 30 s. I want to use this test information for this>>> I have another job , lets say job B (resource bucket of this job has 40 objects totally and total bucket size is 90 G) . I need to estimate time and cost for situation if bookmark of job B is reset..so how can I do that?how can I compare it with previous one? shall I consider number of bucket or size bucket (using proportions)? if object number is correct answer: I can say job A has 20 object, now we can estimate job B would take 2 * 30s= 60s . since it has 2times more object but if size bucket is correct answer: I can say job A is about 9G, now we can estimate job B would take 10 * 30s= 300s . since it has 10times more size

  • Well, it really depends on how the job is formed. AWS Glue pricing is based on the Data Processing Unit (DPU) used, which is a measure of consumption of compute resources. The cost would depend on how long the job runs and how many DPUs it uses. If job B takes longer to run or uses more DPUs, it will cost more.

  • I know that... but I did one experiment as I said in comment, now I want to use the information and generalize this information to others jobs..and then try to guess how long other jobs will take after reseting bookmark? all of them use 2dpu, and I know how can I estimate cost when I know how long they will take..I don't know how can I estimate 'runtime' for them (without resetting job B bookmarks. I want to guess how long job B will take to run if I reset )

  • You observed that resetting the bookmark for job A increased its runtime from 10s to 30s. If we assume that the increase in runtime is proportional to the number of objects, then job B (with twice the number of objects) would indeed take about 60s. On the other hand, if the increase in runtime is proportional to the size of the bucket, then job B (with ten times the size) would take about 300s.

    However, these are rough estimates and the actual runtime could be different. It’s also worth noting that the cost of running the job will depend on the runtime and the number of DPUs used. So, if the runtime increases, the cost will also increase proportionally.

    This is like finding the exact formula for gambling. There isn't one.

  • Thanks..but my question was which one is the answer? 60 or 300? I know this is just like estimation. But I need to know should I consider number of object or size ?!

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas