glue bookmark and cost estimation

0

Hi everyone, I got stuck with something. if bookmark is reset it means that all data will be scanned. how can I estimate the cost for that? let's explain and can you guide me with this following question: consider that before reseting bookmark, job A took about 10 s. then I reset the bookmark for job A (resource bucket of this job has 20 objects totally and total bucket size is 9 G) and now runtime is about 30 s. I want to use this information for this>> I have another job , lets say job B (resource bucket of this job has 40 objects and total bucket size is 90 G) . I need to estimate time and cost for this job, let’s consider if bookmark of job B is reset..so how can I do that?how can I compare it with previous one? shall I consider number of bucket or size bucket ?

if object number is correct answer: I can say job A has 20 object, now we can estimate job B would take 2 * 30s= 60s . since it has 2times more object

but if size bucket is correct answer: I can say job A is about 9G, now we can estimate job B would take 10 * 30s= 300s . since it has 10times more size thanks

profile picture
gh02
asked 3 months ago206 views
1 Answer
0

Hi,

Since you already have a starting point i.e. you have a job which runs using a bookmark and you have configured X DPUs to process Y number of records in S3 in Z minutes. I would use this logic now to extrapolate the costs for processing the full/un-bookmarked data.

E.g. The pricing as of today in us-east-1 is $0.44 per DPU-Hour for each Apache Spark or Spark Streaming job, billed per second with a 1-minute minimum (Glue version 2.0 and later). So, if you have a job which needs 2 DPUs to process 10k records in 2 mins, you will be charged as below:

Unit conversions Duration for which Apache Spark ETL job runs: 2 minutes = 0.04 hours Pricing calculations Max (0.04 hours, 0.0166 hours (minimum billable duration)) = 0.04 hours (billable duration) 2 DPUs x 0.04 hours x 0.44 USD per DPU-Hour = 0.04 USD (Apache Spark ETL job cost) ETL jobs cost (monthly): 0.04 USD

For pricing, please refer this and also check the pricing calculator

Note, this is only for Glue ETL. Independent of this, the objects you store in an S3 bucket will incur pricing for storage and access refer here

Thanks, Rama

profile pictureAWS
EXPERT
answered 3 months ago
profile picture
EXPERT
reviewed 3 months ago
  • The complexity of the ETL is important here, since it influences the runtime and DPU utilisation. You can only compare Job A (bookmark v/s no-bookmark) and the same for Job B (bookmark v/s no-bookmark)). I suggest you should do a PoC with a small volume and check - that would be the best way to come up with larger estimate.

  • I meant that how can I guess the cost for next jobs..I updated my question please read it again. thanks

  • is there any way to set bookmark manually? to skip scan all data?

  • I guess, you could move the already processed data to another bucket (?)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions