- Newest
- Most votes
- Most comments
Hi,
Since you already have a starting point i.e. you have a job which runs using a bookmark and you have configured X DPUs to process Y number of records in S3 in Z minutes. I would use this logic now to extrapolate the costs for processing the full/un-bookmarked data.
E.g. The pricing as of today in us-east-1 is $0.44 per DPU-Hour for each Apache Spark or Spark Streaming job, billed per second with a 1-minute minimum (Glue version 2.0 and later). So, if you have a job which needs 2 DPUs to process 10k records in 2 mins, you will be charged as below:
Unit conversions Duration for which Apache Spark ETL job runs: 2 minutes = 0.04 hours Pricing calculations Max (0.04 hours, 0.0166 hours (minimum billable duration)) = 0.04 hours (billable duration) 2 DPUs x 0.04 hours x 0.44 USD per DPU-Hour = 0.04 USD (Apache Spark ETL job cost) ETL jobs cost (monthly): 0.04 USD
For pricing, please refer this and also check the pricing calculator
Note, this is only for Glue ETL. Independent of this, the objects you store in an S3 bucket will incur pricing for storage and access refer here
Thanks, Rama
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 2 years ago
The complexity of the ETL is important here, since it influences the runtime and DPU utilisation. You can only compare Job A (bookmark v/s no-bookmark) and the same for Job B (bookmark v/s no-bookmark)). I suggest you should do a PoC with a small volume and check - that would be the best way to come up with larger estimate.
I meant that how can I guess the cost for next jobs..I updated my question please read it again. thanks
is there any way to set bookmark manually? to skip scan all data?
I guess, you could move the already processed data to another bucket (?)