1 Risposta
- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
1
Hi there,
This approach may work for low volumes, with a few caveats.
- Macie's create-classification-job API has a very low TPS limit (once per 10 seconds with some bursting). This means that if you have multiple uploads in a short time, you will be throttled which will add additional code complexity.
- S3 has no way of saying "Get me all objects tagged X". In order for Macie to determine which objects to classify based on a tag, it needs to iterate over all objects in a bucket and call S3's get-object-tagging APIs. If your landing zone bucket is large and you are relying on tags, this can cause a lot of extra S3 API calls and delays in job performance.
In general, Macie's jobs are relatively heavyweight, meaning they are optimized for running over large volumes of data and not real time / just-in-time data flows. Some suggestions:
- Run jobs less frequently based on SLA. For instance, if you have a processing SLA of 4 hours, you can call Macie every 4 hours and process all objects that landed in that time. This leverages Macie's batch efficiencies and will avoid throttling.
- Use prefixes or other filters besides tagging to identify which objects to classify. For instance, instead of tagging an object with "date:5/19/2023", which can incur the overhead described above, you can put objects in a 2023/19/5 prefix and then scan, as prefix filtering within S3 is extremely efficient.
Hope that helps!
con risposta un anno fa
Contenuto pertinente
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 3 anni fa
- AWS UFFICIALEAggiornata 8 mesi fa
Hi @alatech
Thank you so much for quick response. I understand Macie has very low TPS but we expect 40-50 max file transfer requests in day. We have lifecycle on S3 object where object get removed in 30 days. so at any given point, S3 bucket will have max 1500 files.
Macie does not have hourly based schedule option, min/lowest frequency is once in day which does not fulfill our requirement. Since these files support critical use cases so we cannot have delay in process. With all limitation about Macie, I think we will have think about alternate options.
Do you think it is better to explore alternate option than S3 bucket ?
Thank Kiran