Skip to content

How I Automatically Patched AWS Glue Jobs Using LLM-Based RCA Engine (Project Feedback Request)

0

Hi AWS Community šŸ‘‹,

I recently built a fully automated RCA + patching engine for AWS Glue jobs using Lambda, DynamoDB, and Gemini (LLM).

Here’s what it does:

  • Detects Glue failures
  • Sends logs to Gemini for RCA
  • Extracts patch recommendations
  • Automatically updates Glue job config
  • Sends alerts via SNS

⚔ Before: RCA took ~45 mins manually
āœ… Now: RCA + patch + alert = 10 seconds

Here’s the full project with code + Medium article: šŸ”— GitHub: (https://github.com/ashishkesari18/Data-Engineering-Projects/tree/main/AWS%2BData%20Engineering/SiliconPrimeX) šŸ“ Medium: https://medium.com/@ashishkesari018/siliconprimex-building-an-autonomous-self-healing-data-platform-on-aws-c7a73703795c

Would love feedback from the AWS Glue, Serverless, and Analytics communities!
How can I take this to the next level?

Thanks in advance šŸ™Œ
— Ashish Kesari

1 Answer
0
Accepted Answer

While your project sounds interesting, I should clarify some key aspects about AWS Glue to ensure your solution aligns with how the service actually works.

AWS Glue is a fully managed, serverless data integration service that provides capabilities for ETL (Extract, Transform, Load) operations. It automatically scales resources based on workload demands and includes built-in monitoring capabilities.

For your automated patching system, it's important to understand that AWS Glue already provides several monitoring tools:

  • AWS Glue job run insights
  • Integration with AWS CloudTrail
  • Apache Spark UI for Spark-backed jobs
  • CloudWatch metrics for Data Catalog usage

AWS Glue jobs can be triggered by schedules, events, or upon completion of other jobs. The service supports both Spark and Ray runtime environments, as well as Python shell jobs.

For your project enhancement, consider leveraging AWS Glue's existing features:

  1. Use AWS Glue Studio's visual interface for easier job management and troubleshooting
  2. Implement event-based triggers to create more sophisticated workflows
  3. Utilize AWS Glue's built-in workflow capabilities for complex ETL activities
  4. Take advantage of AWS Glue's AI-powered capabilities for code generation and Spark troubleshooting

Your approach of using Lambda, DynamoDB, and an LLM for automated remediation is innovative, but make sure you're not duplicating capabilities that AWS Glue already provides natively.
Sources
Serverless Data Integration – AWS Glue – Amazon Web Services
What is AWS Glue? - AWS Glue
Building visual ETL jobs - AWS Glue

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.