Skip to content

Greengrass: How to detect & auto-heal deployment drift after SD-card fail-over?

0

Summary

We run a production fleet of Greengrass v2 core devices that boot from a primary SD card but automatically fail over to a minimal secondary SD card when the primary is damaged, causing deployment drift that must be detected and healed.

Scenario

Each device has two SD cards:

  • Primary SD – boot + full thing-group deployment (application components).
  • Secondary SD – minimal factory image used only if the primary card fails.

The factory image includes only:

  • aws.greengrass.Nucleus
  • aws.greengrass.TokenExchangeService
  • aws.greengrass.LogManager
  • aws.greengrass.Cli
  • aws.greengrass.crypto.Pkcs11Provider

What happens

  1. Device boots from the primary SD.
  2. Thing-group deployment succeeds (SUCCEEDED IoT Job).
  3. Primary SD later becomes unreadable → bootloader falls back to the secondary SD.
  4. Device reconnects to AWS IoT Core / Greengrass missing all application components.
  5. Because the previous deployment job is already completed, Greengrass does not redeploy anything, and the core remains UNHEALTHY until we create a new deployment manually.

Questions

  1. Drift detection Does Greengrass v2 have a built-in way to notice that the local component state no longer matches the thing-group deployment and automatically trigger redeployment?

  2. Recommended pattern / best practice If no built-in feature exists, what approach would you recommend for:

    • Detecting “deployment drift” (component versions on device vs. desired state).
    • Automatically starting a fresh deployment—without manual console or CLI steps—when drift is detected.

Any pointers to docs, sample code, or your own patterns would be hugely appreciated.
Thanks!

asked 16 days ago90 views
4 Answers
4

May referring this:

  1. Custom Drift Detection Component Deploy a lightweight Greengrass component (e.g., com.example.DriftDetector) that: • Periodically checks the local component list (/greengrass/v2/work/com.aws.greengrass.componentName/). • Compares it to the expected components from the thing group deployment (fetched via ListEffectiveDeployments or ListInstalledComponents). • Detects missing or outdated components.
  2. Trigger a Redeployment If drift is detected: • Use the AWS SDK (e.g., Python boto3) to call CreateDeployment for the affected thing or thing group. • Optionally, revise the existing deployment with no changes to force a redeploy.
  3. Automate with CloudWatch or Lambda • Use CloudWatch metrics or IoT Device Defender to monitor component health. • Trigger a Lambda function to initiate redeployment when unhealthy states or drift are detected.
EXPERT
answered 16 days ago
0

For this usecase, you may also want to have different things name for primary and secondary with both of them being on the same thing group. If the disk failure does occur you can see it in the console as well as re-trigger the deployment.

AWS
answered 13 days ago
0

UPDATE: July 11: My original answer was highlighting potential difficulties with fleet status service sequence numbers. However, it seems the Greengrass cloud service changed its handling some months ago, and therefore that should not be an issue anymore. APIs such as ListInstalledComponents should reliably report current state after the cutover to the secondary SD card.

I still recommend that you do NOT revise/create a deployment. Instead, remove the thing from the thing group, and then add it back in. This will create a new job execution for just that affected device. If you revised the deployment, you would send a new job execution to every device in the fleet.

AWS
EXPERT
answered 13 days ago
0

To reduce (but not eliminate) the chances of going into a bad state, you could periodically sync the application state with a tool like fsync.

Or run the SD cards as mirrors of each other. (Now a full backup rather than minimal)

AWS
answered 6 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.