내용으로 건너뛰기

Greengrass: How to detect & auto-heal deployment drift after SD-card fail-over?

0

Summary

We run a production fleet of Greengrass v2 core devices that boot from a primary SD card but automatically fail over to a minimal secondary SD card when the primary is damaged, causing deployment drift that must be detected and healed.

Scenario

Each device has two SD cards:

  • Primary SD – boot + full thing-group deployment (application components).
  • Secondary SD – minimal factory image used only if the primary card fails.

The factory image includes only:

  • aws.greengrass.Nucleus
  • aws.greengrass.TokenExchangeService
  • aws.greengrass.LogManager
  • aws.greengrass.Cli
  • aws.greengrass.crypto.Pkcs11Provider

What happens

  1. Device boots from the primary SD.
  2. Thing-group deployment succeeds (SUCCEEDED IoT Job).
  3. Primary SD later becomes unreadable → bootloader falls back to the secondary SD.
  4. Device reconnects to AWS IoT Core / Greengrass missing all application components.
  5. Because the previous deployment job is already completed, Greengrass does not redeploy anything, and the core remains UNHEALTHY until we create a new deployment manually.

Questions

  1. Drift detection Does Greengrass v2 have a built-in way to notice that the local component state no longer matches the thing-group deployment and automatically trigger redeployment?

  2. Recommended pattern / best practice If no built-in feature exists, what approach would you recommend for:

    • Detecting “deployment drift” (component versions on device vs. desired state).
    • Automatically starting a fresh deployment—without manual console or CLI steps—when drift is detected.

Any pointers to docs, sample code, or your own patterns would be hugely appreciated.
Thanks!

질문됨 4달 전98회 조회
4개 답변
4

May referring this:

  1. Custom Drift Detection Component Deploy a lightweight Greengrass component (e.g., com.example.DriftDetector) that: • Periodically checks the local component list (/greengrass/v2/work/com.aws.greengrass.componentName/). • Compares it to the expected components from the thing group deployment (fetched via ListEffectiveDeployments or ListInstalledComponents). • Detects missing or outdated components.
  2. Trigger a Redeployment If drift is detected: • Use the AWS SDK (e.g., Python boto3) to call CreateDeployment for the affected thing or thing group. • Optionally, revise the existing deployment with no changes to force a redeploy.
  3. Automate with CloudWatch or Lambda • Use CloudWatch metrics or IoT Device Defender to monitor component health. • Trigger a Lambda function to initiate redeployment when unhealthy states or drift are detected.
전문가
답변함 4달 전
0

For this usecase, you may also want to have different things name for primary and secondary with both of them being on the same thing group. If the disk failure does occur you can see it in the console as well as re-trigger the deployment.

AWS
답변함 4달 전
0

UPDATE: July 11: My original answer was highlighting potential difficulties with fleet status service sequence numbers. However, it seems the Greengrass cloud service changed its handling some months ago, and therefore that should not be an issue anymore. APIs such as ListInstalledComponents should reliably report current state after the cutover to the secondary SD card.

I still recommend that you do NOT revise/create a deployment. Instead, remove the thing from the thing group, and then add it back in. This will create a new job execution for just that affected device. If you revised the deployment, you would send a new job execution to every device in the fleet.

AWS
전문가
답변함 4달 전
0

To reduce (but not eliminate) the chances of going into a bad state, you could periodically sync the application state with a tool like fsync.

Or run the SD cards as mirrors of each other. (Now a full backup rather than minimal)

AWS
답변함 4달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

관련 콘텐츠