Salesforce CDC to Data Lake


Hello, I'm looking to push Salesforce data to a Data Lake. This data lake table needs to hold different versions of the record. I experimented with AppFlow, but I couldn't really get the control I wanted over the process (mainly notifying when an event came in).

To cover the requirement of storing the changes I'm thinking of implementing Iceberg or Hudi. In addition to storing the data for analytics, there are some additional requirements to push data back to Salesforce in more of a real-time nature. Because of this, I've created an EventBridge rule to capture Salesforce CDC events in realtime.

The plan is to take those events and process those back into the lake. I was thinking of just sending the event to Lambda and then using Athena to update the Iceberg/Hudi table. One of the difficulties is since the stream is only what changed I have to query up Athena to get the last full record and then overlay the changes.

Good solution? Bad solution? Other things to try? We are a small company so data volume isn't huge so I'm not wanting to overengineer this process in the amount of work or $$$. My team is primarily database developers so anything we can do to stay more sql-like is a plus.


asked a year ago506 views
1 Answer

Your plan to use Salesforce CDC events captured by EventBridge and process them in Lambda to update an Iceberg or Hudi table in your Data Lake sounds like a good solution. Using Athena to update the table is a good choice as well, especially if your team is primarily database developers.

You mentioned that one difficulty is overlaying the changes onto the last full record, but you could consider using Delta Lake instead of Iceberg or Hudi. Delta Lake has built-in support for handling CDC and can manage the change data more efficiently. With Delta Lake, you can create a table that stores the full version of the record and a transaction log that tracks the changes. Delta Lake can apply the changes to the table and maintain its version history automatically.

Overall, it sounds like you have a solid plan in place. You could start with Iceberg or Hudi and switch to Delta Lake if you find that the process of overlaying changes is too cumbersome. Keep in mind that this solution will require some ongoing maintenance and monitoring to ensure that the data is being captured correctly and that the process is running smoothly.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions