By using AWS re:Post, you agree to the Terms of Use

Utilize Glue ETL for mapping misspelled information to existing data sets

0

I am looking to ingest upstream files via Glue ETL and I need to match misspellings to existing, already standardized data, based on rules that I can either continually add to or train a model to learn from, to them add to my database. It is basically a continuously growing reference table(s). All of this is currently done in Excel by hand for multiple columns/fields and I need to automate it.

General Example: I already have a list of known matches (e.g., "tigre" = "tiger"), so any field that has "tigre" should map to the proper spelling without any additional steps.

I believe that I need to have a training step for matches that don't already exist. So when the spelling "tigerrr" comes along, I can map to "tiger" and, next time the process runs, the mapping occurs properly.

I dove into DataBrew, but does not appear to be able to handle large reference tables for mapping in the recipes. It did not look like FindMatches in Glue Studio was quite the right tool either as it appears to focus on full record matching, not individual field matches.

Any recommendations?