Utilize Glue ETL for mapping misspelled information to existing data sets

0

I am looking to ingest upstream files via Glue ETL and I need to match misspellings to existing, already standardized data, based on rules that I can either continually add to or train a model to learn from, to them add to my database. It is basically a continuously growing reference table(s). All of this is currently done in Excel by hand for multiple columns/fields and I need to automate it.

General Example: I already have a list of known matches (e.g., "tigre" = "tiger"), so any field that has "tigre" should map to the proper spelling without any additional steps.

I believe that I need to have a training step for matches that don't already exist. So when the spelling "tigerrr" comes along, I can map to "tiger" and, next time the process runs, the mapping occurs properly.

I dove into DataBrew, but does not appear to be able to handle large reference tables for mapping in the recipes. It did not look like FindMatches in Glue Studio was quite the right tool either as it appears to focus on full record matching, not individual field matches.

Any recommendations?

已提問 2 年前檢視次數 61 次
沒有答案

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南