How to properly create a custom entity recognizer in Amazon Comprehend

0

Hi all! I'm currently learning how to create custom entity recognizers in Amazon Comprehend, and some questions came up.

In the documentation I saw this example:

Enter image description here

It mentions giving "context" to the labels. How do I do this? For example, let's assume I'm working on PDF data that contains information about Engineers and Analysts involved in some project, something like this:

Enter image description here

What exactly do I have to pass to SageMaker Ground Truth so it generates the objects to be labeled? Just the lines that have the Engineer and Analyst names? All of the lines in the file? Because in this example I gave, I have only 8 lines of text in the file, but if I had something like 50 lines, I'd only have 2 lines to be labeled, and the others 48 would have to be skipped entirely.

asked 2 years ago571 views
1 Answer
0

The "context" mentioned here refers to the document in which the mention occurs (which is not necessarily a specific key-value labelling) - not something you annotate.

To illustrate:

  • A simple RegEx e.g. \d{4}-\d{2}-\d{2} could extract all date mentions e.g. "2022-08-25" from text
  • A text entity recognition model could be trained to distinguish different entity types by the context in which their mentioned. For example, 'Agreement_Start_Date' and 'Agreement_End_Date' if you were training a model on contract documents that might contain a sentence like "This agreement shall be effective from 2022-08-25, and will end on 2022-08-26".
  • A layout-aware entity recognition model could be trained to distinguish different entity types by the combination of text content and overall page layout. For example, trying to distinguish the sender address vs recipient address in traditional letters.

As mentioned on the other question, if your documents really do contain explicit Label: Value pairs then you probably don't need to go to the trouble of training a model in Comprehend: Textract Form Extraction should pick them out for you (you can try it out in the AWS Console for Amazon Textract).

If you have more complex documents where the pre-trained key: value detection in Amazon Textract doesn't meet the need, then yes you could train an Amazon Comprehend model by example, to extract different entity types ('Engineer_Name', 'Analyst_Name' for example). I'd suggest to refer to this two-part blog:

  1. Annotating the documents via Amazon SageMaker Ground Truth
  2. Training and using the model in Comprehend

If you were training the model to extract 'Engineer_Name' and 'Analyst_Name' entity types, you would highlight "John Doe" for one and "Jane Doe" for the other. The model would (ideally!) learn to pick out other analyst names vs engineer names when seeing similar-looking documents in future. Hope this helps to clarify!

AWS
EXPERT
Alex_T
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions