Comprehend, how to retain grouped entities (engineer + address + phone)

0

I'm new to Comprehend and we're doing custom entity recognition. We have entities like ENGINEER, ENGINEER_ADDRESS, ENGINEER_PHONE and ENGINEER_EMAIL and a document can have multiple engineers with their information in it, how do we make sure the info is grouped together? What's the best approach for this?

I think the question is pretty clear for the experts here, but to give an example:

Jane Johnson is an engineer living at 4024 Hillside Street Scottsdale Arizona and can be reach at phone number 480-424-9944 or email jane@gmail.com. The other engineer working on the project is Ben Franklin who lives at 262 Geraldine Lane, New York and can only be reached by email b.franklin@hotmail.com.

Want to make sure that Ben's address is not coupled to Jane when we post process.

Zjeraar
asked a year ago244 views
1 Answer
0
Accepted Answer

Hello,

I understand that you would like to know if we can group the entities together based on the information (like phone number and address that belong to a correct engineer entity)

Please note that as of now, unfortunately, AWS Comprehend does not support grouping of entities or tagging the entities with related ones. Having said that as of now, we do not have any built-in feature to achieve this.

However, to look upon whether this feature will be included in future updates to Comprehend service, you can refer to below feature requests.

  1. https://aws.amazon.com/new/
  2. https://aws.amazon.com/about-aws/whats-new/machine-learning/

However, a workaround can be used for the scenario.

As you might already know, when we have custom entity recognizers, we can use it for real-time entity recognition or async analysis jobs. Both these detect entities and give output in below format

{
                "BeginOffset": 0,
                "EndOffset": 22,
                "Score": 0.9763959646224976,
                "Text": "John Johnson",
                "Type": "JUDGE"
}

You can refer to the below links for more details on output of custom entity recognizers

  1. for real time: https://docs.aws.amazon.com/comprehend/latest/dg/outputs-cer-sync.html
  2. for async operations: https://docs.aws.amazon.com/comprehend/latest/dg/outputs-cer-async.html

As you can see in the output, each entity has the parameters "BeginOffset" which represents the beginning of the entity and "EndOffset" which represents the ending of an entity in the source document. (Please refer to the link for more details on the entity component in the output)

We can use these parameters to check which phone number and address entities are placed or appear near an enginner entity. We can use this info to check which entities are occuring near each other in the input document and group them according to these offset details.

For example: if the output has below details:

`Engineer: "Jane", beginOffset: 0, endOffset: 5
PhoneNumber: 12345, beginOffset: 15, endOffset: 20
Engineer: "Helen", beginOffset: 90, endOffset: 95
PhoneNumber: 67890, beginOffset: 100, endOffset: 105`

Here, since engineer Jane appears near the phone number 12345 according to offset details, it will most probably be related to this enginee

However, please note that we will have to post-process the output script to achieve this. We can use any local python script or any other language script to process the output file and group the entities into a required file.

AWS
SUPPORT ENGINEER
answered a year ago
  • Okay, thanks, I thought that would be the case, but just making to be sure I did not miss anything.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions