Ground Truth Text Format

1

Hello
We are trying to setup labeling for text with ground truth. Is it possible to format the source text to be labeled in some way e.g. html, markdown? A new line would already help a lot. I could not find any documentation on this here https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-input.html

Thanks
Nicolas

asked 4 years ago127 views
3 Answers
0

When you post tasks to Ground Truth they are displayed to workers using a web interface so HTML is the best bet for formatting the text you want annotated. For example, you could simply use the following to create a multi-line input:

{"source": "Lorem ipsum <br/>dolor sit amet"}

Unfortunately, by default, inputs are HTML escaped to prevent confusion between your variable text and HTML. As a result if you use the Text Classification widget with the text above will just be displayed to workers as Lorem ipsum <br/>dolor sit amet. To pass those values without escaping them you'll need to create a custom template that includes a filter on the variable to prevent it from being escaped.

To setup a custom task, start by creating Lambdas to handle for the pre and post processing required. Information on setting these up can be found at https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step3.html

I've included simple pre and post Lambda examples below that I find are a good place to start for text annotation. Note that the post Lambda doesn't do any answer consolidation, just simply passes back all of the answers provided by Workers.

Create a custom template in Ground Truth and use the Sentiment Analysis template to create a starter task for text annotation. To prevent it from escaping HTML values, update the Liquid variables to include the skip_autoescape filter.

{{ task.input.text | skip_autoescape }}

You can find more info on using Liquid template values here:
https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step2.html#sms-custom-templates-step2-automate

Pre

def lambda_handler(event, context):
    print(event)
    source = event['dataObject'].get('source')

    if source:
        print("text is {}".format(source))
    else:
        print("Missing text in dataObject")
        return {}
    
    response = {
        "taskInput": {
            "text": source
        }
    }
    print(response)
    return response

Post

import json
import boto3
from urllib.parse import urlparse


def lambda_handler(event, context):
    print(json.dumps(event))

    payload = get_payload(event)
    print(json.dumps(payload))

    consolidated_response = []
    for dataset in payload:
        annotations = dataset['annotations']
        responses = []
        for annotation in annotations:
            response = json.loads(annotation['annotationData']['content'])
            if 'annotatedResult' in response:
                response = response['annotatedResult']

            responses.append({
                'workerId': annotation['workerId'],
                'annotation': response
            })

        consolidated_response.append({
            'datasetObjectId': dataset['datasetObjectId'],
            'consolidatedAnnotation' : {
                'content': {
                    event['labelAttributeName']: {
                        'responses': responses
                    }
                }
            }
        })

    print(json.dumps(consolidated_response))
    return consolidated_response


def get_payload(event):
    if 'payload' in event:
        parsed_url = urlparse(event['payload']['s3Uri'])
        s3 = boto3.client('s3')
        text_file = s3.get_object(Bucket=parsed_url.netloc, Key=parsed_url.path[1:])
        return json.loads(text_file['Body'].read())
    else:
        return event.get('test_payload',[])
answered 4 years ago
0

Ah, great. That was exactly what I was missing. I simply changed the template to {{ task.input.taskObject | skip_autoescape }} and left out the pre-processing lambda.

Edited by: nicolasdoodle on Apr 28, 2019 11:56 PM

answered 4 years ago
0

I'm trying to do the same thing. But couldn't make it work. Did you still use the post-processing lambda? Or did you just change the template.liquid file

Edited by: apoorvsrivastava on Jun 18, 2021 9:28 AM

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions