[Textract] [ExpenseAnalysis] [Asynchronous] - How to extract just summary fields?

0

Hello, I'm trying to extract just the summary fields from a startExpenseAnalysis call with Textract (ignore the line item fields) , is it that possible?

I'm trying to reduce the response time less than 30 seconds since I'd like to implement it in an APIGateway endpoint.

Thanks.

Daniel
asked 3 months ago118 views
2 Answers
0
Accepted Answer

I believe the short answer to your question is no: It's not possible today to specify Textract AnalyzeExpense should only consider the summary fields to save time. There are no parameters in either the synchronous AnalyzeExpense or asynchronous StartExpenseAnalysis to achieve this.

If you're using the async APIs, do make sure you're using the event-driven SNS call-back rather than polling the job status (which introduces poll wait delay).

If your workload is small/low-concurrency (no or minimal parallel requests), you coooooould explore splitting input documents yourself and using the synchronous AnalyzeExpense API: But this approach wouldn't scale well due to the Textract Quota Limits - and I wouldn't guarantee without testing whether it'd be faster than async anyway.

Like the other answer mentioned, you could explore trying to optimize your documents themselves for faster processing?

AWS
EXPERT
Alex_T
answered 3 months ago
0

To extract just the summary fields from a startExpenseAnalysis call with Textract, you can follow these steps:

  1. Use the StartExpenseAnalysis API to initiate the asynchronous analysis of the invoices or receipts stored in an Amazon S3 bucket. This API will return a JobId that you can use to retrieve the results later.

  2. Use the GetExpenseAnalysis API to retrieve the results of the expense analysis operation. The response will contain a SummaryFields section that includes the extracted summary information, such as the total amount, currency, and other high-level details.

  3. To reduce the response time to less than 30 seconds, you can consider the following approaches:

    • Optimize the input documents: Ensure that the invoices or receipts are in a format (JPEG, PNG, or PDF) that Textract can process efficiently. Also, make sure the documents are of good quality and not too large.

    • Use asynchronous processing: As mentioned, the StartExpenseAnalysis API initiates an asynchronous operation. This allows you to retrieve the results later using the GetExpenseAnalysis API, which should be faster than a synchronous operation.

    • Implement caching: If you are processing the same documents repeatedly, you can cache the summary results to avoid re-processing the documents every time.

    • Optimize your API Gateway endpoint: Ensure that your API Gateway endpoint is configured correctly, with appropriate caching, throttling, and other performance-related settings.

  4. Note that the SummaryFields section of the GetExpenseAnalysis response will only contain the high-level summary information, and not the line item details. If you need to extract the line item information as well, you can access the LineItemGroups section of the response.

StartExpenseAnalysis - Amazon Textract

AnalyzeExpense - Amazon Textract

GetExpenseAnalysis - Amazon Textract

AWS
AWS TAM
answered 3 months ago
  • Thanks for the answer,

    Basically I'm doing that so far, probably I didn't ask properly but I'd like to configure the StartExpenseAnalysisRequest or GetExpenseAnalysisRequest in some way that the response back only includes the SummaryFields.

  • I haven't tried this but I think you can use FeatureTypes

    A list of the types of analysis to perform. Add TABLES to the list to return information about the tables that are detected in the input document. Add FORMS to return detected form data. Add SIGNATURES to return the locations of detected signatures. Add LAYOUT to the list to return information about the layout of the document. All lines and words detected in the document are included in the response (including text that isn't related to the value of FeatureTypes).
    
    Type: Array of strings
    
    Valid Values: TABLES | FORMS | QUERIES | SIGNATURES | LAYOUT
    
    Required: Yes
    
  • Yeah but those FeatureTypes only work with DocumentAnalysis and I'm working with ExpenseAnalysis, thanks for the response.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions