How to make my architecture asynchronous?

0

Hello, I have this service that I make use of Amazon Textract, in which I have a lambda function where have a trigger, that when I upload a image/file of a receipt or invoice, it runs the lambda function with the startAnalyzeExpense, for that I have the following architecture:

My design

But then I got this issue that people will need the option to upload more than one file, and process all of them, for that I need a asynchronous way for those needs.

  • What changes would you recommend to make in my architecture?
  • Any other tips would be appreciated.

Thanks.

This is another question that I got while writing this, sorry, this is like a secondary question, when uploading a receipt or invoice in a pdf file, it can have more than one page, but I am getting an issue processing the file with Amazon Textract, with startAnalyzeExpense it works normally, but it returns it like this:

Response

The problem is separated in two elements in ExpenseDocuments, so it is like separated, the solution that I have is that for those files I merge the pdf in one, but I'm worried about the file size.

GerLC
asked 3 months ago171 views
4 Answers
1

Please find my comments for Answer 1:  Modify the Lambda function that triggers on file upload to handle multiple files. This could involve iterating over a batch of files uploaded to S3 and implement an SQS queue to hold messages that contain references to the files that need processing as Asynchronous process . Finally you need to change aws lamdba in asynch way manner.

Answer 2:Ensure that your post-processing logic can identify and handle duplicate entries that might appear across page.While merging multiple pages into a single-page PDF, as you've mentioned, can be a solution. However, as you're concerned, this can lead to large file sizes

profile picture
Jagan
answered 3 months ago
1

hello, Please find my comments for Answer 1: 1.S3 Bucket for File Storage: Instead of triggering the Lambda function directly from the upload, store the uploaded files in an S3 bucket. This allows you to handle multiple files and provides a durable storage solution.

2.S3 Event Trigger:
    Configure an S3 event trigger on the bucket so that when new files are uploaded, it triggers a Lambda function.

3. Lambda Function to Queue Jobs: Modify your Lambda function to be a queue handler instead of directly starting the analysis. When triggered by the S3 event, the Lambda function should pick up the uploaded file(s), and enqueue a job for each file in a scalable queue service like Amazon Simple Queue Service (SQS).

4. SQS for Job Queue:
    Create an SQS queue where each message in the queue represents a job to be processed.

5. Asynchronous Processing Lambda: Implement a separate Lambda function that polls the SQS queue for new jobs. When a job is received, this Lambda function can then invoke the startAnalyzeExpense function for each file.

6.Processing Results:
    Depending on your requirements, you can handle the results differently. You might store the processed data in a database, notify users via email, or store the results back in S3.

7. Monitoring and Error Handling: Implement monitoring and error handling in your architecture. For example, CloudWatch can be used to monitor the performance of Lambda functions, and you can set up dead-letter queues in SQS for handling failed processing attempts.

Here's a simplified flow:

  1. User uploads files to an S3 bucket.
  2. S3 bucket triggers an event.
  3. The Lambda function (Queue Handler) is triggered by the S3 event.
  4. The Lambda function enqueues a message in an SQS queue for each file.
  5. Another Lambda function (Processing Lambda) polls the SQS queue for new jobs.
  6. The Processing Lambda function invokes the startAnalyzeExpense function for each file.
  7. Processed results are stored or handled accordingly.

This approach allows you to handle multiple files asynchronously, making your system more scalable and flexible. Additionally, it provides better error handling and monitoring capabilities.

profile picture
answered 3 months ago
0

Hello, Please find my comments for Answer 2:

1.Review API Responses: Check the responses you receive from the startAnalyzeExpense API call. See if there are any identifiers or metadata that can help you link or group the results for multi-page documents.

2.Combine Results Client-Side: If you're receiving separate results for each page, you may consider combining the results on your end after receiving them. This would involve processing the individual results and merging them into a cohesive structure.

3.Custom Post-Processing: Develop a custom post-processing step where you analyze the individual results and consolidate them based on your business logic. You can use information like document identifiers or page numbers to associate and merge the results.

4.Limit the Number of Pages: If file size is a concern, and you're still considering merging PDFs, you could explore limiting the number of pages in each file before merging. This might help control the resulting file size while still ensuring that the documents are processed correctly.

5.Optimize PDFs: Before merging, you can consider optimizing the individual PDFs to reduce file size. There are tools and libraries available that can help compress and optimize PDF files without losing important information.

profile picture
answered 3 months ago
0

Thank you both of you, so it would be something like this?

  1. Upload a File to S3: When a file is uploaded to your S3 bucket, an event notification is automatically sent to an SQS queue.
  2. SQS Processes the Event: The SQS queue receives the event notification and stores it in a queue.
  3. Trigger a Lambda Function: A separate Lambda function is triggered by the SQS queue. This function retrieves the event details from the SQS message and processes the file accordingly.
  4. Process the File: The Lambda function uses the Amazon Textract service to analyze the file and performs any necessary actions.

New architecture

While investigating I came across that EventBridge is another option, but for more complex scenarios.

@Jagan what do you mean by change aws lamdba in async way manner? I'm already using an async/await handler function and using startExpenseAnalysis and getExpenseAnalysis which are specific for async way instead of analyzeExpense which is synchronous.

GerLC
answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions