跳至內容

Training Textract to Separate Text Blocks Into Separate Components with a Delimiter

0

I'm trying to use Textract to extract the product descriptions form our PDF catalogs in page order. The Textract analysis picks up the descriptions as text blocks, but how do I go about training Textract to split each product description text block into its key components, such as title, author, description, etc?

Enter image description here

已提問 2 年前檢視次數 303 次
1 個回答
1

Hello,

To extract the key components like title, author, and description from product descriptions in your PDF catalogs, Textract currently does not have built-in capabilities for that level of customization.

Machine learning models trained on sample catalog pages could help automatically classify the text into different fields. Services like Amazon SageMaker, AWS Glue, etc can help build such models.

已回答 2 年前
專家
已審閱 2 年前
  • You can develop a post-processing system that applies rules to classify text blocks based on layout patterns, or for a more sophisticated solution, train a custom machine learning model with Amazon SageMaker to recognize and categorize the text appropriately.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。