Skip to content

Issue with Knowledge Base and Tabular data

0

I am using knowledge bases from bedrock to load and save my data for perform retrieval operation on it. But I am noticing that it is not good at understanding tabular data that is present between text in a pdf or a docx file.

Is there any workaround that can be beneficial for my scenario.

It would be great if you could help me with this. Thanks

2 Answers
0

Hi,

Yes, Bedrock KB may have in some cases issues to ingest tabular data.

First thing: did you check ingestion logs to see if you have doc processing issues at ingestion time ? New feature was just released: https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-bases-logging.html

Second: did you think to run such docs throught Amazon Textract and feed KB with the (processed) result of Textract instead of the doc itself. Textract will probably better understand the tabular data that you can then feed to KB.

Best,

Didier

EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago
0

RAG systems like Bedrock KB work by searching for relevant content chunks from your knowledge base, then feeding these in to an LLM along with the question to generate an answer.

It's important to remember that modern LLMs are still pretty bad at arithmetic: so if your questions require summarising the data from multiple cells by eg calculating a subtotal that isn't already explicitly present in the file, you'll likely need extra steps (like an agent with a dedicated calculator tool) for reliable results.

For tables that are very long, and especially if they span multiple pages of the document without repeating headers, the search aspect may become challenging: it's hard for even a semantic search to understand that a long sequence of numbers is relevant or irrelevant to a particular question in context.

For questions that don't need mathematical aggregation, on relatively short tables, modern Foundation Models should be pretty good. The most common stumbling block is that you need to make sure the content is indexed for search and presented to the generating LLM in an appropriate format to maximise the LLM's chance of success. Some tools just dump PDF/docx content to flat text which doesn't actually preserve the tables' structure, and this would make your generation model's life very hard even if the correct chunks are retrieved from search.

One option to better represent your documents for text-only models is to run them through Amazon Textract with LAYOUT and TABLES analyses enabled, and convert Textract's (information-rich but complex) JSON output to something like HTML or Markdown which text LLMs like. Done well, this should render proper semantic representations of your table cells (e.g. <table><tr><td>... so an LLM can "see" the structure. By converting your docs to semantic HTML before ingesting to Bedrock KB, you may be able to boost both retrieval and generation performance vs a naive flat text file ingestion. See for example the linearization configurations in Amazon Textractor, which is also used by LangChain's AmazonTextractPDFLoader... Or if you're not into Python, the Textract Response Parser for JS/TypeScript can do similar HTML rendering.

Another option (assuming you're able to search well) is to use a multi-modal generation model that actually analyses the document in raw/visual format. You would probably still need to ingest & search your docs via text/HTML/MD (eg OpenSearch + embedding model), but could feed the corresponding page images into Claude 3+ to generate the answer. You could use Bedrock KBs to power your Retrieve step still for this, but would need custom orchestration (and PDF-to-image conversion) for the Generate.

tl;dr: It's all about optimizing the representation of tables in your documents as you put them through the ingestion process or present the retrieved chunks to the final generate model. Textract+Textractor/TRP can generate nice HTML or Markdown representations that preserve semantic structure (like section headers, tables, paragraphs, etc) in your documents... But if your tables are so long & data-dense that it's difficult for search to identify the right chunk to retrieve for a question, you may need to do some extra representation engineering (like duplicating table headers between pages) to help more.

AWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.