Bedrock Knowledge Base - PDF Page Numbers

0

Does anyone know if there is a way to enable the storage of page numbers from a processed PDF in the metadata of the embeddings?

2 Answers
2

Currently, Amazon Bedrock Knowledge Bases does not have a built-in feature to automatically store page numbers from processed PDFs in the metadata of embeddings. However, you can implement a workaround to achieve this functionality.

When creating a knowledge base, you have the option to provide custom metadata for each document. This metadata can include various attributes, including page numbers if you preprocess your PDFs to extract this information.

To implement this, you would need to:

  1. Preprocess your PDF documents to extract page numbers.
  2. Create metadata files for each document, including the page number information.
  3. When ingesting data into the knowledge base, ensure you upload both the document files and their corresponding metadata files.

The metadata files should have the same name as the source data files with a '.metadata.json' suffix. In these files, you can include page numbers as a custom metadata attribute.

When querying the knowledge base, you can then use metadata filtering to narrow down results based on specific page numbers or ranges. This feature allows you to pre-filter retrievals from vector stores based on document metadata, which can improve retrieval accuracy and reduce the number of chunks that need to be searched.

Keep in mind that metadata filtering in Knowledge Bases for Amazon Bedrock currently supports boolean, string, double, and integer data types. So you could store page numbers as integers in your metadata.

While this approach requires some additional preprocessing work, it provides a way to incorporate page number information into your knowledge base queries and improve the specificity of your retrievals.
Sources
Knowledge Bases for Amazon Bedrock now supports metadata filtering to improve retrieval accuracy | AWS Machine Learning Blog
Knowledge Bases for Amazon Bedrock now supports metadata filtering

profile picture
answered a month ago
profile picture
EXPERT
reviewed a month ago
  • The use case we have is to be able to provide the page numbers from which a query for the knowledge base was built. This can then be provided to the user to go to the pertinent pages of the cited documents should the query response not be enough. My understanding of the workaround above is that it's for filtering on the query, rather than providing additional information in the response, is that correct?

0

After following guidelines from: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.VectorDB.html While trying to sync with s3 as datasource and aurora postgres as vector storage, I'm getting error: """ The vector database encountered an error while processing the request: Named parameter syntax is invalid, input: x-amz-bedrock-kb-document-page-number (Service: RdsData, Status Code: 400, Request ID: e50356c2-f2ba-44e7-95ee-8c0b8f123415)" """

Any idea? Thanks

profile picture
answered 15 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions