How to include page-level metadata (like page numbers) while chunking documents for ingestion into AWS Knowledge Bases?

0

I'm currently working on ingesting documents into AWS Knowledge Bases, and I'm already able to get document-level metadata using the default setup. However, I want to go a step further and:

  • Perform page-level chunking OR
  • Include page number metadata for each chunk
  • The goal is to map each chunk back to its original page number within the document. This will help with more granular tracing and relevance during retrieval.

What I'm looking for:

  • Best practices for extracting and retaining page-level context during chunking
  • Strategies/tools to parse documents (like PDFs) into pages and preserve that info as metadata
  • How to/can we inject this custom metadata into chunks before sending to AWS Knowledge Bases?

Has anyone implemented a similar setup or have suggestions on achieving page-level mapping alongside document-level metadata in AWS KB ingestion pipelines?

1 Answer
0

Currently, Amazon Bedrock Knowledge Bases does not have a built-in feature to automatically store page numbers from processed PDFs in the metadata of embeddings. However, there are workarounds you can implement to achieve this functionality.

To include page-level metadata like page numbers while chunking documents for ingestion into AWS Knowledge Bases, you can follow these steps:

  1. Preprocess your PDF documents to extract page numbers and other relevant page-level information.

  2. Create metadata files for each document, including the page number information. These metadata files should have the same name as the source data files with a '.metadata.json' suffix.

  3. When ingesting data into the knowledge base, upload both the document files and their corresponding metadata files.

In your metadata files, you can include page numbers as a custom metadata attribute. Knowledge Bases for Amazon Bedrock supports metadata filtering, which allows you to use boolean, string, double, and integer data types. You could store page numbers as integers in your metadata.

To implement this approach:

  1. Use a PDF parsing library (like PyPDF2 or pdfplumber in Python) to extract text and page numbers from your documents.

  2. Create a custom chunking function that not only splits the text into appropriate chunks but also keeps track of which page each chunk comes from.

  3. For each chunk, create a metadata entry that includes the page number and any other relevant information.

  4. When ingesting the data, ensure that each chunk is associated with its corresponding metadata.

While this method requires additional preprocessing work, it provides a way to incorporate page number information into your knowledge base queries and improve the specificity of your retrievals.

It's worth noting that if you're using Amazon OpenSearch Serverless for your vector store and have PDF documents in your data source, Amazon Bedrock knowledge bases will automatically generate document page numbers and store them in a metadata field called 'x-amz-bedrock-kb-document-page-number'. However, this feature is not supported if you choose no chunking for your documents.

If you've already created a knowledge base without this information, you can re-sync it to reflect the page numbers.

Remember that while this approach allows for more granular tracing and relevance during retrieval, it does require careful preprocessing and metadata management. You may need to experiment with different chunking strategies to find the right balance between chunk size, overlap, and metadata granularity for your specific use case.
Sources
Bedrock Knowledge Base - PDF Page Numbers | AWS re:Post
KnowledgeBase chunking customization | AWS re:Post
Include metadata in a data source to improve knowledge base query - Amazon Bedrock

profile picture
answered 20 days ago
  • But what about x-amz-bedrock-kb-document-page-number, which is a metadata I get from my PDF?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions