- Newest
- Most votes
- Most comments
Currently, Amazon Bedrock Knowledge Bases does not have a built-in feature to automatically store page numbers from processed PDFs in the metadata of embeddings. However, there are workarounds you can implement to achieve this functionality.
To include page-level metadata like page numbers while chunking documents for ingestion into AWS Knowledge Bases, you can follow these steps:
-
Preprocess your PDF documents to extract page numbers and other relevant page-level information.
-
Create metadata files for each document, including the page number information. These metadata files should have the same name as the source data files with a '.metadata.json' suffix.
-
When ingesting data into the knowledge base, upload both the document files and their corresponding metadata files.
In your metadata files, you can include page numbers as a custom metadata attribute. Knowledge Bases for Amazon Bedrock supports metadata filtering, which allows you to use boolean, string, double, and integer data types. You could store page numbers as integers in your metadata.
To implement this approach:
-
Use a PDF parsing library (like PyPDF2 or pdfplumber in Python) to extract text and page numbers from your documents.
-
Create a custom chunking function that not only splits the text into appropriate chunks but also keeps track of which page each chunk comes from.
-
For each chunk, create a metadata entry that includes the page number and any other relevant information.
-
When ingesting the data, ensure that each chunk is associated with its corresponding metadata.
While this method requires additional preprocessing work, it provides a way to incorporate page number information into your knowledge base queries and improve the specificity of your retrievals.
It's worth noting that if you're using Amazon OpenSearch Serverless for your vector store and have PDF documents in your data source, Amazon Bedrock knowledge bases will automatically generate document page numbers and store them in a metadata field called 'x-amz-bedrock-kb-document-page-number'. However, this feature is not supported if you choose no chunking for your documents.
If you've already created a knowledge base without this information, you can re-sync it to reflect the page numbers.
Remember that while this approach allows for more granular tracing and relevance during retrieval, it does require careful preprocessing and metadata management. You may need to experiment with different chunking strategies to find the right balance between chunk size, overlap, and metadata granularity for your specific use case.
Sources
Bedrock Knowledge Base - PDF Page Numbers | AWS re:Post
KnowledgeBase chunking customization | AWS re:Post
Include metadata in a data source to improve knowledge base query - Amazon Bedrock
Relevant content
- asked a year ago
- asked 7 months ago
- AWS OFFICIALUpdated 5 months ago
But what about x-amz-bedrock-kb-document-page-number, which is a metadata I get from my PDF?