- Newest
- Most votes
- Most comments
Your feature requests for Amazon Bedrock Data Automation (BDA) and Knowledge Bases highlight important considerations for enterprise document processing workflows.
Regarding BDA model selection, you're right that currently there are limited options. Amazon Bedrock offers three parsing strategies when creating a knowledge base: the Bedrock default parser (for text-only content), Bedrock Data Automation (BDA) for multimodal content, and the Foundation model parser for complex document structures. The Foundation model parser might address some of your concerns as it provides advanced parsing capabilities specifically designed for complex document structures, including PDFs with tables and visually rich content.
For promptable parsing and output shaping, the Foundation model parser does offer some customization options. When using this parser, you can customize the default prompt used for data extraction, which could potentially address some of your requirements like handling headers/footers or enforcing terminology consistency.
Regarding preserving structure across page boundaries, this is a known limitation in traditional RAG systems. When documents are divided into chunks, context can be lost between sections. Amazon Bedrock offers several chunking strategies that might help with this issue:
-
Semantic chunking - This divides text into meaningful segments based on semantic similarity rather than arbitrary character counts, which helps preserve context and meaning.
-
Hierarchical chunking - Particularly useful for complex documents with nested structures like technical documents or academic papers with complex formatting and nested tables.
For configurable Knowledge Bases, Amazon Bedrock does allow some customization through custom AWS Lambda functions. You can use these to customize the chunking process to align with your specific RAG application requirements.
Another approach worth considering is contextual retrieval, which adds explanatory context to each chunk before generating embeddings. This enriches the vector representation with relevant contextual information, enabling more accurate retrieval of semantically related content.
These features and approaches can help address some of your concerns, though not all of your specific feature requests are currently available in the service.
Sources
Revolutionizing drug data analysis using Amazon Bedrock multimodal RAG capabilities | Artificial Intelligence
Improve search results and accurate in Amazon Bedrock | AWS re:Post
Contextual retrieval in Anthropic using Amazon Bedrock Knowledge Bases | AWS Machine Learning Blog
Create a knowledge base for multimodal content - Amazon Bedrock
Relevant content
- asked 2 years ago
- asked a year ago
