Bedrock KB data source sync error with no chunking and file of 600 words

0

Hi, I tried to create a KB in Bedrock out of txt files from S3 with "Chunking strategy: No" and when syncing data source I get "Error warning The server encountered an internal error while processing the request." if the input file has more than ~550 words. Why is that? Titan Embeddings should take 8192 tokens, so roughly this number of words, right?

Model: Titan Text Embeddingsv2 Vector dimensions: 1024

1 Answer
0

1. Enable Chunking

  • Why?: Even though the Titan Embeddingsv2 model can handle 8192 tokens, there could be overhead in processing the file. Enabling chunking will split the text into manageable sections that fit within system limits.

  • Solution: Try enabling chunking and setting a chunk size that is well within the system’s tolerance (e.g., 500 tokens per chunk). This should help you avoid hitting the error limit while still processing the full content of your file.

How to enable chunking:

  • In your Bedrock settings, change the "Chunking strategy" to "Yes" or choose an appropriate chunking method (like sentence-based or fixed-size chunks).

https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html

2. Reduce File Size Per Sync

  • Why?: The error might be due to an internal server constraint when processing larger files. Even though 600 words should fit within 8192 tokens, it's safer to break files into smaller pieces.

  • Solution: Break your text files into smaller sections manually, ensuring each section has around 500 words. Sync smaller sections one by one to avoid the error.

3. Preprocess the Input Text

  • Why?: The system might be consuming more tokens than expected due to the presence of special characters, punctuation, or complex words.

  • Solution: Preprocess your text before uploading it. Remove unnecessary special characters, extra spaces, and ensure that the input is simplified to reduce token overhead.

    • For instance, remove excess punctuation or break down complex sentences into simpler ones.

https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html

4. Monitor Token Usage

  • Why?: Although you might be counting words, tokens are a more accurate measure of how much input is being processed. Tools like tokenizers can help estimate the token count for your input.

  • Solution: Use a token counter (like Hugging Face’s tokenizers) to estimate how many tokens your input is using. Ensure it’s within safe limits (e.g., aim for inputs using fewer than 8000 tokens).

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
text = "Your input text here"
token_count = len(tokenizer.encode(text))
print(f"Number of tokens: {token_count}")

https://huggingface.co/docs/tokenizers/python/latest/index.html

5. Contact AWS Support

  • Why?: If none of the above solutions work, there may be an internal limitation or bug within Bedrock that requires technical support.

  • Solution: Reach out to AWS Support, providing details about your error, file size, and settings. They might be able to increase server-side limits or offer another workaround.

6. Check Model & Service Limits

  • Why?: There could be service-specific limits imposed by Bedrock or configuration issues that impact embedding size, especially if dealing with custom setups or constraints.

  • Solution: Review any Bedrock or embedding model documentation for limits or configuration settings that might help resolve the issue.

https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions