After creating a Knowledge Base you can use the API's to ingest data from your S3 bucket into your vector database and according to the documentation 'After you create your knowledge base, you ingest or sync your data so that the data can be queried.'
So we have been testing to ensure that if we remove a document from S3 then the vectors within the database are also removed - we have yet to test if these vectors are updated if a document is replaced with one of exactly the same name.
The following sequence of events were conducted repeatedly to ensure consistent results.
Using folder created in S3 bucket – (testing-folder)
Single PDF file uploaded to S3 bucket
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Test query sent into LLM based on known text within pdf file: the expected result from the LLM was received.
Another single PDF file uploaded to S3 bucket
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Test query sent into LLM based on known text within new pdf file: the expected result from the LLM was received.
Test query sent into LLM based on known text within original pdf file: the expected result from the LLM was received.
• Above two tests gave correct results
Single PDF file deleted from S3 bucket
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Test query sent into LLM based on known text within the deleted pdf file: we received results back based on the contents of that file – this is not correct.
Pinecone database index queried, and it was observed that data from the deleted file was still present.
Original deleted PDF file uploaded to S3 bucket again
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Test query sent into LLM based on known text within pdf file: the expected result from the LLM was received.
Pinecone database index queried, and it was observed that data from the deleted file was still present together with a duplicate set of entries from the second upload.
• Above two tests gave incorrect results
Both PDF file deleted from S3 bucket
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Test query sent into LLM based on known text within the deleted pdf files: we received results back indicating it could not answer the query.
Pinecone database index queried, and it was observed that the index was now empty.
• Above test gave correct results
Pinecone database index was cleared down and a single file was uploaded to S3 bucket/folder.
• Lambda Function was observed as being triggered
• Knowledge Base observed as syncing
• Lambda Function completes without error
• Knowledge Base observed as Available
Files from existing folders were not added back into the vector database.
• This was not expected behavior
So is Bedrock API just not working correctly or is the documentation wrong in implying that it should handle deletes and updates - which is concerning as that means we would have to write our own code to read through the S3 bucket and ensure the Vector Database is accurate