- Newest
- Most votes
- Most comments
It seems you already flooded the internet with your case :)
Refer to AWS Support if none got you a solution.
Who ever is going to stumble with the situation, I have fixed the issue so I am posting how I did it if someone in the future will need it:
In Bedrock if you choose Web Crawler as a data source in Knowledge base is not working correctly, since once you press the sync button it cannot be stopped...
What I did is to delete the Vector index (in my case Open Search Serverless or OSS for short) to try and trigger a failure, after a couple of hours it did fail but I had a different issue the data source could not be deleted.
It had the following error: "Unable to delete data from vector store for data source with ID XXXXXXXXXX. Check your vector store configurations and permissions and retry your request. If the issue persists, consider updating the dataDeletionPolicy of the data source to RETAIN and retry your request."
So if you change the deletion policy via the UI it won't work (it will show you that it changed successfully but it won't be changed)
The solution for that is to delete via the CLI Following these docs:
- To get the current data source information https://docs.aws.amazon.com/cli/latest/reference/bedrock-agent/get-data-source.html
- To update the deletion policy https://docs.aws.amazon.com/cli/latest/reference/bedrock-agent/update-data-source.html
- To delete the data source https://docs.aws.amazon.com/cli/latest/reference/bedrock-agent/delete-data-source.html
You will need to get the data source info since it contains the 'data source configuration' information that is required to pass to the update data source command and since the vector ingestion cannot be changed we will need to pass that as well, so we won't get an error that we are trying to change it
The get command commend to run is:
aws bedrock-agent get-data-source --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID>
Replace DATASOURCE_ID and KB_ID with your parameters
you will see 2 important parameters that you will need to use "data source configuration" and "vector ingestion configuration"
Copy the containing jsons under each of the objects and copy them into a local file each object copy into a different file (e.g., data source configuration into tmp.json and vector ingestion configuration into tmp2.json make sure they are formatted correctly and you do not have json syntax errors)
than upload these 2 into the cli using Actions -> Upload File in the CLI window on the top right
After that we will run the update data source command:
aws bedrock-agent update-data-source --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID> --data-source-configuration file://tmp.json --vector-ingestion-configuration file://tmp2.json --name <NAME_OF_DATA_SOURCE> --data-deletion-policy RETAIN
the response will be the the data source with the new configuration to make sure it did change run the get-data-source command again and look for
"dataDeletionPolicy : RETAIN"
instead of DELETE
than you can run the delete-data-source command as follows:
aws bedrock-agent delete-data-source --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID>
And you are good to go and delete the knowledge base as well if you need to.
Hope that helped AI-ing boys
In my case, the crawler had already started so the status was set to 'Running'. I could neither delete the knowledge base nor the data source. I managed to delete all underlying resources on OpenSearch, but that did not force the crawler to exit.
Solution My final solution was to destroy the auto-generated IAM role. With the IAM permissions removed, (presumably) the ingestion job was unable to access the selected foundation models for embeddings therefore leading to quickly discarding each crawled webpage. The total number of items discovered was 14900, with the IAM role deleted the crawler processed the remaining 9k or so items in roughly 15 minutes.
If you are using your custom IAM role, you might need to double-check if another service is dependent on it before you delete that role temporarily. Hopefully this solution helps someone else, but this is strange that AWS developed a feature that allows you to start a job but you cannot terminate it even if it means incurring massive $$$.
Relevant content
- asked 4 months ago
- asked 8 months ago
- AWS OFFICIALUpdated 5 months ago