Unable to Stop Running Sync Job in AWS Bedrock Knowledge Base

0

Hey All,

I have an issue with AWS Bedrock Knowledge Base, Web crawler as a data source, I have accidently put 2 URLs, of Wikipedia (e.g, "https://en.wikipedia.org/wiki/article1 and second URL: "https://en.wikipedia.org/wiki/article2") hosts and scope set to HOSTS_ONLY I am assuming the crawler is trying to crawl the entire Wikipedia, but since it is not Kendra or Lambda but data source set in bedrock I cannot stop the ingestion job, the status is set to 'STARTING' and I deleted the Vector Index(Open search) successfully to trigger a failure.

What else can I do in this situation the job is still running for like an hour an a half.

Any help would be grateful, thank you.

3 Answers
0

It seems you already flooded the internet with your case :)

Refer to AWS Support if none got you a solution.

profile picture
EXPERT
answered 10 months ago
0

Who ever is going to stumble with the situation, I have fixed the issue so I am posting how I did it if someone in the future will need it:

In Bedrock if you choose Web Crawler as a data source in Knowledge base is not working correctly, since once you press the sync button it cannot be stopped...

What I did is to delete the Vector index (in my case Open Search Serverless or OSS for short) to try and trigger a failure, after a couple of hours it did fail but I had a different issue the data source could not be deleted.

It had the following error: "Unable to delete data from vector store for data source with ID XXXXXXXXXX. Check your vector store configurations and permissions and retry your request. If the issue persists, consider updating the dataDeletionPolicy of the data source to RETAIN and retry your request."

So if you change the deletion policy via the UI it won't work (it will show you that it changed successfully but it won't be changed)

The solution for that is to delete via the CLI Following these docs:

You will need to get the data source info since it contains the 'data source configuration' information that is required to pass to the update data source command and since the vector ingestion cannot be changed we will need to pass that as well, so we won't get an error that we are trying to change it

The get command commend to run is:

aws bedrock-agent get-data-source  --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID>

Replace DATASOURCE_ID and KB_ID with your parameters

you will see 2 important parameters that you will need to use "data source configuration" and "vector ingestion configuration"

Copy the containing jsons under each of the objects and copy them into a local file each object copy into a different file (e.g., data source configuration into tmp.json and vector ingestion configuration into tmp2.json make sure they are formatted correctly and you do not have json syntax errors)

than upload these 2 into the cli using Actions -> Upload File in the CLI window on the top right

After that we will run the update data source command:

aws bedrock-agent update-data-source --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID> --data-source-configuration file://tmp.json --vector-ingestion-configuration file://tmp2.json --name <NAME_OF_DATA_SOURCE> --data-deletion-policy RETAIN

the response will be the the data source with the new configuration to make sure it did change run the get-data-source command again and look for "dataDeletionPolicy : RETAIN" instead of DELETE

than you can run the delete-data-source command as follows:

aws bedrock-agent delete-data-source --data-source-id <DATASOURCE_ID> --knowledge-base-id <KB_ID>

And you are good to go and delete the knowledge base as well if you need to.

Hope that helped AI-ing boys

answered 10 months ago
0

In my case, the crawler had already started so the status was set to 'Running'. I could neither delete the knowledge base nor the data source. I managed to delete all underlying resources on OpenSearch, but that did not force the crawler to exit.

Solution My final solution was to destroy the auto-generated IAM role. With the IAM permissions removed, (presumably) the ingestion job was unable to access the selected foundation models for embeddings therefore leading to quickly discarding each crawled webpage. The total number of items discovered was 14900, with the IAM role deleted the crawler processed the remaining 9k or so items in roughly 15 minutes.

If you are using your custom IAM role, you might need to double-check if another service is dependent on it before you delete that role temporarily. Hopefully this solution helps someone else, but this is strange that AWS developed a feature that allows you to start a job but you cannot terminate it even if it means incurring massive $$$.

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions