Skip to content

Exclude url with AWS Q WEBCRAWLER using regex don't work

0

I am use AWS Q WEBCRAWLER to crawl information from my company website. But there is 1 url that end with .mp3 that is not able to index. Though the sync status is shown completed but it's yellow (I guess this means there is some issues). So I go edit and in the sync scope, I see there are "crawl url pattern" to exclude and "url index pattern" to exclude. I tried regex ^(https?|ftp|file)://(www.)?(.*?)\.(mp3)$ to match the url and I test it did match. But after I sync again, it still don't work, the url is still being crawled. I even tried using the url itself and still don't work. Am I doing anything wrong? Does anyone know situation like this?

PS: I think I solved it. So I only use URL Crawl Exclusion pattern with regex but not Index URL Exclusion pattern. I think there are many issues with Q Business, and the best way to solve them is to delete and rebuild the Q app again.

asked a year ago587 views
1 Answer
0
AWS
answered a year ago
  • I mean I want to exclude any url that end with .mp3, but the url is still not excluded after I add the regex.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.