Yes, there are few things to be done to avoid this from happening:
- Origin needs to be locked to the outside internet, except Cloudfront IPs. It appears that Google is able to reach the Origin directly.
- Validate if the *.cloudfront.net URLs are a part of the publishing code anywhere. If so, replace them with the Cloudfront - fronted domain names.
- Put a robots.txt on Amazon S3. Something like, the www.domain.com/robots.txt
I would not recommend putting Origin redirection etc. coz there is no reason for anyone to hit the Origin directly. It is also an open backdoor for the malicious actors to get through.
Once you do #01, #02, #03; the indexing of the Origin URLs will gradually go away. Also I believe, they can also reach out to Google (after doing 1, 2, 3) to remove those URLs from their search feeds.
Using Cognito and Cloudfront to control access to user files on S3asked a month ago
Old S3 Content Being Served, Help!asked a year ago
Amplify and Google Searchasked 9 months ago
Caching headers on Amplify served contentasked 3 years ago
Are resized images served with cloudfront and resized with Lambda edge cached?Accepted Answerasked 4 months ago
Google Index become 0 after migration on AWS Lightsailasked 4 years ago
Removing personal information content on Cloudfrontasked a month ago
Redirecting a domain to an external URLAccepted Answerasked 2 months ago
How to avoid google to index cloudfront served contentAccepted Answerasked 2 years ago
CloudFront and Google Analyticsasked 4 months ago