Efficient AWS-Powered Data Retrieval from Public Websites

0

Hi Everyone,

Could anyone offer guidance on devising an AWS-based strategy for acquiring data from a publicly accessible source, such as the USDA public website.

Example: I need assistance in accessing the latest PDF from the website 'https://usda.library.cornell.edu/concern/publications/rn3011428?locale=en' and subsequently transforming the data contained within the PDF into a tabular format using AWS services. Any insights or guidance on how to effectively achieve this task would be greatly appreciated.

Thank you

Anand
asked 6 months ago199 views
1 Answer
1

Hello. There are several ways to accomplish what you're looking to do. Since you mentioned "most efficient" I am going to recommend a serverless approach, just because information retrieval of this kind doesn't warrant running infrastructure all the time just to scrape a website every once in a while. I should note that in many cases scraping a website is considered to be against terms of service, so it's best to check with the site terms of service before doing this.

In your case this data appears to be intentionally available for public access, but again, please verify that scraping and/or interacting with the website via automation is not against terms of service.

IF automating the download of this data is, in fact, within the terms of service there are a few ways you can go about it. I will present two. The first is the simplest method that still involves doing it yourself. This will require some basic knowledge of a programming language like Python, Javascript, Go, Java etc. The second relies more on managed services and I will share a sample repository to get you started:

THE SIMPLEST DIY METHOD Use an AWS Lambda function and utilize a web scraping library like beautiful soup (bs4) for Python or Puppeteer for Node. If you prefer other languages pick your favorite scraper.

You will need to do a couple of things with the scraper, and since this is just a basic PDF you can probably fit it all into a simple Lambda function. With the scraper you will need to:

  1. Load the base page, this is the URL you provided. You will essentially be downloading all of the front end code for this page so that you can traverse the Document Object Model.
  2. Since it looks like these reports are updated on a daily basis you may want to download the latest each day, so you will need to basically locate the URL that matches the HTML tag for "download latest". Since this URL appears to be an object address rather than an API call (not conveniently something like ?published=latest) you will need to virtually "click" on the button that takes you to the latest of the page.
  3. There are several ways to select this URL. The simplest is to identify the parent element of the button and then find the download button and the URL inside of it. In this case the parent HTML element is <div class="m-headline">LATEST RELEASE</div> -- from there use your web scraping library of choice to traverse the child elements and find the "download latest" button and the corresponding URL
  4. Now you can download this PDF either to the lambda local storage (less ideal) or to an S3 bucket. Again, going with the theme of simplest, I would recommend using lambda local storage first and then switching to S3 if you are not able to perform the required operations locally.
  5. From here you can use another library for PDF parsing such as Tabula (for Python) or pdf.json (for Node) OR you could choose to use Amazon Textract
  6. Finally -- depending on where you want this data to go you will need to format it accordingly and their write to your desired data store. Alternatively if you want to kick off another event in response to the data you can trigger that event.

Note that if you are going to be passing the data through multiple steps in a pipeline a Step Functions Workflow might be a better fit. My general approach with Lambdas is to do as much as I can with a single function, and if I'm starting to chain multiple events together, refactor it into a Step Functions Workflow.

Use an Amazon Eventbridge event to trigger the lambda function at the interval you require. Eventbridge essentially allows you to trigger event-driven workflows the way you might with a chron job on a Linux machine.

The Simplest Managed Services Method Another approach here is to use Amazon Kendra (enterprise search aided by ML) along with some AWS Step Functions. In this case I will refer you to a sample repository because it is comprehensive and will show you how to get started: https://github.com/aws-samples/aws-step-functions-kendra-web-crawler-search-engine.

With that high level answer -- tell us a bit more about what you need this data for, the required interval, and your existing knowledge and skills with Lambda functions. Happy to provide more prescriptive guidance. This is a very common use-case and it would be great to document the simplest path in a bit more detail. Thanks for the great question!

AWS
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions