Skip to content

Run Selenium, Pytesseract on MWAA

0

I want to run scraping jobs via MWAA. How should I go about this ?

How to install such dependancies (requirements itself wont suffice) Should I use startup-scripts ?

1 Answer
0

To run scraping jobs with Selenium and Pytesseract on Amazon Managed Workflows for Apache Airflow (MWAA), you'll need to handle these dependencies properly. Here are the recommended approaches:

Using Startup Scripts

Startup scripts are an excellent option for installing complex dependencies like Selenium and Pytesseract on MWAA. This approach is particularly useful because:

  1. You can selectively install dependencies based on the MWAA component type (worker, scheduler, or web server)
  2. It allows you to install dependencies that might require additional system-level packages

Here's how you could implement a startup script for your scraping needs:

#!/bin/bash
if [[ "${MWAA_AIRFLOW_COMPONENT}" != "worker" ]] then
exit 0
fi

echo "Installing virtual Python environment for scraping tools"
pip3 install --upgrade pip
sudo pip3 install --user virtualenv
sudo mkdir python3-virtualenv
cd python3-virtualenv
sudo python3 -m venv scraping-env
sudo chmod -R 777 *

echo "Activating virtual environment"
source scraping-env/bin/activate

echo "Installing scraping dependencies"
pip3 install selenium pytesseract
# Add any other dependencies you need

echo "Listing installed packages"
pip3 list

echo "Deactivating virtual environment"
deactivate

Packaging Dependencies

Another approach is to package your dependencies into a .zip file:

  1. Use the MWAA local runner to create a .zip file of your dependencies:
./mwaa-local-env package-requirements
  1. Upload this .zip file to your S3 bucket and specify the location in your MWAA environment configuration.

Best Practices

  • Test your dependencies in a non-production environment first
  • For Selenium, you'll need to ensure the appropriate webdriver is available
  • For Pytesseract, you might need additional system dependencies which can be installed via the startup script
  • Consider creating a stable requirements file with specific versions to avoid conflicts
  • If your scraping jobs are resource-intensive, adjust your MWAA environment size accordingly

By using either startup scripts or packaging dependencies, you can successfully run Selenium and Pytesseract-based scraping jobs on MWAA. The startup script approach offers more flexibility for complex dependencies that might require system-level installations.
Sources
Amazon MWAA best practices for managing Python dependencies | AWS Big Data Blog
Using dbt with Amazon MWAA - Amazon Managed Workflows for Apache Airflow
Access private code repositories for installing Python dependencies on Amazon MWAA | AWS Big Data Blog

answered 8 months ago
  • I guess the dependancies are getting installed. But I need a way to install/download supporting stuff.

    For eg - Selenium would need a Browser-binary. How should I do that ?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.