- Newest
- Most votes
- Most comments
To run scraping jobs with Selenium and Pytesseract on Amazon Managed Workflows for Apache Airflow (MWAA), you'll need to handle these dependencies properly. Here are the recommended approaches:
Using Startup Scripts
Startup scripts are an excellent option for installing complex dependencies like Selenium and Pytesseract on MWAA. This approach is particularly useful because:
- You can selectively install dependencies based on the MWAA component type (worker, scheduler, or web server)
- It allows you to install dependencies that might require additional system-level packages
Here's how you could implement a startup script for your scraping needs:
#!/bin/bash if [[ "${MWAA_AIRFLOW_COMPONENT}" != "worker" ]] then exit 0 fi echo "Installing virtual Python environment for scraping tools" pip3 install --upgrade pip sudo pip3 install --user virtualenv sudo mkdir python3-virtualenv cd python3-virtualenv sudo python3 -m venv scraping-env sudo chmod -R 777 * echo "Activating virtual environment" source scraping-env/bin/activate echo "Installing scraping dependencies" pip3 install selenium pytesseract # Add any other dependencies you need echo "Listing installed packages" pip3 list echo "Deactivating virtual environment" deactivate
Packaging Dependencies
Another approach is to package your dependencies into a .zip file:
- Use the MWAA local runner to create a .zip file of your dependencies:
./mwaa-local-env package-requirements
- Upload this .zip file to your S3 bucket and specify the location in your MWAA environment configuration.
Best Practices
- Test your dependencies in a non-production environment first
- For Selenium, you'll need to ensure the appropriate webdriver is available
- For Pytesseract, you might need additional system dependencies which can be installed via the startup script
- Consider creating a stable requirements file with specific versions to avoid conflicts
- If your scraping jobs are resource-intensive, adjust your MWAA environment size accordingly
By using either startup scripts or packaging dependencies, you can successfully run Selenium and Pytesseract-based scraping jobs on MWAA. The startup script approach offers more flexibility for complex dependencies that might require system-level installations.
Sources
Amazon MWAA best practices for managing Python dependencies | AWS Big Data Blog
Using dbt with Amazon MWAA - Amazon Managed Workflows for Apache Airflow
Access private code repositories for installing Python dependencies on Amazon MWAA | AWS Big Data Blog
Relevant content
- asked 5 years ago
- asked 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 3 years ago

I guess the dependancies are getting installed. But I need a way to install/download supporting stuff.
For eg - Selenium would need a Browser-binary. How should I do that ?