Process Objects on Amazon S3 with Precision File Type Detection
The purpose of this article is to provide a guide on leveraging AWS Lambda and the Python-Magic library to accurately detect and categorize file types. This approach enables businesses to build more robust and efficient content processing workflows, ensuring mission-critical downstream functions can operate seamlessly. Readers will learn practical techniques to process objects on S3, unlocking new productivity gains and better managing their cloud-based content.
Introduction
Amazon S3 offers unparalleled scale, durability, and flexibility, and is a go-to cloud storage service for a wide variety of customer use cases. But what if you need to selectively process uploaded content based on file type? This challenge is more common than you might think, with far-reaching implications for optimizing workflows and ensuring downstream processing success.
Imagine an S3 bucket serving as the central content repository for your organization. While the ability to accept a wide range of file types is a key benefit, it can also introduce complexities around managing and processing that data. Relying solely on file extensions is a risky approach, as users can easily circumvent your logic by manually changing extensions. An example here would include mission-critical downstream functions that can only process or consume specific file formats. In these cases, implementing a robust file type validation mechanism becomes essential for avoiding processing errors and costly rework.
In this post, we'll explore a powerful technique to determine S3 object file types to ensure your content processing workflows run with precision and efficiency. Whether you're a seasoned AWS veteran or just getting started, you'll walk away with a deep understanding of how to optimize your S3 usage and unlock new levels of productivity.
Solution Overview
This solution leverages the power of Amazon S3 and AWS Lambda to automate the categorization and processing of uploaded content based on file type. By implementing a robust file type verification mechanism, businesses can ensure their content workflows run with precision and efficiency, avoiding costly processing failures downstream.
The solution workflow is as follows:
- A user or application uploads an object to the designated /landing prefix within an S3 bucket.
- An Amazon S3 event notification invokes an AWS Lambda function, informing it of the newly uploaded content.
- The Lambda function utilizes the python-magic library to accurately determine the file type of the uploaded object.
- If the object is identified as a supported file type, the Lambda function moves it to the /supported prefix within the S3 bucket and deletes the original object.
- For objects determined to be of an unsupported file type, the Lambda function moves them to the /unsupported prefix and deletes the original object.
Note: While this solution simply moves objects to different prefixes within the same S3 bucket based on their file type, you could easily alter this solution to achieve a different desired outcome.
Introduction to Python-Magic
Python-Magic is a Python library that leverages the libmagic file identification system to accurately determine file types. It works by analyzing file signatures and file content, not just extensions, making it more reliable than simpler methods. The library examines the initial bytes of a file for distinctive patterns and uses a comprehensive, regularly updated database to identify file types and encodings. This approach allows python-magic to correctly identify files even with incorrect extensions, detect less common formats, and provide both MIME type and character encoding information. Its accuracy, versatility, and performance make it superior to many other file type detection mechanisms, especially those that rely solely on file extensions or simpler analysis techniques.
In the solution presented above, AWS Lambda will leverage the python-magic library to determine an object's file type by reading the the first 2,048 bytes of an object and comparing that object's signature against a database of file types present within libmagic. This approach is superior to file extension classification systems while also ensuring entire objects are not being downloaded and scanned, which could lead to a costly, long-running processing system.
Building the Solution
Create Object Prefixes in S3
This solution assumes you have a new or existing S3 bucket you can test with, however, this bucket will require you to create three prefixes at the root of the bucket. The bucket name can be anything (unique), but we'll need to ensure the following prefixes are created.
- Create a root-level prefix of landing for incoming objects landing in S3.
- Create a root-level prefix of supported to serve as the destination for approved file types.
- Create a root-level prefix of unsupported to serve as the destination for unapproved file types.
Create a Lambda Deployment Package
We'll need to ensure we're satisfying the python-magic dependency as this library isn't included by default with the AWS Lambda Python 3.12 runtime. Follow the steps below to create a simple Lambda package that includes this library and we'll later update the function code.
- Launch an Amazon Linux 2023 EC2 instance and connect to it via SSH as ec2-user.
ssh ec2-user@ip_address_of_ec2_instance
- Ensure you are in the home directory.
cd $home
- Update the system and install necessary tools.
sudo dnf update -y sudo dnf groupinstall "Development Tools" -y sudo dnf install -y zip gcc openssl-devel bzip2-devel libffi-devel zlib-devel
- Download and compile Python 3.12.
cd /opt sudo wget https://www.python.org/ftp/python/3.12.0/Python-3.12.0.tgz sudo tar xzf Python-3.12.0.tgz cd Python-3.12.0 sudo ./configure --enable-optimizations sudo make altinstall
- Navigate back to the home directory.
cd $home
- Verify the installation.
python3.12 --version
- Create a directory for the project.
mkdir lambda_package cd lambda_package
- Create a Python 3.12 virtual environment.
python3.12 -m venv env source env/bin/activate
- Install the necessary packages.
pip install --upgrade pip pip install python-magic
- Install libmagic.
sudo dnf install -y file-devel
- Create a directory for your Lambda function package.
mkdir package
- Copy files to the package directory.
cp -r env/lib/python3.12/site-packages/magic package/ cd package mkdir lib cp /usr/lib64/libmagic.so.1 package/lib/
- Create a minimal Lambda function named lambda_function.py in the package directory.
cat << EOF > lambda_function.py import json import magic def lambda_handler(event, context): # TODO implement return { 'statusCode': 200, 'body': json.dumps('Hello from Lambda!') }
- Create the deployment package.
zip -r9 ../lambda_package.zip . cd ..
- Push the deployment package to an S3 bucket.
aws s3 cp /home/ec2-user/lambda_package/lambda_package.zip s3://<your_bucket_name>
Create a Lambda Function
We're now ready to create a new Lambda function that contains the python-magic library and our minimal Python code.
- Navigate to the AWS Lambda console and click on Create Function.
- At the top of the screen choose the option to Author from Scratch.
- Give your function a name such as ClassifyObjectFunction.
- Select Python3.12 for the function runtime.
- Select x86_64 for the architecture.
- Select Create a new role with basic Lambda permissions and we will customize these permissions later.
- Verify your settings and click on Create function when complete.
Update Lambda Function Code
Upload Packaged Library
Let's now update our function to include the library we packaged earlier.
- On the Code tab within the newly created function select the Upload from dropdown followed by Amazon S3.
- Provide the Amazon S3 URL for the lambda_package.zip you uploaded earlier and click on Save.
- Verify that the function code was updated along with the python-magic library we packaged earlier.
Upload Function Code
We're now ready to update our Lambda function code to include the classification and processing logic.
- Review the Python code below and use this as your Lambda function code, replacing the existing (minimal) code added when we built the Lambda package.
- Click on Deploy when complete.
Note: There are four discrete functions within this code base, specifically:
- lambda_handler - the main function entrypoint that receives the S3 event notification and places calls to other functions as needed.
- get_file_mime_type - determines the object file type using the python-magic library and returns the file type as file_type.
- move_to_unsupported - moves the object to the /unsupported prefix in S3 based on the detected file type and those specifically allowed using the ALLOWED_FILE_TYPES environment variable.
- move_to_supported - moves the object to the /supported prefix in S3 based on the detected file type and those specifically allowed using the ALLOWED_FILE_TYPES environment variable.
# Import necessary libraries import os import magic import boto3 import logging # Configure logging logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): # Set bucket_name and key using S3 event notification bucket_name = event['Records'][0]['s3']['bucket']['name'] key = event['Records'][0]['s3']['object']['key'] # Log bucket_name and key logger.info(f"Bucket Name: {bucket_name}") logger.info(f"Object Key: {key}") # Call get_file_mime_type function file_type = get_file_mime_type(bucket_name, key) # Get the allowed file types from the environment variable allowed_types_str = os.environ.get('ALLOWED_FILE_TYPES', '') allowed_types = [t.strip() for t in allowed_types_str.split(',') if t.strip()] # Check if the file type is allowed if file_type in allowed_types: logger.info(f"File type {file_type} is allowed.") response = move_to_supported(bucket_name, key) else: logger.info(f"File type {file_type} is not allowed.") response = move_to_unsupported(bucket_name, key) # Handle the case of disallowed file type (e.g., delete the file, move it to a different bucket, etc.) return { 'statusCode': 200, 'body': f'Processed file of type: {file_type}' } def get_file_mime_type(bucket_name, key): """ Returns the MIME type of an S3 object. :param bucket_name: The name of the S3 bucket. :param key: The key (path) of the S3 object. :return: The MIME type of the file, or None if an error occurs. """ # Create an S3 client s3 = boto3.client('s3') try: # Read the first 2048 bytes of the file response = s3.get_object(Bucket=bucket_name, Key=key, Range='bytes=0-2047') file_sample = response['Body'].read() # Determine MIME type of the file mime = magic.Magic(mime=True) file_type = mime.from_buffer(file_sample) # Return the file type return file_type except Exception as e: # Log exception logger.error(f"Error getting file MIME type: {e}") return None def move_to_supported(bucket_name, key): # Create an S3 client s3 = boto3.client('s3') try: # Extract object name from key object_name = key.split('/')[-1] # Set new_key to /supported prefix new_key = f"supported/{object_name}" # Copy the object to the new key s3.copy_object(Bucket=bucket_name, CopySource={'Bucket': bucket_name, 'Key': key}, Key=new_key) # Delete the original object s3.delete_object(Bucket=bucket_name, Key=key) logger.info(f"File moved to /supported prefix: {new_key}") return new_key except Exception as e: # Log exception logger.error(f"Error moving file to /supported prefix: {e}") return None def move_to_unsupported(bucket_name, key): # Create an S3 client s3 = boto3.client('s3') try: # Extract object name from key object_name = key.split('/')[-1] # Set new_key to /unsupported prefix new_key = f"unsupported/{object_name}" # Copy the object to the new key s3.copy_object(Bucket=bucket_name, CopySource={'Bucket': bucket_name, 'Key': key}, Key=new_key) # Delete the original object s3.delete_object(Bucket=bucket_name, Key=key) logger.info(f"File moved to /unsupported prefix: {new_key}") return new_key except Exception as e: # Log exception logger.error(f"Error moving file to /unsupported prefix: {e}") return None
Your function should now look similar to the example below.
Customize Lambda Role Permissions
Let's now customize the permissions on the IAM Role being used by the Lambda function to give it access to the S3 test bucket.
- In the AWS Lambda console for the ClassifyObjectFunction navigate to the Configuration tab.
- Click on the IAM role under the Execution Role section to open the IAM console in a new tab.
- In the IAM console for the ClassifyObjectFunction role click on the Add permissions dropdown and select Create inline policy.
- On the Policy editor screen click on the JSON button and add the following IAM policy.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<your_bucket_name>/*"] } ] }
- Click on Next once the policy has been updated.
- Name the policy AllowAccessToS3 and click on Create policy.
Enable S3 Event Notifications
We're now ready to enable S3 event notifications to invoke the Lambda function on newly uploaded objects.
- In the S3 console navigate to the Properties tab within your test bucket.
- Scroll down and click on Create event notification.
- Enter InvokeClassifyObjectFunction for the event name.
- Specify a prefix of landing/.
- Select the tickbox next to All object create events.
- Under Destination ensure that Lambda function is selected.
- Under Specify Lambda function select the ClassifyObjectFunction created earlier.
- Click on Save changes to create the event notification.
Update Function Timeout
While this solution only reads in a portion of the file to determine it's type, Lambda will likely need more than the (default) 3 seconds to determine the file type.
- In the Lambda console for the ClassifyObjectFunction select the Configuration tab followed by General configuration.
- Click on the Edit button and update the function timeout to 30 seconds.
Specify Allowed File Types
To facilitate easy updates we are using an environment variable within the Lambda function to determine which file types are allowed.
- In the Lambda console for the ClassifyObjectFunction select the Configuration tab followed by Environment variables.
- Click on Edit to add a new environment variable.
- Enter ALLOWED_FILE_TYPES for the key and enter text/plain for the value.
Note:: You can specify as many allowed file types as you wish up to the max size supported by Lambda. This list of file types should be separated with a , and should have no spaces between them.
- Click on Save when done.
Test the Solution
Create and Upload Test Files
We're now ready to test the solution and will do this by uploading two files to the /landing prefix in our S3 test bucket.
- Create two files on your local machine using Microsoft Word and Notepad++ and save them to your desktop as Test1.docx and Test1.txt respectively.
- Upload the two files to the /landing prefix in your S3 bucket.
Very shortly after you've uploaded these files you should see them disappear from the /landing prefix. The Test1.txt file will be moved to the /supported prefix, while the Test1.docx file will be moved to the /unsupported prefix.
Monitor Function Invocation in CloudWatch
The function has been configured to emit logs to CloudWatch to help validate the processing that's taking place.
- From the Monitor tab within the ClassifyObjectType function, click on the View CloudWatch Logs button to open CloudWatch in a new tab.
- You should notice two new log streams that have a time stamp corresponding with the time you uploaded the test files to S3. Open each of them in a new tab.
- Verify the log event for the Test1.txt (shown first) and Test1.docx (shown second) invocations.
Summary and Call to Action
This post has demonstrated how you can use S3 event notifications, AWS Lambda functions, and the Python-Magic library to accurately detect and categorize file types. We encourage you to test this solution within your environment and modify it to suit your needs.
Relevant content
- asked 8 months agolg...
- asked a year agolg...
- Accepted Answerasked 2 years agolg...
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a year ago