Skip to content

Crawl from excel file with multiple sheets?

0

I have Excel sheets with multiple sheets on it stored in S3. Currently, I have separate csv files for each sheet, and crawling from each csv files. Instead of doing this, I would like to crawl from the one Excel sheet, by specifying the name of the xslx file and sheet. Is there a way to do this in Boto3? Also, is there a way to use crawler to generate multiple tables? For example, if there are 4 sheets on xslx file, A, B, C and D, I would like to generate tables for all 4 of them so that I can use it on Athena.

asked 2 years ago1.2K views
1 Answer
4

Hello, check these approaches once may resolves

1. Use an AWS Lambda function to convert Excel sheets to CSV:

Lambda Function Setup:

  • Use the pandas library to read Excel sheets and convert them to CSV.
  • Store each CSV in an S3 bucket.
import boto3
import pandas as pd
from io import BytesIO

s3 = boto3.client('s3')

def lambda_handler(event, context):
    bucket = 'your-s3-bucket'
    key = 'path/to/your/file.xlsx'

    # Download Excel file from S3
    obj = s3.get_object(Bucket=bucket, Key=key)
    excel_data = obj['Body'].read()

    # Load Excel file
    xls = pd.ExcelFile(BytesIO(excel_data))

    # Convert each sheet to CSV and upload back to S3
    for sheet_name in xls.sheet_names:
        df = pd.read_excel(xls, sheet_name=sheet_name)
        csv_buffer = BytesIO()
        df.to_csv(csv_buffer, index=False)
        s3.put_object(Bucket=bucket, Key=f'path/to/csv/{sheet_name}.csv', Body=csv_buffer.getvalue())

2. Use AWS Glue Crawler to crawl the CSV files:

1.Create an AWS Glue Crawler:

  • Set the S3 path where the CSV files are stored.
  • Configure the crawler to create separate tables for each CSV file.
  • Run the crawler to update the Glue Data Catalog.
  1. Query the tables in Athena:
  • After the crawler runs, the tables for each sheet will be available in the Glue Data Catalog.
  • Use Athena to query the tables.
EXPERT
answered 2 years ago
  • How do I configure Crawler to create separate tables for each CSV file?

  • When setting up the AWS Glue Crawler, just make sure to specify the S3 path where your CSV files are stored and enable the option to create separate tables for each CSV file. After running the crawler, you'll have individual tables in the Glue Data Catalog for each CSV file, ready to use in Athena.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.