Glue Visual ETL: Can't copy raw data from RDS MySQL to S3 bucket due to unclassified error: Schema specified that header line is to be written; but contains no column names

0

I'm trying to build an ETL pipeline with AWS Glue, and the first step is to copy raw data from the original source to a staging bucket. The job is rather simple: source is a data catalog table (from MySQL RDS), target is an S3 bucket in CSV format, compressed.

I added a filter to copy just 1000 records and not the whole table, to make a POC. But run after run I'm getting the same uncategorized error: "Schema specified that header line is to be written; but contains no column names", which makes no sense because the source schema has headers. Amazon Q suggested a schema mismatch between source and target (wich is not the case), but I run the crawler again an tried to use everything "out of the box", to keep it as simple as possible, with no luck.

Screenshot

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import re

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Solicitud
Solicitud_node1705721974155 = glueContext.create_dynamic_frame.from_catalog(
    database="onix",
    table_name="onix_solicitud",
    transformation_ctx="Solicitud_node1705721974155",
)

# Script generated for node Filter
Filter_node1705723991171 = Filter.apply(
    frame=Solicitud_node1705721974155,
    f=lambda row: (row["id"] >= 262667709 and row["id"] <= 262668712),
    transformation_ctx="Filter_node1705723991171",
)

# Script generated for node Raw
Raw_node1705722955468 = glueContext.write_dynamic_frame.from_options(
    frame=Filter_node1705723991171,
    connection_type="s3",
    format="csv",
    connection_options={
        "path": "s3://xxxxxxxxxxx/raw/",
        "compression": "gzip",
        "partitionKeys": [],
    },
    transformation_ctx="Raw_node1705722955468",
)

job.commit()

It's VERY frustrating not being able to finish simple tasks like this, I'm wasting a lot of time. Could this be related to something else, like invalid data in the source? it would not be the first time the error message has nothing to do with the root cause.

asked 3 months ago221 views
1 Answer
0
Accepted Answer

The problem was there were some invalid dates in the source table, so after converting that column to text the job succeeded.

answered 3 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions