How do I resolve the "Unable to infer schema" exception in AWS Glue?

3 minuto de leitura
0

My AWS Glue job fails with one of the following exceptions:

"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"

Short description

This error usually happens when AWS Glue tries to read a Parquet or Orc file that's not stored in an Apache Hive-style partitioned path with the key=val structure. AWS Glue expects the Amazon Simple Storage Service (Amazon S3) source files to be in key-value pairs. For example, if the AWS Glue job is processing files at s3://s3-bucket/parquet-data/, the files must have the following partition structure:

s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet

If your Parquet or Orc files are stored in a hierarchical structure, the AWS Glue job fails with the "Unable to infer schema" exception. Example:

s3://s3-bucket/parquet-data/year/month/day/file1.parquet

Resolution

Use one of the following methods to resolve the error.

Restructure your data

Copy the files into a new S3 bucket and use Hive-style partitioned paths. Run the job again.

Replace partition column names with asterisks

If restructuring your data isn't feasible, create the DynamicFrame directly from Amazon S3. In the Amazon S3 path, replace all partition column names with asterisks (*). When you use this solution, AWS Glue does not include the partition columns in the DynamicFrame—it only includes the data.

For example, assume that your files are stored in an S3 bucket with the following partition structure:

s3://s3-bucket/parquet-data/year/month/day/files.parquet

To process all files in the s3://s3-bucket/parquet-data/ path, create the DynamicFrame:

dynamic_frame0 = glueContext.create_dynamic_frame_from_options('s3',connection_options={'paths':['s3://s3-bucket/parquet-data/*/*/*'],},format="parquet",transformation_ctx = "dynamic_frame0")

Use a map transformation to add partition columns

To include the partition columns in the DynamicFrame, create a DataFrame first, and then add a column for the Amazon S3 file path. Then, create the DynamicFrame and apply a map transformation to add the partition columns, as shown in the following example. Before using the sample code, replace the Amazon S3 paths and enter your partition column names using the correct index values.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import *
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df= spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*")
modified_df = df.withColumn('partitions_column',input_file_name())
dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0")

def modify_col(x):
     if x['partitions_column']:
        new_columns = x['partitions_column'].split('/')
        x['year'],x['month'],x['day'] = new_columns[4],new_columns[5],new_columns[6]
        del x['partitions_column']
     return x

modified_dyf = Map.apply(dyf_0,f=modify_col)
datasink2 = glueContext.write_dynamic_frame.from_options(frame =modified_dyf , connection_type = "s3", connection_options = {"path": "s3://my-output-bucket/output/","partitionKeys": ["year","month","day"]}, format = "parquet", transformation_ctx = "datasink2")

For more information about applying map transformations, see Map Class.


Related information

Managing partitions for ETL output in AWS Glue

AWS OFICIAL
AWS OFICIALAtualizada há 3 anos
2 Comentários

There could be two reasons :-

  1. AWS may not be able to verify the file. Make sure that the file is of correct file type and you are selecting the correct options (like delimiters, etc.)

  2. [MOST PROBABLE REASON] The IAM permissions to the given S3 object may not be available. You not only have to see if the Correct permissions are given, you need to make sure that the permissions are set for Correct Resources.

e.g

"Effect" : "Allow",
"Action": [
     "s3:GetObject"
],
"Resource" :[
     "arn:aws:s3:::{your_bucket_name}",
     "arn:aws:s3:::{your_bucket_name}/*"
]

In your case, replaceyour_bucket_namewith s3-bucket

AB
respondeu há 6 meses

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERADOR
respondeu há 6 meses