When I parse a fixed-width .dat file with a built-in classifier, my AWS Glue crawler classifies the file as UNKNOWN.
Short description
Built-in classifiers can't parse fixed-width data files. Use a grok custom classifier instead.
Resolution
Create the grok custom classifier
Complete the following steps.
-
Open the AWS Glue console.
-
In the navigation pane, choose Classifiers.
-
Choose Add classifier, and then enter the following:
For Classifier name, enter a unique name.
For Classifier type, choose Grok.
For Classification, enter a description of the format or type of data that you're classifying.
For Grok pattern, enter the built-in patterns that you want AWS Glue to use to find matches in your data. To parse a .dat file, you don't need a delimiter between fields. Because each field has a known length, use a regex pattern to find matches.
Example:
(?<col0>.{7})(?<col1>.{8})(?<col2>.{14})(?<col3>.{52})
(Optional) For Custom patterns, enter any custom patterns that you want to use. These patterns are referenced by the grok pattern that classifies your data. Each custom pattern must be on a separate line. For more information, see Writing grok custom classifiers.
-
Choose Create.
Create and run the crawler
Complete the following steps:
- In the navigation pane, choose Crawlers.
- Choose Add crawler.
- For Crawler name, enter a unique name.
- Choose the arrow next to the Tags, description, security configuration, and classifiers (optional) section, and then go to the Custom classifiers section.
- Choose Add next to the customer classifier that you previously created, and then choose Next.
- On the Specify crawler source type page, choose Data stores, and then choose Next.
- On the Add a data store page, enter the following:
For Choose data store, choose your preferred data store.
For Include path, enter the path to your .dat file.
- Choose Next, and then confirm whether you want to add another data store.
- On the Choose an IAM role page, select an existing AWS Identity and Access Management (IAM) role or create a new one. Then, choose Next.
- For Frequency, choose Run on demand, and then choose Next.
- On the Configure the crawler's output page, for Database, choose the database that you want to create the table in. Then, choose Next.
- Choose Finish.
- When the crawler status changes to Ready, select the crawler name, and then choose Run crawler.
- Wait for the crawler to finish, and then choose Tables in the navigation pane. The Classification must match the classification that you entered for the grok custom classifier.
Related information
Creating classifiers using the AWS Glue console
Defining and managing classifiers