By using AWS re:Post, you agree to the AWS re:Post Terms of Use

How do I resolve the RegexSerDe error "Number of matching groups doesn't match the number of columns" in Athena?

2 minute read
0

I used RegexSerDe to create an Amazon Athena table. When I query the table, I get this error: "Number of matching groups doesn't match the number of columns".

Resolution

To resolve this error, match the number of fields in the Athena table with the number of capturing groups in the regex pattern. For example, the following row input data has seven fields:

64.###.##.## - - [07/Jul/2020:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523

The following is the correct regex pattern for the example input data:

^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$

Note: RegexSerDe follows the Java standard. Because the backslash is an escape character in the Java String class, you must use a double backslash to define a single backslash. For example, you must use \\w to define \w in your regex.

The example regex pattern has seven capturing groups and the example input data has seven fields. You don't get the RegexSerDe error because the number of capturing groups matches the number of fields.

To run a DDL statement, specify the regex capturing groups for SERDEPROPERTIES as shown in the following example:

CREATE EXTERNAL TABLE logs  
(  
  col1 string,  
  col2 string,  
  col3 string,  
  col4 string,  
  col5 string,  
  col6 string,  
  col7 string   
)   
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'  
WITH SERDEPROPERTIES (  
   "input.regex" = "^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$")   
LOCATION 's3://doc-example-bucket/path/'

Related information

Use SerDes

How do I use Amazon Athena to analyze my Amazon S3 server access logs?

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago