How do I resolve the RegexSerDe error "Number of matching groups doesn't match the number of columns" in Amazon Athena?

2 minute read
0

I created an Amazon Athena table using RegexSerDe. When I query the table, I get this error: "Number of matching groups doesn't match the number of columns."

Resolution

To resolve this error, be sure that the number of capturing groups in the regex pattern matches the number of fields that were defined when you created the table in Athena. For example, here is a row of input data:

64.xxx.xx.xx - - [07/Jul/2020:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523

The row has seven fields. This is the correct regex pattern:

^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$

Note: RegexSerDe follows the Java standard. Because the backslash is an escape character in the Java String class, you must use a double backslash to define a single backslash. For example, to define \w, you must use \w in your regex.

There are seven capturing groups in this regex pattern, and there are seven fields in the input data. When you query the table, RegexSerDe doesn't throw the "Number of matching groups doesn't match the number of columns" exception.

To run a DDL statement, specify the regex capturing groups for SERDEPROPERTIES, as shown in the following example:

CREATE EXTERNAL TABLE logs
(
  col1 string,
  col2 string,
  col3 string,
  col4 string,
  col5 string,
  col6 string,
  col7 string 
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = "^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$") 
LOCATION 's3://doc-example-bucket/path/'

Related information

Using a SerDe

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago