I used RegexSerDe to create an Amazon Athena table. When I query the table, I get this error: "Number of matching groups doesn't match the number of columns".
Resolution
To resolve this error, match the number of fields in the Athena table with the number of capturing groups in the regex pattern. For example, the following row input data has seven fields:
64.###.##.## - - [07/Jul/2020:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
The following is the correct regex pattern for the example input data:
^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$
Note: RegexSerDe follows the Java standard. Because the backslash is an escape character in the Java String class, you must use a double backslash to define a single backslash. For example, you must use \\w to define \w in your regex.
The example regex pattern has seven capturing groups and the example input data has seven fields. You don't get the RegexSerDe error because the number of capturing groups matches the number of fields.
To run a DDL statement, specify the regex capturing groups for SERDEPROPERTIES as shown in the following example:
CREATE EXTERNAL TABLE logs
(
col1 string,
col2 string,
col3 string,
col4 string,
col5 string,
col6 string,
col7 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([0-9.]+) ([\\w.-]) ([\\w.-]) \\[([A-Za-z0-9:/]+ [+-][0-9]{4})\\] \"(.+?)\" ([0-9]{3}) ([0-9]+)$")
LOCATION 's3://doc-example-bucket/path/'
Related information
Use SerDes
How do I use Amazon Athena to analyze my Amazon S3 server access logs?