AWS Athena - Hive Regex
0
I have a service logs as below format
{"date":"2024-03-13T16:19:22.456430Z","stream":"stdout","log":"{\"level\":\"info\",\"message\":\"request arrived - method: GET url: /.well-known/apollo/server-health hostname: 10.213.4.160\",\"appId\":\"TEST-b33a5c95-dba6-485f-9eba-53a67e131a47\",\"appName\":null,\"service\":\"my-service\",\"version\":\"1.0.0-20231204-1357\",\"context\":\"CorrelationMiddleware\",\"hostname\":\"my-service-69fd96486f-h525p\",\"time\":1710346762456}","pod_name":"my-service-69fd96486f-h525p","namespace_name":"test-namespace","container_name":"my-service","cluster_name":"aws-account-dev-eks-cluster-multiaz-01"}
{"date":"2024-03-13T16:19:22.456910Z","stream":"stdout","log":"{\"level\":\"info\",\"message\":\"response dispatched. timetaken: 0 seconds\",\"appId\":\"TEST-b33a5c95-dba6-485f-9eba-53a67e131a47\",\"appName\":null,\"service\":\"my-service\",\"version\":\"1.0.0-20231204-1357\",\"context\":\"CorrelationMiddleware\",\"hostname\":\"my-service-69fd96486f-h525p\",\"time\":1710346762456}","pod_name":"my-service-69fd96486f-h525p","namespace_name":"test-namespace","container_name":"my-service","cluster_name":"aws-account-dev-eks-cluster-multiaz-01"}
And I'm trying to use regex
"date":"([^"]+)".*"stream":"([^"]+)".*"log":"\{\\"level\\":\\"([^\\"]+)\\",\\"message\\":\\"([^\\"]+)\\",\\"appId\\":\\"([^\\"]+)\\",\\"appName\\":([^,\\"]*),\\"service\\":\\"([^\\"]+)\\",\\"version\\":\\"([^\\"]+)\\",\\"context\\":\\"([^\\"]+)\\",\\"hostname\\":\\"([^\\"]+)\\",\\"time\\":(\d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)"
in order to filter values of date ,stream ,level,message,correlationId,appName,service,version,context,hostname,time,pod_name ,namespace_name,container_name,cluster_name
This regex perfetcly filter values when I try with regex tester, but it seems regex I provided doesn't support for Hive, I'm getting below error
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.util.regex.PatternSyntaxException: Illegal repetition near index 46 "date":"([^"]+)".*"stream":"([^"]+)".*"log":"{\"level\":\"([^\"]+)\",\"message\":\"([^\"]+)\",\"correlationId\":\"([^\"]+)\",\"appName\":([^,\"]*),\"service\":\"([^\"]+)\",\"version\":\"([^\"]+)\",\"context\":\"([^\"]+)\",\"hostname\":\"([^\"]+)\",\"time\":(d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)" ^
This query ran against the ‘myfirstathenadb’ database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query ID: deaf3582-aaca-4ee6-800e-b8fb47149813
Can someone help me on this?
This is my athena script
CREATE EXTERNAL TABLE IF NOT EXISTS table1(
date STRING,
stream STRING,
level STRING,
message STRING,
appId STRING,
appName STRING,
service STRING,
version STRING,
context STRING,
hostname STRING,
time BIGINT,
pod_name STRING,
namespace_name STRING,
container_name STRING,
cluster_name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='"date":"([^"]+)".*"stream":"([^"]+)".*"log":"\{\\"level\\":\\"([^\\"]+)\\",\\"message\\":\\"([^\\"]+)\\",\\"correlationId\\":\\"([^\\"]+)\\",\\"appName\\":([^,\\"]*),\\"service\\":\\"([^\\"]+)\\",\\"version\\":\\"([^\\"]+)\\",\\"context\\":\\"([^\\"]+)\\",\\"hostname\\":\\"([^\\"]+)\\",\\"time\\":(\d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)"')
LOCATION 's3://location-of-s3-buskcet/'
已提問 2 個月前檢視次數 531 次lg...
沒有答案
- 最新
- 最多得票
- 最多評論
相關內容
- 已提問 10 個月前lg...
- 已提問 6 個月前lg...
- 已提問 1 年前lg...
- 已提問 6 個月前lg...
- AWS 官方已更新 2 年前
- AWS 官方已更新 1 年前