AWS Athena - Hive Regex

0

I have a service logs as below format

{"date":"2024-03-13T16:19:22.456430Z","stream":"stdout","log":"{\"level\":\"info\",\"message\":\"request arrived - method: GET url: /.well-known/apollo/server-health hostname: 10.213.4.160\",\"appId\":\"TEST-b33a5c95-dba6-485f-9eba-53a67e131a47\",\"appName\":null,\"service\":\"my-service\",\"version\":\"1.0.0-20231204-1357\",\"context\":\"CorrelationMiddleware\",\"hostname\":\"my-service-69fd96486f-h525p\",\"time\":1710346762456}","pod_name":"my-service-69fd96486f-h525p","namespace_name":"test-namespace","container_name":"my-service","cluster_name":"aws-account-dev-eks-cluster-multiaz-01"}
{"date":"2024-03-13T16:19:22.456910Z","stream":"stdout","log":"{\"level\":\"info\",\"message\":\"response dispatched. timetaken: 0 seconds\",\"appId\":\"TEST-b33a5c95-dba6-485f-9eba-53a67e131a47\",\"appName\":null,\"service\":\"my-service\",\"version\":\"1.0.0-20231204-1357\",\"context\":\"CorrelationMiddleware\",\"hostname\":\"my-service-69fd96486f-h525p\",\"time\":1710346762456}","pod_name":"my-service-69fd96486f-h525p","namespace_name":"test-namespace","container_name":"my-service","cluster_name":"aws-account-dev-eks-cluster-multiaz-01"}

And I'm trying to use regex

"date":"([^"]+)".*"stream":"([^"]+)".*"log":"\{\\"level\\":\\"([^\\"]+)\\",\\"message\\":\\"([^\\"]+)\\",\\"appId\\":\\"([^\\"]+)\\",\\"appName\\":([^,\\"]*),\\"service\\":\\"([^\\"]+)\\",\\"version\\":\\"([^\\"]+)\\",\\"context\\":\\"([^\\"]+)\\",\\"hostname\\":\\"([^\\"]+)\\",\\"time\\":(\d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)"

in order to filter values of date ,stream ,level,message,correlationId,appName,service,version,context,hostname,time,pod_name ,namespace_name,container_name,cluster_name

This regex perfetcly filter values when I try with regex tester, but it seems regex I provided doesn't support for Hive, I'm getting below error

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.util.regex.PatternSyntaxException: Illegal repetition near index 46 "date":"([^"]+)".*"stream":"([^"]+)".*"log":"{\"level\":\"([^\"]+)\",\"message\":\"([^\"]+)\",\"correlationId\":\"([^\"]+)\",\"appName\":([^,\"]*),\"service\":\"([^\"]+)\",\"version\":\"([^\"]+)\",\"context\":\"([^\"]+)\",\"hostname\":\"([^\"]+)\",\"time\":(d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)" ^
This query ran against the ‘myfirstathenadb’ database, unless qualified by the query. Please post the error message on our forum  or contact customer support  with Query ID: deaf3582-aaca-4ee6-800e-b8fb47149813

Can someone help me on this?

This is my athena script

CREATE EXTERNAL TABLE IF NOT EXISTS table1(
  date STRING,
  stream STRING,
  level STRING,
  message STRING,
  appId STRING,
  appName STRING,
  service STRING,
  version STRING,
  context STRING,
  hostname STRING,
  time BIGINT,
  pod_name STRING,
  namespace_name STRING,
  container_name STRING,
  cluster_name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='"date":"([^"]+)".*"stream":"([^"]+)".*"log":"\{\\"level\\":\\"([^\\"]+)\\",\\"message\\":\\"([^\\"]+)\\",\\"correlationId\\":\\"([^\\"]+)\\",\\"appName\\":([^,\\"]*),\\"service\\":\\"([^\\"]+)\\",\\"version\\":\\"([^\\"]+)\\",\\"context\\":\\"([^\\"]+)\\",\\"hostname\\":\\"([^\\"]+)\\",\\"time\\":(\d+)}".*"pod_name":"([^"]+)".*"namespace_name":"([^"]+)".*"container_name":"([^"]+)".*"cluster_name":"([^"]+)"')
LOCATION 's3://location-of-s3-buskcet/'
Madara
질문됨 2달 전530회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠