Why are aggregate results in a Log Insights query nonsensical (count < count_distinct for the same variable)?

0

The following log insights query on a single log group returns negative numbers for the variable @distinct_unique_keys_delta:

parse @message /(?<@unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter @message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(@unique_key) - count_distinct(@unique_key) as @distinct_unique_keys_delta
        by datefloor(@timestamp, 1d) as @_datefloor 
| sort @_datefloor asc

My understanding is that the number of unique values of a variable can never be more than the total number of values of a variable. When I ran this query I was concerned that I might be misunderstanding the correct usage of datefloor, so I tried this query:

parse @message /(?<@unique_key>Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+)/
| filter @message like /Processing key: \w+\/[\w=_-]+\/\w+\.\d{4}-\d{2}-\d{2}-\d{2}\.[\w-]+\.\w+\.\w+/
| stats count(@unique_key) - count_distinct(@unique_key) as @distinct_unique_keys_delta

The result of this query for the time range I chose (a whole day), was -20,347 for the @distinct_unique_keys_delta variable.

To me this result seems completely nonsensical. Am I doing something wrong, interpreting the results wrong or is there a bug in the code running this log insights query?

1개 답변
0

I have discovered that the count_distinct function in AWS Log Insights queries doesn't really return a distinct count! As per the documentation

Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.

Apparently I can't just assume that a function returns an accurate result.

The documentation page.

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠