Skip to content

How do I analyze complex JSON records in Athena?

3 minute read
Content level: Intermediate
0

Amazon Athena can efficiently analyze complex JSON data with nested structures, which is common across many AWS services. This guide demonstrates how to parse and analyze complex JSON records using Amazon Inspector findings as a practical example.

Resolution

We'll implement a three-step approach to analyze Amazon Inspector findings:

  1. Export Inspector findings to Amazon S3
  2. Create a base table for raw findings
  3. Create views for parsed data analysis

Step 1: Export JSON Records to S3

  1. Export your JSON records from your source (e.g., Amazon Inspector console: "Findings" -> "All findings")
  2. Choose "Export findings" and select JSON format
  3. Specify your S3 bucket destination and encryption settings

Step 2: Create Raw JSON Table

Execute the following query in Athena to create a base table for the raw JSON records. Replace <path-to-records> with your S3 path:

CREATE EXTERNAL TABLE raw_json_records (
 finding string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
 'input.regex' = '(.*)',
 'input.regex.case.insensitive' = 'true'
)
STORED AS TEXTFILE
LOCATION 's3://<path-to-records>/';

Step 3: Create Parsed JSON View

Create a view that parses the JSON structure:

CREATE OR REPLACE VIEW parsed_json_records AS
WITH raw_json_data AS (
SELECT 
    -- Remove leading comma if present
    CASE 
      WHEN finding LIKE ',{%' THEN SUBSTR(finding, 2)
      ELSE finding
    END AS finding
  FROM raw_json_records
  )
SELECT 
    JSON_EXTRACT_SCALAR(finding, '$.awsAccountId') as aws_account_id,
    JSON_EXTRACT_SCALAR(finding, '$.description') as description,
    JSON_EXTRACT_SCALAR(finding, '$.findingArn') as finding_arn,
    JSON_EXTRACT_SCALAR(finding, '$.firstObservedAt') as first_observed_at,
    JSON_EXTRACT_SCALAR(finding, '$.lastObservedAt') as last_observed_at,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.remediation')) as remediation,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.resources')) as resources,
    JSON_EXTRACT_SCALAR(finding, '$.severity') as severity,
    JSON_EXTRACT_SCALAR(finding, '$.status') as status,
    JSON_EXTRACT_SCALAR(finding, '$.type') as type,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.codeVulnerabilityDetails')) as code_vulnerability_details,
    CAST(JSON_EXTRACT_SCALAR(finding, '$.epss.score') as DOUBLE) as epss,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.exploitabilityDetails')) as exploitability_details,
    JSON_EXTRACT_SCALAR(finding, '$.exploitAvailable') as exploit_available,
    JSON_EXTRACT_SCALAR(finding, '$.fixAvailable') as fix_available,
    CAST(JSON_EXTRACT_SCALAR(finding, '$.inspectorScore') as DOUBLE) as inspector_score,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.inspectorScoreDetails')) as inspector_score_details,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.networkReachabilityDetails')) as network_reachability_details,
    JSON_FORMAT(JSON_EXTRACT(finding, '$.packageVulnerabilityDetails')) as package_vulnerability_details,
    JSON_EXTRACT_SCALAR(finding, '$.title') as title,
    JSON_EXTRACT_SCALAR(finding, '$.updatedAt') as updated_at
FROM 
 raw_json_data
where JSON_EXTRACT_SCALAR(finding, '$.awsAccountId') is not null

Optional: Create Domain-Specific Views

The nested JSON fields such as resources have different structures based on the service. You may create additional views for service specific analysis. For EC2-specific findings:

CREATE OR REPLACE VIEW ec2_findings AS
WITH dataset AS (
    SELECT
        CAST(
            json_parse(resources) AS 
              ARRAY(
                ROW(
                    id varchar, 
                    type varchar,
                    details json,
                    partition varchar,
                    region varchar,
                    tags json
                )
              )
        ) AS items,
        *
    FROM parsed_json_records 
)
SELECT
    -- Add fields from nested json resources field as needed
    JSON_EXTRACT_SCALAR(JSON_OBJ.details, '$.awsEc2Instance.imageId') AS ec2_image_id,
    JSON_OBJ.id AS resource_id, 
    JSON_OBJ.type AS resource_type, 
    JSON_OBJ.region AS resource_region, 
    -- Other fields from parsed_findings view
    aws_account_id,
    title,
    description,
    severity,
    status,
    updated_at
FROM dataset 
     CROSS JOIN UNNEST(items) t(JSON_OBJ)
WHERE JSON_OBJ.type = 'AWS_EC2_INSTANCE'

Common Queries

Here are some useful queries for analysis:

-- Get high-severity findings
SELECT * FROM parsed_json_records 
WHERE severity = 'HIGH' 
ORDER BY updated_at DESC;

-- Count findings by severity
SELECT severity, COUNT(*) as count 
FROM parsed_json_records 
GROUP BY severity;

-- Get findings with available exploits
SELECT title, severity, updated_at 
FROM parsed_json_records 
WHERE exploit_available = 'true';

Additional Information

  • This approach works with any complex JSON structure, not just Inspector findings
  • Filter for specific resource types or nested field values as needed
  • Extend views by modifying SELECT statements to extract additional nested fields
  • Use parsed views directly in Amazon QuickSight for visualization
  • Handles common JSON parsing issues like leading commas and nested arrays

Related Information