How do I use Amazon Athena to analyze VPC flow logs?

4 minute read
0

I want to use Amazon Athena to analyze my Amazon Virtual Private Cloud (Amazon VPC) flow logs.

Short description

Use Amazon VPC Flow Logs to analyze network traffic patterns, and identify threats and risks across your Amazon VPC network.

Resolution

Use Athena to analyze Amazon VPC flow logs

To use Athena to analyze the Amazon VPC flow logs, complete the following steps:

  1. Use the Amazon Athena console query editor, to run the following command to create a database. Replace test_db_vpclogs with a name for your database:

    CREATE DATABASE test_db_vpclogs;
    

    Important: It's a best practice to create the database in the same AWS Region as the Amazon Simple Storage Service (Amazon S3) bucket that the flow logs are stored in.

  2. In the database, run the following command to create a table for the VPC flow logs. Replace test_table_vpclogs with the name of your table. Also, modify the LOCATION parameter to point to the Amazon S3 bucket that contains your log data:

    CREATE EXTERNAL TABLE `test_table_vpclogs`(
      `version` int, 
      `account_id` string, 
      `interface_id` string, 
      `srcaddr` string, 
      `dstaddr` string, 
      `srcport` int, 
      `dstport` int, 
      `protocol` bigint, 
      `packets` bigint, 
      `bytes` bigint, 
      `start` bigint, 
      `end` bigint, 
      `action` string, 
      `log_status` string, 
      `vpc_id` string, 
      `subnet_id` string, 
      `instance_id` string, 
      `tcp_flags` int, 
      `type` string, 
      `pkt_srcaddr` string, 
      `pkt_dstaddr` string, 
      `az_id` string, 
      `sublocation_type` string, 
      `sublocation_id` string, 
      `pkt_src_aws_service` string, 
      `pkt_dst_aws_service` string, 
      `flow_direction` string, 
      `traffic_path` int)
    PARTITIONED BY ( 
      `region` string, 
      `day` string)
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ' ' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://awsexamplebucket/awsexampleprefix/awsexamplelogs/1111222233334444/vpcflowlogs/'
    TBLPROPERTIES (
      'projection.day.format'='yyyy/MM/dd', 
      'projection.day.range'='2021/01/01,NOW', 
      'projection.day.type'='date', 
      'projection.enabled'='true', 
      'projection.region.type'='enum', 
      'projection.region.values'='us-east-1,us-west-2,ap-south-1,eu-west-1', 
      'skip.header.line.count'='1', 
      'storage.location.template'='s3://awsexamplebucket/awsexampleprefix/awsexamplelogs/1111222233334444/vpcflowlogs/${region}/${day}'
      )

    Note: The preceding command uses partition projection to create a table, partition the table, and automatically populate the partitions. If a projected partition doesn't exist in Amazon S3, then Athena still projects the partition. It's a best practice to use partitioned attributes in your queries.

  3. Use the Amazon Athena console query editor to run SQL statements on the table. You can save the queries, view previous queries, or download query results in CSV format.

Example queries

Note: Replace test_table_vpclogs with the name of your table. Modify the column values and other variables to fit your query.

To view the first 100 flow log entries in chronological order for a specified period of time, run a query similar to the following:

SELECT * 
 FROM test_table_vpclogs 
 WHERE day >= '2021/02/01' AND day < '2021/02/28' 
 ORDER BY day ASC 
 LIMIT 100;

To view the server that receives the top ten number of HTTP packets for a specified period of time, run a query similar to the following:

SELECT SUM(packets) AS packetcount, 
       dstaddr
FROM test_table_vpclogs
WHERE dstport = 443
  AND day >= '2021/03/01'
  AND day < '2021/03/31'
GROUP BY dstaddr
ORDER BY packetcount DESC
LIMIT 10;

The preceding query counts the number of packets that are received on HTTPS port 443 and groups them by destination IP address. Then, it returns the top 10 entries from the previous week.

To check the logs that were created for a specified time range, run a query similar to the following:

SELECT interface_id,
       srcaddr,
       action,
       protocol,
       to_iso8601(from_unixtime(start)) AS start_time,
       to_iso8601(from_unixtime("end")) AS end_time
FROM test_table_vpclogs
WHERE DAY >= '2021/04/01'
  AND DAY < '2021/04/30';

To view flow logs for a specific source IP address between a specified time range, run a query similar to the following:

SELECT *
FROM test_table_vpclogs
WHERE srcaddr = '10.117.1.22'
  AND day >= '2021/02/01'
  AND day < '2021/02/28';

To list the rejected TCP connections between a specified time range, run a query similar to the following:

SELECT day,
       interface_id,
       srcaddr,
       action,
       protocol
FROM test_table_vpclogs
WHERE action = 'REJECT' 
    AND protocol = 6 
    AND day >= '2021/02/01' AND day < '2021/02/28'
LIMIT 10;

To view the flow logs for the IP address range that starts with '10.117', run a query similar to the following:

SELECT *
FROM test_table_vpclogs
WHERE split_part(srcaddr,'.', 1)='10'
  AND split_part(srcaddr,'.', 2) ='117'

To view the flow logs for a specific destination IP address between a time range, run a query similar to the following:

SELECT *
FROM test_table_vpclogs
WHERE dstaddr = '10.0.1.14'
AND day >= '2021/01/01'
AND day < '2021/01/31'

Related information

How do I monitor traffic in my VPC with flow logs?

Analyzing VPC Flow Logs using Amazon Athena, and Amazon QuickSight

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago