Skip to content

How do I use Athena to analyze Amazon VPC flow logs?

5 minute read
0

I want to use Amazon Athena to analyze my Amazon Virtual Private Cloud (Amazon VPC) flow logs.

Resolution

You can use the Athena console query editor to create a database, create a table for the VPC flow logs, and run example queries. Then, use the flow logs to analyze network traffic patterns and to identify threats and risks across your Amazon VPC network.

Create a database with Athena console query editor

Open the Athena console query editor, and then run the following command:

CREATE DATABASE test_db_vpclogs;

Note: Replace test_db_vpclogs with the name that you want to use for your database.

Important: It's a best practice to create the database in the same AWS Region as the Amazon Simple Storage Service (Amazon S3) bucket that the flow logs are stored in.

Create a table for the flow logs in the database

Open the Athena console. Then, in the navigation pane, choose Query editor. In the Athena console query editor, run a command similar to the following example:

CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  `end` bigint,
  action string,
  log_status string,
  vpc_id string,
  subnet_id string,
  instance_id string,
  tcp_flags int,
  type string,
  pkt_srcaddr string,
  pkt_dstaddr string,
  az_id string,
  sublocation_type string,
  sublocation_id string,
  pkt_src_aws_service string,
  pkt_dst_aws_service string,
  flow_direction string,
  traffic_path int
)
PARTITIONED BY (region string, date string, hour string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/prefix/AWSLogs/{account_id}/vpcflowlogs/'
TBLPROPERTIES (
"EXTERNAL"="true",
"skip.header.line.count" = "1",
"projection.enabled" = "true",
"projection.region.type" = "enum",
"projection.region.values" = "us-east-1,us-west-2,ap-south-1,eu-west-1",
"projection.date.type" = "date",
"projection.date.range" = "2021/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.hour.type" = "integer",
"projection.hour.range" = "00,23",
"projection.hour.digits" = "2",
"storage.location.template" = "s3://amzn-s3-demo-bucket/prefix/AWSLogs/${account_id}/vpcflowlogs/${region}/${date}/${hour}"
)

Note: Replace test_table_vpclogs with the name that you want for your table.

Make sure that you modify the LOCATION parameter to point to the Amazon S3 bucket that contains your log data. Update projection.region.values to the AWS Regions that you have VPC flow logs. Set the value of projection.date.range to the earliest available logs in your data. For example, if your logs start on January 1, 2021, then set projection.date.range to '2021/01/01,NOW'. Modify the value according to the actual start date of your logs for accurate data projection. The preceding command uses partition projection to create a table, partition the table, and automatically populate the partitions. If a projected partition doesn't exist in Amazon S3, then Athena still projects the partition. It's a best practice to use partitioned attributes in your queries.

Run SQL statements on the table for the flow logs

Open the Athena console. Then, in the navigation pane, choose Query editor. Use the Athena console query editor to run SQL statements on the table. You can save the queries, view previous queries, or download query results in .csv format.

Example queries

Note: In the following example queries, replace test_table_vpclogs with the name of your table. Modify the column values and other variables for your use case.

To view the first 100 flow log entries in chronological order for a specified period of time, run the following query:

SELECT *  FROM test_table_vpclogs    
 WHERE day >= '2021/02/01' AND day < '2021/02/28'   
 ORDER BY day ASC   
 LIMIT 100;

To view the 10 destination servers that receive the most HTTPS packets for a specified period of time, run the following query:

SELECT SUM(packets) AS packetcount, 
        dstaddr  
FROM test_table_vpclogs  
WHERE dstport = 443  
  AND day >= '2021/03/01'  
  AND day < '2021/03/31'  
GROUP BY dstaddr  
ORDER BY packetcount DESC  
LIMIT 10;

To check the logs that are created for a specified time range, run the following query:

SELECT interface_id,
       srcaddr,  
       action,  
       protocol,  
       to_iso8601(from_unixtime(start)) AS start_time,  
       to_iso8601(from_unixtime("end")) AS end_time  
FROM test_table_vpclogs  
WHERE DAY >= '2021/04/01'  
  AND DAY < '2021/04/30';

To view flow logs for a specific source IP address between a specified time range, run the following query:

SELECT * FROM test_table_vpclogs  
WHERE srcaddr = '10.117.1.22'  
  AND day >= '2021/02/01'  
  AND day < '2021/02/28';

To list the rejected TCP connections between a specified time range, run the following query:

SELECT day,
interface_id,  
       srcaddr,  
       action,  
       protocol  
FROM test_table_vpclogs  
WHERE action = 'REJECT'   
    AND protocol = 6   
    AND day >= '2021/02/01' AND day < '2021/02/28'  
LIMIT 10;

To view the flow logs for the IP address range that starts with 10.117, run the following query:

SELECT * FROM test_table_vpclogs  
WHERE split_part(srcaddr,'.', 1)='10'  
  AND split_part(srcaddr,'.', 2) ='117'

To view the flow logs for a specific destination IP address between a time range, run the following query:

SELECT * FROM test_table_vpclogs  
WHERE dstaddr = '10.0.1.14'  
AND day >= '2021/01/01'  
AND day < '2021/01/31'

Related information

How do I monitor traffic in my VPC with flow logs?

Analyzing VPC Flow Logs using Amazon Athena and Amazon QuickSight

AWS OFFICIALUpdated 4 months ago