Parse and query CloudTrail logs with AWS Glue, Amazon Redshift Spectrum and Athena

Jo Evens — Mon, 05 Nov 2018 10:22:41 +0000

Building on the Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena blog post on the AWS Big Data blog, this post will demonstrate how to convert CloudTrail log files into parquet format and query those optimized log files with Amazon Redshift Spectrum and Athena.

The people over at awslabs did a great job in providing scripts that allow the conversion through AWS Glue ETL jobs. I’ll be using their scripts throughout this post.

Depending on your use case, either Redshift Spectrum or Athena will come up as the best fit:
If you want ad-hoq, multi-partitioning and complex data types go with Athena.
If on the other hand you want to integrate wit existing redshift tables, do lots of joins or aggregates go with Redshift Spectrum.

Setting things up

Users, roles and policies

For the purpose of this demo, I’ve created a demo-user with following policies attached:

AmazonAthenaFullAccess
AmazonRedshiftFullAccess
An inline policy allowing read-only access to the CloudTrail logs on S3 and the scripts bucket.
An inline policy allowing read-write access to the S3 bucket containing the Glue ETL scripts

The Glue service role contains:

The managed AWSGlueServiceRole
An inline policy giving read-write access to the CloudTrail logs on S3

The Redshift service role contains:

The managed AWSGlueConsoleFullAccess role
An inline policy giving read access to the cloudtrail logs on S3

In order to use Athena and Redshift from SQL editors, please add port 443 and 5439 to your VPC’s default security group.

Awslabs scripts

The readme.md on the github project page explains how to build and deploy the scripts. In this case, I’ve uploaded the scripts to another bucket, not the bucket containing the CloudTrail logs.

Glue

Once the scripts are in place, create the Glue ETL job using the AWS CLI:

aws glue create-job --name CloudTrailLogConvertor \
--description Convert and partition CloudTrail logs \
--role AWSGlueServiceRole-CrawlerS3 \	
--command Name=glueetl,ScriptLocation=s3:///sample_cloudtrail_job.py \	 	 
--default-arguments '	 	 
"--extra-py-files":"s3:///athena_glue_converter_.zip",
"--job-bookmark-option":"job-bookmark-enable" 
"--raw_database_name":"cloudtrail_logs", 
"--raw_table_name":"cloudtrail_raw", 
"--converted_database_name":"cloudtrail_logs", 	 
"--converted_table_name":"cloudtrail_optimized",	 	 
"--TempDir":"s3:///tmp", 	 
"--s3_converted_target":"s3:///converted/cloudtrail",	 	 
"--s3_source_location":"s3:////cloudtrail/"	 	 
'

Now, to actually start the job, you can select it in the AWS Glue console, under ETL – Jobs, and click Action – Run Job, or through the CLI:

aws glue start-job-run --job-name CloudtrailLogConvertor

You can follow up on progress by using:

aws glue get-job-runs --job-name CloudtrailLogConvertor

Until the JobRunState is Succeeded:

 "JobRuns": [	 	 
 {	 	 
 "Id": "jr_1cc3f9b8cf88a5abddd4f6957ec53ddfb70839773cc39292d5f8707ca19c7b6c",	
 "Attempt": 0,	 	 
 "JobName": "CloudtrailLogConvertor",	
 "StartedOn": 1541160714.424,	 	 
 "LastModifiedOn": 1541161359.587,	 	 
 "CompletedOn": 1541161359.587,	 	 
 "JobRunState": "SUCCEEDED",	 	 
 "PredecessorRuns": [],	 	 
 "AllocatedCapacity": 10	 	 
 }	 	 
 ]	 	 
}

Athena

Launch your favorite SQL editor (Additional drivers) , or open Athena in the AWS console.
Let’s see what our table looks like:

You’ll notice 4 columns starting with json_. These contain some more nested JSON data.
For example:

You can use the JSON extract functionality in Athena to dive in deeper:

Something more useful to interpret:

Redshift Spectrum

Now that we have our tables and database in the Glue catalog, querying with Redshift Spectrum is easy.
First make sure you have a Redshift cluster running, then create the external schema:

create external schema cloudtrail_logs
from data catalog
database 'cloudtrail_logs'
iam_role 'arn:aws:iam:::role/demo-redshift';

Our tables are detected automatically (Thank you Glue).

Creating a session with psql:

:~$ psql -h demo-cluster.cpbkebqgnfoo.eu-west-1.redshift.amazonaws.com -p 5439 -U demouser -d dev
Password for user demouser: 
psql (10.5 (Ubuntu 10.5-0ubuntu0.18.04), server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

dev=# SELECT COUNT(*) FROM cloudtrail_logs.cloudtrail_optimized;
 count  
--------
 119182
(1 row)

dev=#

And a query making a bit more sense than COUNT(*) :

dev=# SELECT COUNT (*) AS TotalEvents, json_extract_path_text(json_useridentity,'type') AS usertype, eventname
dev-# FROM cloudtrail_logs.cloudtrail_optimized
dev-# WHERE eventtime >= '2017-01-01T00:00:00Z' 
dev-# AND usertype = 'Root'
dev-# GROUP BY eventname,json_useridentity
dev-# ORDER BY TotalEvents DESC
dev-# LIMIT 10;
 totalevents | usertype |         eventname         
-------------+----------+---------------------------
         588 | Root     | DescribeLoadBalancers
         392 | Root     | ListBuckets
         392 | Root     | GetBucketLocation
         336 | Root     | DescribeDBInstances
         168 | Root     | DescribeAutoScalingGroups
          90 | Root     | DescribeAccountLimits
          84 | Root     | GetTrailStatus
          84 | Root     | DescribeTrails
          84 | Root     | DescribeAccountAttributes
          84 | Root     | DescribeDBSecurityGroups
(10 rows)

The post Parse and query CloudTrail logs with AWS Glue, Amazon Redshift Spectrum and Athena appeared first on Cloudar.