Building on the Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena blog post on the AWS Big Data blog, this post will demonstrate how to convert CloudTrail log files into parquet format and query those optimized log files with Amazon Redshift Spectrum and Athena.
The people over at awslabs did a great job in providing scripts that allow the conversion through AWS Glue ETL jobs. I’ll be using their scripts throughout this post.
Depending on your use case, either Redshift Spectrum or Athena will come up as the best fit:
If you want ad-hoq, multi-partitioning and complex data types go with Athena.
If on the other hand you want to integrate wit existing redshift tables, do lots of joins or aggregates go with Redshift Spectrum.
Setting things up
Users, roles and policies
For the purpose of this demo, I’ve created a demo-user with following policies attached:
- AmazonAthenaFullAccess
- AmazonRedshiftFullAccess
- An inline policy allowing read-only access to the CloudTrail logs on S3 and the scripts bucket.
- An inline policy allowing read-write access to the S3 bucket containing the Glue ETL scripts
The Glue service role contains:
- The managed AWSGlueServiceRole
- An inline policy giving read-write access to the CloudTrail logs on S3
The Redshift service role contains:
- The managed AWSGlueConsoleFullAccess role
- An inline policy giving read access to the cloudtrail logs on S3
In order to use Athena and Redshift from SQL editors, please add port 443 and 5439 to your VPC’s default security group.
Awslabs scripts
The readme.md on the github project page explains how to build and deploy the scripts. In this case, I’ve uploaded the scripts to another bucket, not the bucket containing the CloudTrail logs.
Glue
Once the scripts are in place, create the Glue ETL job using the AWS CLI:
aws glue create-job --name CloudTrailLogConvertor \ --description Convert and partition CloudTrail logs \ --role AWSGlueServiceRole-CrawlerS3 \ --command Name=glueetl,ScriptLocation=s3://<scriptbucket>/sample_cloudtrail_job.py \ --default-arguments ' "--extra-py-files":"s3://<scriptbucket>/athena_glue_converter_<latest>.zip", "--job-bookmark-option":"job-bookmark-enable" "--raw_database_name":"cloudtrail_logs", "--raw_table_name":"cloudtrail_raw", "--converted_database_name":"cloudtrail_logs", "--converted_table_name":"cloudtrail_optimized", "--TempDir":"s3://<scriptbucket>/tmp", "--s3_converted_target":"s3://<logbucket>/converted/cloudtrail", "--s3_source_location":"s3://<logbucket>/<account>/cloudtrail/" '
Now, to actually start the job, you can select it in the AWS Glue console, under ETL – Jobs, and click Action – Run Job, or through the CLI:
aws glue start-job-run --job-name CloudtrailLogConvertor
You can follow up on progress by using:
aws glue get-job-runs --job-name CloudtrailLogConvertor
Until the JobRunState is Succeeded:
"JobRuns": [ { "Id": "jr_1cc3f9b8cf88a5abddd4f6957ec53ddfb70839773cc39292d5f8707ca19c7b6c", "Attempt": 0, "JobName": "CloudtrailLogConvertor", "StartedOn": 1541160714.424, "LastModifiedOn": 1541161359.587, "CompletedOn": 1541161359.587, "JobRunState": "SUCCEEDED", "PredecessorRuns": [], "AllocatedCapacity": 10 } ] }
Athena
Launch your favorite SQL editor (Additional drivers) , or open Athena in the AWS console.
Let’s see what our table looks like:
You’ll notice 4 columns starting with json_. These contain some more nested JSON data.
For example:
You can use the JSON extract functionality in Athena to dive in deeper:
Something more useful to interpret:
Redshift Spectrum
Now that we have our tables and database in the Glue catalog, querying with Redshift Spectrum is easy.
First make sure you have a Redshift cluster running, then create the external schema:
create external schema cloudtrail_logs from data catalog database 'cloudtrail_logs' iam_role 'arn:aws:iam::<accountnumber>:role/demo-redshift';
Our tables are detected automatically (Thank you Glue).
Creating a session with psql:
:~$ psql -h demo-cluster.cpbkebqgnfoo.eu-west-1.redshift.amazonaws.com -p 5439 -U demouser -d dev Password for user demouser: psql (10.5 (Ubuntu 10.5-0ubuntu0.18.04), server 8.0.2) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help. dev=# SELECT COUNT(*) FROM cloudtrail_logs.cloudtrail_optimized; count -------- 119182 (1 row) dev=#
And a query making a bit more sense than COUNT(*) :
dev=# SELECT COUNT (*) AS TotalEvents, json_extract_path_text(json_useridentity,'type') AS usertype, eventname dev-# FROM cloudtrail_logs.cloudtrail_optimized dev-# WHERE eventtime >= '2017-01-01T00:00:00Z' dev-# AND usertype = 'Root' dev-# GROUP BY eventname,json_useridentity dev-# ORDER BY TotalEvents DESC dev-# LIMIT 10; totalevents | usertype | eventname -------------+----------+--------------------------- 588 | Root | DescribeLoadBalancers 392 | Root | ListBuckets 392 | Root | GetBucketLocation 336 | Root | DescribeDBInstances 168 | Root | DescribeAutoScalingGroups 90 | Root | DescribeAccountLimits 84 | Root | GetTrailStatus 84 | Root | DescribeTrails 84 | Root | DescribeAccountAttributes 84 | Root | DescribeDBSecurityGroups (10 rows)