<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Athena Archives | Cloudar</title>
	<atom:link href="https://cloudar.be/tag/athena/feed/" rel="self" type="application/rss+xml" />
	<link>https://cloudar.be/tag/athena/</link>
	<description>100% Focus On AWS // 100% Customer Obsession</description>
	<lastBuildDate>Mon, 05 Nov 2018 10:22:41 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>Parse and query CloudTrail logs with AWS Glue,  Amazon Redshift Spectrum and Athena</title>
		<link>https://cloudar.be/awsblog/parse-and-query-cloudtrail-logs-with-aws-glue-amazon-redshift-spectrum-and-athena/</link>
					<comments>https://cloudar.be/awsblog/parse-and-query-cloudtrail-logs-with-aws-glue-amazon-redshift-spectrum-and-athena/#respond</comments>
		
		<dc:creator><![CDATA[Jo Evens]]></dc:creator>
		<pubDate>Mon, 05 Nov 2018 10:22:41 +0000</pubDate>
				<category><![CDATA[AWS Blog]]></category>
		<category><![CDATA[Athena]]></category>
		<category><![CDATA[AWS]]></category>
		<category><![CDATA[Redshift]]></category>
		<guid isPermaLink="false">https://cloudar.be/?p=8837</guid>

					<description><![CDATA[<p>The post <a href="https://cloudar.be/awsblog/parse-and-query-cloudtrail-logs-with-aws-glue-amazon-redshift-spectrum-and-athena/">Parse and query CloudTrail logs with AWS Glue,  Amazon Redshift Spectrum and Athena</a> appeared first on <a href="https://cloudar.be">Cloudar</a>.</p>
]]></description>
										<content:encoded><![CDATA[<div class="wpb-content-wrapper"><div id="ut-row-69bb7db80cbaf" data-vc-full-width="true" data-vc-full-width-init="false" class="vc_row wpb_row vc_row-fluid vc_column-gap-0 ut-row-69bb7db80cbc3" ><div class="wpb_column vc_column_container vc_col-sm-12" ><div id="ut_inner_column_69bb7db823ff1" class="vc_column-inner " ><div class="wpb_wrapper">
	<div class="wpb_text_column wpb_content_element" >
		<div class="wpb_wrapper">
			<p>Building on the <a href="https://aws.amazon.com/blogs/big-data/aws-cloudtrail-and-amazon-athena-dive-deep-to-analyze-security-compliance-and-operational-activity/">Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena</a> blog post on the AWS Big Data blog, this post will demonstrate how to convert CloudTrail log files into parquet format and query those optimized log files with Amazon Redshift Spectrum and Athena.</p>
<p>The people over at <a href="https://github.com/awslabs/athena-glue-service-logs">awslabs</a> did a great job in providing scripts that allow the conversion through AWS Glue ETL jobs. I&#8217;ll be using their scripts throughout this post.</p>
<p>Depending on your use case, either Redshift Spectrum or Athena will come up as the best fit:<br />
If you want  ad-hoq, multi-partitioning and complex data types go with Athena.<br />
If on the other hand you want to integrate wit existing redshift tables, do lots of joins or aggregates go with Redshift Spectrum.</p>
<h1>Setting things up</h1>
<h2>Users, roles and policies</h2>
<p>For the purpose of this demo, I&#8217;ve created a demo-user with following policies attached:</p>
<ol>
<li>AmazonAthenaFullAccess</li>
<li>AmazonRedshiftFullAccess</li>
<li>An inline policy allowing read-only access to the CloudTrail logs on S3 and the scripts bucket.</li>
<li>An inline policy allowing read-write access to the S3 bucket containing the Glue ETL scripts</li>
</ol>
<p>The Glue service role contains:</p>
<ol>
<li>The managed AWSGlueServiceRole</li>
<li>An inline policy giving read-write access to the CloudTrail logs on S3</li>
</ol>
<p>The Redshift service role contains:</p>
<ol>
<li>The managed AWSGlueConsoleFullAccess role</li>
<li>An inline policy giving read access to the cloudtrail logs on S3</li>
</ol>
<p>In order to use Athena and Redshift from SQL editors, please add port 443 and 5439 to your VPC&#8217;s default security group.</p>
<h2>Awslabs scripts</h2>
<p>The readme.md on the github project page explains how to build and deploy the scripts. In this case, I&#8217;ve uploaded the scripts to another bucket, not the bucket containing the CloudTrail logs.</p>
<h1>Glue</h1>
<p>Once the scripts are in place, create the Glue ETL job using the AWS CLI:</p>
<pre class="lang:default decode:true ">aws glue create-job --name CloudTrailLogConvertor \
--description Convert and partition CloudTrail logs \
--role AWSGlueServiceRole-CrawlerS3 \	
--command Name=glueetl,ScriptLocation=s3://&lt;scriptbucket&gt;/sample_cloudtrail_job.py \	 	 
--default-arguments '	 	 
"--extra-py-files":"s3://&lt;scriptbucket&gt;/athena_glue_converter_&lt;latest&gt;.zip",
"--job-bookmark-option":"job-bookmark-enable" 
"--raw_database_name":"cloudtrail_logs", 
"--raw_table_name":"cloudtrail_raw", 
"--converted_database_name":"cloudtrail_logs", 	 
"--converted_table_name":"cloudtrail_optimized",	 	 
"--TempDir":"s3://&lt;scriptbucket&gt;/tmp", 	 
"--s3_converted_target":"s3://&lt;logbucket&gt;/converted/cloudtrail",	 	 
"--s3_source_location":"s3://&lt;logbucket&gt;/&lt;account&gt;/cloudtrail/"	 	 
'</pre>
<p>Now, to actually start the job, you can select it in the AWS Glue console, under ETL &#8211; Jobs, and click Action &#8211; Run Job, or through the CLI:</p>
<pre class="lang:default decode:true ">aws glue start-job-run --job-name CloudtrailLogConvertor</pre>
<p>You can follow up on progress by using:</p>
<pre class="lang:default decode:true ">aws glue get-job-runs --job-name CloudtrailLogConvertor</pre>
<p class="">Until the JobRunState is Succeeded:</p>
<pre class="lang:default decode:true "> "JobRuns": [	 	 
 {	 	 
 "Id": "jr_1cc3f9b8cf88a5abddd4f6957ec53ddfb70839773cc39292d5f8707ca19c7b6c",	
 "Attempt": 0,	 	 
 "JobName": "CloudtrailLogConvertor",	
 "StartedOn": 1541160714.424,	 	 
 "LastModifiedOn": 1541161359.587,	 	 
 "CompletedOn": 1541161359.587,	 	 
 "JobRunState": "SUCCEEDED",	 	 
 "PredecessorRuns": [],	 	 
 "AllocatedCapacity": 10	 	 
 }	 	 
 ]	 	 
}</pre>
<p>&nbsp;</p>
<h1>Athena</h1>
<p>Launch your favorite SQL editor (<a href="https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html">Additional drivers</a>) ,  or open Athena in the AWS console.<br />
Let&#8217;s see what our table looks like:</p>
<p><img fetchpriority="high" decoding="async" class="alignnone size-full wp-image-8961" src="https://cloudar.be/wp-content/uploads/2018/11/table.png" alt="" width="904" height="621" srcset="https://cloudar.be/wp-content/uploads/2018/11/table.png 904w, https://cloudar.be/wp-content/uploads/2018/11/table-768x528.png 768w" sizes="(max-width: 904px) 100vw, 904px" /></p>
<p>You&#8217;ll notice 4 columns starting with json_. These contain some more nested JSON data.<br />
For example:<br />
<img decoding="async" class="alignnone size-full wp-image-8962" src="https://cloudar.be/wp-content/uploads/2018/11/nestedjson.png" alt="" width="1054" height="154" srcset="https://cloudar.be/wp-content/uploads/2018/11/nestedjson.png 1054w, https://cloudar.be/wp-content/uploads/2018/11/nestedjson-768x112.png 768w" sizes="(max-width: 1054px) 100vw, 1054px" /></p>
<p>You can use the <a href="https://docs.aws.amazon.com/athena/latest/ug/extracting-data-from-JSON.html">JSON extract</a> functionality in Athena to dive in deeper:<img decoding="async" class="alignnone size-full wp-image-8963" src="https://cloudar.be/wp-content/uploads/2018/11/json_extract.png" alt="" width="1068" height="153" srcset="https://cloudar.be/wp-content/uploads/2018/11/json_extract.png 1068w, https://cloudar.be/wp-content/uploads/2018/11/json_extract-768x110.png 768w" sizes="(max-width: 1068px) 100vw, 1068px" /></p>
<p>Something more useful to interpret:<br />
<img loading="lazy" decoding="async" class="alignnone size-full wp-image-9009" src="https://cloudar.be/wp-content/uploads/2018/11/athena_query.png" alt="" width="852" height="391" srcset="https://cloudar.be/wp-content/uploads/2018/11/athena_query.png 852w, https://cloudar.be/wp-content/uploads/2018/11/athena_query-768x352.png 768w" sizes="auto, (max-width: 852px) 100vw, 852px" /></p>
<h1>Redshift Spectrum</h1>
<p>Now that we have our tables and database in the Glue catalog, querying with Redshift Spectrum is easy.<br />
First make sure you have a Redshift cluster running, then create the external schema:</p>
<pre class="lang:default decode:true ">create external schema cloudtrail_logs
from data catalog
database 'cloudtrail_logs'
iam_role 'arn:aws:iam::&lt;accountnumber&gt;:role/demo-redshift';</pre>
<p>Our tables are detected automatically (Thank you Glue).</p>
<p>Creating a session with psql:</p>
<pre class="lang:default decode:true">:~$ psql -h demo-cluster.cpbkebqgnfoo.eu-west-1.redshift.amazonaws.com -p 5439 -U demouser -d dev
Password for user demouser: 
psql (10.5 (Ubuntu 10.5-0ubuntu0.18.04), server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

dev=# SELECT COUNT(*) FROM cloudtrail_logs.cloudtrail_optimized;
 count  
--------
 119182
(1 row)

dev=# 
</pre>
<p>And a query making a bit more sense than COUNT(*) :</p>
<pre class="lang:default decode:true ">dev=# SELECT COUNT (*) AS TotalEvents, json_extract_path_text(json_useridentity,'type') AS usertype, eventname
dev-# FROM cloudtrail_logs.cloudtrail_optimized
dev-# WHERE eventtime &gt;= '2017-01-01T00:00:00Z' 
dev-# AND usertype = 'Root'
dev-# GROUP BY eventname,json_useridentity
dev-# ORDER BY TotalEvents DESC
dev-# LIMIT 10;
 totalevents | usertype |         eventname         
-------------+----------+---------------------------
         588 | Root     | DescribeLoadBalancers
         392 | Root     | ListBuckets
         392 | Root     | GetBucketLocation
         336 | Root     | DescribeDBInstances
         168 | Root     | DescribeAutoScalingGroups
          90 | Root     | DescribeAccountLimits
          84 | Root     | GetTrailStatus
          84 | Root     | DescribeTrails
          84 | Root     | DescribeAccountAttributes
          84 | Root     | DescribeDBSecurityGroups
(10 rows)
</pre>
<p>&nbsp;</p>

		</div>
	</div>
</div></div></div></div><div class="vc_row-full-width vc_clearfix"></div><div id="ut-row-69bb7db824b95" data-vc-full-width="true" data-vc-full-width-init="false" class="vc_row wpb_row vc_row-fluid vc_column-gap-0 ut-row-69bb7db824ba3" ><div class="wpb_column vc_column_container vc_col-sm-12" ><div id="ut_inner_column_69bb7db82517e" class="vc_column-inner " ><div class="wpb_wrapper"></div></div></div></div><div class="vc_row-full-width vc_clearfix"></div>
</div><p>The post <a href="https://cloudar.be/awsblog/parse-and-query-cloudtrail-logs-with-aws-glue-amazon-redshift-spectrum-and-athena/">Parse and query CloudTrail logs with AWS Glue,  Amazon Redshift Spectrum and Athena</a> appeared first on <a href="https://cloudar.be">Cloudar</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://cloudar.be/awsblog/parse-and-query-cloudtrail-logs-with-aws-glue-amazon-redshift-spectrum-and-athena/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
