Getting Usage Info From CloudFront Logs

Say you have a bunch of sites that are hosted on an EC2 instances, and for various reasons you’ve setup CloudFront to help handle the traffic. You have your normal webserver logs to give you part of the picture, but you probably will want to dig into the CloudFront logs as well in order to get a better picture of your actual throughput.

Thankfully, this is pretty easy. And somewhat annoyingly complicated.

I’ll be going over some general steps to take to get this done. I’ll only be focussing on the ‘download’ CloudFront distribution type, though the ‘streaming’ type isn’t too different. In fact, it really just comes down to the log file format.

Before I start, here’s where I got most of my info on the lgos themselves in case you want to go straight to the source :
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

Some things to keep in mind about the logs are :

  • You’ll end up with a TON of log files. Each client hitting the CDN basically will generate its own log file, so each log file is pretty much a single transaction.
  • The logs aren’t delivered immediately. Or at least, you shouldn’t assume they will be. This means you’ll need to keep an eye out for logs that might show up several hours after the event that triggers them.
  • The logs might never show up. If you are using this for billing, just keep in mind that it is possible that this won’t be a complete accounting of all the requests.

So here’s a quick rundown on how to grab, parse, and utilize the logs :

The first thing you need is an S3 bucket. This will be where you store your CloudFront logs. I use one giant bucket for all my logging, but you could just as easily setup multiple buckets so you have a 1:1 ratio of bucket to distribution, but that will probably drive you insane after a while. Instead, just use a prefix when setting up the distribution logging, and the logs will be split up and organized for you. I use the origin name as the prefix for my logs, but there isn’t any reason you can’t organize by client, host, or even a domain name.
You should probably also consider adding some lifecycle rules. If you absolutely need to hold onto the logs forever, then you can archive them off somewhere.

Next you’ll need to modify your CloudFront distribution to use the log. This will be in the ‘General’ section of the distribution config, and will also be present when you do an initial setup. Just switch ‘Logging’ to ‘on’, choose your bucket from the selectbox, and then give it your prefix.

So now you’re collecting logs. If you watch the bucket, you’ll see many logs begin to pour in with just moderate usage.

You have a few options on how to deal with the logs at this point. I’m going to keep it pretty simple.

To get the logs out, you can use a tool like s3cmd to sync the logs from the S3 bucket into a local directory, and then parse the logs from there. To keep bandwidth and compute costs down, you could do this every few hours if you like rather than constantly; it just depends on your needs.

Once you have your local copies, you can loop through the files, and parse them out. Handily, each file has a header line that describes the fields. I’m most interested in the fourth field with is the bytes sent to the client, which is the bandwidth used. Other fields might be useful to you, especially if you use the same distribution for multiple sites or want to track the source IP addresses.

That’s pretty much what I do with the logs. I usually drop the logs after 10 days or so, and I keep my local copies for up to 3 days for reference. I have scripts that grab, parse, and push the data into a monitoring system for graphing and collating with the normal server logs so I can get a bigger picture of the traffic each site is using. I find this especially helpful as I can see which sites are efficiently using the CDN, and which ones might need tweaking.

Getting Usage Info From CloudFront Logs