IRC Stats Generation with AWS and Docker

Posted by ProgrammingAce on Thu 21 April 2016

This post assumes you have some experience working in AWS. We’ll go over all of the code needed to set this project up, but it’s expected you’re familiar with AWS terms and concepts.

I’ve been a fan of using internet chatrooms for most of my life, and my primary tool for that is Internet Relay Chat (IRC). I’ve been part of one community for more than a decade, and we have chat logs going back almost to the beginning. When you have that much history to parse, it’s pretty cool for the community to do some analytics on the data.

We’ve looked at a couple of stats generation tools over the years, but eventually settled on a PHP project called ‘Super Serious Stats‘. The basics of this tool are simple:

  • Ingest all of the log files into a sqlite database.
  • Process the database calculating totals and looking for common data.

For a while, this app ran on a cron schedule through a VPS I was renting; but the ingestion and processing tasks only took about 15 seconds a day, so it was a bit of a waste of resources. Thinking about how I would architect this process in the modern ‘DevOps’ style, I came up with this plan:

  • Have an IRC chatbot upload the logs to an S3 bucket.
  • Run a docker task to ingest the logs and run the processing component.
  • Export the resulting static html files to another S3 bucket with webhosting enabled.

What resources you’ll need to build this system:

  • Two S3 buckets, one for storing logs and one for hosting the resulting HTML.
  • A docker host (could be done with ECS, but I’m running docker locally to keep costs down)
  • An IAM role that grants write access to the log’s S3 bucket.
  • An IAM role that grants read-only access to the log bucket and write access to the HTML bucket.

Setting up the S3 buckets:

You’ll need to create two S3 buckets, one to hold the logs and one to host the resulting HTML files. We’ll name them as such in the code below:

  • example-logs
  • example-web

For these buckets, you’ll need to create three IAM policies to setup the permissions. These will later be applied to users we’ll create:

We’ll name the first policy ‘logs-admin’, and it will grant full read/write access to the logs S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-logs"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::example-logs/*"
            ]
        }
    ]
}

We’ll create a second IAM policy named ‘logs-readonly’ that will grant read only access to the logs bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::example-logs",
                "arn:aws:s3:::example-logs/*"
            ]
        }
    ]
}

We’ll create a third IAM policy named ‘web-admin’ that will give read/write permissions over the HTML bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::example-web"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::example-web/*"
            ]
        }
    ]
}

Finally, we’ll need to go to the properties of the ‘example-web’ bucket, enable web hosting, and add a bucket policy that allows the files to be hosted:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadGetObject",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::example-web/*"
        }
    ]
}

Now we just need to create IAM users who can use these policies to manipulate the contents of the S3 buckets. Create a user named ‘stats-admin’ and attach the ‘logs-admin’ profile. Create a second user named ‘stats-processing’ and attach the ‘web-admin’ and ‘logs-readonly’ profiles.

Now your permissions are setup, and your users should be restricted to only the actions required to do their job.

Setting up the docker task:

Now we’re going to setup docker to process the stats. First, you’ll need to setup some docker volumes to persist the logs and database between runs (it takes 15-20 minutes to process 10 years worth of stats, so we want to cache the logs and database locally).

Build a pair of volumes named ‘stats_db’ and ‘stats_logs’:

docker volume create --name stats_db
docker volume create --name stats_logs

Next we’ll build a docker container with ‘Super Serious Stats’ installed:

# Start with the CentOS 7 image.
FROM centos:7

# Install the required yum packages for Super Serious Stats
RUN yum install -y epel-release git php-zip php-zlib epel-release php php-mbstring unzip php-pdo; yum -y clean all
RUN yum install -y python-pip; yum -y clean all

# Install the AWS cli tools
RUN pip install awscli

# Create the directories to store our data
RUN mkdir /var/lib/sss/
RUN mkdir /tmp/logs
RUN mkdir /var/www/html/sss

# Use git to clone Super Serious Stats
RUN git clone https://github.com/tommyrot/superseriousstats.git /tmp/superseriousstats

# Copy the config file for Super Serious Stats into the container at build time
COPY sss.conf /tmp/superseriousstats/sss.conf

# Copy the boilerplate web files into an HTML directory
RUN cp /tmp/superseriousstats/www/* /var/www/html/sss

# Runtime command to sync the logs, ingest the logs, run the stats generation, and sync the results to the web bucket
CMD aws s3 sync s3://example-logs/ /tmp/logs && \
    php /tmp/superseriousstats/sss.php -i /tmp/logs/ && \
    php /tmp/superseriousstats/sss.php -o /var/www/html/sss/index.html && \
    aws s3 sync /var/www/html/sss/ s3://example-web/

This container builds an environment and imports a configuration file for SSS. At run time, it syncs the logs from an S3 bucket, processes the logs, then syncs the output with another bucket.

We can build this container image with:
docker build -t stats .

Running the project:

Finally, we can run the project in two steps; upload the logs to the proper S3 bucket, then run the docker container to process them.

You can upload the logs from wherever they’re stored using the aws cli tool and the credentials for the ‘logs-admin’ user. Just run ‘aws s3 sync’ from the logs folder and target the ‘example-logs’ bucket. You can set it up in cron with:
* 1 * * * /usr/bin/aws s3 sync /path/to/logs s3://example-logs/
Now you can run the docker task to process the logs and publish the results to your S3 bucket. Note that you’ll need to inject the credentials for the ‘web-admin’ user as environment variables into the container:
docker run --rm \
    -v stats_logs:/tmp/logs \
    -v stats_db:/var/lib/sss/ \
    -e AWS_ACCESS_KEY_ID=insert_your_id_here \
    -e AWS_SECRET_ACCESS_KEY=insert_access_key_here \
    -e AWS_DEFAULT_REGION=us-east-1 \
    stats

When that completes, you should be able to see your stats output by browsing to your ‘example-web’ S3 bucket in your browser.