Reporting with Pandas and Seals and Pythons, Oh My

I spend perhaps too much time generating and reviewing numbers and charts and reports, but the right combination of tools can make this enjoyable (or at least less tedious). For me, that generally means using the pandas data analysis library and the python programming language to analyze data stored in MariaDB or MySQL.

There are a number of BI and analytics tools available, all of which have tradeoffs depending on your needs and your circumstances. You can get by with spreadsheets, write so much SQL you begin to mumble it in your sleep, spend a hefty chunk of money investing in a commercial solution, or burn too much time configuring solutions whose purpose seems to be generating sales for the vendor’s consulting services rather than helping you get things done.

I’ve done all of these, some rather more than others (I may be bitter), but my “go to” analytics toolset today revolves around the pandas library. It has the perfect balance of capability, flexibility, and usability that I need in a tool that sees daily use.

Let’s walk through a few examples. I note that I use Python 3.4 under OS X for these. While this will not be a tutorial (for Pandas, Python, or MariaDB), I hope that seeing a few samples of what it can do will encourage you to try it. First, let’s set up a working environment:

pyvenv-3.4 ~/py34
cd ~/py34
source bin/activate
pip install matplotlib pandas ipython sqlalchemy mysql-connector-python --allow-external mysql-connector-python

This creates a clean, virtual python environment in the py34 directory and installs a few dependencies, and takes less than a minute for me. Pandas uses matplotlib for charting, and in recent versions prefers SQLAlchemy for querying SQL databases; IPython is my preferred interactive shell when working with python; and I will use MySQL Connector/Python to query an existing MariaDB database.

The database, and thus my examples here, contains data regarding a hypothetical, growing technical support organization, purely because I happen to have that kind of thing handy. I will not paste that into this post.

Let’s put a few more details in place and analyze some data. We will start IPython, import the libraries we need, set an option to make charts look a bit prettier, and configure the database connection:

ipython --pylab

import sqlalchemy
import pandas
pandas.set_option('display.mpl_style', 'default')
e = sqlalchemy.create_engine('mysql+mysqlconnector://dellis@localhost/support')

Now let’s just load everything from a single table, in this case the “issues” table:

df = pandas.io.sql.read_sql_table('issues', e)

This creates a pandas DataFrame (somewhat akin to a spreadsheet) containing the table’s data. You can use the read_sql() method for populating a DataFrame with more realistic/complex queries. The columns correspond to basic information about support issues:

In [6]: df.columns
Out[6]: Index(['issue', 'customer', 'product', 'severity', 'status', 'created', 'emails'], dtype='object')

Let’s answer a few very high level questions:

  • How many total support issues have been opened?
  • How many unique customers have opened issues?
In [7]: df.issue.count()
Out[7]: 3511

In [8]: df.customer.nunique()
Out[8]: 503

We can get a little more complex: which customers have opened the most issues? This is similar to SQL, in that we will group by the customer identity, count the size of each group, sort the result in descending order, and list the top 5:

x = df.groupby('customer').issue.count()
x.sort(ascending=False)
x[:5]

# or: df.groupby('customer').issue.count().sort(inplace=False, ascending=False)[:5]

Out[9]:
customer
Customer 111    20
Customer 37     19
Customer 45     19
Customer 3      18
Customer 19     18
Name: issue, dtype: int64

We could target more proactive use cases if we queried a live, production database. Ex: How many issues are open and waiting on a support engineer to take action? How many at each Severity level?

In [10]: df[df.status == 'Support'].issue.count()
Out[10]: 6

In [11]: df[df.status == 'Support'].severity.value_counts()
Out[11]:
S4    3
S3    2
S1    1
dtype: int64

I could retrieve the issue numbers themselves by simply not executing aggregate functions. ex: df[df.status == 'Support'].issue

Answering questions in this fashion can be useful for a wide variety of tasks, but now let’s explore trends over time. In this case, I want to review my data based on when the issues were created.

There are several ways of doing this in pandas, and I use different methods depending on my needs. In this case, I will reindex the DataFrame I’ve been using, changing its index to the “created” column (it has been indexed via a simple “row identifier” so far). Then, I will create another index based on the year and month of the issue creation dates, and I will use these to examine new issue inflow:

df.set_index('created', drop=False, inplace=True)
pi = df.index.to_period('M')

In [14]: df['2013'].issue.count()
Out[14]: 1256

In [15]: df['2014Q1'].issue.count()
Out[15]: 476

In [16]: df.groupby(pi).issue.count()[-5:]
Out[16]:
created
2014-03    165
2014-04    156
2014-05    188
2014-06    167
2014-07    190
Freq: M, Name: issue, dtype: int64

Let’s review our inflow by quarters, instead:

In [17]: df.groupby(pi.asfreq('Q')).issue.count()[-5:]
Out[17]:
created
2013Q3     317
2013Q4     408
2014Q1     476
2014Q2     511
2014Q3     190
Freq: Q-DEC, Name: issue, dtype: int64

Perhaps we want to review our Q2 issues that have a larger than “normal” number of emails associated with them. Let’s arbitrarily define “larger than normal” as more than the mean plus the mean absolute deviation:

In [18]: df[ df.emails > (df.emails.mean() + df.emails.mad()) ]['2014Q2'].issue.count()
Out[18]: 118

Where would we be without charts? Let’s see our monthly inflow in a graph, along with the rolling mean (or moving average) with a six month period:

In [19]: df.groupby(pi).issue.count().plot(legend=True, label="Inflow")
In [20]: pandas.rolling_mean(df.groupby(pi).issue.count(), 6).plot(legend=True, label="Average")

We can review the evolution of our weekly inflow:

In [21]: pandas.rolling_mean(df.groupby(df.index.to_period('W')).issue.count(),6).plot()

Or, the number of unique customers reporting issues by quarter:

In [22]: df.groupby(pi.asfreq('Q')).customer.nunique().plot()

We can see at a glance our issue growth rate for different products:

In [23]: df.groupby(pi).product.value_counts().unstack().plot()

There are several other types of charts available. Here we have a bar chart:

In [24]: df.groupby(pi.asfreq('Q')).issue.count()[:-1].plot(kind='bar')

Pandas can load data from a variety of sources, whether a MariaDB database or CSV, Excel, HDF, JSON, and many others. It can also export that data in different formats, including writing back to a database.

If, for example, you are constructing an analytics web service with pandas on the back end, you can convert your results into various native python data types (lists, dicts, etc), or directly to JSON or other formats, to be consumed however your application needs it.

There are countless ways to use it, whether you are connecting directly to a production OLTP server for adhoc “queries” or automated alert reporting, or generating monthly reports against a denormalized data mart.

I could barely even scratch the surface in this post. If these couple dozen little examples piqued your interest, however, you can read much more about it at the project’s website: http://pandas.pydata.org/