Managing and visualizing tabular data with pandas and matplotlib

Seminar on Information Visualization
Tampere University of Technology, 24.5.2013


Jaakko Salonen
Researcher
TUT / IISLAB

iki.fi/jaakko.salonen
@jsalonen

In this presentation

  • Introduction to tools
  • Setup and examples
  • Data analysis case example
  • Summary

pandas

Python Data Analysis Library

Open source, BSD-licensed library providing high-performance easy-to-use data structures and tools for the Python programming language

Not a framework

  • Pandas assists you in data analysis and modeling
  • No a comprehensive solution, but a library
  • Use other tools to "mungle" and prepare your data
  • Likewise does not provide comprehensive statistics tools


(What we believe is) a common pandas stack:
numpy, pandas, matplotlib, statsmodels, IPython

Notable Features

  • An efficient data table implementation ("DataFrame")
  • Comprehensive import and export tools (CSV, Excel, SQL, ...)
  • Data alignment and missing data handling
  • Can conveniently handle large data sets
  • Data table merges and joins
  • Comprehensive time series support
  • And a lot more

matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

matplotlib + pandas

  • Pandas provides matplotlib integration
  • Convenient way to visualize data with pandas
  • Use with IPython to make notebooks that mix code with visualizations

IPython

http://ipython.org/

  • "Provides a rich architecture for interactive computing"
  • Interactive shells for terminal and Qt
  • A browser-based notebook
  • And much more


We used it here to simply draw graphics

Setup and examples

Prerequisites

  • Install pandas
  • Install matplotlib
  • Install numpy
  • Install IPython

Conventional start up


ipython --pylab

import pandas as pd
import numpy as np

DataFrame

A 2-dimensional labeled data structure with columns of potentially different types 1

	
# Create 3 column dataframe with 3 rows
df = pd.DataFrame(np.random.randn(3, 3), columns=['x', 'y', 'z'])

# Show frame representation
df

# Plot
df.plot()

DataFrame from CSV

Sample data: Nokia stock price from Google Finance


# Read CSV, use first column as index
df = pd.DataFrame.from_csv('nok.csv', index_col=0)

# Show columns
df.columns

# Show data types
df.dtypes

# Show index
df.index

# Rename columns
df.columns = ['open', 'high', 'low', 'close', 'vol']

Slicing


# Show head (first rows)
df.head()

# Show tail (last rows)
df.tail()

# Pick first 20 rows
df[:20]

# Pick head from last 100 rows
df[-100:].head()

Descriptive statistics

http://pandas.pydata.org/pandas-docs/dev/basics.html

# Mean
df.mean()

# Mean (rows; doesn't make any sense here though)
df.mean(1)

# Sum (skip rows without applicable values)
df.sum(skipna=True)

More: count, median, min, max, mad, std, var, ...

Series

A one-dimensional labeled array capable of holding any data type 1


# Choosing individual Series from DataFrame
df['high']

# Or simply:
df.high

# Combine with plot
df.high.plot()

# Create and plot a new DataFrame from selected Series
df2 = df[['low', 'high']]
df2.plot()

Data arithmetics and
matrix operations

http://pandas.pydata.org/pandas-docs/dev/dsintro.html

# Difference
df.close - df.open

# The result is still a Series and can be e.g. sliced and plotted
(df.close - df.open).tail()
(df.close - df.open).plot()

# Transpose
df.T

# Matrix multiplication
df.T.dot(df)

Data analysic case example

Case: Finnish government subsidiaries

http://yle.fi/aihe/artikkeli/2012/09/07/yritystukien-kanta-asiakkaat

  • Data set of Finnish government subsidaries
  • Scope: 10 years, 5 billion euros worth of subsidiaries
  • Most importantly: the raw data available for analysis

Excel workbook → DataFrame

Assuming you downloaded and unzipped the data set.


# Openpyxl is required for .xlsx parsing ("pip install openpyxl")
xlsx = pd.ExcelFile('mot_yritystuet_myonnot_final.xlsx')

# Locate desired sheet
xlsx.sheet_names

# Parse the sheet into a DataFrame
# First row contains license info, skip it
df = xlsx.parse(xlsx.sheet_names[0], skiprows=1)

# Show some metadata to figure out if it was parsed correctly
df.index
df.columns
df.head()

So tell me now

Who granted subsidiaries and how much?


# List columns
df.columns

# Group by grantor
df.groupby(u'myöntäjä')

# Group by grantor and calculate sum
df.groupby(u'myöntäjä').sum()

# All of the above and sort
grantors = df.groupby(u'myöntäjä').sum().sort(u'myönnetty summa')

# Plot!
grantors[u'myönnetty summa'].plot(kind='bar')

Who got the most?


# Just like previously, but group by yritys (company)
by_company = df.groupby('yritys').sum()

# Sort companies by total grant sum
by_company_sorted = by_company.sort(u'myönnetty summa', ascending=False)

# Plot
by_company_sorted[u'myönnetty summa'][0:20].plot(kind='barh')

Subsidiaries by region?


# Sums by region
by_region = df.groupby('maakunta').sum()

# Drop missing data
by_region = by_region.drop('ei_saatavissa')

# Sort
by_region_sorted = by_region.sort(u'myönnetty summa', ascending=False)

# And plot
by_region_sorted[u'myönnetty summa'].plot(kind='barh')

Summary

  • Pandas is a powerful and feature-rich tabular data analysis library
  • You need to understand the basic concepts to properly use it
  • Pandas solves tabular data analysis and modelling problems. That's all it does, but it does them well.

Thank you!

Questions?


Further discussion

References and further reading

Pandas documentation
(http://pandas.pydata.org/pandas-docs/dev/)


McKinney, W. (2012). Python for Data Analysis. O'Reilly.