Wednesday, January 23, 2013

Blogging with the ipython netbook and the github/nbviewer combo

As you can see I've just changed the whole aspect of the blog.
This is a response to a couple of design need. The lesser one is that all the plots that I will create will be with a white background, and against a dark background it result in a pain in the eye. The greater one is that Blogger suck for posting code and images. I spend half of my time to check the font and color of the code, and every image require saving it to disk. I can't even show any more complicated code as the results should be reformatted by hand. And, to be honest, I always dreamed of simply blogging my ipython notebook scripts.

So, thank to Brian E. Granger, who wrote a simple method to post an ipython notebook as a frame in the post, I can have the cake and eat it too!
The method is simple as adding an iframe tag (with the correct dimension put by hand, but that's a minor flaw) in the HTML code of the post, and voilĂ !

<iframe src="http://nbviewer.ipython.org/3835181/" width="800" height="1500"></iframe>

Leveraging the magic of the nbviewer server (that is an amazing service, by the way) I can now write a wonderfully formatted notebook, with code and formulas and plots and everything, and just hand it to you. What I'm going to do is set a git repository, create my notebook in there and link them with nbviewer in here. By the way, the notebook that it's linked it's the wonderful XKCD plot style created by Jake Vanderplas, a great blogger and python developer

I will try to explain how I'm going to set up and manage the repository.
First of all I create a new repository called blogger_notebook
Having set up a github account (that is really easy), the next step is to create the directory that will host my material.

mkdir blogger_code
cd blogger_code/

GitHub give some useful information on how to create a new repository. First of all, we create it by writing

git init

this tell to git that in this directory we have a git repository and that it should keep the version control backup of the data. Now I tell him the online repository location:

git remote add origin https://github.com/EnricoGiampieri/blogger_notebook.git 

Ok, we are close to the goal. I copy the notebook that will be my next post, basemap.ipynb. Now I need to tell git to follow it

git add basemap.ipynb

now everytime I make a modification to this file that I want to remember I can save it with the commit command. I also need to add a description of the modification done

git commit basemap.ipynb -m "creation of an example of basemap usage"

lastly, to keep up the repository online, I should put it into the GitHub repository. This will ask for my username and password and will upload all the modification to the online repository.

git push -u origin master

you can see the results here:

https://github.com/EnricoGiampieri/blogger_notebook

Now, the last step is to create a nbviewer link to the notebook. You should take the link to the raw file (you can obtain it going into the file and search for the RAW button) and give it to the nbviewer main page. it will give you a nice link to the notebook with all it's content:

http://nbviewer.ipython.org/urls/raw.github.com/EnricoGiampieri/blogger_notebook/master/basemap.ipynb

Obviously this is barely scratching the surface of the (super)power of git, but there are tons of manuals online that explain it better than what I could ever do. This was just a step by step guide to how to setup this "delayed blogging" method.

Sunday, January 13, 2013

Sympy for statistics

One of the python module to which I have the most controversial feelings is without any doubt sympy.

Sympy is a great piece of software that can deal with a huge amount of problem in a quite elegant way, and I would really like to use it more in my work. The main drawback was a very poor support for statistic, and making all those integral by hand felt a little odd.

It was with a lot of happiness that I read about the development of a new module for statistics in sympy, called sympy.stats, that promised to address all (or at least most) of the needs that someone can have working out statistical problems.

The foundation for this module has been put into place by Matthew Rocklin in the summer 2012. He made a good job, and the module has been indeed extended to support a great amount of probability distribution, both continuous and finite. There is yet no support for infinite discreet space like the natural numbers, and this means that few very important distribution like the Poisson or the Negative Binomial are still left out, but the overall feeling is very good.

The library is based on the idea of Random variable, defining a probability measure over a certain domain. For example, a normal variable is defined over the whole real axis and implements the gaussian probability density.

A selected amount of operation can be done over these random variables, notably obtaining the density estimation, the probability of an event or the expectation value.

But let the code speak.
Let's import sympy and sympy.stats, and create a Normal variable with a fixed variance and mean represented by a sympy real variable. Remember that any time we specify a new sympy symbol we have to declare a name for that symbol. in this case our normal distribution will be called simply X.

import sympy
import sympy.stats as stats

mu = sympy.Symbol('mu')
X = stats.Normal('X',mu,1)

We can now ask the expected mean and standard deviation of out random variable:

print sympy.simplify(stats.E(X))
print sympy.simplify(stats.variance(X))

that return, as expected, mu and 1.
We can also create new random expression based on the original one.
We know for example that a chi squared variable is the sum of N normal, so we can obtain the mean and variance of a 2-degree of freedom Chi distribution simply by summing up the squares of two normal distribution:

X = stats.Normal('X',0,1)
Y = stats.Normal('Y',0,1)
Chi = X**2 + Y**2


stats.E(Chi)
# 2
stats.variance(Chi)
# 4

that are exactly the values we were expecting (see Chi Squared distribution)
We can sample our expression with sample or sample_iter, and we can look at the resulting distribution:

samples = list(stats.sample_iter(Chi, numsamples=1e4))

we can plot the histogram with pylab as simple as:

pylab.hist(samples, bins=100)
pylab.show()

We can also evaluate the conditioned probability of events, but on continuous function this lead to some heavy integrals, so I will demonstrate it using the more simpler Die class, that represents the launch of a fair n-sided die.

X = stats.Die('X',6)
Y = stats.Die('Y',6)
W = stats.Die('W',6)
Z = X + Y + W


We can ask what is the probability that a realization of X is grater than 4:

stats.P(X>4)
# 1/3

or that it equals a certain value, say 3 (the ugly syntax cannot be avoided due to how the equality test is evaluated):

stats.P(sympy.Eq(X,3))
# 1/6 

We can also ask what is the probability that the three dice Z will roll more than 10 given that the first die rolled a 4:

stats.P(Z>10, sympy.Eq(X,4))

So, summing up, the stats module of sympy is really promising and I hope that a lot of work will be done on it to make it even better. If I will understand the sympy development process and the module class hierarchy, I will surely try to make a contribution.
Given these praises, for my needs it still lacks several fundamental features:
  • support for non-limited discreet spaces
  • better support for mixtures of distribution (right now I still get only error complaining about the invertibility of the CDF)
  • better fall-back to numerical evaluation, as a lot of distribution are described by integrals and special functions and, even if the integration routine of sympy is pretty solid, not everything can be solved analytically
My best whishes to the sympy team, thank you for your great job!