As you may know (you probably don't :), I've built a greeklish to greek converter for my Diploma thesis, using Python.

It worked well enough (for a diploma thesis), but it was slow. I mean, really, really slow. It was coded in Pure Python (tm) and had no optimizations whatsoever. The main part of the code was an implementation of a Hidden Markov Model:

In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

I was using dicts mainly, to map between a state and a probability, but what I really wanted was a matrix. I went and replaced my dicts with lists of lists, which yielded a performance improvement (a dict lookup may be constant-time, but an index in a list will be faster, because its constant is smaller). I also did some cacheing and other minor stuff, like instead of waiting for KeyErrors or IndexErrors I went and filled up the "matrix", which also helped. Unfortunately, when all that is done, you realize that Python isn't a language designed for number crunching. You have to use something else to do the heavy lifting.

Why numpy?

A usual argument of dynamic language fans when confronted with the "dynamic languages are slow" issue, is, well "you can always rewrite the critical parts in C". Python is supposed to make that easy, using ctypes or SWIG or other libraries. I don't know, I'm not really comfortable doing C, and doing numeric stuff in C seems scary to me, as you'll have to deal with precision and all that weird stuff.

So I decided to use numpy, as the de facto standard when doing numerical applications in Python. It was very straightforward: Instead of using:

a = [[1,2],[3,4]]

you can use:

a = array([[1,2],[3,4]])

and lo and behold, all your code still runs!

Not so fast, mister

Well, actually it runs slower! Huh? How come?

Apparently, just switching to numpy arrays isn't good at all, if you still are looping over them in Python. Nooo, you have to rewrite all your functions to take advantage of the numpy speedup. This can be quite tricky and will lead to hair pulling.

The walkthrough

I'll repeat here the steps I've took to implement the forward algorithm as described in L. Rabiner's HMM tutorial. It's simply 3 equations (19, 20, 21), mostly doing sums and multiplications.

I'll spare the irrelevant details. Assuming A, B, pi are the parameters of the HMM, N is the number of the hidden states and given obs, a list of observation indices, here is the naive code, mostly copying from the book:

#19
T = len(observation_list)
a = zeros((T,N))
for i in range(N):
    a[0][i] = pi[i]*B[obs[0]][i]

#20
for t in range(1,T):
    for j in range(N):
        for i in range(N):
            a[t][j] += a[t-1][i]*A[i][j]
        a[t][j] *= B[obs[t]][j]

#21
prob = 0
for i in range(N):
    prob += a[T-1][i]

The pattern is roughly, the sum (or product) of something over i, from 1 to N, turns into a for i in range(N). Let's see how this can be rewritten to use numpy.

First of all, don't delete the old function! We'll interleave the new code and sprinkle assertions in between to make sure we don't make mistakes. If you have unit tests (you should!), you can make sure later that everything still ticks, but while rewriting it helps to take baby steps.

For function #19, the numpy code is this:

a_ = zeros((T, self.N))
a_[0] = pi*B[obs[0],:]

assert((a==a_).all())

It's apparent that the conversion "algorithm" is:

  1. Use the numpy element accessor [x,y] instead of [x][y]
  2. Replace "i" (the counter) with ":"
  3. If you have just a vector, remove the square brackets entirely.
  4. Remove the for loop
  5. Assert that each element of the numpy array is the same with naive array.

Things to remember:

  1. The "*" operator for numpy arrays is element-wise multiplication, instead of dot product.
  2. The ":" accessor works a bit like a slice-operator. I like to thik of it as wildcard

What the above code is doing boils down to:

"Multiply each element of pi with the corresponding element of the slice obs[0] of B, and assign the resultant array (a vector, really) to a_[0]"

With that in mind, we can easily deduce that for function #21 the code will be:

prob_ = sum(a_[T-1,:])

assert(prob_==prob)

Which means, "sum the elements of the slice T-1 of array a_"

Function #20 is a bit more complex, and I'll go step-by-step:

First, there is a sum:

for i in range(self.N):
    a[t][j] += a[t-1][i]*A[i][j]

Which can be easily replaced with

a_[t][j] = sum(a_[t-1,:]*A[:,j])

using the above algorithm. We now have:

for t in range(1,T):
    for j in range(self.N):
        a_[t][j] = sum(a_[t-1,:]*A[:,j])
        a_[t][j] *= B[obs[t]][j]

There is also a product (a_[t][j] *= B[obs[t]][j]), that can be swapped with a_[t,:] *= B[obs[t],:]

The code now becomes:

for t in range(1,T):
    for j in range(self.N):
        a_[t][j] = sum(a_[t-1,:]*A[:,j])
    a_[t,:] *= B[obs[t],:]

Only one loop to go! However, that sum is trouble. When I tried to apply the usual "replace-i-with-:" routine, I got an AssertionError (that's why you have assertions!). After some minutes with pen and paper, it occured to me that the sum operator reduced the array to a single number, when clearly I expected a vector. So I needed an operator multiplied two arrays, did per row sums and return an array.

OK, duh, that's the dot product. My linear algebra is rusty, I know. So now the code becomes:

for t in range(1,T):
    a_[t,:] = dot(a_[t-1,:],A)
    a_[t,:] *= B[obs[t],:]

Or, in one line:

for t in range(1,T):
    a_[t,:] = dot(a_[t-1,:],A)*B[obs[t],:]

UPDATE: Of course, when implementing the viterbi algorithm, I came across this piece of code:

    for t in range(1,T):
        for j in range(self.N):
            max_daij = 0.0

            for i in range(self.N):
                tmp_daij=d[t-1][i]*A[i][j]
                if tmp_daij>=max_daij:
                    max_daij=tmp_daij
                    psi[t][j] = int(i)
            d[t][j] = max_daij*B[obs[t]][j]

It seems that max and argmax should be used. This translates to:

    for t in range(1,T):
        max_daij=amax(d_[t-1,:]*A.T,1)
        psi_[t,:] = argmax(d_[t-1,:]*A.T,1)
        d_[t,:] = max_daij*B[obs[t],:]

Things to notice: I used from numpy import max as amax so as not to complicate things. Also note the .T syntax. This is the transposed array. It's quite cheap as an operation, as it returns just a transposed view to the data. I should've done this testing months ago. Sigh.


Conclusion

That's not bad. The final result has less cruft, at the expense of more concentration. I actually kept the interleaved version so I can understand what's going on if I revisit the function.

I hope I haven't done anything stupid, since the data I use aren't that complicated, and in my experience, you really need big datasets to test against when developing numerical applications.

February 9, 2008, 5:52 p.m. More (1169 words) 3 comments Feed
Previous entry: Η Ελλάδα ταξιδεύει
Next entry: Switching to Emacs!

Comments

1

Comment by Orestis Markou , 2 years, 7 months ago :

Well, after trying to port the Viterbi algorithm too, I've finally found the bug that was tormenting me for about 3 months: The "max" and "argmax" functions of numpy don't behave the same as doing it in a loop.

I'll try to reach someone in the numpy project to see what's going on.

2

Comment by Orestis Markou , 2 years, 7 months ago :

It seems that the issue arises when there are similar values. It seems that numpy's max and argmax have some kind of bias... Strange.

3

Comment by Orestis Markou , 2 years, 7 months ago :

Hah! Disregard the above, I'm stupid. More testing revealed that of course my logic was flawed. I'll update with the suspicious code snippet.


This post is older than 30 days and comments have been turned off.