The need for speed – Python performs well with NumPy
We were required to speed up a 150K lines financial risk management application. Financial calculations amount to less than 25% of the code, but we started to use NumPy to parallelize object-oriented code, building incremental, real-time financial valuations and pivot reporting engines.
If from speed you get your thrill, take precaution, make your will.
Is Python fast enough for large applications? Certainly it is for Bank of America’s risk management framework Quartz, which apparently consists of several million lines of Python code, written by thousands of programmers and running on ten thousands cores. While I don’t know their performance tricks, I can tell you what we did for our own solution, Quantax.
Quantax is an application to calculate and present market risk for trading in banks and asset management. It is written entirely in Python, and evolved over 10 years. It competes with applications typically written in C or C++ for speed.
For a major overhaul, performance was of utmost concern. As Quantax presents results as events happen (soft real time), some standard approaches and packages (both in Python and outside) didn’t fit. We need to incrementally process new transactions and prices and insert them into the result, while allowing user supplied calculations as plugins.
We built the new infrastructure around NumPy:
- We extract data from objects once and put it into NumPy arrays (and keep it updated when something changes).
- Finance loves class hierarchies (e.g., an interest option is an option with some extras), but object access – with its chain of method calls — is a hindrance to parallel computation.
- So we try to extract the static attributes once, and reuse them with different rates.
- We have different tables for integer and float arrays, bound together by common structures.
- We use dictionaries with integer values combined with lists to store non-numeric information into NumPy tables.
- We have sub tables, join them, and pivot them, and fill them incrementally using call backs (which users may supply as plugins).
Cache is king – still
Quantax uses many levels of cache (on reports, valuations, models, rates, portfolios, dependencies, stored objects, etc.), because no computation is always faster than an optimized computation. But it is sometimes difficult to determine when “the world changes” and what exactly is affected. This is very important so we can avoid needless calculations. Event queuing and merging is used for this, as well as address mapping as calculations move through structures.
Lessons learned with NumPy
- Test performance if you do something else than straight numeric code. There might be surprises, because NumPy may not be optimized for what you’re doing.
- In particular, avoid many concatenates, as this fragments the heap and required copying. We pre-allocate arrays instead, and fill them using indexing.
- NumPy is almost as robust as Python – but not 100%:
- Expect the unexpected – build border case tests, catering for empty or ragged arrays.
Lessons about abstraction and developers
- NumPy seems to be hard to understand for non-mathematically-inclined people (e.g., banking experts).
- For developers, NumPy is easy to understand, but array-oriented approaches can be hard, especially if they involve moving positions of arrays (rolling, inserting, etc.).
- Do provide a nice API on top, which requires local (one at the time) thinking only.
- Don’t create complex solutions without simple abstractions if you have (non-mathematically-inclined) domain experts coding.
PyPy, JIT, and the GIL
We also looked at PyPy, a just-in-time-compiled (JIT-ed) implementation of Python. PyPy can considerably speed up object-heavy code, but at this time still lacks some NumPy features, so we can’t use it right now.
PyPy is also experimenting with a Software Transactional Memory (STM) approach to avoid a Global Interpreter Lock (GIL), which prevents Python to run CPU-bound code in multiple threads. STM in PyPy is being actively developed, and is looking for donations.
With the GIL, Quantax keeps the cores busy by using task-level parallelism on processes, with some duplication of caches. With STM, we could go for much finer grained parallelism, using threads, with minimal inter-process communication and memory overhead.
PS: In upcoming posts, I will write about the Quantax user interface overhaul, and about the Composability principles we used. So, why don’t you follow our blog using a feed reader? I use feedly, but any RSS reader will do. To subscribe in feedly, click this button: