Scientific applications in Python

http://durden.github.io/python_science_apps

  • This is middle of conference so how about I introduce the future?
  • Let's talk about building a desktop app, those still exist right?
  • Not all of us are in the cloud yet!

Luke Lee

  • Writing oil/gas software with Geophysists, mathematicians, geologists
  • Embedded C Developer with SSDs
  • Explain what tools I've learned to start becoming a productive Python developer in the energy business.

Overview

  • Problems
  • Why Python
  • Tools
  • Sample app
  • This is a TOUR, not comprehensive
  • Lots of good stuff will be missed
  • This isn't a 'how' talk, it's more of a 'where' talk
  • Lots of links at the end, worth exploring if you're interested

Problems to solve

  • Slice/crunch numbers
  • Interactive plotting
  • Deployment
  • Windows/Linux
  • Too many ASCII file formats!

Why Python

  • Scientific community
  • Works with 'fast' languages
  • Works with other virtual machines/platforms
  • Good packaging tools for easy deployment
    • Pyinstaller, py2exe
  • Focus on reasons not usually talked about
  • Obvious:
    • Open source/tools/(Windows/OS X/nix)
  • Works with C/C++/Fortran
  • Other VMs/platforms (IronPython/Jython)
  • Clients ask for it!
  • Space, Weather, Model molecules
  • Lots of stuff is IO bound, not CPU

Python tools

  • Crunch numbers
    • IPython, NumPy, pandas, scipy, pytables
  • Visualize 2d/3d
    • PyQt/PyQwt, matplotlib, VTK, Mayavi
  • Location
    • Esri (ArcPy/geoprocessing)
  • Not enough time for everything
  • Be aware Enthought makes lots of plotting tools
  • Big overview of some common tools
  • Ma-ya-vee
  • Esri embraced Python as scripting lang for ArcGIS

Let's build an app!

  • Model
    • Sqlite/Django ORM
  • View
    • HTML/CSS/Javascript
  • Controller
    • Python glue/Django/Flask
  • Ok so we know Python can handle the problem we've outlined.
  • I'm a Django enthusiast
  • Lots of Django devs here so let's make an analogy

Sample app overview

  • Model
    • HDF5/PyTables
  • View
    • PyQt/PyQwt
  • Controller
    • NumPy/Pandas/Scipy
  • Lots of choices, So how do we use all that with Python tools?
  • Translate that MVC into desktop world
  • Think back to requirements
    • Crunch big numbers, visualization
    • Forget location data for now
  • Wrote sample app to try out, see links at end

Model

HDF5

  • Built for scientific data
  • Designed for big data
  • Hierarchical format
  • Fast parallel/random access
  • Portable binary format
  • Easy to discover/crawl structure
  • Started by a bunch of smart supercomputing guys
    • Version 5 in 1998
  • Built on lots of limitations of hdf4
  • Lots of compression/chunking available:
    • 30 columns and 1 million entries using ~ 13 MB
  • Usable from lots of languages, C/Java/Fortran
  • Mostly C, not RDBMS replacement
  • Very active despite being

Model

PyTables

  • Read/write HDF5 files
  • No concurrency
  • NumPy to boost performance
  • Think ORM for HDF5
  • More comprehensive than h5py
  • Built for big data
  • Not a replacement for relational DB, more like companion
  • Tools being aware of NumPy avoids copying to Python datatypes first
  • Great community/developer support

Model

PyTables

1 for row in ro.where('pressure > 10'):
2     print row['energy']

View

PyQt

  • Python bindings to Qt toolkit
  • Cross-platform
  • Includes GUI, network, XML, SQL, etc.
  • Pyside for LGPL
  • General UI widgets; menubars, toolbars, etc.
  • Qt is old and hardened, first release 1992
  • PyQt first released around 1998
  • Beware of licenses; PySide is LGPL, PyQt is GPL

View

PyQwt

  • For science/engineering apps
  • Much smaller/faster than matplotlib
  • Bad Python docs, use C++ docs
  • Stable, but not a lot of dev. activity
  • Not compatible with PySide?
  • PyQt/PyQwt can feel a bit awkward at times b/c they wrap C++ code with automated tools

View

Pyqtgraph

  • Possible PyQwt replacement
  • Doesn't rely on Qwt
    • Pure python (PyQt/Pyside/numpy)
  • 3D
  • Fast performance?
  • Recommending use of pyqtgraph for future
  • Docs claim very fast performance on part with PyQwt
    • Haven't tested myself
  • Uses numpy and Qt GraphicsView under hood for performance

Controller

NumPy

  • Arrays with brains
  • Fast element-wise operations
  • Smart memory management/copy semantics
  • Controller part is where things get exciting, unique to Python
  • Base of any scientific app in Python
  • Lots of incarnations of an array libraries in Python, NumPy learned from them
  • NumPy is everywhere, lots of tools use it directly to avoid intermediate data types (pandas/pytables)
  • In-memory
  • Written in C/Python

Controller

NumPy

- Pure python

1 >>> x = range(10000)
2 >>> %timeit [item + 1 for item in x]
3 1000 loops, best of 3: 437 us per loop

- NumPy

1 >>> x = numpy.arange(10000)
2 >>> %timeit x + 1
3 100000 loops, best of 3: 13.9 us per loop
  • Outsource loops to NumPy/C

Controller

NumPy

1 >>> x = numpy.arange(3)
2 >>> x
3 array([0, 1, 2])
4 >>> x[x > 1]
5 array([2])
6 >>> x > 1
7 array([False, False,  True], dtype=bool)
  • Boolean indexing, creates new array
  • Operations can be chained to build complex 'queries'

Controller

NumPy

1 >>> x
2 array([0, 1, 2])
3 >>> x[:2][0] = 1
4 >>> x
5 array([1, 1, 2])
6 >>> x[x > 0][0] = 10
7 >>> x
8 array([1, 1, 2])
  • view vs. copy
  • Be mindful of how you index
  • NumPy is designed for big data
  • Tries to avoid copying

Controller

NumPy

1 >>> rand_arr = np.random.rand(2, 2)
2 >>> numpy.savetxt('test.out',
3                   rand_arr,
4                   delimiter=' ',
5                   fmt='%1.5f',
6                   header='a b',
7                   comments='')
  • Create 2D array of random float data
  • NumPy is full of useful tools; rand/savetxt
  • Can save binary data; only viewable with numpy though
  • Saving to txt makes data readable for everyone
  • Remember, too many ASCII files? Here it is again!
  • Keep this in mind, we'll look at it in a bit

Controller

Pandas

  • Fast read/write for SQL dbs, CSV, HDF5
  • 'Group by' and merge large data sets
  • Toolkit to unify NumPy/matplotlib
  • 'Replacement' for R
  • Popular in financial industry
  • R is open source statistical language
  • Built on numpy
  • 2 main data structures
    • DataSeries -> 1d array with labels
    • DataFrame -> 2d array like SQL table/spreadsheet

Controller

Pandas

1 >>> pandas.read_csv('test.out',
2                     delim_whitespace=True)
3         a        b
4 0  0.93954  0.74496
5 1  0.12518  0.17269
  • read_csv reads in a DataFame
  • Notice it handles our header line, pretty prints, labels
  • Pandas excels and got this right; easy to get existing data in

Controller

Pandas

  • File size: ~ 203KB (208052 bytes)
  • 26 columns
  • 1000 rows
  • pandas.read_csv: 0.56s
  • numpy.loadtxt: 2.35s
  • custom OrderedDict (10 lines): 1.4s
  • numpy.loadtxt into OrderedDict: 2.65s
  • Pandas doesn't just get data in easily; it's fast!
  • Lots of recent work to optimize this even more
  • Pandas heavily optimizes this with Cython
  • Cython is outside of scope, but it's a way to speed up Python with data types

Controller

Scipy

  • Stats
  • Integration
  • Matrices
  • Linear algebra
  • Scipy is huge collection of tools

Controller

Scipy

1 >>> from scipy import integrate
2 >>> x2 = lambda x: x**2
3 >>> integrate.quad(x2,0.,4.)
4 (21.333333333333332, 2.3684757858670003e-13)
  • Wish I knew about this in my calculus classes!

Sample app review

  • Model
    • HDF5/PyTables
  • View
    • PyQt/PyQwt
  • Controller
    • NumPy/Pandas/Scipy
  • Lots of tools introduced in short time, let's review

Deployment

  • pip/requirements.txt
  • PyInstaller
  • py2exe
  • Too many choices
  • pip is easy but could be tricky b/c end-users in my area don't like to get into the command line and install tools
  • Tough b/c users can always ugrade their own stuff and break things

PyInstaller

  • Package app into nice executable
  • Finds your dependencies automatically
  • Explicit support for PyQt/Django/matplotlib
  • Major improvements in Pyinstaller 2.x
  • Customizable with hooks
  • Hook architecture to package in support for a custom app if needed

PyInstaller pitfalls

  • Can use hooks to tell about dynamic imports/sys.path issues
  • pyqtgraph is big offender of dynamic imports
  • Tries to dynamically import everything into single namespace for convience but turns out to be a huge pain
  • Pyinstaller doesn't see dynamic imports

Demo!

Practice!

  • Created a sample app with problems to fix
  • Sample oil data from data.gov
  • Best way to learn is to have a problem and try to fix it
  • Go fork project and practice
  • Go to 'Practice' section

Links

  • Lots of links, worth exploring if you're interested
  • Several good companies in the space, Enthought and Continuum Analytics
  • Pycon