nbarbey.github.com

home archive code resume about contact

Recent posts.

Sumatra introduction

July 08, 2011

Sumatra is a very nice tool to do reproducible research or simply to manage numerical experiments in an efficient way.

There is already a very well written documentation with a getting started section. It assumes that the reader is familiar with revision control software. This is also very well covered on the web (see for instance the mercurial quick start page).

But I would like to show how simple it really is to use sumatra to people unfamiliar with this way of doing numerical experiments. I will illustrate this with a script taken from tamasis-map which performs map-making for the Herschel Space Observatory and in particular the PACS instrument.

A simple script to perform map-making looks like this :

#!/usr/bin/env python

import os
import tamasis as tm

# PACS data filename
filename = 'frames_blue.fits'
# load metadata
obs = tm.PacsObservation(filename=filename)
# load data
tod = obs.get_tod(flatfielding=False)
# define instrument model
projection   = tm.Projection(obs, resolution=3.2, oversampling=False,
                            npixels_per_sample=6)
masking_tod  = tm.Masking(tod.mask)
model = masking_tod * projection
# perform map-making
map_rls = tm.mapper_rls(tod, model, hyper=1., tol=1.e-4)
# save result
map_rls.save("map.fits")

This script consists in a few simple functions parametrized by a small number of keywords. It takes a FITS file as input and output another FITS file. The script can be executed from the command line but no argument can be passed from the command line.

This is not what Sumatra expects. Sumatra expects a script which accepts a configuration file which defines the parameters of the numerical experiment. The configuration file should be passed from the command line to the script and parsed.

So we should rewrite our map-making script in this manner. We can separate it in two parts. A part parsing the configuration file. This can be done for instance with the ConfigParser package.

#!/usr/bin/env python

options = "hvf:o:"
long_options = ["help", "verbose", "filenames=", "output="]

def main():
    import os, sys, getopt, ConfigParser, time
    # parse command line arguments
    try:
        opts, args = getopt.getopt(sys.argv[1:], options, long_options)
    except getopt.GetoptError, err:
        print(str(err))
        usage()
        sys.exit(2)

    # default
    verbose = False

    # parse options
    for o, a in opts:
        if o in ("-h", "--help"):
            usage()
            sys.exit()
        if o in ("-v", "--verbose"):
            verbose = True

    # read config file
    if len(args) == 0:
        print("Error: config filename argument is mandatory.\n")
        usage()
        sys.exit(2)
    config_file = args[0]
    config = ConfigParser.RawConfigParser()
    config.read(config_file)
    keywords = dict()
    sections = config.sections()
    sections.remove("main")
    # parse config
    for section in sections:
        keywords[section] = dict()
        for option in config.options(section):
            get = config.get
            # recast to bool, int or float if needed
            if section == "PacsObservation":
                if option == "reject_bad_line":
                    get = config.getboolean
                if option == "fine_sampling_factor":
                    get = config.getint
                if option in ("active_fraction",
                              "delay",
                              "calblock_extension_time"):
                    get = config.getfloat
            if section == "get_tod":
                if option in ("flatfielding",
                              "substraction_mean",
                              "raw"):
                    get = config.getboolean
            if section == "Projection":
                if option in ("oversampling",
                              "packed"):
                    get = config.getboolean
                if option == "npixels_per_sample":
                    get = config.getint
                if option == "resolution":
                    get = config.getfloat
            if section == "mapper_rls":
                if option in ("verbose",
                              "criterion"):
                    get = config.getboolean
                if option == "maxiter":
                    get = config.getint
                if option in ("hyper",
                              "tol"):
                    get = config.getfloat
            # store option using the appropriate get to recast option.
            keywords[section][option] = get(section, option)
    # special case for the main section
    data_file_list = config.get("main", "filenames").split(",")
    # if filenames argument is passed, override config file value.
    for o, a in opts:
        if o in ("-f", "--filenames"):
            data_file_list = a.split(", ")
    data_file_list = [w.lstrip().rstrip() for w in data_file_list]
    # append date string to the output file to distinguish results.
    date = time.strftime("%y%m%d_%H%M%S", time.gmtime())
    filename = data_file_list[0].split(os.sep)[-1]
    # remove extension
    fname = ".".join(filename.split(".")[:-1])
    # store results into the Data subdirectory as expected by sumatra
    output_file = "Data/map" + fname + '_' + date + '.fits'
    # if output argument is passed, override config file value.
    for o, a in opts:
        if o in ("-o", "--output"):
            output_file = a
    # run tamasis mapper
    pipeline_rls(data_file_list, output_file, keywords, verbose=verbose)

# to call from command line
if __name__ == "__main__":
    main()

This is a long code. But the longest part is used to define the data type of the arguments. The idea here is that a section of the configuration file correspond to a step of the map-making script. A dictionary storing the parameters of each step is filled from the values in the corresponding section.

Now the map-making script can be rewritten as a function accepting a dictionary of dictionaries as follows :

def pipeline_rls(filenames, output_file, keywords, verbose=False):
    """
    Perform regularized least-square inversion of a simple model (tod
    mask and projection). Processing steps are as follows :
    
        - define PacsObservation instance
        - get Time Ordered Data (get_tod)
        - define projection
        - perform inversion on model
        - save file

    Arguments
    ---------
    filenames: list of strings
        List of data filenames.
    output_file : string
        Name of the output fits file.
    keywords: dict
        Dictionary containing options for all steps as dictionary.
    verbose: boolean (default False)
        Set verbosity.

    Returns
    -------
    Returns nothing. Save result as a fits file.
    """
    from scipy.sparse.linalg import cgs
    import tamasis as tm
    # verbosity
    keywords["mapper_rls"]["verbose"] = verbose
    # define observation
    obs_keys = keywords.get("PacsObservation", {})
    obs = tm.PacsObservation(filenames, **obs_keys)
    # get data
    tod_keys = keywords.get("get_tod", {})
    tod = obs.get_tod(**tod_keys)
    # define projector
    proj_keys = keywords.get("Projection", {})
    projection = tm.Projection(obs, **proj_keys)
    # define mask
    masking_tod  = tm.Masking(tod.mask)
    # define full model
    model = masking_tod * projection
    # perform map-making inversion
    mapper_keys = keywords.get("mapper_rls", {})
    map_rls = tm.mapper_rls(tod, model, solver=cgs, **mapper_keys)
    # save
    map_rls.save(output_file)

It is also easy to add some documentation to the script :

def usage():
    print(__usage__)

__usage__ = """Usage: pacs_rls [options] [config_file]

Use tamasis regularized least-square map-making routine
using the configuration [filename]

[config_file] is the name of the configuration file. This file
contains arguments to the various processing steps of the
map-maker. For an exemple, see the file tamasis_rls.cfg in the
tamasis/pacs/src/ directory.

Options:
  -h, --help        Show this help message and exit
  -v, --verbose     Print status messages to std output.
  -f, --filenames   Overrides filenames config file value.
  -o, --output      Overrides output default value.
"""

The configuration file would look like this :

[main]
filenames=frames_blue.fits

[PacsObservation]
fine_sampling_factor=1
policy_bad_detector=mask
reject_bad_line=False
policy_inscan=keep
policy_turnaround=keep
policy_other=remove
policy_invalid=mask
active_fraction=0
delay=0.0

[get_tod]
unit=Jy/detector
flatfielding=True
subtraction_mean=True
raw=False
masks=activated

[Projection]
#resolution=3.2
#npixels_per_sample=6
oversampling=True
packed=False

[mapper_rls]
hyper=1.0
tol=1.e-5
maxiter=300

With this code, you can do from the command line :

pacs_rls.py pacs_rls.cfg

Or using sumatra :

smt init pacs_rls
smt run --executable=python --main=pacs_rls.py pacs_rls.cfg

Note that the files you use with Sumatra should be under revision control, with Mercurial for instance. This can be done like this

hg init
hg add pacs_rls.py pacs_rls.cfg
hg commit

Also, you should have a Data subdirectory to store results.

Finally, to see the results of your experiments, you can take a look at them using the built-in web interface :

smtweb

You can also look at results from the command line :

smt list

One very nice feature of Sumatra is the possibility to change the configuration file parameters from the cli :

smt run --executable=python --main=pacs_rls.py mapper_rls.hyper=0.1 pacs_rls.cfg

The power of numpy dtype keyword

July 08, 2011

One of Numpy very usefull feature is its ability to define very easily an array with a very large choice of data types. For instance, this gives the possibility to very easily test an algorithm for its robustness against error propagation due to the floating point precision. You could write something like this :

# generate arrays with different data types
ashape = (16,)
dtypes = (np.float16, np.float32, np.float64)
arrays = [np.ones(ashape, dtype=dtype) for dtype in dtypes]
# test algorithm against different data types
out_arrays = [my_algorithm(arr) for arr in arrays]
# measure the difference
diff16 = np.sum(np.abs(out_arrays[0] - out_arrays[-1]))
diff32 = np.sum(np.abs(out_arrays[1] - out_arrays[-1]))
print("cumulated difference 16bits vs 64 bits %f" % diff16)
print("cumulated difference 32bits vs 64 bits %f" % diff32)

Changing the data dtype from 64 bits to 32 bits can save your code a factor of 2 in memory usage. This can be very important for data intensive applications. Numpy even provide a 16 bits float which can be useful if precision requirements are not very high.

Matplotlib slider to display 3d arrays

July 08, 2011

Here is a small function which shows how one can display slices of a 3d array using matplotlib imshow and Slider. I give it here as I did not see this exemple anywhere else.

def cube_show_slider(cube, axis=2, **kwargs):
    """
    Display a 3d ndarray with a slider to move along the third dimension.

    Extra keyword arguments are passed to imshow
    """
    import matplotlib.pyplot as plt
    from matplotlib.widgets import Slider, Button, RadioButtons

    # check dim
    if not cube.ndim == 3:
        raise ValueError("cube should be an ndarray with ndim == 3")

    # generate figure
    fig = plt.figure()
    ax = plt.subplot(111)
    fig.subplots_adjust(left=0.25, bottom=0.25)

    # select first image
    s = [slice(0, 1) if i == axis else slice(None) for i in xrange(3)]
    im = cube[s].squeeze()

    # display image
    l = ax.imshow(im, **kwargs)

    # define slider
    axcolor = 'lightgoldenrodyellow'
    ax = fig.add_axes([0.25, 0.1, 0.65, 0.03], axisbg=axcolor)

    slider = Slider(ax, 'Axis %i index' % axis, 0, cube.shape[axis] - 1,
                    valinit=0, valfmt='%i')

    def update(val):
        ind = int(slider.val)
        s = [slice(ind, ind + 1) if i == axis else slice(None)
                 for i in xrange(3)]
        im = cube[s].squeeze()
        l.set_data(im, **kwargs)
        fig.canvas.draw()

    slider.on_changed(update)

    plt.show()

Older posts... (see archive for more)

managed by github ¤ generated by jekyll ¤ licensed by creative commons ¤ served by github