Phylogenetics and Rmpi – An Example

Better parallel than serial

Together with my colleague Greg W., I implemented the usage of multiple CPUs in one of my R projects. I hereby used the R library “Rmpi” for the coordination of parallel processes. Rmpi is an R wrapper for the MPI communications protocol. The speed gains that my code execution is experiencing through this process parallelization is considerable, and it may, hence, be of interest to a wider audience. I therefore present here a small example of using Rmpi in phylogenetic investigations.

First, let us set up a set of calculations that warrant process parallelization. In a maximum parsimony tree search, for example, multiple starting trees may wished to be evaluated. A parsimony ratchet can be executed in parallel on these starting trees, as the runs are independent of each another.

library(ape)
library(phangorn)
data = read.phyDat("alignment.phy")     # Alignment of 47 taxa with 1,000,000 bp each
trees = read.tree("trees.tre")          # Set of 500 phylogenetic trees (the starting trees)
myFunc = function(tree,data) {
  return(phangorn::pratchet(data,start=tree))
}

Second, let us execute the parsimony ratchet serially (where an MP analysis is executed with the first starting tree, then the second, then the third, etc.). As we can see, MP inference performed serially on my data set takes roughly 10 min.

tme=proc.time()
sout=lapply(trees,myFunc,data)
proc.time()-tme
   user  system elapsed 
584.513   1.070 584.977

Third, let us execute the parsimony ratchet in parallel, using the maximum number of CPUs available on my system (nproc = 8). As we can see, the MP inference performed in parallel on my data set takes roughly 4 min.

# Infer number of processors (i.e. CPUs) on system
nproc = as.numeric(system("nproc",intern=T))
# Specify one master and nproc-1 slaves
mpi.spawn.Rslaves(nslaves=nproc-1)

# Pass various functions to the Rmpi slaves
mpi.bcast.Robj2slave(is.rooted)
mpi.bcast.Robj2slave(is.binary.tree)
mpi.bcast.Robj2slave(unroot)
mpi.bcast.Robj2slave(stree)

tme=proc.time()
# Parallel apply the parsimony ratchet function on all starting trees;
# use load balancing for CPU assignment
pout=mpi.parLapply(trees,myFunc,data)
proc.time()-tme

# Close slaves
mpi.close.Rslaves()
user.self   sys.self    elapsed user.child  sys.child 
  129.244     93.736    224.178      0.000      0.000

Utilizing all 8 CPUs on my computer (one for the master, the rest for the slave processes) is not necessarily the most efficient parallelization. Spawning each slave process requires a certain amount of CPU resources, and these resources may be larger than the benefit of the added calculation speed (i.e. law of diminishing returns). The following line graph visualizes this issue.

Diminishing returns when using multiple CPUs in R script

Diminishing returns when using multiple CPUs in R script

Reducing file sizes of pdf-documents

A handy command for high-resolution documents

Many of you may have experienced it too: You review a manuscript and make numerous notes in the page margins. You then scan-in your notes, but receive a pdf-file too large for submission to the Editorial Manager website. Here’s what I do to solve this issue: I use the PostScript language interpreter “ghostscript” to reduce the resolution and, by extension, the file size of said document.

gs -dBATCH -sDEVICE=pdfwrite -dNOPAUSE -dPDFSETTINGS=/ebook -sOutputFile=outfile.pdf infile.pdf

The output resolution is controlled by option “dPDFSETTINGS”. For example, value “/ebook” results in 150 dpi, value “/printer” in 300 dpi.

Running bioinformatic errands

Grocery shopping, laundry and #!/usr/bin/python

For a simulation study, I need to convert a concatenated list of NEXUS files (whereby length(list) = x)  into a single NEXUS file with x number of partitions. A single NEXUS file with x number of partitions is the input required by the Bayesian evolutionary analysis software BEAST.

How do I solve this problem? Answer: By using the brilliant BioPython package and some custom Python code.

STEP 1: Loading of concatenated NEXUS blocks; see https://github.com/michaelgruenstaeudl for custom libraries

#!/usr/bin/python
import sys
sys.path.insert(0,"/home/michael/git/ScienceScripts/")
import GeneralFileOperations as GFO
import GeneralStringOperations as GSO
ind = GFO.loadR("sim.10.nex")

STEP 2: Parsing individual NEXUS blocks

from Bio.Nexus import Nexus
ind = GSO.csplit(ind,"#NEXUS")
outd = [(GSO.exstr(ds,"[","]").replace(" ",""), Nexus.Nexus(ds)) for ds in ind]

STEP 3: Combining individual NEXUS blocks to partitions; saving to file

combined = Nexus.combine(outd)
combined.write_nexus_data(filename="sim.10.genes.nex")

STEP 4: Replacing “begin sets;” with “begin mybayes;

#!/bin/bash
sed -i 's/begin sets;/begin mrbayes;/' sim.10.genes.nex

Simple example on debugging in R

95% of a programmer’s day is spent debugging

R has nice functions to flash-freeze your code execution and do some debugging by accessing local variables. I wrote the following example to illustrate R’s debugging capabilities for a colleague of mine.

Let us first set up a function which will break during execution. Said function attempts to rpint column 4 of a matrix that only has three columns.

willBreak = function (FillInteger)
{
# VALID; Perform a print operation
cat("Hello", "World!")
# VALID; Set up a matrix of predefined dimensions (3 by 3)
aMatrix = matrix(FillInteger,nrow=3,ncol=3)
# VALID; Print first row of matrix
print(aMatrix[1,])
# INVALID; Print fourth column of matrix
# Will break, because subscript out of bound
print(aMatrix[,4])
}

Let us now load that broken function into “mtrace” and also call it. A debugging terminal will appear, where each line is initialize by “D(1)” .

library(debug)
mtrace(willBreak)
willBreak(1)

D(1)>

We can cycle through each line of code by hitting enter until we reach the the code exception, i.e. where the fourth column of a matrix with ncols=3 is attempted to be displayed.

D(1)> 
Hello World!
NULL

D(1)> 
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1

D(1)> 
[1] 1 1 1

D(1)> 
Error in aMatrix[, 4] : subscript out of bounds

Having the benefit of a debugging terminal, we can now access the variables in question directly, add the missing fourth column to variable “aMatrix” and correctly finish code execution.

D(1)> # Correcting the code interactively

D(1)> # Setting up a matrix column

D(1)> aColumn = matrix(3,nrow=3,ncol=1)
     [,1]
[1,]    3
[2,]    3
[3,]    3

D(1)> # Appending the column to original matrix

D(1)> aMatrix = cbind(aMatrix,aColumn)
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    3
[2,]    1    1    1    3
[3,]    1    1    1    3

D(1)> aMatrix[ ,4]
[1] 3 3 3

Evolution Conference in Raleigh, NC

This is why I do science.

Wow … is all I can say. This year’s annual meeting of evolutionary biologists from across the globe in the city of Raleigh, NC, was amazing. Everyone who attended this conference knows what I mean.

Evolution 2014 in Raleigh, NC

Evolution 2014 in Raleigh, NC

Organized by North Carolina State (Go Wolfpacks!), the past seven days were simply perfect. The conference was held in the city’s state-of-the-art conference center. The quality of scientific talks and subsequent discussions was superb. (I do not remember a single uninspiring talk!) The pre- and post-conference workshops were excellent. Our apartments at the NC State campus were brand-new. Attendees were oozing smiles and hugs. The atmosphere during the social mixers was like reuniting a big family.

It is those experiences that remind me what an amazing community we evolutionary biologists have.

Michael Gruenstaeudl at Evolution 2014

Michael Gruenstaeudl at Evolution 2014

The flowchart – a scientist’s best friend

Don’t  leave Earth  do complex data analysis without it, Douglas Adams would have said.

Some of my friends who are not in academia often ask me how scientists keep track of all the steps involved in a complex analysis. The answer: We use helper tools like anyone else. Below is a flowchart that I created today to visualize input and output of a multi-step analysis.

Flowchart of a posterior predictive simulation

Flowchart of a posterior predictive simulation

Tree format conversion in R

From the life of a biologist/bioinformatician

In my Inbox on Monday morning:

Hi Michael, do you know how can I convert a tree file in nexus format to a phylip format file, using R? Best, A.

Sure. Here’s how you do it:

> library(phybase)
>
> tree.string.example = "#NEXUS\nBegin trees;\n\tTranslate\n\t\t1 a,\n\t\t2 b,\n\t\t3 c,\n\t\t4 d,\n\t\t5 e,\n\t\t6 f,\n\t\t7 g\n;\ntree TEST =((4:0.733,((7:0.004,1:0.004):0.287,2:0.292):0.440):0.0596,(3:0.509,(6:0.112,5:0.112):0.396):0.283);\nEnd;"
>
> cat(tree.string.example, file="test.nex")
>
> tree.nexus = phybase::read.tree.string("test.nex", format="nexus")
>
> tree.nexus
$tree
[1] "((4:0.733,((7:0.004,1:0.004):0.287,2:0.292):0.440):0.0596,(3:0.509,(6:0.112,5:0.112):0.396):0.283);"

$names
[1] "a" "b" "c" "d" "e" "f" "g"

$root
[1] TRUE

> phybase::write.tree.string(tree.nexus$tree, file="test.phy", format="phylip")
>

Cropping photographs with ImageMagick

How to handle your digital scissors.

Generating photographic evidence is an important part of being a biologists. However, not all photographs taken can be published as-is. Instead, it may be necessary to crop an image in order to highlight a certain section. Most people would use a default image editor for cropping images, which are by and large commanded via a graphical user interface (GUI). Adobe Photoshop, GIMP or pixlr.com come to mind. For the task of re-sizing or cropping an image  it may more precise to use a command-line image editor, however.

My personal favorite among command-line image editors is ImageMagick. As with any good Linux software, there is an initial learning curve. However, once you are familiar with its basic usage, you never want to switch back. Hence, here is a simple example to help you make the transition.

I have taken the following photograph on a field trip to the Canary Islands, Spain in 2009. It displays the capitula of Tolpis laciniata (Sch.Bip. ex Webb & Berthel.) Webb, an island plant in the sunflower family (Asteraceae). This image was used on the cover page of the scientific journal Cladistics (Volume 29, Issue 4), which featured my scientific article on the genus Tolpis Adans.

Capitula of Tolpis laciniata

Capitula of Tolpis laciniata

For an upcoming lecture, I would like to crop this image so that only the top left capitulum is visible. To that end, I first measure the dimensions of the image via the command “identify”:

michael ~ $ identify Tolpis_laciniata.JPG 
Tolpis_laciniata.JPG JPEG 1779x1860 1779x1860+0+0 8-bit sRGB 2.072MB 0.120u 0:00.130

As we can see, the image has a width of 1779 px and a height of 1860 px. Now, I will crop the image to its upper-left quadrant via command “convert”. The basic usage of convert is:

convert  <inputfile>  -crop  <width>x<height>+<offset x-axis>+<offset y-axis>  <outputfile>

The offset values are counted from the top left corner, and I estimate them to be 100 px each.

michael ~ $ convert Tolpis_laciniata.JPG -crop 890x930+100+100 Tolpis_laciniata.cropped.JPG

Voilà.

One capitulum of Tolpis laciniata

One capitulum of Tolpis laciniata

Debugging with PDB – A hands-on example

A must-have in the Python tool box of every “digital entomologist”.

You have some complex code. It breaks during execution. Via the traceback you know the approximate location of the problem in your code. You modify the code. It breaks again. Modify, break, modify, break, … until you eventually find the issue and move on. What did Albert Einstein say again about repetitive behavior?

There is a more focused approach to debugging in Python: the pdb module. This little Python module allows you to insert a special command in your code, which will function as a breakpoint and provide you with pseudo-terminal access to variables and functions at the time of breakage. It acts like flash-freezing the code execution, just that you can access variables and functions in your code.

Let us look at an example. The following definition will break because the code attempts to concatenate strings and integers.

def will_break():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Generate string from variables
    output = variable1 + variable2[1] + variable3 + variable2[3]
    # Print variable output
    print output
>>> will_break()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in will_break
TypeError: cannot concatenate 'str' and 'int' objects

Let us now include the pdb break statement “pdb.set_trace()” immediately prior to line 9 as suggested by traceback. Line 9 is a comment line, so I will actually place the pdb break statement prior to the last genuine code line located before line 9.

import pdb
def for_debugging():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Insertion of the pdb breakpoint
    pdb.set_trace()
    # Generate string from variables
    output = variable1 + variable2[1] + variable3 + variable2[3]
    # Print variable output
    print output

When executing the above code, we are supplied with a primitive terminal, the pdb shell. This terminal starts with “(Pdb)” on every line, has it’s own syntax, and every command must be prepended with an exclamation mark (e.g. “!print”). Commenting does not work in this terminal.

>>> for_debugging
> <ipython-input-3-ea7022cd72b7>(12)for_debugging()
-> output = variable1 + variable2[1] + variable3 + variable2[3]
(Pdb) 
(Pdb) p variable3
2
(Pdb)
(Pdb) ![type(x) for x in [variable1, variable2, variable3]]
[<type 'str'>, <type 'list'>, <type 'int'>]
(Pdb) !# Don't do anything.
*** SyntaxError: unexpected EOF while parsing (<stdin>, line 1)

To circumvent the restrictions of the pdb shell, we can launch a temporary interactive Python session from within pdb, so that we have the full functionality of Python back and all the local variables available.

(Pdb) !import code; code.interact(local=vars())

Python 2.7.6 (default, Feb 26 2014, 12:07:17) 
[GCC 4.8.2 20140206 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)

In : type(variable2)
<type 'list'>

In : # Don't do anything.

With the information gained from our pdb-enclosed interactive Python session, we can modify the code in a much more cognizant and directed way than mere meddling with the original code. Hence, let us correct the original code; see the change in line 10.

def will_work():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Insertion of pdb statement
    #pdb.set_trace()
    # Generate string from variables
    output = variable1 + variable2[1] + str(variable3) + variable2[3]
    # Print variable output
    print output

Simple example on using rPython – Part 2

The second part of a simple rPython tutorial.

In the previous blog post we left off with a Python dictionary created from within R via rPython. Let us now add another key-value pair to the dictionary. I use the command “python.assign()” to push strings to, but not to pull strings from, Python. To retrieve the updated Python dictionary, I instead use “python.get()”.

# Assigning a variable in R
> homecountry = "Austria"
# Adding a dictionary entry
> python.assign("capcities['at']", homecountry)
# Retrieving the updated dictionary from Python
> abbrev_dict_updated = python.get("capcities")
> abbrev_dict_updated
        at         sk         hu         de 
 "Austria" "Slovakia"  "Hungary"  "Germany"

Note that using “python.assign()” to pull strings from Python into R does not work out of the box. I suspect that the reason for this deficiency of rPython is that the package has not been developed enough to handle more complex strings.

# Converting the Python dictionary into a Python string
> python.exec("capcities_string = ', '.join('%s: %s' % (key,
value) for (key,value) in capcities.items())")
# Initializing a string in R
> abbrev_dict_updated = ""
# Assigning a variable in R with a string from Python
> python.assign(abbrev_dict_updated, "capcities_string")
  File "<string>", line 2
    =' [ "capcities_string" ] '
    ^
SyntaxError: invalid syntax

Let us now utilize the power of Python and perform a more complex operation on the dictionary in Python (e.g. sorting the dictionary by value, utilizing some Python libraries). I will then import the ordered dictionary into R.

> python.exec("from collections import OrderedDict")
> python.exec("from operator import itemgetter")
> python.exec("abbrev_dict_sorted = OrderedDict(sorted(
capcities.items(), key=itemgetter(1)))")
> abbrev_dict_sorted = python.get("abbrev_dict_sorted")
> abbrev_dict_sorted
        at         de         hu         sk 
 "Austria"  "Germany"  "Hungary" "Slovakia"