Reducing file sizes of pdf-documents

A handy command for high-resolution documents

Many of you may have experienced it too: You review a manuscript and make numerous notes in the page margins. You then scan-in your notes, but receive a pdf-file too large for submission to the Editorial Manager website. Here’s what I do to solve this issue: I use the PostScript language interpreter “ghostscript” to reduce the resolution and, by extension, the file size of said document.

gs -dBATCH -sDEVICE=pdfwrite -dNOPAUSE -dPDFSETTINGS=/ebook -sOutputFile=outfile.pdf infile.pdf

The output resolution is controlled by option “dPDFSETTINGS”. For example, value “/ebook” results in 150 dpi, value “/printer” in 300 dpi.

Running bioinformatic errands

Grocery shopping, laundry and #!/usr/bin/python

For a simulation study, I need to convert a concatenated list of NEXUS files (whereby length(list) = x)  into a single NEXUS file with x number of partitions. A single NEXUS file with x number of partitions is the input required by the Bayesian evolutionary analysis software BEAST.

How do I solve this problem? Answer: By using the brilliant BioPython package and some custom Python code.

STEP 1: Loading of concatenated NEXUS blocks; see https://github.com/michaelgruenstaeudl for custom libraries

#!/usr/bin/python
import sys
sys.path.insert(0,"/home/michael/git/ScienceScripts/")
import GeneralFileOperations as GFO
import GeneralStringOperations as GSO
ind = GFO.loadR("sim.10.nex")

STEP 2: Parsing individual NEXUS blocks

from Bio.Nexus import Nexus
ind = GSO.csplit(ind,"#NEXUS")
outd = [(GSO.exstr(ds,"[","]").replace(" ",""), Nexus.Nexus(ds)) for ds in ind]

STEP 3: Combining individual NEXUS blocks to partitions; saving to file

combined = Nexus.combine(outd)
combined.write_nexus_data(filename="sim.10.genes.nex")

STEP 4: Replacing “begin sets;” with “begin mybayes;

#!/bin/bash
sed -i 's/begin sets;/begin mrbayes;/' sim.10.genes.nex

Simple example on debugging in R

95% of a programmer’s day is spent debugging

R has nice functions to flash-freeze your code execution and do some debugging by accessing local variables. I wrote the following example to illustrate R’s debugging capabilities for a colleague of mine.

Let us first set up a function which will break during execution. Said function attempts to rpint column 4 of a matrix that only has three columns.

willBreak = function (FillInteger)
{
# VALID; Perform a print operation
cat("Hello", "World!")
# VALID; Set up a matrix of predefined dimensions (3 by 3)
aMatrix = matrix(FillInteger,nrow=3,ncol=3)
# VALID; Print first row of matrix
print(aMatrix[1,])
# INVALID; Print fourth column of matrix
# Will break, because subscript out of bound
print(aMatrix[,4])
}

Let us now load that broken function into “mtrace” and also call it. A debugging terminal will appear, where each line is initialize by “D(1)” .

library(debug)
mtrace(willBreak)
willBreak(1)

D(1)>

We can cycle through each line of code by hitting enter until we reach the the code exception, i.e. where the fourth column of a matrix with ncols=3 is attempted to be displayed.

D(1)> 
Hello World!
NULL

D(1)> 
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1    1    1
[3,]    1    1    1

D(1)> 
[1] 1 1 1

D(1)> 
Error in aMatrix[, 4] : subscript out of bounds

Having the benefit of a debugging terminal, we can now access the variables in question directly, add the missing fourth column to variable “aMatrix” and correctly finish code execution.

D(1)> # Correcting the code interactively

D(1)> # Setting up a matrix column

D(1)> aColumn = matrix(3,nrow=3,ncol=1)
     [,1]
[1,]    3
[2,]    3
[3,]    3

D(1)> # Appending the column to original matrix

D(1)> aMatrix = cbind(aMatrix,aColumn)
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    3
[2,]    1    1    1    3
[3,]    1    1    1    3

D(1)> aMatrix[ ,4]
[1] 3 3 3

Evolution Conference in Raleigh, NC

This is why I do science.

Wow … is all I can say. This year’s annual meeting of evolutionary biologists from across the globe in the city of Raleigh, NC, was amazing. Everyone who attended this conference knows what I mean.

Evolution 2014 in Raleigh, NC

Evolution 2014 in Raleigh, NC

Organized by North Carolina State (Go Wolfpacks!), the past seven days were simply perfect. The conference was held in the city’s state-of-the-art conference center. The quality of scientific talks and subsequent discussions was superb. (I do not remember a single uninspiring talk!) The pre- and post-conference workshops were excellent. Our apartments at the NC State campus were brand-new. Attendees were oozing smiles and hugs. The atmosphere during the social mixers was like reuniting a big family.

It is those experiences that remind me what an amazing community we evolutionary biologists have.

Michael Gruenstaeudl at Evolution 2014

Michael Gruenstaeudl at Evolution 2014

The flowchart – a scientist’s best friend

Don’t  leave Earth  do complex data analysis without it, Douglas Adams would have said.

Some of my friends who are not in academia often ask me how scientists keep track of all the steps involved in a complex analysis. The answer: We use helper tools like anyone else. Below is a flowchart that I created today to visualize input and output of a multi-step analysis.

Flowchart of a posterior predictive simulation

Flowchart of a posterior predictive simulation

Tree format conversion in R

From the life of a biologist/bioinformatician

In my Inbox on Monday morning:

Hi Michael, do you know how can I convert a tree file in nexus format to a phylip format file, using R? Best, A.

Sure. Here’s how you do it:

> library(phybase)
>
> tree.string.example = "#NEXUS\nBegin trees;\n\tTranslate\n\t\t1 a,\n\t\t2 b,\n\t\t3 c,\n\t\t4 d,\n\t\t5 e,\n\t\t6 f,\n\t\t7 g\n;\ntree TEST =((4:0.733,((7:0.004,1:0.004):0.287,2:0.292):0.440):0.0596,(3:0.509,(6:0.112,5:0.112):0.396):0.283);\nEnd;"
>
> cat(tree.string.example, file="test.nex")
>
> tree.nexus = phybase::read.tree.string("test.nex", format="nexus")
>
> tree.nexus
$tree
[1] "((4:0.733,((7:0.004,1:0.004):0.287,2:0.292):0.440):0.0596,(3:0.509,(6:0.112,5:0.112):0.396):0.283);"

$names
[1] "a" "b" "c" "d" "e" "f" "g"

$root
[1] TRUE

> phybase::write.tree.string(tree.nexus$tree, file="test.phy", format="phylip")
>

Cropping photographs with ImageMagick

How to handle your digital scissors.

Generating photographic evidence is an important part of being a biologists. However, not all photographs taken can be published as-is. Instead, it may be necessary to crop an image in order to highlight a certain section. Most people would use a default image editor for cropping images, which are by and large commanded via a graphical user interface (GUI). Adobe Photoshop, GIMP or pixlr.com come to mind. For the task of re-sizing or cropping an image  it may more precise to use a command-line image editor, however.

My personal favorite among command-line image editors is ImageMagick. As with any good Linux software, there is an initial learning curve. However, once you are familiar with its basic usage, you never want to switch back. Hence, here is a simple example to help you make the transition.

I have taken the following photograph on a field trip to the Canary Islands, Spain in 2009. It displays the capitula of Tolpis laciniata (Sch.Bip. ex Webb & Berthel.) Webb, an island plant in the sunflower family (Asteraceae). This image was used on the cover page of the scientific journal Cladistics (Volume 29, Issue 4), which featured my scientific article on the genus Tolpis Adans.

Capitula of Tolpis laciniata

Capitula of Tolpis laciniata

For an upcoming lecture, I would like to crop this image so that only the top left capitulum is visible. To that end, I first measure the dimensions of the image via the command “identify”:

michael ~ $ identify Tolpis_laciniata.JPG 
Tolpis_laciniata.JPG JPEG 1779x1860 1779x1860+0+0 8-bit sRGB 2.072MB 0.120u 0:00.130

As we can see, the image has a width of 1779 px and a height of 1860 px. Now, I will crop the image to its upper-left quadrant via command “convert”. The basic usage of convert is:

convert  <inputfile>  -crop  <width>x<height>+<offset x-axis>+<offset y-axis>  <outputfile>

The offset values are counted from the top left corner, and I estimate them to be 100 px each.

michael ~ $ convert Tolpis_laciniata.JPG -crop 890x930+100+100 Tolpis_laciniata.cropped.JPG

Voilà.

One capitulum of Tolpis laciniata

One capitulum of Tolpis laciniata

Debugging with PDB – A hands-on example

A must-have in the Python tool box of every “digital entomologist”.

You have some complex code. It breaks during execution. Via the traceback you know the approximate location of the problem in your code. You modify the code. It breaks again. Modify, break, modify, break, … until you eventually find the issue and move on. What did Albert Einstein say again about repetitive behavior?

There is a more focused approach to debugging in Python: the pdb module. This little Python module allows you to insert a special command in your code, which will function as a breakpoint and provide you with pseudo-terminal access to variables and functions at the time of breakage. It acts like flash-freezing the code execution, just that you can access variables and functions in your code.

Let us look at an example. The following definition will break because the code attempts to concatenate strings and integers.

def will_break():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Generate string from variables
    output = variable1 + variable2[1] + variable3 + variable2[3]
    # Print variable output
    print output
>>> will_break()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in will_break
TypeError: cannot concatenate 'str' and 'int' objects

Let us now include the pdb break statement “pdb.set_trace()” immediately prior to line 9 as suggested by traceback. Line 9 is a comment line, so I will actually place the pdb break statement prior to the last genuine code line located before line 9.

import pdb
def for_debugging():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Insertion of the pdb breakpoint
    pdb.set_trace()
    # Generate string from variables
    output = variable1 + variable2[1] + variable3 + variable2[3]
    # Print variable output
    print output

When executing the above code, we are supplied with a primitive terminal, the pdb shell. This terminal starts with “(Pdb)” on every line, has it’s own syntax, and every command must be prepended with an exclamation mark (e.g. “!print”). Commenting does not work in this terminal.

>>> for_debugging
> <ipython-input-3-ea7022cd72b7>(12)for_debugging()
-> output = variable1 + variable2[1] + variable3 + variable2[3]
(Pdb) 
(Pdb) p variable3
2
(Pdb)
(Pdb) ![type(x) for x in [variable1, variable2, variable3]]
[<type 'str'>, <type 'list'>, <type 'int'>]
(Pdb) !# Don't do anything.
*** SyntaxError: unexpected EOF while parsing (<stdin>, line 1)

To circumvent the restrictions of the pdb shell, we can launch a temporary interactive Python session from within pdb, so that we have the full functionality of Python back and all the local variables available.

(Pdb) !import code; code.interact(local=vars())

Python 2.7.6 (default, Feb 26 2014, 12:07:17) 
[GCC 4.8.2 20140206 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)

In : type(variable2)
<type 'list'>

In : # Don't do anything.

With the information gained from our pdb-enclosed interactive Python session, we can modify the code in a much more cognizant and directed way than mere meddling with the original code. Hence, let us correct the original code; see the change in line 10.

def will_work():
    # Variable 1 is a string
    variable1 = "Hello"
    # Variable 2 is a list
    variable2 = ["1", "World", "3", "!"]
    # Variable 3 is an integer
    variable3 = 2
    # Insertion of pdb statement
    #pdb.set_trace()
    # Generate string from variables
    output = variable1 + variable2[1] + str(variable3) + variable2[3]
    # Print variable output
    print output

Simple example on using rPython – Part 2

The second part of a simple rPython tutorial.

In the previous blog post we left off with a Python dictionary created from within R via rPython. Let us now add another key-value pair to the dictionary. I use the command “python.assign()” to push strings to, but not to pull strings from, Python. To retrieve the updated Python dictionary, I instead use “python.get()”.

# Assigning a variable in R
> homecountry = "Austria"
# Adding a dictionary entry
> python.assign("capcities['at']", homecountry)
# Retrieving the updated dictionary from Python
> abbrev_dict_updated = python.get("capcities")
> abbrev_dict_updated
        at         sk         hu         de 
 "Austria" "Slovakia"  "Hungary"  "Germany"

Note that using “python.assign()” to pull strings from Python into R does not work out of the box. I suspect that the reason for this deficiency of rPython is that the package has not been developed enough to handle more complex strings.

# Converting the Python dictionary into a Python string
> python.exec("capcities_string = ', '.join('%s: %s' % (key,
value) for (key,value) in capcities.items())")
# Initializing a string in R
> abbrev_dict_updated = ""
# Assigning a variable in R with a string from Python
> python.assign(abbrev_dict_updated, "capcities_string")
  File "<string>", line 2
    =' [ "capcities_string" ] '
    ^
SyntaxError: invalid syntax

Let us now utilize the power of Python and perform a more complex operation on the dictionary in Python (e.g. sorting the dictionary by value, utilizing some Python libraries). I will then import the ordered dictionary into R.

> python.exec("from collections import OrderedDict")
> python.exec("from operator import itemgetter")
> python.exec("abbrev_dict_sorted = OrderedDict(sorted(
capcities.items(), key=itemgetter(1)))")
> abbrev_dict_sorted = python.get("abbrev_dict_sorted")
> abbrev_dict_sorted
        at         de         hu         sk 
 "Austria"  "Germany"  "Hungary" "Slovakia"

Simple example on using rPython – Part 1

Spoon-feeding for the rest of us.

Calling Python from within R via the R-package “rPython” is fairly easy. However, very little documentation exists on this package, and some of the commands may appear “quirky” at first. The paucity of written documentation on this package seems to scare away many biologists, who – if they are like me – prefer to work with well-documented code. But fear not! Usage of rPython is straightforward with a bit of exercise. The following tutorial is intended to give a quick and simple example on how to work with rPython.

> library(rPython)
Loading required package: RJSONIO

I use the command “python.exec()” to execute basically any Python operation within R.

# Generating a dictionary in Python 
> python.exec("capcities = {'sk':'Slovakia','de':'Germany', 
'hu':'Hungary'}")

I use the command “python.get()” to assign a new variable in R with a value from Python.

# Pulling the dictionary from Python into R
> abbrev_dict = python.get("capcities")
> abbrev_dict
 de sk hu 
 "Germany" "Slovakia" "Hungary"

Note, however, that “python.get()” is very inflexible in what it can retrieve from Python.

> abbrevs = python.get("capcities.keys()")
Traceback (most recent call last):
...
TypeError: dict_keys(['de', 'sk', 'hu']) is not JSON serializable

Instead, I use:

> abbrevs = python.get("list(capcities.keys())")
> abbrevs
[1] "de" "sk" "hu"