Tag Archives: computers

ASMS 2018: Exhilarating and exhausting

The American Society of Mass Spectrometry annual conference represents my one sure visit to the United States each year.  What is it about this meeting that keeps bringing me back across the Atlantic Ocean?  What makes this gathering feel like an academic home?

Early Days

My first encounter with ASMS took place in 1998, when I attended the annual conference in Orlando.  During this and other early years of the conference, I made it my goal to eat only free food during the four days of the conference.  I remember ice cream breakfasts from a vendor at this first meeting!  Being notoriously frugal did my waistline no favors, but then I was skinny as a rail during graduate school.


Prof. Pevzner, from his early days as a Wild West sheriff

My Ph.D. project involved the creation of a automated sequence tag inference engine from low-resolution tandem mass spectra of peptides.  That meant I had one particular talk on my agenda for ASMS 1998.  I listened with rapt attention to talk WOF 3:10 given by Pavel Pevzner (then a scientist at Millennium Pharmaceuticals) describing his SHERENGA software: “Automated De Novo Peptide Sequencing.”  I remember introducing myself to him after his talk.  When I mentioned my project, I remember poking in that we were competitors!  I was a frightfully competitive guy back then.  I am grateful that Pavel let the comment pass; in the two decades since that meeting he and I have become friends.

I feel I must mention ASMS 2004, the year that John Yates, III won the Biemann Medal, which I consider to be mass spectrometry’s highest award.  The conference was held at Nashville, TN, which was lovely given that I was a post-doc at Oak Ridge National Laboratory, just four hours down the interstate.  I arrived at the conference to learn that Steve Gygi, a friend of mine from graduate school, had played an epic prank on John and me at one of the preliminary meetings.  A student of his had captured a video of John and me encouraging people to get out onto the dance floor at a Keystone Symposium.  Steve had used the video in a research talk to show that while John was an expert in biochemistry, analytical chemistry, and bioinformatics, he couldn’t dance!


…of which the less said, the better

Professional Integration

To attend a yearly conference is one thing, but becoming part of its organization is quite another.  After I joined the faculty of Vanderbilt University in 2005 as an assistant professor, I decided that ASMS was the organization that felt most like “home” to me, and I began paying my dues yearly rather than haphazardly on the years I planned to attend the conference.  I became familiar with a growing number of its luminaries, both through the senior scientists with whom I collaborated and through smaller meetings, such as the Association of Biomolecular Resource Facilities and the United States Human Proteomics Organization.  Happily, I gained a reputation as an energetic speaker who could make mass spectrometry informatics seem more approachable.

My three biggest public roles within ASMS have all been drawn from the field of mass spectrometry informatics.  I feel deeply honored to have twice selected the speakers to appear in panels on the informatics of identification.  My second big involvement was with the Bioinformatics Interest Group.  After the main panels on each full day of the conference, ASMS features workshops for interest groups, running from 5:45 to 7:00 PM.  Since the conference attendees tend to exhaustion after such busy days, the workshops function best when they feature passionate speakers that interact quite a lot with the audience.  I am certainly not ashamed to stand outside the meeting room, inviting absolute strangers to join our group!  I enjoyed my moments as Donahue, running between different members of the audience with the microphone.

My biggest engagement, however, has been a long-running ASMS short course.  In 2011, Alexey Nesvizhskii, Nuno Bandeira, and I offered “Bioinformatics for Protein Identification” for the first time.  In this two-day short course (on the Saturday and Sunday preceding the conference), we introduced the algorithms that enable protein identification for newcomers to proteomics.  Happily, the course drew a good response, and we have now run the short course for eight consecutive years!  It’s a lot of work, and it makes each ASMS visit six days rather than four, but I really draw a lot of satisfaction from working with the participants.


The 2018 class

ASMS 2018: San Diego

What made this year such a busy program?  I would start with the fact that I completed my Ph.D. at San Diego, and I had many friends to visit while there!  I was very grateful to visit with friends from the “Darkstar” science fiction, gaming, yoga, and movie-making club; I hadn’t seen many of them for fifteen years!  I was also happy to see Ben Winnick, a friend of mine since my undergraduate years at the University of Arkansas.  It’s humbling to think I have known him since 1997.  These social calls complemented the professional friendships I was able to renew at the conference.

Since John Yates has made his home in San Diego since 2000, I was also glad to attend the reunion dinner he organized on the Saturday before the conference.  I was sitting down to dinner with my extended family of 200 friends, a bit worn out from running the first day of our short course, when I learned that the first speaker for the event had dropped out due to illness.  I was soon penciled in to replace him!  I frantically scribbled some notes while eating so that I could share some of my favorite stories from the early Yates Lab.  I was glad I could make people laugh!

Although I was not part of this year’s bioinformatics interest group, I was included as a speaker for the Analytical Lab Managers Interest Group under Emily Chen and David Quilici at their Monday evening workshop.  I emphasized the methods core lab managers need to incorporate “Big Data” into their work, emphasizing data repository use and careful statistics.  I slumped into my hotel bed directly after this talk; I had been yammering about something or other almost continuously for three straight days.


Dave goes contrarian.

Wednesday put me right back on stage.  I was slated for a mock debate over at the Informatics Hub.  I was paired with my friend Juan Antonio Vizcaino (responsible for PRIDE repository); he would argue that Big Data was transforming proteomics, and I would argue that Big Data was creating more problems than it was worth!  It’s true that I have some doubts about the value of Big Data practices to date.  I hope my talk caused participants to think about good strategies for its incorporation.

Of course, the “work” that most conference attendees incur still awaited me.  I had submitted a poster reporting work I have conducted in agricultural proteomics with the University of the Western Cape.  We created an ortholog mapping table via BLAST that allowed us to determine which protein in sorghum mapped to which protein in maize.  We then used the mapping table to re-align our spectral count table so that the counts for each ortholog pair appeared on the same row.  This means our statistical model can look for differences between our “wet” and “dry” cohorts in both species, simultaneously!  I look forward to writing that paper.  My poster had been slated for Thursday, so I dutifully stood beside my A0 format poster throughout the morning and into the early afternoon.  I was glad to see that the poster hall was not completely deserted, even on the last day.

I am grateful to the people that launch the annual conference for ASMS each year.  It’s wonderful to gather with friends and see what each of us has created in the course of our work!


HUPO-PSI 2018: Dave’s takeaways

I have now attended three HUPO-PSI (Human Proteome Organization Proteomics Standards Initiative) meetings: Ghent, Beijing, and Heidelberg.  As an early skeptic about the standard data formats for mass spectrometry, I confess I have substantially revised my opinion on the usefulness of HUPO-PSI.  I have now served as the Quality Control Working Group chair for two full years, and I feel I understand quite a lot more of what these meetings accomplish.

Who is HUPO-PSI?

HUPO-PSI may receive less name recognition than the formats that they have made possible.  Thousands of biological mass spectrometrists have used the mzML format, frequently employing ProteoWizard software to produce it.  The mzML format, then is probably HUPO-PSI’s most conspicuous success.  This format, though, was not the first XML-based attempt to capture proteomics data.  We would probably point to mzXML for that.  Actually, mzML wasn’t even the first file format canonized by HUPO-PSI for this purpose!  We would point to mzData for that.  Because HUPO-PSI was humble enough to seek the input of the mzXML team, a far more fully-fledged format, mzML, could be produced by merging the best aspects of mzData and mzXML.  I also see a huge splash from the Molecular Interactions side of HUPO-PSI, though I have always associated with the mass spectrometry side instead.

HUPO-PSI has its share of detractors, as well.  The group has placed quite a lot of emphasis on XML as the preferred “persistence” (mode of long-term information storage), though its formats have occasionally been translated to other strategies for long-term storage, such as the HDF5 scalable technology suite.  A consequence of embracing XML is that researchers see their data storage needs increase significantly; we frequently see that the mzML produced from a Thermo RAW file increases in size even as it reduces the amount of information it contains through “peaklisting.”  As a consequence, some of the groups within HUPO-PSI have invested effort in more streamlined options for data storage, such as the tab-delimited “mzTab” format, now undergoing an expansion to a wide variety of analyte types.  The team of which I am part, the Quality Control Working Group, is looking at another structural strategy called JavaScript Object Notation or “JSON” for compactly relating information.

Two topics have mystified me more than any others about the operation of HUPO-PSI.  The first is the reliance on a “CV” and “ontology.”  Controlled Vocabularies are essentially sets of terms that have been rigorously defined for use in reporting a kind of information.  An ontology relates these terms together, for example through “IS_A”, “PART_OF,” “HAS_PART,” and “REGULATES” relationships employed in the Gene Ontology.  HUPO-PSI makes use of ontologies that have been defined for other efforts, such as the Units of Measurement Ontology created by European Bioinformatics Institute and Phenotype and Trait Ontology.  It also maintains its own set of controlled vocabularies, such as the PSI-MS.  For my quality control to create a HUPO-PSI compatible format means connecting into these information resources rather than “reinventing the wheel” with an altogether new ontology.

The second topic that mystifies me is the “Document Process.”  Once a working group has beaten all the problems they can find from a proposed file type (a schema, or format to store the information, along with a CV that defines key terms for that file type), they submit the package to the document process, wherein more experienced standards creators draw attention to potential problems and external reviewers evaluate the extent to which that format meets the needs of the community for which it is intended.  I will learn a lot more about the Document Process when our proposal is ready for this group!


Despite his expression, Bunsen would certainly approve!

Meeting Outcomes

The mzML format seems very stable and very capable in its version 1.1.0.  Mass spectrometry technologies, however, are always improving!  For the last decade and more, ion mobility technology has been maturing in technology development laboratories, and a few mass spectrometry vendors now offer instruments that incorporate this separation technology.  The mzML CV and schema, however, has had somewhat patchy support for the information from this separation.  At this year’s meeting, Eric Deutsch convened a small group of people to discuss the best way to support this technology within mzML, ideally without forcing a major update in the format.  Hans Visser of Waters Corporation has made a lot of contributions on this score, and Matt Chambers (a wunderkind whose company I enjoyed during my decade at Vanderbilt) had offered some feedback on how to incorporate this information.  Our meeting at HUPO-PSI helped set us on a course for formal support for ion mobility!


I am grateful for the Ph.D. students and recent grads supporting our effort!

The HUPO-PSI Quality Control Working Group

I was really proud of the Quality Control Working Group.  I assigned us all a bit of homework for this meeting.  Three committee members create tools that generate quality metrics; all of us were assigned the task of creating a mock-up of the qcML we thought our software should produce.  One of us produced a database for storage of quality metrics; he was tasked to demonstrate what a qcML holding an analysis of these metrics should look like.  As a result, this meeting was far more concrete about what we need to do to finalize this format.  In particular, we grappled with the challenges of embedding information in JSON format within an XML wrapper.  Our consideration of complex data structures for particular metrics, such as three-dimensional matrices, is now much more applied in nature.

The field of proteomics needs to improve its ability to communicate issues of quality control.  There’s a perception of irreproducibility that hangs over the field.  While there is some basis in reality for these reproducibility claims, a fair bit of the problem is that researchers shy away from discussing quality issues in their papers.  A Ph.D. student in Shanghai has been heading the Quality Control Working Group recommendations for “MIAPE-QC:” the Minimal Information About a Proteomics Experiment for Quality Control.  I was sad that she could not attend the meeting due to grad school requirements, but a colleague of hers from Beijing presented the current state of the MIAPE-QC document.  We had a really good conversation about it, but I think our recommendations were a bit garbled on their way back to her; she was discouraged by our feedback and felt we were arguing for her to start over.  We’re working to clear up the confusion.  We will support her valuable efforts in educating our community.

Next stop: Cape Town?

As the meeting drew to a close, I put on my presenter hat one last time.  It was time to state my case for hosting the next HUPO-PSI at Cape Town!  Several different sites are bidding to host: Adelaide, Tokyo, San Diego, and Cape Town are all in the mix.  I started by taking the bull by the horns.  Cape Town may seem very far away, but it is actually in the same time zone as Heidelberg!  Flying from New York City is pretty rough, though, with a flight time of almost 15 hours.  My friend Eric Deutsch would have one of the worst routes since he is coming from Seattle.  Still, South Africa is seeing good growth in mass spectrometry, and we would love to see more of its laboratories corresponding with HUPO-PSI.  I highlighted some of the lovely attractions and hosting sites that we might visit as a group.  Hopefully, the steering committee will see its way to Cape Town in the near future!


The HUPO-PSI Steering Committee enjoys dinner.

Clinical proteomics in Russia and my last pair of pants

An index to this series is found on its first post.

October 30, 2017

At last the first day of ClinProt 2017 had arrived! I set aside my now-muddy pairs of jeans in favor of my fresh and clean blue dress pants, laced up my shiny black shoes, and put on my enthusiastic green shirt. With a spot of breakfast downstairs (on my third morning eating there, I found that the milk jug was full for the first time!), I was ready to meet with the others for a shuttle van ride over to the conference.

Moscow traffic at 8:20 AM is a bit intense. The drivers here are a bit more careful of road laws than I have seen in other countries, but they still produce some pretty creative merges in their traffic jams. What would have been a few minutes on the subway was more like a half hour on the road, but my dress pants were still pristine when we arrived at the Congress Center at the I.M. Sechenov First Moscow State Medical University. The facility had a lovely central hall, with a graceful split staircase to the two main venues for our meeting. I hadn’t seen lecture halls in which an array of nine HDTVs replaced the more typical projector. It certainly produced a bright image, though the borders between screens were distracting.


Why project when you can emit?

The Clinical Proteomics 2017 meeting was organized because a confluence of groups wanted to consolidate researchers in this country. EuPA, the European Proteomics Association, helps to integrate activities that span national proteomics societies. The Russian Human Proteomics Organization (RHUPO) sought to foster a sense of community among Russian research groups in this area. The Sechenov First Moscow State Medical University was happy to contribute a venue for the event, and many instrument, reagent, and other vendors agreed to take part, as well. I haven’t learned the total count of attendees yet, but I know that there are 87 research posters. For a first effort, I think it is clear that a great many things have gone well.

From the very first talk, it was apparent that Russian clinical proteomics researchers are grappling with challenges that became familiar to me as part of the National Cancer Institute (NCI) CPTAC program. Anna Kudryavtseva discussed her efforts to reconcile proteomics data with those that had been produced by NCI The Cancer Genome Atlas (TCGA), working in a particular sub-type of head and neck cancer. Prioritizing genes that were more frequent targets of mutation in tumors has value for understanding which proteins are most useful to monitor closely, for example. It was a great “plenary” (all attendees) talk to kick off these discussions.

As soon as we split to multiple sessions, I was on duty. I co-chaired the “Genomics and Beyond” panel with Sergey Moshkovskii. It was a bit odd to be fielding this panel while the Protein Informatics workshop was taking place in another room (that topic has been my bread and butter for two decades)! In this case, however, Sergey and I were not only chairing the session but also leading it with our two lectures, both in the field of proteogenomics.


Photo credit: Olga Kiseleva

I defined the term by saying that we want to improve our interpretation of genomic data by integrating proteomics data, and we want to improve our interpretation of proteomics data by integrating genomic data (I was trying to be ecumenical). From there, I led the group through the new paper that I’ve published with Anzaan Dippenaar and Tiaan Heunis, in which we demonstrated our ability to recognize sequence variations and novel genes in Mycobacterium tuberculosis “bugs” that had been isolated from patient sputum in South Africa. Sergey followed up by finding evidence of RNA editing in fruit flies.


Photo credit: Olga Kiseleva

The other speakers in the panel were also quite interesting. Matthias Schwab was visiting from Germany, and he educated the group on the current status of the field of pharmacogenomics. Vladimir Strelnikov, a geneticist, described the value of bisulfite sequencing for measuring DNA methylation in breast cancer. Sergey Radko outlined a SISCAPA-like strategy for using “aptamers” to enrich proteins prior to Selected Reaction Monitoring. Artem Muravev closed out the session to discuss the challenges of biobanking. This last talk was delivered in Russian, so I benefited quite a lot from real-time translation to English by Anastasia, one of two translators fielding our session (during my talk, she had been translating my words to Russian as I worked through my slides). Finally all the speakers came together for fifteen minutes of question and answer. I tweaked our pharmacogenomics speaker a little bit by saying that even if we had the complete sequences for every human on earth in our hands today, personalized medicine would not have arrived!

With the morning complete, everyone adjourned to a nearby restaurant. I was a little leery when I learned our destination was the Black Market, but I needn’t have worried; we wandered down the street to a lovely restaurant named “Black Market.” I had the Black Market Burger and felt thoroughly happy. I felt very grateful that the European Proteomics Association picked up the bill for that morning’s speakers!

Back in the conference, I enjoyed hearing my long-time friend David Goodlett discuss his long-term monitoring study of diabetes. He’s a careful guy, and it is good to see that he can make label-free proteomics sing in biofluids (a tough space to work), recognizing protein pairs for which expression can flag the onset of disease. It’s very reminiscent of the kind of study Stellenbosch University has produced in the space of tuberculosis. Our next speaker returned to the subject of biobanking, and he delivered his talk via Skype, not my favorite format. I am a big believer in contact with my audience.


Did I mention it was my enthusiastic green shirt?

I threw all my remaining energy into the poster session. Interacting with researchers at the start of their careers is very rewarding, and people who stand beside their work without knowing whether or not anyone will take interest have a hard job. These students were even braver, since they were prepared to defend their work in English!

I started with a poster very near and dear to my heart. A.V. Mikurova was evaluating the different levels of sequence coverage achieved by database search (Mascot, X!Tandem) and de novo algorithms (PepNovo+, Novor, and PEAKS) when working with 27 LC-MS/MS experiments for a defined mixture of human proteins. We discussed the relative unresponsiveness of sequence coverage as a metric for performance evaluation and the challenge of ensuring the algorithms had comparable configuration. I asked S.E. Novikova about her choices of statistical model for a time-series measurement of proteomes in response to all-trans retinoic acid. I hope my statistics lectures online will be useful to her, though it sounds like she’s already on the right track. N.V. Kuznetsova taught me a few things I didn’t know about celiac disease! She had been evaluating the ability of Triticain-Α to degrade the most immunogenic peptide of gluten-family proteins. Finally, J. Bespyatykh was presenting a poster on the proteomics of Mycobacterium tuberculosis from a strain called Beijing B0/W148. Her work obviously had a strong relationship to what Tiaan and Anzaan had published with me, so we had a great conversation about the work. I hope we can help her find a sequence database that is a more ideal fit for her proteomes than the generic “H37Rv” protein database. I was really pleased to speak with so many students about their work at this meeting.

With that, I slumped onto a wall and didn’t move very much. The other conference attendees had flowed back into the conference room for an afternoon round of talks. I let my mind wander for a bit, though I did have some nice conversations with the vendors. Soon, though, I heard some odd noises echoing through the entry hallway. Was there a music practice room somewhere in the building? Was that a tuba?


Dixieland music in Moscow! Photo credit: Olga Kiseleva

My questions were answered when I eventually joined everyone downstairs for a catered closing reception. The organizers had invited a Dixieland band to perform for our reception! The group was really solid. I particularly liked one of their trumpeters, since he had a smooth Chuck Mangione vibe going on. I kept recognizing songs only part of the way, since they were singing many of the lyrics in Russian! I finally got a solid hit on “Mack the Knife!” I sat up close to enjoy the show.

With the evening at an end, I declined invitations to go hit a bar and walked to the nearby Frunzenskaya subway station. Two stops later, I was in my neighborhood. I trudged up the paved driveway to the street with my hotel. As I awaited the green light at my last crosswalk before the hotel entrance, a car drove too close to the curb where I was waiting, and dirty rainwater soaked my last clean pair of pants.


What protein database is best for tuberculosis?

As many of you know, I have specialized in the field of proteomics, the study of complex mixtures of proteins that may be characteristic of a disease state, development stage, tissue type, etc.  Here in South Africa, my application focus has shifted from colon cancer to tuberculosis.  As a newcomer to this field, I’ve been curious to know whether the field of tuberculosis has good information resources to leverage in its fight against the disease.

The key resource any proteomics group can leverage is the sequence database, specifically the list of all protein sequences encoded by the genome in question.  The human genome incorporates around 20,310 protein-coding genes (reduced from estimates of 26,588 from the 2001 publication), but those genes code for upwards of 70,000 distinct proteins through alternative splicing. Bacteria are able to get by with far smaller numbers of genes.  E. coli, for example, functions with only 4309 proteins.  The organism that infects humans and other animals to produce tuberculosis is named Mycobacterium tuberculosis.  If we were to rely upon the excellent UniProt database, from which I quoted E. coli protein-coding gene counts, we would probably conclude that M. tuberculosis relies upon even fewer genes: only 3993 (3997 proteins)!


UniProt is an excellent all-around resource for proteomics, but researchers in a particular field usually gravitate to a data resource that is particular to their organism.  People who work with C. elegans for developmental studies, for example, use WormBase.  People who study genetics with D. melanogaster would use FlyBase.  People in tuberculosis have frequently turned to TubercuList for its annotation of the M.tb genome (comprising 4031 proteins).  This database, however, has not been updated since March of 2013 (available from the “What’s New” page).  Can it still be considered current, four years later?



As a recent import from clinical proteogenomics, my first impulse is still to run to the genome-derived sequence databases of NCBI, particularly its RefSeq collection.  I found a NCBI genome for M. tuberculosis there, with a  last modification date from May 21, 2016 and indicating its annotation was based upon “ASM19595v2,” a particular assembly of the sequencing data.  This was echoed when I ran to Ensembl, another site most commonly used for eukaryotic species (such as humans) rather than prokaryotic organisms (such as bacteria).  Their Ensembl tuberculosis proteome was built upon the same assembly as was the one from NCBI.


As a former post-doc from Oak Ridge National Laboratory, I am always likely to think of the Department of Energy’s Joint Genome Institute.  The DOE sequences “bugs” (slang for bacteria) like nobody’s business.  Invariably, I find that I can retrieve a complete proteome for a rare bacterium at JGI which is represented by only a handful of proteins in UniProt!  This makes JGI a great resource for people who work in “microbiome” projects, where samples contain proteins from an unknown number of micro-organisms.  In any case, they had many genomes that had been sequenced for tuberculosis (using the Genome Portal, I enumerated projects for Taxonomy ID 1773).  I settled for two that were in finished state, one by Manoj Pillay that appeared to serve as the reference genome and another by Cole that appeared to be an orthogonal attempt to re-annotate the genome from fresh sequencing experiments.

The easiest way to compare the six databases I had accumulated for M. tuberculosis is to enumerate the sequences in each database.  The FASTA file format is very simple; if you can count the number of lines in the file that start with ‘>’, you know how many different sequences there are!  I used the GNU tool “grep” to count them:

grep -c "^>" *.fasta
  • TubercuList: 4031 proteins
  • NCBI GCF: 3906 proteins
  • DOE JGI Cole: 4076 proteins
  • DOE JGI Pillay: 4048 proteins
  • Ensembl: 4018 proteins
  • UniProt: 3997 proteins

So far, one could certainly be excused for thinking that these databases are very nearly identical.  Of course, databases may contain very similar numbers of sequences without containing the same sequences.  One might count how many sequences are duplicated among these databases, but identity is too tough a criterion (sequences can be similar without being identical).  For example, database A may contain a long protein for gene 1 while database B contains just part of that long protein sequence for gene 1.  Database A may be constructed from one gene assembly while Database B is constructed from an altogether different gene assembly, meaning that small genetic variations may lead to small proteomic variations.

pgec20header20final20editI opted to use OrthoVenn, a rather powerful tool for analyzing these sequence database similarities.  The tool was published in 2015.  Almost immediately, I ran into a vexing problem.  The Venn diagram created by the software left out TubercuList!  I was delighted to get a rapid response from Yi Wang, the author of the tool (through funding of the United States Department of Agriculture’s Agricultural Research Service).  The tool could not process TubercuList because it contained disallowed characters in its sequence!  I followed his tip to sniff the file very closely.  I found that both sequence entries and accession numbers contained characters they should not.  Specifically, I found these interloping characters:

+ * ' #

OrthoVenn Venn chart

Scrubbing those bonus characters from the database allowed the OrthoVenn software to run perfectly.  Before we leave the subject, I would comment that these characters would cause problems for almost any program designed to read FASTA databases; in some cases, for example, the protein containing one of those characters might be prevented from being identified because of these inclusions!  My read is that they were introduced by manual typing errors; they are not frequent, and they appeared at a variety of locations.  Let’s remember that they have been in place for four years, with no subsequent database release!

Most people are accustomed to seeing Venn diagrams that incorporate two or three circles.  In this case I compelled the software to compare six different sets.  The bars shown at the bottom of the image show the numbers of clusters in each database; note that these differ from the number of sequences reported in my bullet list above because OrthoVenn recognizes that sequences within a single database may be highly redundant of each other!  (If sequences were completely identical, they could be screened out by the Proteomic Analysis Workbench from OHSU.)  Looking back at the six-pointed star drawn by the software, we might conclude that the overlap is nearly perfect among these databases.  We see four clusters specific to the JGI Pillay database, and 131 clusters specific to some sub-population of the databases, but the great bulk of clusters (3667) are apparently shared among all six databases!


The Edwards visualization from OrthoVenn

Oh, how much difference a visualization makes!  Shifting the visualization to “Edwards‘ Venn” alters the picture considerably.  Now we see that the star version hides the labels for some combinations of database.  We see that 3667 clusters are indeed shared among all six databases.  After that, we can descend in counts to 131 clusters found in the Pillay and Cole databases from JGI; does this reflect a difference in how JGI runs its assemblies?  Next we step to 106 clusters found in UniProt, Ensembl, Tuberculist, and NCBI GCF, but neither of the JGI databases.  The next sets down represent 70 clusters found in all but NCBI GCF or 25 clusters found in all but the two JGI databases and NCBI GCF.

I interpret this set of intersections to say that tuberculosis researchers are faced with a bit of a dilemma.  If they use a JGI database, they’ll miss the 106 clusters in all the other databases.  If they use Ensembl or TubercuList, they will include those 106 but lose the 131 clusters specific to the JGI databases.  Helpfully, OrthoVenn shows explicitly which sequences map to which clusters.  Remember that when I downloaded the Ensembl and NCBI databases, I saw that they were both based upon a single genome assembly called ASM19595v2.  Did they contain exactly the same genes?  No!  Ensembl contained two fairly big sets of genes that NCBI omitted, including 70 and 25 protein clusters, respectively.  NCBI contains another 11 protein clusters that were omitted from Ensembl.  Just because two databases stem from the same assembly does not imply that they have identical content.

For my part, I may use some non-quantitative means to decide upon a database.  I do not like making manual edits to a database since then others need to know exactly which edits I’ve made to reproduce my work.  That takes away TubercuList.  Next, I feel strongly that the FASTA database should contain useful text descriptions for each accession.  Take a look at the lack of information TubercuList provides for its first protein:


That’s right.  Nothing!  The Joint Genome Institute databases are quite similar in omitting the description lines. Compare that to what we see in the NCBI and UniProt databases:

NP_214515.1 chromosomal replication initiator protein DnaA [Mycobacterium tuberculosis H37Rv]
sp|P9WNW3|DNAA_MYCTU Chromosomal replication initiator protein DnaA OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=dnaA PE=1 SV=1

That’s much more informative. We’ve got missing data here, too, though. Tuberculosis researchers have grown accustomed to their “Rv numbers” to describe their most familiar genes/proteins, but NCBI and UniProt leave those numbers out of well-characterized genes; the Rv numbers still appear for less well-characterized proteins, such as hypothetical proteins. By comparison, Ensembl includes textual descriptions as well as Rv numbers in a machine-parseable format for every entry:

CCP42723 pep chromosome:ASM19595v2:Chromosome:1:1524:1 gene:Rv0001 transcript:CCP42723 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:dnaA description:Chromosomal replication initiator protein DnaA

On this basis, I believe Ensembl may be the best option for tuberculosis researchers. It is kept up-to-date while TubercuList is not, and it allows researchers to refer back to the old Rv number system in each description.

I hope that this view “under the hood” has helped you understand a bit more of the kind of question that occasionally bedevils a bioinformaticist!

With the new year, a new office!

The Division of Molecular Biology and Human Genetics occupies the fourth floor of the FISAN building at the SUN Faculty of Medicine and Health Sciences.  As its research programs have become better funded, substantial numbers of clinical and research staff have been added to its roster.  One practical result of this addition was that I shared an office with three other researchers when I arrived in South Africa in late 2015.  With the start of 2017, however, our division has gained access to office space on the third floor.  I am happy to report that as of last week, I have a new solo office!

This move does not come without regrets, though.  I have become friends with the inhabitants of F416, and my new hallway currently seems quite lonely by comparison.  Sam Sampson is a group leader who came to SUN via the National Research Foundation “South African Research Chairs Initiative,” and she has impressed me with her concentration skills in our busy office.  I also appreciate her thoughtful gift of teaspoons when mine went missing!  I really value Kim Stanley’s friendship; she has been very tolerant of my practical jokes, and occasionally I catch a glimpse of her mischievous sense of humor.  She invests countless hours in the REDCap study databases that undergird much of the research for our division.  Nasiema Allie was the last of the four people in our office to arrive.  Her job is quite critical since she ensures that the BSL3 lab facilities for our division are as safe as they can be.  Why is that a big deal?  Our division emphasizes research in tuberculosis, and we culture Mycobacterium tuberculosis from patient samples.  Some of the strains we recover from patients are resistant to every drug available to treat this disease.  Let’s just say that I don’t store my lunch in the freezers lining the division’s hallways!


Our freezers are equipped with wireless boxes that “phone home” if the temperature inside rises.

My new digs are on the wing extending east from the FISAN entrance.  To be on the third floor means I am several feet closer to the flock of chickens at ground level.  When I open the window (!) of my new office to feel a sweet afternoon breeze, I also get to hear the crowing of the roosters.  Last week we also had the questionable benefit of being closer to the smell of decomposition as cadavers were moved downstairs; FISAN is an Afrikaans abbreviation for “physiology and anatomy!”  That said, the third floor has great accommodations for the bioinformatics and biostatistics students we will be training in SATBBI.  The student chamber we have selected has abundant space, featuring bookshelves, a chalkboard, a bulletin board, and even a sink!  Right outside we have a smaller area we hope to position as a meeting room.


We haven’t reached our final configuration for the desks in the bioinformatics student workspace.

This brings us to my office.  I was one of the first professors to pick out my new home, and I decided on one featuring a blue wall (rather than the beige featured throughout the complex), an intact chalkboard (rather than the removal scars from one that had been removed), and a ledge underneath its narrow window.  I discovered that the ledge was the perfect height for tucking a cabinet or drawer set from our old furniture upstairs.  They will match the desk that my graduate student and I hauled downstairs from my old office.  The ledge is sturdy enough that I can stand on it to raise my window, so I feel confident that it will house some plants for me soon!


All I need now is a coffee table.

Moving my computers down was a bit more worrisome.  Happily, the LAN port (or “network point,” as they would say here) was already live, though it is a slower 100 Mbps rather than gigabit.  In any case, my Ubuntu Linux file server “Deep Thought” made the transition downstairs without a hiccup.  I recently brought my Intel Core i7 workstation “Alabaster” from home; it connects to the network wirelessly, so I can use a network wire to connect the two computers in my office directly.  Using a gigabit network port exclusively to communicate between the pair means I can use the RAID from Deep Thought almost as though it were a local hard drive in Alabaster.  This may be as good place as any to mention an act of generosity from Vanderbilt University.  When I decided to move my lab to South Africa, the Department of Biomedical Informatics allowed me to move almost all the computers associated with my laboratory to Stellenbosch University!  It made a real difference to my new division.


What office is complete without a memento or ten?

I have assembled a collection of treasures on my desk that link me to my past.  Probably my oldest memento is a koosh ball that I acquired in high school.  I am very fond of my jar of marbles for my discussions in frequentist biostatistics; I bought these marbles when I was starting as a professor at Vanderbilt from the Moon Marble Company, near Kansas City.  My first Ph.D. graduate student bought me a jade pen holder that I use everyday.  My singing bowl from China gets a special place of prominence.  A small, red Buddha was a parting present from the Harkeys, close friends from Nashville.  An analog clock from Vanderbilt reminds me of my friend Bing Zhang, who headed to Houston around the time I moved to Cape Town.  I don’t remember where my Ganesh came from, but he has a reputation for finding the solutions to problems, so he definitely belongs on my desk!

My name placard has moved, I have given up my key to F416, and all my things have migrated downstairs.  Over the weekend, my lovely girlfriend bought me white and colored chalk for my new chalkboard!  Now it’s time for the science to flow from my desk once again.  Wish me luck!


This hallway awaits the rest of its occupants!


Five research institutions in one week

My home institution is the Faculty of Medicine and Health Sciences of Stellenbosch University, in the outskirts of Cape Town. Because 60% of my effort is paid by the Medical Research Council, however, I serve as a bioinformatics resource to tuberculosis researchers, regardless of their institution. On a recent week, my duties took me to five different institutions. This post invites you to come along for the ride!

Monday: Tygerberg Campus

The first day of each week generally finds me perched at my computer in an office shared with three colleagues. They are each quite impressive scientists. One is a professor, another is a database administrator, and the newest member is a lab safety manager. All three of them work long hours. We added a whiteboard on the door this week to keep track of where each of us may be found.


What can I say? I like maps.

My chief project at the Tygerberg campus for this week has been establishing a small cluster of computers for the Mycobacteriology Molecular Evolution and Physiology group. The recent loss of a server destroyed some assemblies of genomic sequencing data for them. I brought seven i7 processing nodes from my old lab. We spent much of Monday afternoon consolidating memory to produce four computers with 8GB or more of RAM (even microbial genomes can take a fair amount of RAM for assembly). We installed Ubuntu Linux on the computers since the assembly software requires that operating system. The group promises to keep the processors busy in the coming weeks!

Tuesday: Groote Schuur

The Faculty of Health Sciences for the University of Cape Town is based at Groote Schuur Hospital (this is not pronounceable by an American, but try HREE-ta SKEW). I work on Stellenbosch hours, which means I arrive at UCT long before most visitors. I had two institutions to visit on this Tuesday: the Centre for Proteomics and Genomics Research and the UCT health sciences campus itself. CPGR and I had talked about a large-scale proteomic survey spanning more than one hundred patients, so I brought along a portable hard drive to collect the mass spectrometry data. I have enjoyed talking with their proteomics team, so engaging in a legitimate collaboration with them is a great opportunity.


Just after sunrise, Anzio Road’s intersection with the M4 is quite pretty.

I walked up Anzio Road to the UCT Health Sciences Campus. I used my recently acquired badge (I’ve become an honorary professor at UCT) to card into the campus and then used it again to enter the Institute of Infectious Disease and Molecular Medicine (abbreviated the “IDM“). I had a great meeting with a post-doctoral fellow in SATVI. She has been waiting very patiently for me to analyze cell populations in a large flow cytometry data set associated with IRIS (Immune Reconstitution Inflammatory Syndrome, a severe condition that sometimes results when a person with HIV infection begins anti-retroviral therapy). At long last, we could begin visualizing the results from this huge study!


The IDM feels both modern and classic to me.

I returned to my normal haunt at UCT, the Blackburn Lab. I had already filled my morning with other meetings, but I could still get some updates on projects from his graduate students and post-docs. At two in the afternoon, we had our weekly community proteomics meeting. Once a month, we put on a special two-hour program devoted to a special topic (for example, the next special program discussed strategies for publishing effectively in proteomics). After that meeting ended, I was on the road back to my usual neighborhood.

Wednesday: Tygerberg Campus

Upon my return to my home campus, I had the usual crush of email to field, due to my time at UCT the prior day. Working with so many institutions has led to a proliferation of email accounts; at the moment I monitor two accounts from back in the States plus my Google account, and I’ve picked up accounts for both Stellenbosch University and University of Cape Town. I expect that most messages will come to my Google account or my Stellenbosch account, but manuscript review requests frequently show up elsewhere.


Poking the beast in the bioinformatics cave

I decided to give another afternoon to the mini-cluster. Having four computers running Linux is nice, but wouldn’t it be better if the operators didn’t have to physically connect a monitor to each to use it? Shouldn’t we have some files shared among the computers? I hauled out the network switch and wired them up. The information technology folks registered the MAC addresses of the network adapters for use in the FISAN building where they are located. I installed a big hard drive in the head computer and installed NFS and OpenSSH on it. On the other three computers, I installed NFS-COMMON and OpenSSH, each time configuring them to look to “Beast1” for the /scratch file system. Now users can launch jobs over SSH to the other three computers while logged in from Beast1. I felt that these computers had crossed over the transom to become worth using, and I breathed a sigh of relief.

Thursday: University of the Western Cape

I have only recently begun working regularly with the Proteomics Unit of the UWC Biotechnology program, but I have really enjoyed the interaction. I have found myself thinking about proteogenomics in a rather different light to support their work with non-model organisms. To identify proteins in Capsicum frutescens, we needed to know which proteins are encoded by the genome of this organism. Happily, we found an RNA-Seq paper that included assembled transcripts that we could employ in database search algorithms for proteomics. More recently the UWC team has been challenging me with a different plant. We found RNA-Seq data, but not the assembly produced by the authors. It has been fun building a team to infer a proper protein catalog for the species.

One of my favorite games to play with my old lab group was “Where’s the Paper?” We would work our way around the table, with each person discussing what he or she was working on. Then we’d try to figure out the best paper that could be produced from that work. That game has come in handy at UWC, as well. We’re trying to find the best manuscripts among the data that this team has been generating. I have faith in these folks!


The Life Sciences Building is easily visible from the adjoining M10: Robert Sobukwe Rd.

UWC is also notable for having some of the best buildings of any campus in the area. For years it languished behind the better-funded UCT and Stellenbosch Universities, but the democratically-elected government since 1994 has spent substantial funds on the campus. The Life Sciences and Chemical Sciences buildings are gorgeous, and they would not look out of place on any American campus. One can add another nice feature of the campus; it sits adjacent to the Cape Flats Nature Reserve. I love the plants in view as I make my way along West Drive.


The cape flats range from marsh to sandy fynbos.

I returned to the Tygerberg campus of Stellenbosch in the afternoon for the first rehearsal of the “Singing Sensations,” a group of academic staff who are preparing three songs for the annual gala. This year’s program will celebrate sixty years of the Stellenbosch Medical School!

Friday: Stellenbosch University, main campus

My Friday morning began at a large conference room on the floor below mine in the FISAN buiding. The Tuberculosis groups have grown so much that we cannot easily fit into our own seminar room. Three Honours students presented their plans for their research projects to a packed house, and professors peppered them with questions. The presenters did not wilt, happily. For my American readers, I would explain that Honours students have completed their three-year Bachelor’s degrees and are adding another year to prepare them for their chosen graduate program, in this case with Molecular Biology or with Human Genetics at the med school of Stellenbosch University.

I had become excited about seeing the computer cluster in operation again, and I tried to identify which parts I would need to get the other three processors in action again. A few months ago, I had a nasty surprise when I plugged one of the units into the wall. The very loud POP and sizzle had me checking my eyebrows; I needed to flip a switch on the power supply to enable it to use the doubled voltage of South African power outlets. Replacing the ITX power supply on that case seems a bit challenging here in South Africa; the university’s two main parts suppliers do not carry them. I put together a quote that includes a new case and power supply.


The A. I. Perold building on the Stellenbosch main campus

With the morning gone, it was time for me to hit the road for Stellenbosch University’s main campus, which is in (wait for it) Stellenbosch. From the outskirts of Cape Town, the drive lasts less than an hour. I parked near the Perold Building (number 65 on the campus map) and walked up the stairs to the offices of the Dean of Science, Prof Louise Warnich. I visit the main campus every two weeks, usually with my friend and colleague Gerard Tromp, but this was a solo week since he was traveling. I was really grateful that my friend Professor Heinrich Volschenk was available to talk. We have been planning a tutorial session with Laurent Gatto for the end of October. We will teach students how to make use of the R statistical environment and Bioconductor in making sense of their proteomic data. Heinrich and I framed the key topics we want to address with the students. I think it will be quite a good program!

It was a busy week, but quite a rewarding one. I love the diversity of projects that I can influence here.

Making digital images from 35mm slides

As a young man, my father was quite the technologist.  He bought a reel-to-reel tape audio recorder during his service in the U.S. Army, and so I can hear the voices of my grandparents during a long road trip in the 1960s or my parents during their vacation to the South in 1971.  Dad acquired a super-eight film movie camera during that time as well, and he shot movies with it up to the time I was a child.  He even became a camera buff, shooting more than one thousand photographs that were developed as 35mm slides.  As a result, I can draw on an unparalleled archive of images from my family’s history.

As a technologist of a different sort, I have tried to bring this archive into the digital age.  During my college years, I was able to produce an 8mm video camera recording for much of Dad’s movie footage, and I digitized that footage through a miniDV video camera and Firewire cable in later years.  During my graduate school years, I digitized the audio from a few hours of reel-to-reel tape to produce audio CDs.  Oddly, producing high-resolution scans of the 35mm slides has posed the biggest challenge.  Today I can report that we have finally found a way to digitize those amazing boxes of slides!

When I was in graduate school at Seattle (1996-2000), I performed my first experiments with slide scanners.  My friend Elizabeth allowed me to use her HP Photosmart slide scanner, and the resulting images were okay, for the time.  I tried buying an inexpensive slide scanner from another company, and yet the product from the hours of time I invested in using it was fairly disappointing.  In recent years, I purchased a Canon CanoScan 8600F, a flatbed scanner with a lid that can backlight transparent sources.  The images from this flatbed have been pretty nice, since it can operate at 4800 dpi, but scanning even a single slide at this resolution takes a fair amount of time.  I’ve never managed to scan the whole collection with scanners.  I have also found that scanners do not cope very well with the range of brightness that we encountered with the slides; many dark slides simply produced poor quality images in any scanner.

In 1999, I discovered that Canon had produced the FP-100 slide adapter for my Hi-8 video camera, and I acquired one for the princely sum of $120.  Essentially, the FP-100 was a low-temperature lamp, a bracket through which one could move a slide holder (with some wiggle room for positioning), and a ring to attach to the front of the camera.  I was glad that the cool lamp was bright enough to illuminate the darkest images from the collection, since the video camera could adjust its iris to the content of the slide.  Because the video camera could only resolve 480 lines in each image, though, the video images we produced through it were not quite what I wanted.


The entire sandwich of equipment is unwieldy, but it works well enough!

In 2015, I found the missing ingredient.  My Canon EOS-M camera employs a prime (non-zoom) lens with an external fitting of 43 mm.  I asked my photographically talented friend Brad Melton for some assistance on how I would connect the 46 mm FP-100 to the prime lens, and he located a “stepup ring” for me.  Would a slide adapter intended for a video camera be usable with a modern mirrorless high-resolution still camera?  We quickly discovered that the camera was unable to focus on the slide images because the lens was too close to the CCD to focus on an image so near.  I compensated by adding a macro tube between the camera body and the lens.  The entire sandwich of equipment included these elements:

  1. Canon EOS-M2 mirrorless camera
  2. Meike MK-C-AF3B 10mm macro tube
  3. Canon EF-M 22mm f/2 STM prime lens
  4. HeavyStar Dedicated Metal Stepup Ring 43mm-46mm
  5. Canon FP-100 slide adapter.

I was reasonably pleased with the performance, on the whole, but there are certainly some drawbacks. The first is the problem of achieving good focus. Many of these slides now carry a fair amount of dust, and the camera was frequently inclined to auto-focus on the dust rather than the image. The second issue results from the image being so close to the lens; the edges of the slide are considerably farther from the lens than the center. Focusing on one part of the image generally meant that parts of the slide farthest from the focus point would be more than a little fuzzy. This photo gives an example of this behavior:


In the left snapshot, I have selected my mother’s face as the focus point. In the right, I have selected my grandmother’s instead.

I also encountered some degenerate behavior in the focus.  I would occasionally flip the selector over to use manual focus rather than automatic.  After cranking the focus ring all the way to one end of the dial, I would sometimes discover that the camera thought it was being operated by a madman and would force the focus back in the other direction.  When I tried switching to a different macro tube thickness, I was entirely unable to focus, so I simply felt grateful that I could get these images to focus at all!

In some cases, the slide scanner had dealt very poorly with slide images that were quite dark overall or that featured a significant contrast between light and dark portions of a frame.  Happily, the Canon EOS-M2 seemed to handle these contrasts better because it was storing brightness levels in the 14-bit depth afforded by the CR2 raw file format.


My older brother always wanted to play.

One of the most common claims about 35mm slides is that they are far more resilient to aging than are prints from the same negatives.  Did that hold up?  The image of my father on the mule at the top of this post dates from 1964, around fifty years before this blog post was written.  The hues may be somewhat less vibrant than they were originally, but I doubt very much that a print from 1964 would hold up as well.  In this case, I have cropped to approximately 40% of the original field of view.  The slide below this is from 1965, showing my mother during her graduation ceremony from Northeast Missouri State Teachers College.  Again, I have cropped to about 40% of the original slide (in part to strip away unfocused areas).


These slides enable me to see my parents in the primes of their lives!

Of course, capturing the original CR2 files for each of a thousand slides is just the beginning.  From here, I will need to export each to a TIFF file, crop the image to a new dimension, possibly apply a noise reduction filter, and export to JPG (the images I included here have not been gone through noise reduction, though I did scale down the resolution considerably from the 18 megapixel originals).  That step will take considerable time, but I believe the result will be a far more useful archive for our family history.

I hope that this post will contribute some ideas for how to get your family archives in a more manageable condition!


Our family