Tag Archives: bioinformatics

What protein database is best for tuberculosis?

As many of you know, I have specialized in the field of proteomics, the study of complex mixtures of proteins that may be characteristic of a disease state, development stage, tissue type, etc.  Here in South Africa, my application focus has shifted from colon cancer to tuberculosis.  As a newcomer to this field, I’ve been curious to know whether the field of tuberculosis has good information resources to leverage in its fight against the disease.

The key resource any proteomics group can leverage is the sequence database, specifically the list of all protein sequences encoded by the genome in question.  The human genome incorporates around 20,310 protein-coding genes (reduced from estimates of 26,588 from the 2001 publication), but those genes code for upwards of 70,000 distinct proteins through alternative splicing. Bacteria are able to get by with far smaller numbers of genes.  E. coli, for example, functions with only 4309 proteins.  The organism that infects humans and other animals to produce tuberculosis is named Mycobacterium tuberculosis.  If we were to rely upon the excellent UniProt database, from which I quoted E. coli protein-coding gene counts, we would probably conclude that M. tuberculosis relies upon even fewer genes: only 3993 (3997 proteins)!


UniProt is an excellent all-around resource for proteomics, but researchers in a particular field usually gravitate to a data resource that is particular to their organism.  People who work with C. elegans for developmental studies, for example, use WormBase.  People who study genetics with D. melanogaster would use FlyBase.  People in tuberculosis have frequently turned to TubercuList for its annotation of the M.tb genome (comprising 4031 proteins).  This database, however, has not been updated since March of 2013 (available from the “What’s New” page).  Can it still be considered current, four years later?



As a recent import from clinical proteogenomics, my first impulse is still to run to the genome-derived sequence databases of NCBI, particularly its RefSeq collection.  I found a NCBI genome for M. tuberculosis there, with a  last modification date from May 21, 2016 and indicating its annotation was based upon “ASM19595v2,” a particular assembly of the sequencing data.  This was echoed when I ran to Ensembl, another site most commonly used for eukaryotic species (such as humans) rather than prokaryotic organisms (such as bacteria).  Their Ensembl tuberculosis proteome was built upon the same assembly as was the one from NCBI.


As a former post-doc from Oak Ridge National Laboratory, I am always likely to think of the Department of Energy’s Joint Genome Institute.  The DOE sequences “bugs” (slang for bacteria) like nobody’s business.  Invariably, I find that I can retrieve a complete proteome for a rare bacterium at JGI which is represented by only a handful of proteins in UniProt!  This makes JGI a great resource for people who work in “microbiome” projects, where samples contain proteins from an unknown number of micro-organisms.  In any case, they had many genomes that had been sequenced for tuberculosis (using the Genome Portal, I enumerated projects for Taxonomy ID 1773).  I settled for two that were in finished state, one by Manoj Pillay that appeared to serve as the reference genome and another by Cole that appeared to be an orthogonal attempt to re-annotate the genome from fresh sequencing experiments.

The easiest way to compare the six databases I had accumulated for M. tuberculosis is to enumerate the sequences in each database.  The FASTA file format is very simple; if you can count the number of lines in the file that start with ‘>’, you know how many different sequences there are!  I used the GNU tool “grep” to count them:

grep -c "^>" *.fasta
  • TubercuList: 4031 proteins
  • NCBI GCF: 3906 proteins
  • DOE JGI Cole: 4076 proteins
  • DOE JGI Pillay: 4048 proteins
  • Ensembl: 4018 proteins
  • UniProt: 3997 proteins

So far, one could certainly be excused for thinking that these databases are very nearly identical.  Of course, databases may contain very similar numbers of sequences without containing the same sequences.  One might count how many sequences are duplicated among these databases, but identity is too tough a criterion (sequences can be similar without being identical).  For example, database A may contain a long protein for gene 1 while database B contains just part of that long protein sequence for gene 1.  Database A may be constructed from one gene assembly while Database B is constructed from an altogether different gene assembly, meaning that small genetic variations may lead to small proteomic variations.

pgec20header20final20editI opted to use OrthoVenn, a rather powerful tool for analyzing these sequence database similarities.  The tool was published in 2015.  Almost immediately, I ran into a vexing problem.  The Venn diagram created by the software left out TubercuList!  I was delighted to get a rapid response from Yi Wang, the author of the tool (through funding of the United States Department of Agriculture’s Agricultural Research Service).  The tool could not process TubercuList because it contained disallowed characters in its sequence!  I followed his tip to sniff the file very closely.  I found that both sequence entries and accession numbers contained characters they should not.  Specifically, I found these interloping characters:

+ * ' #

OrthoVenn Venn chart

Scrubbing those bonus characters from the database allowed the OrthoVenn software to run perfectly.  Before we leave the subject, I would comment that these characters would cause problems for almost any program designed to read FASTA databases; in some cases, for example, the protein containing one of those characters might be prevented from being identified because of these inclusions!  My read is that they were introduced by manual typing errors; they are not frequent, and they appeared at a variety of locations.  Let’s remember that they have been in place for four years, with no subsequent database release!

Most people are accustomed to seeing Venn diagrams that incorporate two or three circles.  In this case I compelled the software to compare six different sets.  The bars shown at the bottom of the image show the numbers of clusters in each database; note that these differ from the number of sequences reported in my bullet list above because OrthoVenn recognizes that sequences within a single database may be highly redundant of each other!  (If sequences were completely identical, they could be screened out by the Proteomic Analysis Workbench from OHSU.)  Looking back at the six-pointed star drawn by the software, we might conclude that the overlap is nearly perfect among these databases.  We see four clusters specific to the JGI Pillay database, and 131 clusters specific to some sub-population of the databases, but the great bulk of clusters (3667) are apparently shared among all six databases!


The Edwards visualization from OrthoVenn

Oh, how much difference a visualization makes!  Shifting the visualization to “Edwards‘ Venn” alters the picture considerably.  Now we see that the star version hides the labels for some combinations of database.  We see that 3667 clusters are indeed shared among all six databases.  After that, we can descend in counts to 131 clusters found in the Pillay and Cole databases from JGI; does this reflect a difference in how JGI runs its assemblies?  Next we step to 106 clusters found in UniProt, Ensembl, Tuberculist, and NCBI GCF, but neither of the JGI databases.  The next sets down represent 70 clusters found in all but NCBI GCF or 25 clusters found in all but the two JGI databases and NCBI GCF.

I interpret this set of intersections to say that tuberculosis researchers are faced with a bit of a dilemma.  If they use a JGI database, they’ll miss the 106 clusters in all the other databases.  If they use Ensembl or TubercuList, they will include those 106 but lose the 131 clusters specific to the JGI databases.  Helpfully, OrthoVenn shows explicitly which sequences map to which clusters.  Remember that when I downloaded the Ensembl and NCBI databases, I saw that they were both based upon a single genome assembly called ASM19595v2.  Did they contain exactly the same genes?  No!  Ensembl contained two fairly big sets of genes that NCBI omitted, including 70 and 25 protein clusters, respectively.  NCBI contains another 11 protein clusters that were omitted from Ensembl.  Just because two databases stem from the same assembly does not imply that they have identical content.

For my part, I may use some non-quantitative means to decide upon a database.  I do not like making manual edits to a database since then others need to know exactly which edits I’ve made to reproduce my work.  That takes away TubercuList.  Next, I feel strongly that the FASTA database should contain useful text descriptions for each accession.  Take a look at the lack of information TubercuList provides for its first protein:


That’s right.  Nothing!  The Joint Genome Institute databases are quite similar in omitting the description lines. Compare that to what we see in the NCBI and UniProt databases:

NP_214515.1 chromosomal replication initiator protein DnaA [Mycobacterium tuberculosis H37Rv]
sp|P9WNW3|DNAA_MYCTU Chromosomal replication initiator protein DnaA OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=dnaA PE=1 SV=1

That’s much more informative. We’ve got missing data here, too, though. Tuberculosis researchers have grown accustomed to their “Rv numbers” to describe their most familiar genes/proteins, but NCBI and UniProt leave those numbers out of well-characterized genes; the Rv numbers still appear for less well-characterized proteins, such as hypothetical proteins. By comparison, Ensembl includes textual descriptions as well as Rv numbers in a machine-parseable format for every entry:

CCP42723 pep chromosome:ASM19595v2:Chromosome:1:1524:1 gene:Rv0001 transcript:CCP42723 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:dnaA description:Chromosomal replication initiator protein DnaA

On this basis, I believe Ensembl may be the best option for tuberculosis researchers. It is kept up-to-date while TubercuList is not, and it allows researchers to refer back to the old Rv number system in each description.

I hope that this view “under the hood” has helped you understand a bit more of the kind of question that occasionally bedevils a bioinformaticist!


An extraordinary journey in three universities

Last November, I received some very welcome news.  The Deputy Vice-Chancellor for Academics at the University of the Western Cape informed me that I had been named an Extraordinary Professor in the Department of Biotechnology!  My work within that department had been going well, when persistent student protests closed the university through the end of 2016.  This letter reflected the ongoing hope of Biotechnology that our collaboration would continue when the students returned to their studies.  Today I received my official badge, so I would like to write about the work that is developing at each of the three local universities at which I have an appointment.

I have written about my travels among the campuses in and around Cape Town.  I would stress that I spend most of my time at my home institution, the Tygerberg campus for Stellenbosch University.  Bioinformatics has seen considerable investment by the university.  The South African Tuberculosis Bioinformatics Initiative represents the concentration of bioinformatics investigators for our campus: Gerard C. Tromp, Gian van der Spuy, and me.  There are other data scientists, though!  The Centre for Evidence-based Healthcare, led by Taryn Young, offers statistical expertise.  Tonya Esterhuizen specializes in biostatistics.  As I will explain in a moment, I hope to work with them more in the days to come.  This year, my formal teaching duties at my home campus will double.  Don’t worry for me, though, since I will host the Honours students for the Division of Molecular Biology and Human Genetics for only eight days!  I am glad that bioinformatics will have the “standard” module length for our Honours program, equal to Immunology and several other subjects.  I have been supplementing my teaching through an informal “course,” called the “Useful Hour.”  I have begun teaching all comers about a range of subjects, from computers to programming and statistics.  I hope to pull in some philosophy of science soon, as well.  I have been filming these subjects as a bit of an experiment, and it has been handy for those who cannot attend.img_20170126_152122

Hugh Patterton, Gerard Tromp, and I coordinate our efforts near Simonsberg.

The Stellenbosch campus of Stellenbosch University has made strides in bioinformatics, as well.  Hugh Patterton, a professor in the Department of Biochemistry, has been named to lead bioinformatics efforts at this campus.  Naturally, our group (SATBBI) has been talking with Hugh about ways we can reinforce each other’s efforts.  Some of our consultations on the Stellenbosch campus have pointed in the direction of microbiome research, an area that is replete with bioinformatics challenges.  I look forward to seeing what emerges!

I am highlighting the University of the Western Cape in this post, of course!  In describing bioinformatics at the campus, I should start by mentioning the South African National Bioinformatics Institute (SANBI).  Alan Christoffels leads this group of investigators.  They’re an interesting group, with considerable success in capacity development within South Africa and across the continent.  My home on the campus, however, has been with the Department of Biotechnology.  In many respects, this reflects how I have spent my career.  I set the mold in graduate school, when I was a bioinformaticist surrounded by analytical chemists.  I like being close to the people who generate the data I work with!  In the Department of Biotechnology, I work most closely with the group of Ashwil Klein, the lecturer who heads the Proteomics Research and Service Unit.  They have primarily emphasized a gel-based workflow, meaning that they partially isolate proteins on a 2D gel before identifying the spot based on the peptide masses they observe on the Bruker Ultraflex TOF/TOF.  The group is actively moving toward additional instruments, though, and the acquisitions should greatly broaden their capabilities.  I enjoy the intellectual challenges their group produces, since the rules of the road are somewhat less established for agricultural proteomics.


The new UWC Chemical Sciences and Biological Sciences Buildings rise above the Cape Flats Nature Reserve.

In attending the department’s recent strategic retreat, I was introduced among the researchers of UWC Biotechnology more broadly.  I was particularly glad to meet with Dr. Bronwyn Kirby, who heads the Next Generation Sequencing Facility.  We discussed the Honours course offered for the department (I taught bioinformatics for the proteomics module last year), and I believe I’ll get to add some bioinformatics for the sequencing module in 2017!  I was also delighted to meet the SARChI chair who heads the Institute for Microbial Biotechnology and Metagenomics (IMBM), Marla Trindade.  We spoke about what the students of the institute most needed, and establishing a structured curriculum for biostatistics seemed very high on the list.  I mentioned the biostatistics researchers at Stellenbosch above.  My hope is to be able to use much of the structure Stellenbosch has already built in its Biostatistics I and II classes as a model for teaching biostatistics at UWC Biotechnology.  It would be my first effort at teaching biostatistics formally; I hope that I have absorbed enough to be a good teacher for this subject!

I continue to spend my Tuesdays with the University of Cape Town medical school and to visit the Centre for Proteomics and Genomics, as well.  UCT named me an Honorary Professor in the Department of Integrative Biomedical Sciences halfway through 2016.  My interactions there have principally taken place within the Institute of Infectious Disease and Molecular Medicine (IDM), borrowing from the network of relationships that Jonathan Blackburn has established there.  I have worked with Nelson Soares, his Junior Research Fellow, to create monthly programs for the Cape Town community invested in proteomics.  This Tuesday, we started this series for 2017 with an introduction to the methods we use for identifying and quantifying proteins.  I was really pleased that Brandon Murugan, a senior graduate student in the Blackburn Lab, felt comfortable enough to present this material!


I enjoyed my sundown cruise with the SATVI team in May of last year!

From the very beginning of my time in South Africa, I have been working with the South African Tuberculosis Vaccine Initiative (SATVI).  Recently they began having their research in progress meetings on Tuesday morning, allowing me to take part.  I really like the interaction.  They take my questions seriously, and I think we all learn from working together.  Certainly I would find great meaning in being part of a successful vaccine trial for this disease!

I have another group I must mention in describing bioinformatics across these three universities.  Nicola Mulder’s “CBIO” team has been an opening wedge in bioinformatics education for South Africa.  Their H3Africa BioNet courses have been used to supplement the content of B.Sc. education in places like the University of Limpopo.  It should be no surprise that many of the people I have mentioned in today’s post have collaborated in a manuscript describing the growth of bioinformatics in South Africa.  Our field is key to the future of public health and to the advances in biotechnology yet to come!

Semmering, Austria: Proteome Informatics on the upslope

At the start of 2015, I was incredibly fortunate to attend the Midwinter Proteome Informatics Midwinter Seminar at Semmering, Austria.  Although I did not initially know many of the participants, I have subsequently become friends with many of them.  In some cases, we have even written papers and grants together!  I was thrilled to return to Semmering on January 8, 2017 to attend a sequel to this meeting, this time sponsored by the European Proteomics Association.  Our group had nearly doubled from fifty-six to one hundred and five!


January 14, 2015 (Schneeberg appears in the distance behind us.)


Jan. 11, 2017 (photo courtesy of Marc Vaudel)

Despite its small population (below six hundred permanent residents), Semmering is actually an interesting place.  The town is named for the eponymous pass through the Northern Limestone Alps.  The area gained special prominence in 1728 when Emperor Charles VI of Austria completed a road over the pass, a feat commemorated by a hefty monument near the ski resort.


The 18th century monument bathes in the Zauberberg night lights.

One hundred twenty years later, the pass served as a key railway connection, tying together “Lower Austria” and Styria, one of the nine federated states of Austria.  The stylish and well-engineered construction of this railway has been listed by UNESCO as a World Heritage Site.  The railway reaches almost 900 m above sea level.  The tracks employ tunnels and graceful bridges through a ruggedly beautiful terrain.  These rail links accelerated development in the area, making Semmering a major resort destination.

Our conference had grown so much in size that we occupied almost the entirety of the Semmering Sporthotel.  A feature that I particularly enjoy about this conference is the chance to create new tutorials for a crowd of advanced researchers.  In 2015, I premiered a half-day workshop on the subject of algorithms to identify post-translational modifications.  I asked this year’s organizers what kind of tutorial they would most like.  They responded by asking what I was working on right now.  I described my work in preparing sequence databases for identifying proteins of non-model organisms, starting from RNA-Seq experiments.  They replied that this would be just great.  I found it was a very useful exercise to learn the individual methods well enough to teach them to others.  In the end, approximately 35 students worked through the resulting half-day tutorial.  We were pretty challenged by the weak Internet service at the hotel, split across so many users, but most of the crucial steps were possible with data I had provided via USB drives.


My diagram of extant search engines from two years ago

Two years ago, I had chosen a somewhat controversial topic for my plenary lecture (one given to all the attendees at once rather than a subgroup).  In “The Hard Stuff: MS Bioinformatics Moves Beyond Protein Identification,” I argued that the era of publishing new database search engines for proteomics was drawing to a close, since more than thirty such tools have now been published!  I urged them to look beyond these basics to find challenges in non-conventional identification: MS/MS scans containing evidence for multiple peptides, proteins that vary in sequence from a database reference, and peptides bearing complex modifications like glycans or non-ribosomal peptides.


A banner image from my 2013 review of quality control

This year, I decided to spend some attention on a question of importance since I am chairing a quality control working group for the HUPO-PSI.  What types of biological mass spectrometry are not well-served by existing quality control approaches?  I discussed some of the existing efforts in quantitative mass spectrometry within Spectrum Mill, SProCop, and MSstats.  I contrasted this situation with the emerging fields of data-independent acquisition, in which superior reproducibility is regularly claimed without metrics that could substantiate those claims.


Jan 13, 2015: Johannes Griss and I discover our shared sense of humor. (Photo courtesy Lennart Martens)

With two meetings at Semmering under my belt, I must say I am hooked.  These meetings remind me of the lovely RECOMB Computational Proteomics meetings at UCSD from 2010 to 2012.  The quality of attendees is really substantial, and the free-wheeling conversations are highly entertaining and educational.  I must also say that there is nothing quite as thrilling as sledding down the designated path of the ski slopes head-first (NOTE: this posture is discouraged), the way I lost my lens cap in 2015!  If you are in our field, I hope I’ll get to see you at a 2018 meeting!


Johannes Griss introduced me to Almdudler, a lovely tonic that is the taste of Austria for me.

China: From the foundations of computing to its future

An index to the China series appears at the first post.

September 22, 2016

A kindred spirit

I had a delightful time this morning speaking with Dr. Dongbo Bu in his office. My attention was immediately drawn to a small museum of early computers he has collected near the entrance to his office. His Mac 128k from 1984 and other early PCs were blasts from the past, but three members of his collection were quite remarkable. The first was Turing machine that he had built from LEGO Technics. The computer has a memory of approximately sixteen bits, and it can read from the memory, write to its memory, and perform logical operations based on its contents. An even more impressive accomplishment was a larger LEGO construction of considerable complexity. He was delighted that I was able to recognize it as a working model of the Charles Babbage difference engine. Before we left the table behind, he was proud to show me a set of Napier’s Bones that he had produced through a local workworker when he could not find a set to purchase.

img_5987Dr. Bu and I share a lot of research interests. I was very excited by his work on predicting peptide fragmentation. He has returned to a theme I raised in my 2003 paper on the subject, namely that each amino acid has a different propensity to steer fragmentation toward its N-terminal or C-terminal side. His contribution is to model the fragmentation effect when five successive amino acids are considered as a whole. In brief, an amino acid’s steering in one direction or another is biased by the amino acids immediately adjacent and, to a lesser degree, by the amino acids adjacent to its neighbors. He has used the biases seen in five-mers of amino acids to make an optimized model that can be used to very quickly simulate the tandem mass spectrum for a particular doubly-charged peptide.

He showed me a second project that had drawn his interest. When we fragment a glycopeptide, we typically see a somewhat uninformative tandem mass spectrum. The standard practice that follows is to produce an MS^3 spectrum that shows the fragments produced from the most intense fragment we see in the tandem mass spectrum. Dr. Bu’s efforts have shown that less intense fragments can frequently be more informative for glycoproteomics, and he has developed a Bayesian strategy to predict which fragment will be most useful for multi-stage mass spectrometry to recognize the structure of the oligosaccharide.

Presentations from the future: graduate students step forward

The pFind team started today with overviews of two different projects. Chao Liu began the session with a discussion of pLink. The software has undergone considerable revision since the initial release, which achieved a publication in Nature Methods. The group has added new elements to pParse that enables them to recognize monoisotopes with better accuracy and to fine-tune the mass accuracy of the precursor ions. pLink has also been improved by a machine learning boost to its PSM discrimination as well as an ion index to greatly accelerate the checking of a spectrum against a particular peptide sequence. They report that the modifications to their software make it perform like an altogether different beast. Their comparisons with Kojak, a recently published tool from the MacCoss Laboratory, suggest that their updated software offers quite exceptional performance.


The projects represented here are pGlyco, pNovo, and pAnno (left to right)!

The pGlyco effort has also gained traction in its abilities through a similar set of changes. Senior graduate student Wen-Feng Zeng demonstrated the rapidity and sensitivity his software has managed to achieve. Like other tools in the pFind suite, his code benefits from the improvements to pParse as well as the PSM recovery boost possible through machine learning. His software is able to make use of multiple levels of tandem mass spectrometry to recognize the components of the complex oligosaccharide structure. The fragments associated with oligosaccharide components can then be sought in a database of previously observed sugar structures to recognize the one represented by a set of multistage spectra. I am glad to see this project moving along so effectively. I pointed the student toward two glycosylation data sets that I hoped he would find interesting. The first was one created by CPTAC 1 by three laboratories, using instruments from two different vendors. The second was one published by a student who worked in my lab for a semester: (Dr.) Margaret Baker from the University of Hawaii. I admit that I’m simply curious to see what his code could do with the quality data she produced at Vanderbilt!

Because we had consumed so much time with these meetings, we headed off to lunch without the planned presentation of pAnno or of the updated pFind. pAnno was delayed to the following morning, just before our road trip to the Great Wall. I interrogated pFind researcher Hao Chi over our tasty lunch in the cafeteria. He reported that a feature I have wanted to see for years has been implemented! The pParse software is able to recognize multiple isotopic envelopes that occur within the isolation window in the production of a tandem mass spectrum. pFind has been altered to duplicate the spectrum so that it gets searched for each of the precursor masses and charge states that fall within the isolation window! When multiple identifications have been produced for a given tandem mass spectrum, the lower-scoring PSMs may be eliminated from further consideration if they share more than three fragment ions in their assigned lists. The software has also benefited from the machine learning capabilities and ion index acceleration mentioned for pLink. We spent a fair amount of time at lunch discussing the pros and cons of pFind’s consecutivity criterion (called a “kernel trick”): each peptide bond is evaluated in the context of the next two peptide bonds in both directions. In effect, a bond gets a score of “11111” if all of these bonds are matched to a fragment of this series (and thus an exponential boost to its contribution to score), but it gets a “00100” if none of the flanking bonds are matched and no boost. I need to re-read their 2004 paper to remind myself of how this boost is calculated.

September 23, 2016

My final day with the Institution of Computing Technology began with a deferred presentation by graduate student Wen-Jing Zhou, who had entered the exciting world of proteogenomics. As perceived by the ICT team, proteogenomics revolves around the question of genome annotation. How mature are the gene-calls that have been produced for the species we investigate in the laboratory? When we find peptides that map to regions outside of canonical genes, can we trust them? They have attempted to create systems to estimate FDR for novel peptides separately from those that conform to known genes. The group is making use of M. tuberculosis data sets from Ruedi Aebersold, Akhilesh Pandey, and their collaborator, a professor Xu. While the sets from the first two investigators include tens of LC-MS/MS experiments, their collaborator has produced a couple hundred, summing to five million tandem mass spectra. I am hoping to acquire these for local use at Stellenbosch University. [These are links to her two external data sets: Aebersold and Pandey]

Taken together, these presentations represented a pretty incredible breadth of scholarship.  I am delighted that I got a behind-the-scenes look at their work, in some cases even before that work had been published!

China: Seek and ye shall pFind

An index to the China series appears at the first post.

September 21st, 2016

Today represented the core of my work with the Institute for Computing Technology. I was here to visit the “pFinders,” a group of five faculty members who produce algorithms for computational proteomics. I started with an interview with two of the senior faculty, Si-Min He and Rui-Xiang Sun. We have had extensive interactions over the years from a distance, but I have not previously met Professor He in person. We got along well from the word “go.” I have tremendous respect for the work that the pFinders have produced over the last ten years. I first began to understand the significance of this effort when I read their work adapting their database search engine pFind for use in electron transfer dissociation (ETD) data sets. Rather than make perfunctory alterations to which fragment ions were modeled to appear for a given sequence, the team gave serious thought to the problems associated with neutral loss ions from the intact peptide ion. Their paper and one from Robert Chalkley at UCSF had been the standouts in my mind for explaining how to accommodate this data type. I was very lucky to see Dr. Sun again (we had met years before at the UCSD RECOMB Proteomics meetings). He was on his way in just one week to spend some sabbatical time at the University of Wisconsin. It was a very collegial start to my day.


Si-Min He appears to my left, and Rui-Xiang Sun is to the right of me.

I chose to present an open problem for the proteomics community. What is the best way in which to identify proteins via tandem mass spectrometry from non-model organisms (i.e. those for which we have not already produced an annotated genome)? I presented a case study of a plant that my colleagues at the University of the Western Cape had been investigating. What species would be the best reference for understanding its transcriptome? How can we leverage RNA-Seq for proteomics?  From there, I moved to the problems posed by meta-proteomics, cases in which a sample features the proteins of many species that coexist in a ecological niche. I spoke mostly about ecological problems that we can address with this technology, but the same problems arise in understanding the microbial communities that inhabit our bodies. We talked about three workflows that could potentially enable these investigations. I think the pFind team is quite well-suited to implement software for one of those avenues. The group was highly interactive in the course of the talk, and I hope that some of them will follow up with possible solutions.


Look at the size of this proteome informatics group!

We enjoyed lunch most thoroughly at a lovely restaurant just down the street from the ICT building. I would highlight two aspects of the meal. First, one of our dishes contained some fascinating knots; yes, they looked exactly like strips of fabric tied in knots, covered in a brown sauce! The lunch group reaffirmed many times that the fabric actually tofu. I was dubious, but I took a bite anyway. Fabulous! I had made a passing remark about a dessert in the menu, and the group took it as a prompt to add it to our meal. I am so glad they did! It appeared to be a gelatin of sorts, again made from soy. The flavors, though, were truly outstanding. The center line of the snack represented tiny pieces of dates, and the white bands were coconut. I ate three of them!


My sweet tooth works on all continents.

After lunch, two of the faculty presented an overview of the program. I began to understand just how big a scope this group has attempted. Their papers really have grown substantially in number. Just look at the major toolsets this team has generated:

The database search engine is the fundamental tool for routine peptide identification. pFind is one of the most fully-featured and discriminating tools of the type.
Handling peptides that have been chemically cross-linked to each other is a daunting informatic task. They produced one of the earliest genuine successes in this space.
Inferring a sequence directly from a tandem mass spectrum is not easy, but this group has developed a powerful system for that purpose.
If one does not digest the proteins before introducing it to the tandem mass spectrometer, the produced spectra will be far more complex than for peptides. They have published a tool that competes well against others.
This software is the subject of my latest effort to publish with this group. It quantifies proteins and peptides by comparing the intensity seen for peptides between pairs of LC-MS/MS experiments or in isotopically labeled experiments.
Proteomics has shown itself to be a powerful complement to genomics; we call this field proteogenomics. Of course this talented team has taken an interest in the possibilities from multi-Omics!
Recognizing the structure of sugars that have been connected to a peptide is a daunting computational problem, but the group has tackled it with their usual flair.

Near the end of the day, graduate student Hao Yang presented the work he had been conducting in order to recognize the uncharacterized chemical modifications that appear on individual peptides. He has been adapting the pNovo project to make these open searches possible. I asked him any number of questions, and he did quite a nice job in bouncing back from each challenge. I am always happy to see young scientists responding well to the scrutiny each advance receives. The future arrives bit by bit, with each brick atop one emplaced by earlier research.

Dave goes back to work: the Colorado State connection

I have been very fortunate to spend this week at Colorado State University in Ft. Collins, CO.  The university agreed to temporarily appoint me as a Visiting Scientist in the Department of Microbiology, Immunology, and Pathology, starting on October 1st and continuing through the end of 2015.  As a result, my short-term “retirement” is at an end.  One may reasonably ask just what I am doing that justifies my appointment as temporary faculty.  In this post, I will try to explain our goals for this appointment.

The research at Colorado State is a natural bridge between the work carried out in my prior laboratory and the work in which I will be engaged in South Africa.  In the last couple of years, my team has begun work on algorithms for lipid identification.  My last graduate student at my former institution produced results that enabled him to give an oral presentation at the 2015 American Society of Mass Spectrometry meeting.  We are now completing the publication of his innovations.

Lipid mass spectrometry is a very big deal at Colorado State.  Research teams here have interrogated the lipids associated with infectious disease, with a particular emphasis on the Mycobacteria.  M. tuberculosis has drawn their attention because of its impact on human health.  Bacteria are far more diverse in the biochemistry that they deploy than are animals.  Many of the lipids produced by M. tuberculosis are specific to this organism, and understanding the functional role played by these lipids can help us to understand better how the bug interacts with human tissue.

Colorado State has been making considerable headway in deploying tandem mass spectrometry to yield inventories of bacterial lipids (their lipidomes), and my lab has developed software to identify major classes of lipids from tandem mass spectra.  If we can extend our software to handle the glycosylated lipids of M. tuberculosis, we will greatly improve their ability to translate data into knowledge.

Perhaps the best irony of my appointment is that the graduate student who wrote the lipid identification software spent much of his life living right here in Colorado!

John Belisle, Dave Tabb, and Nurul Islam take a break in the Research Innovation Center.

John Belisle, Dave Tabb, and Nurul Islam take a break in the Research Innovation Center.