Tag Archives: bioinformatics

An extraordinary journey in three universities

Last November, I received some very welcome news.  The Deputy Vice-Chancellor for Academics at the University of the Western Cape informed me that I had been named an Extraordinary Professor in the Department of Biotechnology!  My work within that department had been going well, when persistent student protests closed the university through the end of 2016.  This letter reflected the ongoing hope of Biotechnology that our collaboration would continue when the students returned to their studies.  Today I received my official badge, so I would like to write about the work that is developing at each of the three local universities at which I have an appointment.

I have written about my travels among the campuses in and around Cape Town.  I would stress that I spend most of my time at my home institution, the Tygerberg campus for Stellenbosch University.  Bioinformatics has seen considerable investment by the university.  The South African Tuberculosis Bioinformatics Initiative represents the concentration of bioinformatics investigators for our campus: Gerard C. Tromp, Gian van der Spuy, and me.  There are other data scientists, though!  The Centre for Evidence-based Healthcare, led by Taryn Young, offers statistical expertise.  Tonya Esterhuizen specializes in biostatistics.  As I will explain in a moment, I hope to work with them more in the days to come.  This year, my formal teaching duties at my home campus will double.  Don’t worry for me, though, since I will host the Honours students for the Division of Molecular Biology and Human Genetics for only eight days!  I am glad that bioinformatics will have the “standard” module length for our Honours program, equal to Immunology and several other subjects.  I have been supplementing my teaching through an informal “course,” called the “Useful Hour.”  I have begun teaching all comers about a range of subjects, from computers to programming and statistics.  I hope to pull in some philosophy of science soon, as well.  I have been filming these subjects as a bit of an experiment, and it has been handy for those who cannot attend.img_20170126_152122

Hugh Patterton, Gerard Tromp, and I coordinate our efforts near Simonsberg.

The Stellenbosch campus of Stellenbosch University has made strides in bioinformatics, as well.  Hugh Patterton, a professor in the Department of Biochemistry, has been named to lead bioinformatics efforts at this campus.  Naturally, our group (SATBBI) has been talking with Hugh about ways we can reinforce each other’s efforts.  Some of our consultations on the Stellenbosch campus have pointed in the direction of microbiome research, an area that is replete with bioinformatics challenges.  I look forward to seeing what emerges!

I am highlighting the University of the Western Cape in this post, of course!  In describing bioinformatics at the campus, I should start by mentioning the South African National Bioinformatics Institute (SANBI).  Alan Christoffels leads this group of investigators.  They’re an interesting group, with considerable success in capacity development within South Africa and across the continent.  My home on the campus, however, has been with the Department of Biotechnology.  In many respects, this reflects how I have spent my career.  I set the mold in graduate school, when I was a bioinformaticist surrounded by analytical chemists.  I like being close to the people who generate the data I work with!  In the Department of Biotechnology, I work most closely with the group of Ashwil Klein, the lecturer who heads the Proteomics Research and Service Unit.  They have primarily emphasized a gel-based workflow, meaning that they partially isolate proteins on a 2D gel before identifying the spot based on the peptide masses they observe on the Bruker Ultraflex TOF/TOF.  The group is actively moving toward additional instruments, though, and the acquisitions should greatly broaden their capabilities.  I enjoy the intellectual challenges their group produces, since the rules of the road are somewhat less established for agricultural proteomics.


The new UWC Chemical Sciences and Biological Sciences Buildings rise above the Cape Flats Nature Reserve.

In attending the department’s recent strategic retreat, I was introduced among the researchers of UWC Biotechnology more broadly.  I was particularly glad to meet with Dr. Bronwyn Kirby, who heads the Next Generation Sequencing Facility.  We discussed the Honours course offered for the department (I taught bioinformatics for the proteomics module last year), and I believe I’ll get to add some bioinformatics for the sequencing module in 2017!  I was also delighted to meet the SARChI chair who heads the Institute for Microbial Biotechnology and Metagenomics (IMBM), Marla Trindade.  We spoke about what the students of the institute most needed, and establishing a structured curriculum for biostatistics seemed very high on the list.  I mentioned the biostatistics researchers at Stellenbosch above.  My hope is to be able to use much of the structure Stellenbosch has already built in its Biostatistics I and II classes as a model for teaching biostatistics at UWC Biotechnology.  It would be my first effort at teaching biostatistics formally; I hope that I have absorbed enough to be a good teacher for this subject!

I continue to spend my Tuesdays with the University of Cape Town medical school and to visit the Centre for Proteomics and Genomics, as well.  UCT named me an Honorary Professor in the Department of Integrative Biomedical Sciences halfway through 2016.  My interactions there have principally taken place within the Institute of Infectious Disease and Molecular Medicine (IDM), borrowing from the network of relationships that Jonathan Blackburn has established there.  I have worked with Nelson Soares, his Junior Research Fellow, to create monthly programs for the Cape Town community invested in proteomics.  This Tuesday, we started this series for 2017 with an introduction to the methods we use for identifying and quantifying proteins.  I was really pleased that Brandon Murugan, a senior graduate student in the Blackburn Lab, felt comfortable enough to present this material!


I enjoyed my sundown cruise with the SATVI team in May of last year!

From the very beginning of my time in South Africa, I have been working with the South African Tuberculosis Vaccine Initiative (SATVI).  Recently they began having their research in progress meetings on Tuesday morning, allowing me to take part.  I really like the interaction.  They take my questions seriously, and I think we all learn from working together.  Certainly I would find great meaning in being part of a successful vaccine trial for this disease!

I have another group I must mention in describing bioinformatics across these three universities.  Nicola Mulder’s “CBIO” team has been an opening wedge in bioinformatics education for South Africa.  Their H3Africa BioNet courses have been used to supplement the content of B.Sc. education in places like the University of Limpopo.  It should be no surprise that many of the people I have mentioned in today’s post have collaborated in a manuscript describing the growth of bioinformatics in South Africa.  Our field is key to the future of public health and to the advances in biotechnology yet to come!

Semmering, Austria: Proteome Informatics on the upslope

At the start of 2015, I was incredibly fortunate to attend the Midwinter Proteome Informatics Midwinter Seminar at Semmering, Austria.  Although I did not initially know many of the participants, I have subsequently become friends with many of them.  In some cases, we have even written papers and grants together!  I was thrilled to return to Semmering on January 8, 2017 to attend a sequel to this meeting, this time sponsored by the European Proteomics Association.  Our group had nearly doubled from fifty-six to one hundred and five!


January 14, 2015 (Schneeberg appears in the distance behind us.)


Jan. 11, 2017 (photo courtesy of Marc Vaudel)

Despite its small population (below six hundred permanent residents), Semmering is actually an interesting place.  The town is named for the eponymous pass through the Northern Limestone Alps.  The area gained special prominence in 1728 when Emperor Charles VI of Austria completed a road over the pass, a feat commemorated by a hefty monument near the ski resort.


The 18th century monument bathes in the Zauberberg night lights.

One hundred twenty years later, the pass served as a key railway connection, tying together “Lower Austria” and Styria, one of the nine federated states of Austria.  The stylish and well-engineered construction of this railway has been listed by UNESCO as a World Heritage Site.  The railway reaches almost 900 m above sea level.  The tracks employ tunnels and graceful bridges through a ruggedly beautiful terrain.  These rail links accelerated development in the area, making Semmering a major resort destination.

Our conference had grown so much in size that we occupied almost the entirety of the Semmering Sporthotel.  A feature that I particularly enjoy about this conference is the chance to create new tutorials for a crowd of advanced researchers.  In 2015, I premiered a half-day workshop on the subject of algorithms to identify post-translational modifications.  I asked this year’s organizers what kind of tutorial they would most like.  They responded by asking what I was working on right now.  I described my work in preparing sequence databases for identifying proteins of non-model organisms, starting from RNA-Seq experiments.  They replied that this would be just great.  I found it was a very useful exercise to learn the individual methods well enough to teach them to others.  In the end, approximately 35 students worked through the resulting half-day tutorial.  We were pretty challenged by the weak Internet service at the hotel, split across so many users, but most of the crucial steps were possible with data I had provided via USB drives.


My diagram of extant search engines from two years ago

Two years ago, I had chosen a somewhat controversial topic for my plenary lecture (one given to all the attendees at once rather than a subgroup).  In “The Hard Stuff: MS Bioinformatics Moves Beyond Protein Identification,” I argued that the era of publishing new database search engines for proteomics was drawing to a close, since more than thirty such tools have now been published!  I urged them to look beyond these basics to find challenges in non-conventional identification: MS/MS scans containing evidence for multiple peptides, proteins that vary in sequence from a database reference, and peptides bearing complex modifications like glycans or non-ribosomal peptides.


A banner image from my 2013 review of quality control

This year, I decided to spend some attention on a question of importance since I am chairing a quality control working group for the HUPO-PSI.  What types of biological mass spectrometry are not well-served by existing quality control approaches?  I discussed some of the existing efforts in quantitative mass spectrometry within Spectrum Mill, SProCop, and MSstats.  I contrasted this situation with the emerging fields of data-independent acquisition, in which superior reproducibility is regularly claimed without metrics that could substantiate those claims.


Jan 13, 2015: Johannes Griss and I discover our shared sense of humor. (Photo courtesy Lennart Martens)

With two meetings at Semmering under my belt, I must say I am hooked.  These meetings remind me of the lovely RECOMB Computational Proteomics meetings at UCSD from 2010 to 2012.  The quality of attendees is really substantial, and the free-wheeling conversations are highly entertaining and educational.  I must also say that there is nothing quite as thrilling as sledding down the designated path of the ski slopes head-first (NOTE: this posture is discouraged), the way I lost my lens cap in 2015!  If you are in our field, I hope I’ll get to see you at a 2018 meeting!


Johannes Griss introduced me to Almdudler, a lovely tonic that is the taste of Austria for me.

China: From the foundations of computing to its future

An index to the China series appears at the first post.

September 22, 2016

A kindred spirit

I had a delightful time this morning speaking with Dr. Dongbo Bu in his office. My attention was immediately drawn to a small museum of early computers he has collected near the entrance to his office. His Mac 128k from 1984 and other early PCs were blasts from the past, but three members of his collection were quite remarkable. The first was Turing machine that he had built from LEGO Technics. The computer has a memory of approximately sixteen bits, and it can read from the memory, write to its memory, and perform logical operations based on its contents. An even more impressive accomplishment was a larger LEGO construction of considerable complexity. He was delighted that I was able to recognize it as a working model of the Charles Babbage difference engine. Before we left the table behind, he was proud to show me a set of Napier’s Bones that he had produced through a local workworker when he could not find a set to purchase.

img_5987Dr. Bu and I share a lot of research interests. I was very excited by his work on predicting peptide fragmentation. He has returned to a theme I raised in my 2003 paper on the subject, namely that each amino acid has a different propensity to steer fragmentation toward its N-terminal or C-terminal side. His contribution is to model the fragmentation effect when five successive amino acids are considered as a whole. In brief, an amino acid’s steering in one direction or another is biased by the amino acids immediately adjacent and, to a lesser degree, by the amino acids adjacent to its neighbors. He has used the biases seen in five-mers of amino acids to make an optimized model that can be used to very quickly simulate the tandem mass spectrum for a particular doubly-charged peptide.

He showed me a second project that had drawn his interest. When we fragment a glycopeptide, we typically see a somewhat uninformative tandem mass spectrum. The standard practice that follows is to produce an MS^3 spectrum that shows the fragments produced from the most intense fragment we see in the tandem mass spectrum. Dr. Bu’s efforts have shown that less intense fragments can frequently be more informative for glycoproteomics, and he has developed a Bayesian strategy to predict which fragment will be most useful for multi-stage mass spectrometry to recognize the structure of the oligosaccharide.

Presentations from the future: graduate students step forward

The pFind team started today with overviews of two different projects. Chao Liu began the session with a discussion of pLink. The software has undergone considerable revision since the initial release, which achieved a publication in Nature Methods. The group has added new elements to pParse that enables them to recognize monoisotopes with better accuracy and to fine-tune the mass accuracy of the precursor ions. pLink has also been improved by a machine learning boost to its PSM discrimination as well as an ion index to greatly accelerate the checking of a spectrum against a particular peptide sequence. They report that the modifications to their software make it perform like an altogether different beast. Their comparisons with Kojak, a recently published tool from the MacCoss Laboratory, suggest that their updated software offers quite exceptional performance.


The projects represented here are pGlyco, pNovo, and pAnno (left to right)!

The pGlyco effort has also gained traction in its abilities through a similar set of changes. Senior graduate student Wen-Feng Zeng demonstrated the rapidity and sensitivity his software has managed to achieve. Like other tools in the pFind suite, his code benefits from the improvements to pParse as well as the PSM recovery boost possible through machine learning. His software is able to make use of multiple levels of tandem mass spectrometry to recognize the components of the complex oligosaccharide structure. The fragments associated with oligosaccharide components can then be sought in a database of previously observed sugar structures to recognize the one represented by a set of multistage spectra. I am glad to see this project moving along so effectively. I pointed the student toward two glycosylation data sets that I hoped he would find interesting. The first was one created by CPTAC 1 by three laboratories, using instruments from two different vendors. The second was one published by a student who worked in my lab for a semester: (Dr.) Margaret Baker from the University of Hawaii. I admit that I’m simply curious to see what his code could do with the quality data she produced at Vanderbilt!

Because we had consumed so much time with these meetings, we headed off to lunch without the planned presentation of pAnno or of the updated pFind. pAnno was delayed to the following morning, just before our road trip to the Great Wall. I interrogated pFind researcher Hao Chi over our tasty lunch in the cafeteria. He reported that a feature I have wanted to see for years has been implemented! The pParse software is able to recognize multiple isotopic envelopes that occur within the isolation window in the production of a tandem mass spectrum. pFind has been altered to duplicate the spectrum so that it gets searched for each of the precursor masses and charge states that fall within the isolation window! When multiple identifications have been produced for a given tandem mass spectrum, the lower-scoring PSMs may be eliminated from further consideration if they share more than three fragment ions in their assigned lists. The software has also benefited from the machine learning capabilities and ion index acceleration mentioned for pLink. We spent a fair amount of time at lunch discussing the pros and cons of pFind’s consecutivity criterion (called a “kernel trick”): each peptide bond is evaluated in the context of the next two peptide bonds in both directions. In effect, a bond gets a score of “11111” if all of these bonds are matched to a fragment of this series (and thus an exponential boost to its contribution to score), but it gets a “00100” if none of the flanking bonds are matched and no boost. I need to re-read their 2004 paper to remind myself of how this boost is calculated.

September 23, 2016

My final day with the Institution of Computing Technology began with a deferred presentation by graduate student Wen-Jing Zhou, who had entered the exciting world of proteogenomics. As perceived by the ICT team, proteogenomics revolves around the question of genome annotation. How mature are the gene-calls that have been produced for the species we investigate in the laboratory? When we find peptides that map to regions outside of canonical genes, can we trust them? They have attempted to create systems to estimate FDR for novel peptides separately from those that conform to known genes. The group is making use of M. tuberculosis data sets from Ruedi Aebersold, Akhilesh Pandey, and their collaborator, a professor Xu. While the sets from the first two investigators include tens of LC-MS/MS experiments, their collaborator has produced a couple hundred, summing to five million tandem mass spectra. I am hoping to acquire these for local use at Stellenbosch University. [These are links to her two external data sets: Aebersold and Pandey]

Taken together, these presentations represented a pretty incredible breadth of scholarship.  I am delighted that I got a behind-the-scenes look at their work, in some cases even before that work had been published!

China: Seek and ye shall pFind

An index to the China series appears at the first post.

September 21st, 2016

Today represented the core of my work with the Institute for Computing Technology. I was here to visit the “pFinders,” a group of five faculty members who produce algorithms for computational proteomics. I started with an interview with two of the senior faculty, Si-Min He and Rui-Xiang Sun. We have had extensive interactions over the years from a distance, but I have not previously met Professor He in person. We got along well from the word “go.” I have tremendous respect for the work that the pFinders have produced over the last ten years. I first began to understand the significance of this effort when I read their work adapting their database search engine pFind for use in electron transfer dissociation (ETD) data sets. Rather than make perfunctory alterations to which fragment ions were modeled to appear for a given sequence, the team gave serious thought to the problems associated with neutral loss ions from the intact peptide ion. Their paper and one from Robert Chalkley at UCSF had been the standouts in my mind for explaining how to accommodate this data type. I was very lucky to see Dr. Sun again (we had met years before at the UCSD RECOMB Proteomics meetings). He was on his way in just one week to spend some sabbatical time at the University of Wisconsin. It was a very collegial start to my day.


Si-Min He appears to my left, and Rui-Xiang Sun is to the right of me.

I chose to present an open problem for the proteomics community. What is the best way in which to identify proteins via tandem mass spectrometry from non-model organisms (i.e. those for which we have not already produced an annotated genome)? I presented a case study of a plant that my colleagues at the University of the Western Cape had been investigating. What species would be the best reference for understanding its transcriptome? How can we leverage RNA-Seq for proteomics?  From there, I moved to the problems posed by meta-proteomics, cases in which a sample features the proteins of many species that coexist in a ecological niche. I spoke mostly about ecological problems that we can address with this technology, but the same problems arise in understanding the microbial communities that inhabit our bodies. We talked about three workflows that could potentially enable these investigations. I think the pFind team is quite well-suited to implement software for one of those avenues. The group was highly interactive in the course of the talk, and I hope that some of them will follow up with possible solutions.


Look at the size of this proteome informatics group!

We enjoyed lunch most thoroughly at a lovely restaurant just down the street from the ICT building. I would highlight two aspects of the meal. First, one of our dishes contained some fascinating knots; yes, they looked exactly like strips of fabric tied in knots, covered in a brown sauce! The lunch group reaffirmed many times that the fabric actually tofu. I was dubious, but I took a bite anyway. Fabulous! I had made a passing remark about a dessert in the menu, and the group took it as a prompt to add it to our meal. I am so glad they did! It appeared to be a gelatin of sorts, again made from soy. The flavors, though, were truly outstanding. The center line of the snack represented tiny pieces of dates, and the white bands were coconut. I ate three of them!


My sweet tooth works on all continents.

After lunch, two of the faculty presented an overview of the program. I began to understand just how big a scope this group has attempted. Their papers really have grown substantially in number. Just look at the major toolsets this team has generated:

The database search engine is the fundamental tool for routine peptide identification. pFind is one of the most fully-featured and discriminating tools of the type.
Handling peptides that have been chemically cross-linked to each other is a daunting informatic task. They produced one of the earliest genuine successes in this space.
Inferring a sequence directly from a tandem mass spectrum is not easy, but this group has developed a powerful system for that purpose.
If one does not digest the proteins before introducing it to the tandem mass spectrometer, the produced spectra will be far more complex than for peptides. They have published a tool that competes well against others.
This software is the subject of my latest effort to publish with this group. It quantifies proteins and peptides by comparing the intensity seen for peptides between pairs of LC-MS/MS experiments or in isotopically labeled experiments.
Proteomics has shown itself to be a powerful complement to genomics; we call this field proteogenomics. Of course this talented team has taken an interest in the possibilities from multi-Omics!
Recognizing the structure of sugars that have been connected to a peptide is a daunting computational problem, but the group has tackled it with their usual flair.

Near the end of the day, graduate student Hao Yang presented the work he had been conducting in order to recognize the uncharacterized chemical modifications that appear on individual peptides. He has been adapting the pNovo project to make these open searches possible. I asked him any number of questions, and he did quite a nice job in bouncing back from each challenge. I am always happy to see young scientists responding well to the scrutiny each advance receives. The future arrives bit by bit, with each brick atop one emplaced by earlier research.

Dave goes back to work: the Colorado State connection

I have been very fortunate to spend this week at Colorado State University in Ft. Collins, CO.  The university agreed to temporarily appoint me as a Visiting Scientist in the Department of Microbiology, Immunology, and Pathology, starting on October 1st and continuing through the end of 2015.  As a result, my short-term “retirement” is at an end.  One may reasonably ask just what I am doing that justifies my appointment as temporary faculty.  In this post, I will try to explain our goals for this appointment.

The research at Colorado State is a natural bridge between the work carried out in my prior laboratory and the work in which I will be engaged in South Africa.  In the last couple of years, my team has begun work on algorithms for lipid identification.  My last graduate student at my former institution produced results that enabled him to give an oral presentation at the 2015 American Society of Mass Spectrometry meeting.  We are now completing the publication of his innovations.

Lipid mass spectrometry is a very big deal at Colorado State.  Research teams here have interrogated the lipids associated with infectious disease, with a particular emphasis on the Mycobacteria.  M. tuberculosis has drawn their attention because of its impact on human health.  Bacteria are far more diverse in the biochemistry that they deploy than are animals.  Many of the lipids produced by M. tuberculosis are specific to this organism, and understanding the functional role played by these lipids can help us to understand better how the bug interacts with human tissue.

Colorado State has been making considerable headway in deploying tandem mass spectrometry to yield inventories of bacterial lipids (their lipidomes), and my lab has developed software to identify major classes of lipids from tandem mass spectra.  If we can extend our software to handle the glycosylated lipids of M. tuberculosis, we will greatly improve their ability to translate data into knowledge.

Perhaps the best irony of my appointment is that the graduate student who wrote the lipid identification software spent much of his life living right here in Colorado!

John Belisle, Dave Tabb, and Nurul Islam take a break in the Research Innovation Center.

John Belisle, Dave Tabb, and Nurul Islam take a break in the Research Innovation Center.