An index to the China series appears at the first post.
September 22, 2016
A kindred spirit
I had a delightful time this morning speaking with Dr. Dongbo Bu in his office. My attention was immediately drawn to a small museum of early computers he has collected near the entrance to his office. His Mac 128k from 1984 and other early PCs were blasts from the past, but three members of his collection were quite remarkable. The first was Turing machine that he had built from LEGO Technics. The computer has a memory of approximately sixteen bits, and it can read from the memory, write to its memory, and perform logical operations based on its contents. An even more impressive accomplishment was a larger LEGO construction of considerable complexity. He was delighted that I was able to recognize it as a working model of the Charles Babbage difference engine. Before we left the table behind, he was proud to show me a set of Napier’s Bones that he had produced through a local workworker when he could not find a set to purchase.
Dr. Bu and I share a lot of research interests. I was very excited by his work on predicting peptide fragmentation. He has returned to a theme I raised in my 2003 paper on the subject, namely that each amino acid has a different propensity to steer fragmentation toward its N-terminal or C-terminal side. His contribution is to model the fragmentation effect when five successive amino acids are considered as a whole. In brief, an amino acid’s steering in one direction or another is biased by the amino acids immediately adjacent and, to a lesser degree, by the amino acids adjacent to its neighbors. He has used the biases seen in five-mers of amino acids to make an optimized model that can be used to very quickly simulate the tandem mass spectrum for a particular doubly-charged peptide.
He showed me a second project that had drawn his interest. When we fragment a glycopeptide, we typically see a somewhat uninformative tandem mass spectrum. The standard practice that follows is to produce an MS^3 spectrum that shows the fragments produced from the most intense fragment we see in the tandem mass spectrum. Dr. Bu’s efforts have shown that less intense fragments can frequently be more informative for glycoproteomics, and he has developed a Bayesian strategy to predict which fragment will be most useful for multi-stage mass spectrometry to recognize the structure of the oligosaccharide.
Presentations from the future: graduate students step forward
The pFind team started today with overviews of two different projects. Chao Liu began the session with a discussion of pLink. The software has undergone considerable revision since the initial release, which achieved a publication in Nature Methods. The group has added new elements to pParse that enables them to recognize monoisotopes with better accuracy and to fine-tune the mass accuracy of the precursor ions. pLink has also been improved by a machine learning boost to its PSM discrimination as well as an ion index to greatly accelerate the checking of a spectrum against a particular peptide sequence. They report that the modifications to their software make it perform like an altogether different beast. Their comparisons with Kojak, a recently published tool from the MacCoss Laboratory, suggest that their updated software offers quite exceptional performance.
The pGlyco effort has also gained traction in its abilities through a similar set of changes. Senior graduate student Wen-Feng Zeng demonstrated the rapidity and sensitivity his software has managed to achieve. Like other tools in the pFind suite, his code benefits from the improvements to pParse as well as the PSM recovery boost possible through machine learning. His software is able to make use of multiple levels of tandem mass spectrometry to recognize the components of the complex oligosaccharide structure. The fragments associated with oligosaccharide components can then be sought in a database of previously observed sugar structures to recognize the one represented by a set of multistage spectra. I am glad to see this project moving along so effectively. I pointed the student toward two glycosylation data sets that I hoped he would find interesting. The first was one created by CPTAC 1 by three laboratories, using instruments from two different vendors. The second was one published by a student who worked in my lab for a semester: (Dr.) Margaret Baker from the University of Hawaii. I admit that I’m simply curious to see what his code could do with the quality data she produced at Vanderbilt!
Because we had consumed so much time with these meetings, we headed off to lunch without the planned presentation of pAnno or of the updated pFind. pAnno was delayed to the following morning, just before our road trip to the Great Wall. I interrogated pFind researcher Hao Chi over our tasty lunch in the cafeteria. He reported that a feature I have wanted to see for years has been implemented! The pParse software is able to recognize multiple isotopic envelopes that occur within the isolation window in the production of a tandem mass spectrum. pFind has been altered to duplicate the spectrum so that it gets searched for each of the precursor masses and charge states that fall within the isolation window! When multiple identifications have been produced for a given tandem mass spectrum, the lower-scoring PSMs may be eliminated from further consideration if they share more than three fragment ions in their assigned lists. The software has also benefited from the machine learning capabilities and ion index acceleration mentioned for pLink. We spent a fair amount of time at lunch discussing the pros and cons of pFind’s consecutivity criterion (called a “kernel trick”): each peptide bond is evaluated in the context of the next two peptide bonds in both directions. In effect, a bond gets a score of “11111” if all of these bonds are matched to a fragment of this series (and thus an exponential boost to its contribution to score), but it gets a “00100” if none of the flanking bonds are matched and no boost. I need to re-read their 2004 paper to remind myself of how this boost is calculated.
September 23, 2016
My final day with the Institution of Computing Technology began with a deferred presentation by graduate student Wen-Jing Zhou, who had entered the exciting world of proteogenomics. As perceived by the ICT team, proteogenomics revolves around the question of genome annotation. How mature are the gene-calls that have been produced for the species we investigate in the laboratory? When we find peptides that map to regions outside of canonical genes, can we trust them? They have attempted to create systems to estimate FDR for novel peptides separately from those that conform to known genes. The group is making use of M. tuberculosis data sets from Ruedi Aebersold, Akhilesh Pandey, and their collaborator, a professor Xu. While the sets from the first two investigators include tens of LC-MS/MS experiments, their collaborator has produced a couple hundred, summing to five million tandem mass spectra. I am hoping to acquire these for local use at Stellenbosch University. [These are links to her two external data sets: Aebersold and Pandey]
Taken together, these presentations represented a pretty incredible breadth of scholarship. I am delighted that I got a behind-the-scenes look at their work, in some cases even before that work had been published!