As many of you know, I have specialized in the field of proteomics, the study of complex mixtures of proteins that may be characteristic of a disease state, development stage, tissue type, etc. Here in South Africa, my application focus has shifted from colon cancer to tuberculosis. As a newcomer to this field, I’ve been curious to know whether the field of tuberculosis has good information resources to leverage in its fight against the disease.
The key resource any proteomics group can leverage is the sequence database, specifically the list of all protein sequences encoded by the genome in question. The human genome incorporates around 20,310 protein-coding genes (reduced from estimates of 26,588 from the 2001 publication), but those genes code for upwards of 70,000 distinct proteins through alternative splicing. Bacteria are able to get by with far smaller numbers of genes. E. coli, for example, functions with only 4309 proteins. The organism that infects humans and other animals to produce tuberculosis is named Mycobacterium tuberculosis. If we were to rely upon the excellent UniProt database, from which I quoted E. coli protein-coding gene counts, we would probably conclude that M. tuberculosis relies upon even fewer genes: only 3993 (3997 proteins)!
UniProt is an excellent all-around resource for proteomics, but researchers in a particular field usually gravitate to a data resource that is particular to their organism. People who work with C. elegans for developmental studies, for example, use WormBase. People who study genetics with D. melanogaster would use FlyBase. People in tuberculosis have frequently turned to TubercuList for its annotation of the M.tb genome (comprising 4031 proteins). This database, however, has not been updated since March of 2013 (available from the “What’s New” page). Can it still be considered current, four years later?
As a recent import from clinical proteogenomics, my first impulse is still to run to the genome-derived sequence databases of NCBI, particularly its RefSeq collection. I found a NCBI genome for M. tuberculosis there, with a last modification date from May 21, 2016 and indicating its annotation was based upon “ASM19595v2,” a particular assembly of the sequencing data. This was echoed when I ran to Ensembl, another site most commonly used for eukaryotic species (such as humans) rather than prokaryotic organisms (such as bacteria). Their Ensembl tuberculosis proteome was built upon the same assembly as was the one from NCBI.
As a former post-doc from Oak Ridge National Laboratory, I am always likely to think of the Department of Energy’s Joint Genome Institute. The DOE sequences “bugs” (slang for bacteria) like nobody’s business. Invariably, I find that I can retrieve a complete proteome for a rare bacterium at JGI which is represented by only a handful of proteins in UniProt! This makes JGI a great resource for people who work in “microbiome” projects, where samples contain proteins from an unknown number of micro-organisms. In any case, they had many genomes that had been sequenced for tuberculosis (using the Genome Portal, I enumerated projects for Taxonomy ID 1773). I settled for two that were in finished state, one by Manoj Pillay that appeared to serve as the reference genome and another by Cole that appeared to be an orthogonal attempt to re-annotate the genome from fresh sequencing experiments.
The easiest way to compare the six databases I had accumulated for M. tuberculosis is to enumerate the sequences in each database. The FASTA file format is very simple; if you can count the number of lines in the file that start with ‘>’, you know how many different sequences there are! I used the GNU tool “grep” to count them:
grep -c "^>" *.fasta
- TubercuList: 4031 proteins
- NCBI GCF: 3906 proteins
- DOE JGI Cole: 4076 proteins
- DOE JGI Pillay: 4048 proteins
- Ensembl: 4018 proteins
- UniProt: 3997 proteins
So far, one could certainly be excused for thinking that these databases are very nearly identical. Of course, databases may contain very similar numbers of sequences without containing the same sequences. One might count how many sequences are duplicated among these databases, but identity is too tough a criterion (sequences can be similar without being identical). For example, database A may contain a long protein for gene 1 while database B contains just part of that long protein sequence for gene 1. Database A may be constructed from one gene assembly while Database B is constructed from an altogether different gene assembly, meaning that small genetic variations may lead to small proteomic variations.
I opted to use OrthoVenn, a rather powerful tool for analyzing these sequence database similarities. The tool was published in 2015. Almost immediately, I ran into a vexing problem. The Venn diagram created by the software left out TubercuList! I was delighted to get a rapid response from Yi Wang, the author of the tool (through funding of the United States Department of Agriculture’s Agricultural Research Service). The tool could not process TubercuList because it contained disallowed characters in its sequence! I followed his tip to sniff the file very closely. I found that both sequence entries and accession numbers contained characters they should not. Specifically, I found these interloping characters:
+ * ' #
Scrubbing those bonus characters from the database allowed the OrthoVenn software to run perfectly. Before we leave the subject, I would comment that these characters would cause problems for almost any program designed to read FASTA databases; in some cases, for example, the protein containing one of those characters might be prevented from being identified because of these inclusions! My read is that they were introduced by manual typing errors; they are not frequent, and they appeared at a variety of locations. Let’s remember that they have been in place for four years, with no subsequent database release!
Most people are accustomed to seeing Venn diagrams that incorporate two or three circles. In this case I compelled the software to compare six different sets. The bars shown at the bottom of the image show the numbers of clusters in each database; note that these differ from the number of sequences reported in my bullet list above because OrthoVenn recognizes that sequences within a single database may be highly redundant of each other! (If sequences were completely identical, they could be screened out by the Proteomic Analysis Workbench from OHSU.) Looking back at the six-pointed star drawn by the software, we might conclude that the overlap is nearly perfect among these databases. We see four clusters specific to the JGI Pillay database, and 131 clusters specific to some sub-population of the databases, but the great bulk of clusters (3667) are apparently shared among all six databases!
Oh, how much difference a visualization makes! Shifting the visualization to “Edwards‘ Venn” alters the picture considerably. Now we see that the star version hides the labels for some combinations of database. We see that 3667 clusters are indeed shared among all six databases. After that, we can descend in counts to 131 clusters found in the Pillay and Cole databases from JGI; does this reflect a difference in how JGI runs its assemblies? Next we step to 106 clusters found in UniProt, Ensembl, Tuberculist, and NCBI GCF, but neither of the JGI databases. The next sets down represent 70 clusters found in all but NCBI GCF or 25 clusters found in all but the two JGI databases and NCBI GCF.
I interpret this set of intersections to say that tuberculosis researchers are faced with a bit of a dilemma. If they use a JGI database, they’ll miss the 106 clusters in all the other databases. If they use Ensembl or TubercuList, they will include those 106 but lose the 131 clusters specific to the JGI databases. Helpfully, OrthoVenn shows explicitly which sequences map to which clusters. Remember that when I downloaded the Ensembl and NCBI databases, I saw that they were both based upon a single genome assembly called ASM19595v2. Did they contain exactly the same genes? No! Ensembl contained two fairly big sets of genes that NCBI omitted, including 70 and 25 protein clusters, respectively. NCBI contains another 11 protein clusters that were omitted from Ensembl. Just because two databases stem from the same assembly does not imply that they have identical content.
For my part, I may use some non-quantitative means to decide upon a database. I do not like making manual edits to a database since then others need to know exactly which edits I’ve made to reproduce my work. That takes away TubercuList. Next, I feel strongly that the FASTA database should contain useful text descriptions for each accession. Take a look at the lack of information TubercuList provides for its first protein:
That’s right. Nothing! The Joint Genome Institute databases are quite similar in omitting the description lines. Compare that to what we see in the NCBI and UniProt databases:
NP_214515.1 chromosomal replication initiator protein DnaA [Mycobacterium tuberculosis H37Rv]
sp|P9WNW3|DNAA_MYCTU Chromosomal replication initiator protein DnaA OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=dnaA PE=1 SV=1
That’s much more informative. We’ve got missing data here, too, though. Tuberculosis researchers have grown accustomed to their “Rv numbers” to describe their most familiar genes/proteins, but NCBI and UniProt leave those numbers out of well-characterized genes; the Rv numbers still appear for less well-characterized proteins, such as hypothetical proteins. By comparison, Ensembl includes textual descriptions as well as Rv numbers in a machine-parseable format for every entry:
CCP42723 pep chromosome:ASM19595v2:Chromosome:1:1524:1 gene:Rv0001 transcript:CCP42723 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:dnaA description:Chromosomal replication initiator protein DnaA
On this basis, I believe Ensembl may be the best option for tuberculosis researchers. It is kept up-to-date while TubercuList is not, and it allows researchers to refer back to the old Rv number system in each description.
I hope that this view “under the hood” has helped you understand a bit more of the kind of question that occasionally bedevils a bioinformaticist!