Tag Archives: science

The birth of a new conference

November 1, 2017

How does a new conference enter the academic calendar? I was encouraged by the example set by the Clinical Proteomics / Post-Genome Medicine meeting (ClinProt 2017), and I thought it might be useful to talk about some of the things that the group did really well, while relating a bit of what unfolded for me in my last day at the conference.

Logo_BGRS-shrinkFirst off, this is far from the first meeting to take place in Russia on the subject of human proteomics. The Russian Human Proteome Organization has been operating since 2002, and it sponsors two distinct meetings yearly. The main meeting takes place in the city of Kazan each October. Members who are particularly interested in bioinformatics may participate in an annual meeting at the city of Novosibirsk (Bioinformatics of Genome Regulation and Structure / Systems Biology). The RHUPO has also successfully organized a big event for the world HUPO; in 2009, Dr. Alexander Archakov hosted the third Human Proteome Project Workshop in Moscow!

img-42

Human Proteome Project, Moscow (2009)

The ClinProt 2017 meeting seemed special in that it sought to foster connections among many different institutions within Russia; the program was salted with several investigators across Europe and the broader world, but the emphasis seemed to be on developing networks within the country, including multidisciplinary links. As I look across the eleven-member organizational team in the conference program, I see five different research institutions, all in Russia, represented by post-doctoral scientists. This team of junior researchers will all have valuable experience for the future, and senior scientists who attended the meetings will remember who they could rely upon when trying to solve a last-minute problem before a talk!

I would catalog several things, then, that the organizers did right:

Skin in the game
Because several institutions contributed organizers, more schools sent speakers, poster presenters, and trainees. In total, 350 people registered, and 274 attended. That’s pretty great for a first conference!
Personal touch
Several speakers mentioned that they had been recruited by an organizer who knew them from prior contact. Since professors frequently get spammed by for-profit conferences, these personal contacts made a difference in getting the names they wanted for the meeting agenda.
Detail focus
I heard several of the organizers quietly worrying about whether something was going to go just right. Throughout, it was clear that each person knew what his or her responsibility included. The team was definitely committed.
Industry works
I occasionally hear academics sneer at the inclusion of instrument and reagent vendors in speaker rosters, but their participation in a meeting adds more than just money. I was glad to see a representative from Helicon lecture on the value of CyTOF for cell counting applications, since I am mentoring a student working with such data.

I became aware that we had some special guests today as I lingered in the speaker ready room. Several people in suits made an appearance. I had a rapid conversation with Sergey Suchkov, an M.D. and Ph.D. who has a relentless energy about him. He has a strong interest in developing relationships among BRICS nations in the field of “precision medicine” (sometimes called “personalized medicine”), and he wanted to talk about some possibilities between South Africa and Russia in that space. We agreed to touch base this afternoon when he could introduce me to another M.D. Ph.D. friend of his who has become involved in genome bioinformatics. That meting put forward some interesting possibilities in tuberculosis, which has become problematic in the Russian prison system. I hope we will be able to define some projects we can pursue together in this space.

Right away, though, I had to leave our discussion to teach my afternoon workshop on performing post-hoc quality control assessment in large-scale proteome projects. I was very grateful that the conference organizers could add a link to their website so that participants could download the R statistical script and input files for the workshop directly from the link above. That way the conference attendees who needed to leave Moscow early can still get access to the tutorial.

20131230-Xia-Wang-TOC

Image from my paper with Xia Wang introducing “IDFree” metrics

This was my first time to teach a workshop on quality control. My normal curriculum has emphasized protein identification or the recognition of post-translational modifications. Since I am now chairing the HUPO-PSI working group on quality control, though, it was a good time for me to put together some training materials in this space. I chose a highly visible data set, the 1425 LC-MS/MS experiments that the Vanderbilt team produced from colorectal cancer samples for the National Cancer Institute CPTAC program. The workshop would focus on recreating figures that Xia Wang at U-Cincinnati had scripted in the R statistical environment from tables of QC metrics that my team had generated.

I was really pleased with the dozen or so students who attended the workshop. Their questions were very good, and their understanding of the statistical concepts was at a very high level. To give one example, a student asked how differently the files would have spread in my plot of the first two principal components if we had used ordinary PCA rather than robust PCA. Another asked how hierarchical clustering would visualize these data in principal components space. These are not the questions one encounters with people who have never seen PCA before!

So color me impressed. This meeting ran like clockwork, and the students came ready to learn. The speaker list did not have some of the biggest names in world proteomics, but in fact I trusted what I was hearing more because it came from investigators who had worked at the bench more recently. I am of course grateful for the time I’ve been given to see Russia first-hand, but in the end I was brought here to teach and to learn. I enjoyed both missions!

 

Advertisements

Clinical proteomics in Russia and my last pair of pants

October 30, 2017

At last the first day of ClinProt 2017 had arrived! I set aside my now-muddy pairs of jeans in favor of my fresh and clean blue dress pants, laced up my shiny black shoes, and put on my enthusiastic green shirt. With a spot of breakfast downstairs (on my third morning eating there, I found that the milk jug was full for the first time!), I was ready to meet with the others for a shuttle van ride over to the conference.

Moscow traffic at 8:20 AM is a bit intense. The drivers here are a bit more careful of road laws than I have seen in other countries, but they still produce some pretty creative merges in their traffic jams. What would have been a few minutes on the subway was more like a half hour on the road, but my dress pants were still pristine when we arrived at the Congress Center at the I.M. Sechenov First Moscow State Medical University. The facility had a lovely central hall, with a graceful split staircase to the two main venues for our meeting. I hadn’t seen lecture halls in which an array of nine HDTVs replaced the more typical projector. It certainly produced a bright image, though the borders between screens were distracting.

IMG_7892

Why project when you can emit?

The Clinical Proteomics 2017 meeting was organized because a confluence of groups wanted to consolidate researchers in this country. EuPA, the European Proteomics Association, helps to integrate activities that span national proteomics societies. The Russian Human Proteomics Organization (RHUPO) sought to foster a sense of community among Russian research groups in this area. The Sechenov First Moscow State Medical University was happy to contribute a venue for the event, and many instrument, reagent, and other vendors agreed to take part, as well. I haven’t learned the total count of attendees yet, but I know that there are 87 research posters. For a first effort, I think it is clear that a great many things have gone well.

From the very first talk, it was apparent that Russian clinical proteomics researchers are grappling with challenges that became familiar to me as part of the National Cancer Institute (NCI) CPTAC program. Anna Kudryavtseva discussed her efforts to reconcile proteomics data with those that had been produced by NCI The Cancer Genome Atlas (TCGA), working in a particular sub-type of head and neck cancer. Prioritizing genes that were more frequent targets of mutation in tumors has value for understanding which proteins are most useful to monitor closely, for example. It was a great “plenary” (all attendees) talk to kick off these discussions.

As soon as we split to multiple sessions, I was on duty. I co-chaired the “Genomics and Beyond” panel with Sergey Moshkovskii. It was a bit odd to be fielding this panel while the Protein Informatics workshop was taking place in another room (that topic has been my bread and butter for two decades)! In this case, however, Sergey and I were not only chairing the session but also leading it with our two lectures, both in the field of proteogenomics.

DSC_3620-shrink

Photo credit: Olga Kiseleva

I defined the term by saying that we want to improve our interpretation of genomic data by integrating proteomics data, and we want to improve our interpretation of proteomics data by integrating genomic data (I was trying to be ecumenical). From there, I led the group through the new paper that I’ve published with Anzaan Dippenaar and Tiaan Heunis, in which we demonstrated our ability to recognize sequence variations and novel genes in Mycobacterium tuberculosis “bugs” that had been isolated from patient sputum in South Africa. Sergey followed up by finding evidence of RNA editing in fruit flies.

DSC_3631-shrink

Photo credit: Olga Kiseleva

The other speakers in the panel were also quite interesting. Matthias Schwab was visiting from Germany, and he educated the group on the current status of the field of pharmacogenomics. Vladimir Strelnikov, a geneticist, described the value of bisulfite sequencing for measuring DNA methylation in breast cancer. Sergey Radko outlined a SISCAPA-like strategy for using “aptamers” to enrich proteins prior to Selected Reaction Monitoring. Artem Muravev closed out the session to discuss the challenges of biobanking. This last talk was delivered in Russian, so I benefited quite a lot from real-time translation to English by Anastasia, one of two translators fielding our session (during my talk, she had been translating my words to Russian as I worked through my slides). Finally all the speakers came together for fifteen minutes of question and answer. I tweaked our pharmacogenomics speaker a little bit by saying that even if we had the complete sequences for every human on earth in our hands today, personalized medicine would not have arrived!

With the morning complete, everyone adjourned to a nearby restaurant. I was a little leery when I learned our destination was the Black Market, but I needn’t have worried; we wandered down the street to a lovely restaurant named “Black Market.” I had the Black Market Burger and felt thoroughly happy. I felt very grateful that the European Proteomics Association picked up the bill for that morning’s speakers!

Back in the conference, I enjoyed hearing my long-time friend David Goodlett discuss his long-term monitoring study of diabetes. He’s a careful guy, and it is good to see that he can make label-free proteomics sing in biofluids (a tough space to work), recognizing protein pairs for which expression can flag the onset of disease. It’s very reminiscent of the kind of study Stellenbosch University has produced in the space of tuberculosis. Our next speaker returned to the subject of biobanking, and he delivered his talk via Skype, not my favorite format. I am a big believer in contact with my audience.

IMG_7886

Did I mention it was my enthusiastic green shirt?

I threw all my remaining energy into the poster session. Interacting with researchers at the start of their careers is very rewarding, and people who stand beside their work without knowing whether or not anyone will take interest have a hard job. These students were even braver, since they were prepared to defend their work in English!

I started with a poster very near and dear to my heart. A.V. Mikurova was evaluating the different levels of sequence coverage achieved by database search (Mascot, X!Tandem) and de novo algorithms (PepNovo+, Novor, and PEAKS) when working with 27 LC-MS/MS experiments for a defined mixture of human proteins. We discussed the relative unresponsiveness of sequence coverage as a metric for performance evaluation and the challenge of ensuring the algorithms had comparable configuration. I asked S.E. Novikova about her choices of statistical model for a time-series measurement of proteomes in response to all-trans retinoic acid. I hope my statistics lectures online will be useful to her, though it sounds like she’s already on the right track. N.V. Kuznetsova taught me a few things I didn’t know about celiac disease! She had been evaluating the ability of Triticain-Α to degrade the most immunogenic peptide of gluten-family proteins. Finally, J. Bespyatykh was presenting a poster on the proteomics of Mycobacterium tuberculosis from a strain called Beijing B0/W148. Her work obviously had a strong relationship to what Tiaan and Anzaan had published with me, so we had a great conversation about the work. I hope we can help her find a sequence database that is a more ideal fit for her proteomes than the generic “H37Rv” protein database. I was really pleased to speak with so many students about their work at this meeting.

With that, I slumped onto a wall and didn’t move very much. The other conference attendees had flowed back into the conference room for an afternoon round of talks. I let my mind wander for a bit, though I did have some nice conversations with the vendors. Soon, though, I heard some odd noises echoing through the entry hallway. Was there a music practice room somewhere in the building? Was that a tuba?

DSC_3712-shrink

Dixieland music in Moscow! Photo credit: Olga Kiseleva

My questions were answered when I eventually joined everyone downstairs for a catered closing reception. The organizers had invited a Dixieland band to perform for our reception! The group was really solid. I particularly liked one of their trumpeters, since he had a smooth Chuck Mangione vibe going on. I kept recognizing songs only part of the way, since they were singing many of the lyrics in Russian! I finally got a solid hit on “Mack the Knife!” I sat up close to enjoy the show.

With the evening at an end, I declined invitations to go hit a bar and walked to the nearby Frunzenskaya subway station. Two stops later, I was in my neighborhood. I trudged up the paved driveway to the street with my hotel. As I awaited the green light at my last crosswalk before the hotel entrance, a car drove too close to the curb where I was waiting, and dirty rainwater soaked my last clean pair of pants.

 

You can be an academic YouTube STAR!

Many universities have begun exploring the use of the Internet for sharing academic coursework, either via “flipped classrooms” or Massive Open On-Line Courses (MOOCs).  Over the last year, I have uploaded approximately 50 videos to my YouTube Channel, most of them academic lectures.  I hope that I have learned something in this process that will you to publish your work more broadly, as well!

I would start by explaining that my lectures come from multiple purposes and even multiple university campuses.  My longest-running series of lectures came from a weekly seminar on topics of my own choosing called “the Useful Hour.”  I produced fourteen of these sessions (with help from Brigitte Glanzmann when I had to be away for a week), though I only started recording them on video for the last twelve.  I recorded the eight-session bioinformatics module from our division’s B.Sc. Honours program as a trial run for creating a “flipped classroom” in future years (a model where students watch lectures outside of class and spend in-class time working exercises).  More recently, I collaborated with the H3Africa BioNet to produce a four-lecture module on Gene Expression.  From time to time, I help the Tygerberg Postgraduate Student Council by recording a lecturer.  Each of these experiences has had its own lessons to convey.

The technical aspects of recording a video are generally easy enough that even a Ph.D. can do it!  Today’s budget camcorders capture more detail with better sound under lousier conditions than did cameras that cost five times as much even five years ago.  Best of all, one no longer needs to wrestle with tapes and analog-to-digital transfer loss.  Today we simply pull the Secure Digital card out of the camcorder and plug it into the socket on a laptop, where the video files are instantly accessible.  Of course, many people record video using digital cameras or cell phones.  Preparing videos for upload to a public server, however, is frequently more difficult than the initial capture.  I’ll talk about these aspects below.

Focus on the speaker

pointer

Speak softly, and carry a big stick!

We must start with video that is worth watching.  Far too frequently, I see that people recording lectures focus on the slides rather than the person who is delivering the lecture.  Reading text from video is generally unpleasant, and the reality is that looking at people fires circuits in our brains that academic content does not.  Video is a format designed to capture motion; it is a notoriously inefficient method for capturing still images, though!  Keeping the camera on the speaker, then, makes more sense.  This comes with some caveats:

  1. Viewers still need to be able to see the slides.  My answer has been to produce a PDF from the PowerPoint or other presentation software, since almost everyone has the ability to view PDFs on any platform.  I post the PDF to a shared directory on Google Drive, and I include the URL leading to the PDF in the YouTube description.
  2. From time to time a researcher will point to a particular part of a slide.  This is probably problematic on video; if he or she has used a laser pointer, the spot of light will either be too bright (green) or too dim (red) to appear well on video.  A moving mouse pointer might be better.  If the speaker is old-school (like me), he or she may use a stick to point at the slide instead.  This can create a problem of the lecturer “blooming” as he or she moves away from the bright field created by the projector into the relative dark outside the projector’s light.
  3. How will a person watching the video know to advance to the next slide?  Hopefully the speaker says “next slide” out loud.  When my parents recorded my brother’s and my first efforts to read aloud, they told us to bang a spoon against a mug to produce an audible chime with each page turn.  That was even more fun than reading!
  4. Software is publicly available to integrate the slowly-changing slide video with the quickly-moving speaker video.  Screencast-O-Matic will produce videos of up to fifteen minutes in its free version.  This approach will guarantee that your viewers are seeing the same slide the lecturer is seeing as the talk progresses.
screencast

Screencast-O-Matic insets your image atop the slides you are presenting.

Light and detail go hand in hand

As I alluded above, lighting is frequently a problem in academic lecture videos.  We frequently keep our lecture halls very dim in order to make the slides stand out as much as possible.  In a large venue, you may have a spotlight on the speaker, which will help.  In a medium venue, you may have a light in the ceiling directly above the speaker, which can make him or her appear somewhat ghoulish.  The more you rely upon zoom, the less light will reach your camera!  Keep that camera close.  If you can open the blinds on a window so that your speaker is lit, you will have a more interesting video.  Try to find ways to position your camera between the light and the subject (without casting a shadow, of course).  Never forget that the projected slides are much brighter than the subject you are trying to record.  If even the corner of the projected image appears in-shot, expect the speaker to become a flat silhouette.

Today’s cameras can record in very high resolutions, such as 1080p (the same as your HD television).  If lighting is truly problematic, you may want to consider forcing your high-resolution camera to a lower resolution, such as 720p; this may allow it to combine intensities across multiple transistors for each pixel.  Similarly, you should expect that a camera with a larger “retina” will outperform one with a tiny CCD in low light.  To put this in plain terms, do not expect a cell phone to produce quality video in semi-darkness, no matter the name on the label.  That said, I have observed that my “mirrorless” Canon EOS-M2 is inferior to my much cheaper Canon VIXIA HF R62 for video.  The lenses and electronics of the EOS-M2 are optimized for photos, not video.

Privacy issues are a big deal

Ensure that your audience knows that the lecture is being recorded.  Bad things can happen when a person does not want his or her image to be on-line and somebody else decides that they shall be.  Imagine how much worse this becomes when that member of the audience is a minor!  Nobody should be forced into public view because he or she attends a talk.

We frequently expect a period of questions and answers at the end of a lecture (and sometimes in the middle).  A novice camera operator may automatically swing to capture the questioner in action.  Depending on the situation, this part of the video may need to be truncated outright due to privacy issues.

Video is big and hard to handle well

20170831-This-Big

I use my hands a lot.

When I upgraded to my Canon VIXIA HF R62 from a JVC Everio (GZ-HM30AU), I had a rude shock.  My old camera had captured 720p video in very manageable MTS files, but the new camera captured 1080p video in massive MP4 volumes.  I used a 16 GB SDHC card for videos.  The cameras assumed that no file should be allowed to be larger than 4 GB (linked to 32-bit computing).  With the new camera, I consume 4 GB every 33 minutes!  At a couple of long events I recorded, I found that I needed more storage than the 16 GB card could provide.  I solved that problem by upgrading to a 64 GB card.

Naturally, keeping the raw footage of every event I video is not practical.  If each of the 50 videos I posted to YouTube over the last year produced 66 minutes of raw footage, I would need to archive 400 GB for just this period!  Similarly, posting these videos to YouTube would be a problem.  Each hour would span two files, which would require my viewers to watch part ‘A’ and then queue up part ‘B’ immediately afterwards; many would just skip watching the end, humans being humans.  To compound the problem, I live in South Africa, which means my upload speeds to network servers are dreadfully slow.  My home DSL line, for example, achieves 0.3 Mbps.  I have uploaded one GB before, but it takes hours.  In any case, I will probably need to truncate a bit of time off the front and the back of the video.  In short, I need to do video editing.

ffmpeg-summer-v2

While semi-professionals might opt for Adobe Premiere and those who “think different” will break out iMovie, I am a bioinformaticist, and I like software that lets me master high-quality videos with a minimum of fuss and bother.  I use ffmpeg, a very powerful suite of tools that one can use directly on the command line.  Most of the time, I am (a) concatenating my source video files into one movie, (b) including only a middle section, and (c) writing a more compact movie from the source materials.  To use a recent example, I have two input files; I write their names into a file called list.txt:

file mvi_0031.mp4
file mvi_0032.mp4

Next, I run a command line that looks like this:

ffmpeg.exe -ss 00:00:15 -f concat -safe 0 -i list.txt -t 00:50:00 -c:v libx264 -preset slow -c:a copy output.mp4

In order, the options do the following:

  • -ss specifies where in the combined files ffmpeg will start the output video (in this example, after the first fifteen seconds).
  • -ff concat -safe 0 -i list.txt specifies that the files listed in list.txt should be combined into one video and that they are formatted the same way.
  • -t specifies the total duration of the video to be encoded (in this example, exactly fifty minutes).
  • -c:v libx264 -preset slow specifies that my output video will be MPEG 4 pt 10, a very common format for storing video (and one that YouTube knows how to read).
  • -c:a copy directs ffmpeg not to re-compress the audio, making it sound just as nice in the output as it did in the original.

The ffmpeg software is very good at reducing the size of videos without compromising its quality.  I find that I can represent an hour-long lecture in a two GB 1080p video, rather than the nearly 8 GB of source footage.  If I am filled with caffeine for my lecture, the video size increases a bit (more motion requires more bits for accurate representation).

These smaller videos can then be uploaded to my YouTube account.  Happily, if you have a Gmail account (or if you use a different email address to log into Google Services), you can simply use that login for YouTube.  One clicks the arrow pointing up, and a screen will appear to which you drag your video file.  All done, right?

No job is finished until the paperwork is through!

Meta-data is key to your video reaching an audience, and too few people spend adequate time on this step.  I would call your attention to both the “Basic Info” and “Advanced Settings” pages that video authors can complete.  Of course, you should enter a paragraph of information in the basic description blank.  Ask yourself what web searches should find your video, and be sure you include those key terms in the text.  For good measure, add them again in the keywords section!  I like to include the university name where the recording took place.  Hopefully the social media minders for these schools will highlight your video to their large audiences.  YouTube will sniff the video for still frames that might be representative for the video.  I always try to pick the one in which I do not look like I’m suffering a fit of some sort.

Advanced Settings has more options to help users find your video.  Pick a category; generally my lectures fall in the “Science and Technology” category.  Be sure to enter a video location.  Google will translate your information to GPS coordinates so people can find videos shot near particular locations.  Enter a recording date, and select the language of your video (especially if you are not using English).

In many cases, you will have several videos that belong together as a set.  When I produced a short biography and four videos on Gene Expression for H3A BioNet, I also created a “playlist” that contained all five videos in the correct order.  Remember, if you can hook a viewer into watching one of your videos, you might be able to retain their interest for a few more!  Ideally, people will like your stuff enough that they subscribe to your YouTube channel, receiving a notification every time you post a new video.  You will be launched on your next career as a YouTube star!

What protein database is best for tuberculosis?

As many of you know, I have specialized in the field of proteomics, the study of complex mixtures of proteins that may be characteristic of a disease state, development stage, tissue type, etc.  Here in South Africa, my application focus has shifted from colon cancer to tuberculosis.  As a newcomer to this field, I’ve been curious to know whether the field of tuberculosis has good information resources to leverage in its fight against the disease.

The key resource any proteomics group can leverage is the sequence database, specifically the list of all protein sequences encoded by the genome in question.  The human genome incorporates around 20,310 protein-coding genes (reduced from estimates of 26,588 from the 2001 publication), but those genes code for upwards of 70,000 distinct proteins through alternative splicing. Bacteria are able to get by with far smaller numbers of genes.  E. coli, for example, functions with only 4309 proteins.  The organism that infects humans and other animals to produce tuberculosis is named Mycobacterium tuberculosis.  If we were to rely upon the excellent UniProt database, from which I quoted E. coli protein-coding gene counts, we would probably conclude that M. tuberculosis relies upon even fewer genes: only 3993 (3997 proteins)!

logo_7

UniProt is an excellent all-around resource for proteomics, but researchers in a particular field usually gravitate to a data resource that is particular to their organism.  People who work with C. elegans for developmental studies, for example, use WormBase.  People who study genetics with D. melanogaster would use FlyBase.  People in tuberculosis have frequently turned to TubercuList for its annotation of the M.tb genome (comprising 4031 proteins).  This database, however, has not been updated since March of 2013 (available from the “What’s New” page).  Can it still be considered current, four years later?

cms_refseq10years

e-ensembl

As a recent import from clinical proteogenomics, my first impulse is still to run to the genome-derived sequence databases of NCBI, particularly its RefSeq collection.  I found a NCBI genome for M. tuberculosis there, with a  last modification date from May 21, 2016 and indicating its annotation was based upon “ASM19595v2,” a particular assembly of the sequencing data.  This was echoed when I ran to Ensembl, another site most commonly used for eukaryotic species (such as humans) rather than prokaryotic organisms (such as bacteria).  Their Ensembl tuberculosis proteome was built upon the same assembly as was the one from NCBI.

JGI_logo_stacked_DOEtag_UF_CMYK

As a former post-doc from Oak Ridge National Laboratory, I am always likely to think of the Department of Energy’s Joint Genome Institute.  The DOE sequences “bugs” (slang for bacteria) like nobody’s business.  Invariably, I find that I can retrieve a complete proteome for a rare bacterium at JGI which is represented by only a handful of proteins in UniProt!  This makes JGI a great resource for people who work in “microbiome” projects, where samples contain proteins from an unknown number of micro-organisms.  In any case, they had many genomes that had been sequenced for tuberculosis (using the Genome Portal, I enumerated projects for Taxonomy ID 1773).  I settled for two that were in finished state, one by Manoj Pillay that appeared to serve as the reference genome and another by Cole that appeared to be an orthogonal attempt to re-annotate the genome from fresh sequencing experiments.

The easiest way to compare the six databases I had accumulated for M. tuberculosis is to enumerate the sequences in each database.  The FASTA file format is very simple; if you can count the number of lines in the file that start with ‘>’, you know how many different sequences there are!  I used the GNU tool “grep” to count them:

grep -c "^>" *.fasta
  • TubercuList: 4031 proteins
  • NCBI GCF: 3906 proteins
  • DOE JGI Cole: 4076 proteins
  • DOE JGI Pillay: 4048 proteins
  • Ensembl: 4018 proteins
  • UniProt: 3997 proteins

So far, one could certainly be excused for thinking that these databases are very nearly identical.  Of course, databases may contain very similar numbers of sequences without containing the same sequences.  One might count how many sequences are duplicated among these databases, but identity is too tough a criterion (sequences can be similar without being identical).  For example, database A may contain a long protein for gene 1 while database B contains just part of that long protein sequence for gene 1.  Database A may be constructed from one gene assembly while Database B is constructed from an altogether different gene assembly, meaning that small genetic variations may lead to small proteomic variations.

pgec20header20final20editI opted to use OrthoVenn, a rather powerful tool for analyzing these sequence database similarities.  The tool was published in 2015.  Almost immediately, I ran into a vexing problem.  The Venn diagram created by the software left out TubercuList!  I was delighted to get a rapid response from Yi Wang, the author of the tool (through funding of the United States Department of Agriculture’s Agricultural Research Service).  The tool could not process TubercuList because it contained disallowed characters in its sequence!  I followed his tip to sniff the file very closely.  I found that both sequence entries and accession numbers contained characters they should not.  Specifically, I found these interloping characters:

+ * ' #
jVenn_chart

OrthoVenn Venn chart

Scrubbing those bonus characters from the database allowed the OrthoVenn software to run perfectly.  Before we leave the subject, I would comment that these characters would cause problems for almost any program designed to read FASTA databases; in some cases, for example, the protein containing one of those characters might be prevented from being identified because of these inclusions!  My read is that they were introduced by manual typing errors; they are not frequent, and they appeared at a variety of locations.  Let’s remember that they have been in place for four years, with no subsequent database release!

Most people are accustomed to seeing Venn diagrams that incorporate two or three circles.  In this case I compelled the software to compare six different sets.  The bars shown at the bottom of the image show the numbers of clusters in each database; note that these differ from the number of sequences reported in my bullet list above because OrthoVenn recognizes that sequences within a single database may be highly redundant of each other!  (If sequences were completely identical, they could be screened out by the Proteomic Analysis Workbench from OHSU.)  Looking back at the six-pointed star drawn by the software, we might conclude that the overlap is nearly perfect among these databases.  We see four clusters specific to the JGI Pillay database, and 131 clusters specific to some sub-population of the databases, but the great bulk of clusters (3667) are apparently shared among all six databases!

Venn

The Edwards visualization from OrthoVenn

Oh, how much difference a visualization makes!  Shifting the visualization to “Edwards‘ Venn” alters the picture considerably.  Now we see that the star version hides the labels for some combinations of database.  We see that 3667 clusters are indeed shared among all six databases.  After that, we can descend in counts to 131 clusters found in the Pillay and Cole databases from JGI; does this reflect a difference in how JGI runs its assemblies?  Next we step to 106 clusters found in UniProt, Ensembl, Tuberculist, and NCBI GCF, but neither of the JGI databases.  The next sets down represent 70 clusters found in all but NCBI GCF or 25 clusters found in all but the two JGI databases and NCBI GCF.

I interpret this set of intersections to say that tuberculosis researchers are faced with a bit of a dilemma.  If they use a JGI database, they’ll miss the 106 clusters in all the other databases.  If they use Ensembl or TubercuList, they will include those 106 but lose the 131 clusters specific to the JGI databases.  Helpfully, OrthoVenn shows explicitly which sequences map to which clusters.  Remember that when I downloaded the Ensembl and NCBI databases, I saw that they were both based upon a single genome assembly called ASM19595v2.  Did they contain exactly the same genes?  No!  Ensembl contained two fairly big sets of genes that NCBI omitted, including 70 and 25 protein clusters, respectively.  NCBI contains another 11 protein clusters that were omitted from Ensembl.  Just because two databases stem from the same assembly does not imply that they have identical content.

For my part, I may use some non-quantitative means to decide upon a database.  I do not like making manual edits to a database since then others need to know exactly which edits I’ve made to reproduce my work.  That takes away TubercuList.  Next, I feel strongly that the FASTA database should contain useful text descriptions for each accession.  Take a look at the lack of information TubercuList provides for its first protein:

Rv0001_dnaA

That’s right.  Nothing!  The Joint Genome Institute databases are quite similar in omitting the description lines. Compare that to what we see in the NCBI and UniProt databases:

NP_214515.1 chromosomal replication initiator protein DnaA [Mycobacterium tuberculosis H37Rv]
sp|P9WNW3|DNAA_MYCTU Chromosomal replication initiator protein DnaA OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=dnaA PE=1 SV=1

That’s much more informative. We’ve got missing data here, too, though. Tuberculosis researchers have grown accustomed to their “Rv numbers” to describe their most familiar genes/proteins, but NCBI and UniProt leave those numbers out of well-characterized genes; the Rv numbers still appear for less well-characterized proteins, such as hypothetical proteins. By comparison, Ensembl includes textual descriptions as well as Rv numbers in a machine-parseable format for every entry:

CCP42723 pep chromosome:ASM19595v2:Chromosome:1:1524:1 gene:Rv0001 transcript:CCP42723 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:dnaA description:Chromosomal replication initiator protein DnaA

On this basis, I believe Ensembl may be the best option for tuberculosis researchers. It is kept up-to-date while TubercuList is not, and it allows researchers to refer back to the old Rv number system in each description.

I hope that this view “under the hood” has helped you understand a bit more of the kind of question that occasionally bedevils a bioinformaticist!

Are you ready to start a molecular biology M.Sc.?

Professors receive a lot of requests from international students for admission to post-graduate training.  In South Africa, that training could be for “Honours” (a one-year course), an “M.Sc.” (a two-year Master’s program), or a “Ph.D.” (typically three years, post Master’s).  For students changing from one country to another, however, the question of “equivalencies” is key.  Could a four-year B.Sc. (Bachelor’s of Science) from Egypt, for example, be treated as the same thing as a three year B.Sc. followed by one year of Honours in South Africa?  This post gives an example of the questions I asked as I recently tried to determine the right level of admissions for an international student.

The international office for my university had declared that a student’s four-year degree was certainly equivalent to a three-year B.Sc. in South Africa, but it left to the department’s discretion whether or not Honours training was required before a M.Sc.  To support the department’s decision, I decided to build an interview from questions that would delineate the limits of the candidate’s knowledge.  I used the roster of topics for the Division of Molecular Biology and Human Genetics 2017 Honours as a guide.  I used the number of didactic training days for each topic as a weight:

Field Duration
Molecular Biology 8 days
Mycobacteriology 7 days
Biostatistics 12 days
Bioinformatics 8 days
Immunology 8 days
Cell Biology 8 days
Scientific Communication 2 days

I also gave some consideration to the M.Sc. project the student would pursue in my laboratory.  In this case, the work related to the reproducibility of mass spectrometry experiments.  After pondering before my word processor, I selected these questions for the candidate’s interview:

# Field Question
1 Cell Biology What biological processes are described by the Central Dogma of molecular biology? Walk us through each.
2 Biochemistry What do we describe with Michaelis-Menten kinetics?
3 Computer Science How does iteration differ from recursion?
4 Analytical Chemistry By what property does a mass spectrometer separate ions?
5 Medicine In HIV treatment, what is the purpose of a “protease inhibitor?”
6 Biostatistics What role does the “null hypothesis” play in Student’s t-test?
7 Medicine What type of pathogen causes tuberculosis?
8 Genetics What is the purpose of a plasmid vector in cloning? What features do such vectors commonly contain?
9 Cell Biology What cellular process includes prophase, metaphase, anaphase, and telophase?
10 Mathematics The log ratio (base 2) between two numbers is 3. What is the linear ratio?
11 Immunology What is an antibody, and what is its relationship to an antigen? What are the major families of antibodies?
12 Computer Science What is the purpose of an Application Programming Interface (API) or “library?”
13 Biochemistry What do we describe as the secondary structure of a protein?
14 Genetics Of what components are nucleic acids constructed?
15 Biostatistics What is a Coefficient of Variation?
16 Mathematics If I divide the circumference of a circle by its diameter, what value do I get?
17 Immunology What type of immune cell is the primary factory for antibodies?

The interview, conducted via Skype, lasted approximately an hour. As I asked each question, I gave the question orally and pasted the text of that question into the chat session. Remember that as an American, I have a “foreign” accent for the English-speaking population of Africa! I did not want that to be a factor in the candidate’s performance. I was grateful that our division’s Honours program coordinator, Dr. Jennifer Jackson, accompanied me during the interview, both to monitor that the candidate was treated fairly and to ask follow-up questions of her own.

Why did it take an hour to answer these questions? As is customary in post-graduate education, each answer opened the door to a series of other questions. A student may give an answer that covers only part of the question, and the follow-up will poke into the omitted area to see if it is an area of weakness, almost like a dentist with an explorer goes after a darkened area of a tooth to see if it represents dental decay!

Another factor that I want to measure for students is the degree of integration that they have achieved in their educations. To recognize that a word has been mentioned in class is not sufficient; I need to see that students understand how key concepts relate to each other. This synthesis is sometimes hard to evaluate, but it’s important. A student who doesn’t understand how a concept integrates with others will not be able to apply the principle or recognize when it should come into play.

Before the readers of this blog begin showering me with applications, I need to emphasize that the questions I framed for this particular interview are not the questions I would ask of another candidate. The ones above were chosen to reflect the background of the candidate, the diploma program to which he or she had applied, and the nature of the project I had in mind.

I hope that this post will help you decide whether or not you are ready to plunge into post-graduate education!

Young David steps out of his comfort zone

Sometimes, a look through the scrapbook can be a very humbling experience.  I resolved this month to finish a project I launched in 1994.  At last I am publishing the journal I recorded during my first trip to Europe!  For the first time, I am bringing together the forty-two journal entries, my photographs, and the video camera footage that I recorded during my clockwise circuit around the continent.  Before you jump right into the journal, though, could I ask you to read a few thoughts?

More time has passed since I wrote that journal (23 years) than I had lived at that point (I was 20 years old).  The experiences of the last two decades have certainly left their mark.  Since that time, I’ve graduated from two degree programs; I’ve filled my passport with stamps; I’ve built my career in academia; I’ve achieved some level of comfort in finance; I’ve married and divorced.  All of these changes make it hard to recognize the person who wrote those entries as the same person writing this blog!

Setting the scene

19941002-Lyon photo01

I’m sitting by “Le Crayon,” the tower of Credit Lyonnais.

The David who wrote this journal was experiencing profound discomfort.  As a fellow in the University of Arkansas Sturgis Fellows program, I was strongly pushed to spend at least a semester of my junior year abroad.  My undergraduate advisor, Doug Rhoads arranged for me to visit the laboratories of Jean-Jacques Madjar at the University of Lyons, where Thierry Masse mentored my project.  The fact is that I did not enjoy “wet bench” research, and I was becoming concerned that my Biology degree could equip me for a career I did not want!  To complicate the matter further, we never formalized my visa to work in the laboratory for a year-long stretch, and so I needed to leave France well before even a semester had passed.  Scheduling this journey through many countries was my fall-back plan, and my mother was working with the University of Arkansas to get a formal plan in place for the spring of 1995.  In short, I felt that I was failing in this first real test of applying my academic skills.

If you mainly know me as a globe-trotter who uprooted his career and moved to South Africa, you might be surprised to know that as a young man I disliked travel, and I feared change.  Ask the members of Yates Lab how huge a step it seemed to me to move from Seattle, Washington to San Diego, California in the year 2000.  I spent six months poring over maps and dawdling over last details in Seattle.  To go back further in time, I was always the first member of the family to feel it was time for us to return to Kansas City when our family took long road trips in the summer time.  If you read the journal, you will see a David feeling perpetually out of place and coping badly with exhaustion and self-induced malnutrition because I wasn’t willing to spend enough money on food.

The most redundant feature of the journal is that the 20-year-old me was completely agog at the young women I encountered on my travels.  Although a disproportionate number of my friends since elementary school have been female, I must say that I was essentially undateable until my mid-twenties.  I would summarize by saying that I routinely put women on a pedestal and couldn’t see myself as desirable.  This aspect of the journal is high on my list of cringe-inducers.

IMG_9804

I had already given up cursive in college.

What should we call the nexus of judgmental, puritanical, dismissive, and obsessed with money?  I am reminded in this journal that the person I am today was distilled from common mud.  Today I am not immune from these traits, but I do try to improve myself with time.  I have been tagged with the label “stubborn” more times than I would like to admit, but I hope that I can manage open-mindedness and respect for others at least from time to time.  In particular, I struggled to read the passages I wrote about the Turks in Budapest or the drive-by racism I dumped on Latin culture.  At least I realized that smug American chest-thumping was not preferable.  My memories of myself from that time have been substantially white-washed, but my text makes it clear I had a long way to go.  In my memories of that time, I mostly remember that the international relations scholar from Turkey taught me that a bishop or a castle is generally more reliable than a knight in the chess end-game.

From 1994 to now

Travel in Europe today is considerably simpler than it was in 1994.  Moving from country to country is considerably easier because of the Schengen agreement that eliminates customs at borders between countries and the Economic and Monetary Union that makes the Euro the only currency you need for much of the continent.  The traveler’s checks that fueled my travel are not needed in Europe; instead, you feed your bank card into an ATM, and out pops money.  My single telephone call home from Vienna would be likely replaced today by Skype; I could use my phone or computer in the WiFi of any hostel to chat right away with folks at home.

IMG_9801

My account book, in many currencies

I wrote my journal narrative in a spiral-bound notebook, and I kept strict accounts of every franc, Deutschmark, schilling, crown, etc. in a separate small notebook, both of which I acquired while living in Lyon.  I was very fond of Pilot rolling ball pens at the time, and so each page is filled with cramped blue writing.

While my parents used 35mm slide cameras to capture my early years, I carried a 126 film cartridge camera made by Vivitar with me to Europe.  As you will see, many of the images I mention never made it to print when I developed those films, and the term “focus” does not really apply.  In three cases, I used Microsoft’s Image Composite Editor to stitch together multiple photos into a single panorama.

19940618 Lyon cathedrals photo06

The two most visible cathedrals of Lyon, France

Computer video has come quite some distance since 1994.  I originally recorded the video on an analog Sharp “Video8” camera.  When I subsequently upgraded to a miniDV camera, I was able to transfer the video from the old camera to a new one via an S-video cable; this process recorded the video in a digital format on the new tape.  I was able to transfer that digital video without loss to a desktop computer with a FireWire card.  To deinterlace and compress the section of video I’ve posted to YouTube, I used the “yadif” filter of FFMPEG:

ffmpeg.exe -ss 00:00:09 -i input.avi -vf yadif -t 00:45:05 -c:v libx264 -preset slow output.mov

With those comments in place, I hope you enjoy reading the journal, a project 23 years in the making!

Teaching on the sly

When I first arrived at Stellenbosch University, I was a bit concerned.  I had thoroughly enjoyed organizing my own semester-long class in bioinformatics for M.Sc. and Ph.D. students at Vanderbilt University.  Under the “British System,” though, students encounter their final classes in the “Honours” year, crammed between the three-year Bachelor’s program and the two-year Master’s program.  Interestingly, a student may attend Honours at a different college than where he or she completed a bachelor’s degree, and the student may go to yet another university for a Master of Science after the Honours, so long as the training is judged to be relevant.

Overview of South African education program

This sequence describes the common route through South African education, from kindergarten to a terminal degree.

I would take a moment to explain a couple of important features here.  In South Africa, students are required to complete only the first nine grades, called “General Education and Training.”  In the United States, graduation from high school means that you have met your high school’s requirements for that goal (which in turn must meet state requirements).  In South Africa, however, high schools essentially serve to prepare students to take the “matric” exams, which are set (created) and marked (graded) nationally.  Matric successes or failures are what decide a student’s opportunities going forward.  I should also say that the chart above describes the academic track.  Many students take advantage of TVET (Technical and Vocational Education and Training) schools that lead to a certificate or diploma rather than a degree (these campuses have also experienced significant protests).  Each of these training types is considered in determining the SAQA level for a job candidate.

IMG_7451

The 2016 Honours class for the Division of Molecular Biology and Human Genetics

Students who come to Honours in the Division of Molecular Biology and Human Genetics (MBHG) may come from quite a variety of schools and backgrounds.  Like other divisions throughout Stellenbosch University and the University of Cape Town, we are trying to “transform,” or more faithfully represent the broader population of South Africa, and so we seek out candidates who may not have been able to afford the best schools for bachelor’s training.  Transformation is a hard task, and many universities are struggling [Note to self: read that overview chapter!].

My first exposure to teaching at Stellenbosch, then, was to create a bioinformatics “module” for our Honours students.  The group above got to serve as test subjects for my new curriculum, which spanned just four days in 2016.  Instead of 43 one-hour classes from my old Vanderbilt BMIF 310, I adjusted to four morning laboratories (each three hours) and four afternoon lectures (each two hours).  With so little time, I was obviously quite superficial in my coverage.  For 2017, though,  I will conduct a bioinformatics module that extends for eight days (during the first eight business days of May).  I am keeping the hands-on and lecture split the same as last year.  I think the doubling to eight days will be good for both the students and the professor!

Useful Hour 3

In this still from Useful Hour 3, Haiko, Michael, and I impersonate parts of a linked list.

Lecturing just eight days a year isn’t really satisfying my itch to teach, though.  This year I initiated a wildcat “course” of sorts.  The “Useful Hour” takes place each Wednesday at 1:30 PM.  Anyone on campus can attend, and we record videos each week for those who cannot.  The topics have generally been focused on computers, bioinformatics, or biostatistics, though in the coming week we will branch out into biochemistry, as well.  Since the Useful Hour covers so much terrain, I have tried to treat each segment as an independent story, with the topic for each Wednesday announced by my listserv on Monday.  It could be that the loose structure of the Useful Hour will cause its undoing, but for now I am really enjoying its playful vibe.

My work with the Blackburn Lab at the University of Cape Town on Tuesdays has led to another opportunity.  I have teamed up with Nelson Soares, a staff scientist, to create a monthly “Big Show” tutorial for the community of proteomics researchers throughout Cape Town.  Our recent program gave graduate students and post-docs the opportunity to present the essentials of protein identification and quantitation.  In April, we will look at the opportunities their acquisition of a SCIEX TripleTOF will confer on the group.  I appreciate that the students are also willing to listen to a lecture from me, from time to time!

The very latest teaching gig is one I hesitate to mention, since we are still formulating it.  In talking with more members of the Biotechnology Department at the University of the Western Cape, I’ve realized that they have a critical need for more biostatistics training.  I have never taught this subject formally, though I was part of the weekly “Omics” clinic for Biostatistics at Vanderbilt University for a few years.  Certainly one cannot function for long in genomics, transcriptomics, or proteomics without knowing something about biostatistics.  Teaching biostatistics formally is likely to teach me as much about the subject as the students who attend!  I hoped to use slides from Stellenbosch University for teaching weekly courses at UWC, but I could not get that use approved.  Instead, I have once again borrowed the expertise of my friend Xia Wang at the University of Cincinnati.  I am hopeful that I will be able to understand and use her didactic materials.  They’re written in the LaTeX math formatting language, so I will need to remind myself how to edit and export to a format I can display, like PDF. My last real experience with LaTeX was when I wrote my Ph.D. dissertation in 2003.

With students on three university campuses, I think I will finally feel like I have real some momentum in my teaching!