I have a love/hate relationship with the Mythbusters. It’s mostly love because they ask interesting questions and have a philosophy that “failure is always an option”. The hate part is that their experimental design is more based on “will this be entertaining?” rather than “does this answer the question?”. I actually taught a course at UC Davis where we examined Mythbusters episodes and discussed what went right and wrong. We even did our own experiment: “does hot sauce increase your body temperature?”.
One of the common car culture myths is that when tuning your car, you should change only one thing at a time. I’m going to give a very simple example of why that isn’t necessarily a good idea, and I’m going to do that using my own research as an example. Get ready for some bioinformatics-flavored genome science. If you’re too lazy or time-constrained to read the whole blog, here’s the short version: you can’t predict how different parts interact, so optimizing one thing at a time isn’t optimal. Now that you know the end of the story, let me tell the beginning and middle.
A little molecular genetics for car enthusiasts
Proteins are the building blocks of your body. They are both the structure (chassis) as well as the moving parts (pistons, connecting rods). Proteins are not the fluids (oil, gasoline, air) though. Those parts we call metabolites. For our purposes, the blueprints for building each protein are contained in genes, and the sum total of all of the genes are in a book we call the genome. Over the last 20 years or so, we have gotten very good at sequencing genomes. That is to say, we can print out the instruction set for lots of different organisms. We aren’t so good at determining what those instructions mean because it’s hard to determine where the genes are in a bunch of letters.
The sequence above is 800 letters randomly selected from a favorite genome. What do they mean? Does this paragraph correspond to a gene or something else? Deciphering such passages in the various books of life is literally my day job. It’s a pretty sweet gig if you’re into solving puzzles. So let’s imagine briefly that each protein is like a group of sentences. In order to find the genes, we just need to break the genome into sentences. So let’s do that in the following passage.
Nid oes gennyf fawr o bleser gydag ysgrifennu un math o farddoniaeth heblaw caneuon bychain o’r fath hyn. Fy mhlant fy hun ydyw’r Caniadau. Dymuniad fy nghalon a balchder fy mynwes ydyw eu dwyn i fyny yn blant da. Wrth adael i rai ddawnsio mewn plentynrwydd, ac i’r lleill chwerthin ac ysmalio, caiff nifer o honynt gadw carwriaeth ac eraill ganu hen alawon eu brodir. Caiff bechgyn weithio yn y graig, a bugeilio ar y mynydd, a phan fydd dolefiad corn y gâd yn galw, fe’u cyfeiriaf i faes y frwydr i amddiffyn eu cartref, ac i farw’n ddewr tros Ryddid eu mamwlad. Yn nesaf at ofni Duw ac anrhydeddu y brenin, cant garu eu gwlad a meddwl yn dda am eu hiaith a’u cenedl.
I have absolutely no idea what this says because it’s written in Welsh. However, I can tell where the sentences are because I can recognize some punctuation (capital letters, periods, commas). But genomes don’t have punctuation. So now try breaking this up into sentences when you don’t know the language or punctuation.
Let’s further complicate matters by imagining that the instructions for each protein are often interrupted by several advertisements, much as you would see in a magazine (weirdly, much of our genome sequence appears to be junk, like advertisements, and not useful content). In order to read the article, you have to remove the advertisements. We call the advertisements introns and the act of removing them splicing.
In order to splice out the advertisements (introns) to read the article (protein) we must first recognize what the signals look like. In real life, advertisements are usually less than a page, and there is usually a bounding rectangle as well as changes in background or font. That is, we can discern an advertisement by its surrounding context. Genomes don’t have such features, so we use the patterns of letters themselves to provide context. Splice sites have two sides called donor and acceptor. For the experiment today, we are going to look at just one side: the splice acceptor.
The consensus splice acceptor sequence is TTTTCAG. That is to say that this is the most likely sequence of letters. But not all splice acceptors have the same sequence. Sometimes they are ATTACAG, for example. The last two letters, AG, are almost always AG, but the other letters can change quite a bit. A very simplistic model of splicing would be that every AG is a splice acceptor site. But those occur about once every 16 letters while advertisements occur every 100-200 letters. So how do we recognize splice acceptor sites? By looking at the sequences to the left and right. These provide context, much in the same way that a change in fonts or dialect provides context of change.
Here’s a diagram depicting a simple model of a splice acceptor site that we will use for our experiments. On the left is an advertisement (intron). This contains a dictionary of all the common words found in advertisements. On the right is part of a protein-coding sentence (exon). It contains a dictionary of all the common words found in articles (proteins). Importantly, the frequency of words in the two dictionaries are different. For example, the word sale occurs many times in advertisements, but rarely in the article. In the middle is a sequence that ends in AG, which represents the various splice acceptors we have seen before (changes in font/dialect/background/whatever).
In order to think of this experiment in car terms, we are going to imagine the model above as a car. The intron represents the front suspension, the exon is the rear suspension, and the AG box is a combination of tire and pressure. We can make changes to any of these components and see how performance changes.
The first thing we are going to do is make a sweep over tire pressures using our favorite two tires. Here, let’s imagine the Y-axis is grip or some similar approximation of optimality. Both tires appear to like 27 lbs.
Now let’s make some changes to the suspension. We can go with soft or hard springs up front.
- Front Soft (rear soft) 0.905
- Front Hard (rear soft) 0.720
Soft is clearly better than hard. If we are allowed to change only one thing at a time, we can change the rear springs to hard to see if that’s better.
- Front Soft, Rear Soft 0.905 (as seen before)
- Front Soft, Rear Hard 0.891
If we change only 1 thing at a time, we max out performance at 0.905 using tire 1 at 27 lbs and soft-soft suspension. However, it’s only one more test to examine hard on both sides.
- Front Hard, Rear Hard 0.938
So hard-hard is better than soft-soft, and we can’t know that unless we try all combinations. Since there are only 2 sets of springs, this isn’t a big deal. But what if we use all combinations of tire compounds, tire pressures, and springs? Is there a combination better than tire 1, 27 lbs, and hard-hard?
Yes. It turns out that tire 2, 29 lbs, and hard-hard results in 0.942 performance. If we are allowed to optimize one thing at a time, we end up with 0.905 performance, but if we try all combinations, we end up with 0.942. We did have to test 64 combinations to arrive at that result however. Most car testing scenarios can not schedule enough time to do 64 different tests. And with the conditions changing throughout the day, even if one could do 64 tests, the results may be polluted by external conditions.
Given how difficult it is to find optimality, how does one approach the problem? For me, personally, I have cars that don’t have a lot one can tune. I’m pretty much limited to changing tire compounds and pressures. Tuning means driving around a car’s problems more than fixing them. But that’s because I suck at racing. Hopefully your tuning program is better than mine. And hopefully it means testing combinations of parameters rather than greedy/serial optimization.
Tuning is difficult because there are many parameters that may interact in ways that are difficult to predict. While testing all combinations of parameters is the only way to be sure, it may be impractical or impossible to do this in the real world. Tune as well as your time and wallet permit and drive around what you can’t.
The experiment above was conducted using hidden Markov models, a popular architecture to model biological sequence signals. Tire pressure is actually the length of the splice acceptor signal from 2-9 nt (rather than 22-29 psi). Tire 1 vs 2 is a change in emission context from 0 to 1. Soft vs. hard suspension is a change from 0 to 1 in emission context for the intron and exon states. The data source was the first 1% of each chromosome in Caenorhabditis elegans.