The Talk.Origins Archive Post of the Month: May 2009

Subject:    | exploring _all_ of sequence space -- done deal?
Date:       | 02 May 2009
Message-ID: | 5841cf37-bf2f-494f-bdee-2095798523e3@s16g2000vbp.googlegroups.com

Editors notes:
This post is by the author of a peer reviewed paper found here.
Two paragraphs which digress from the central point of the post can be read in the original here.
Specific points made by the primary creationist critic, Sean Pitman, are explicitly addressed at the end.

Hello, I'm David Dryden, the main author of the paper which has generated all of this discussion. Steven Litvintchouk has asked me to comment on the analysis of my paper performed by Sean Pitman and this is appended at the end of this text. I must say that I am a bit surprised at the amount of interest and discussion that has arisen from the little paper by myself and two colleagues. However, I am pleased that it has done so and that my decision to pay for it to be an "open access" paper has been justified. It is my hope that within 10 years all professional peer-reviewed science will be freely available.
[paragraph deleted]

I am probably going to write far too much but if you want the conclusion, it is that Sean Pitman is completely and utterly wrong in everything he says in his comments and displays a great ignorance of proteins and their structure and function. However, I do hope he will read on.

He has failed to recognise that we wished to establish limits on the amount of sequence space explored. Defining limits is a standard scientific procedure, though often forgotten, and prevents a lot of time wasting as long as one does not initially restrict the calculation unnecessarily. Hence, these upper and lower limits are as generous as possible based upon current knowledge. They may be out by a few powers of ten but this is negligible and as research proceeds the limits are likely to come closer together narrowing the range of possibilities. In our graph, we overlaid the limits on top of the number of different sequences possible for a given pool of letters. The letters may total up to 20 in number but it's just a number. If these letters are considered to be amino acids, then we can group the 20 amino acids into a smaller number of categories related to their physical chemical properties. As soon as you state that the symbols you use in the calculation of these big numbers represent physical entities then you cannot ignore their physical properties. Such groupings are well known to be valid in experiment and are the basis for computer-driven sequence analysis. It is unfortunate that such computer analysis has been carried too far by some people especially those who wish to push the ID complexity agenda and that sight has been lost of the fact that proteins do not exist in silico.
[paragraph deleted]

In the real world of protein structure and function, it is well established that many amino acids can be changed with little or no effect on function (there are only two "functions" for a protein: to bind and control another object or to bind another object and catalyse a reaction on that object-and this latter includes proton pumping to rotate a flagellum). The catalytic chemical mechanisms performed by enzymes and ribozymes fall into only 6 categories of well understood chemical reactions (a rather small reaction space). All of biochemistry is built upon them. Those mutations that do change function usually define an active site (for function) or a crucial folding region. It is known that proteins are built up from the repetitive use of smaller peptide sequences being joined together by gene duplication and recombination in the DNA coding.

A general consensus seems to be that amino acid sequences of length about 20 or 30 really start to get interesting in terms of function but even smaller folding units are being proposed as progenitors (see work if possible by Edward N Trifonov and also note that even single amino acids are capable of catalysing reactions). The joining of these 20-30mers (via the gene), which fold into "super-secondary" structure units, rapidly leads to a folded "domain" of 50-150 amino acids. These domains are small folded units. This domain size is essentially universal. You do not find examples of a 1000 amino acid domain but it will instead comprise around 10 smaller domains. It is very important to note that in many cases such folding does not require complete specification of the sequence but only maintenance of the pattern of amino acid categories (polar, non-polar etc). Many domains do interesting things but a recurring theme is to then bring collections of the domains together either as separate subunits to form quaternary structure or to string them together into one single chain by making them via a single gene.

This latter process is how one makes a 1000 amino acid protein. Nature takes the domain modules and puts them together to make novel structures and functions. (Note that very few long protein chains are found as their synthesis is tricky due to increasing chances of errors in gene replication, transcription or translation. These problems were recognised by Francis Crick if I recall correctly in the 1960's. The average protein size is around 500 amino acids.) The number of domains with unique folds appears to be limited to around 1000 to 10,000 (certainly less than a million) and it is becoming rare for science to find new ones. This means that whatever sequence you have, you get one of these folds (but note that there is a relatively new discovery of "natively-unfolded" proteins which are unfolded until they bind to their target molecule at which point they fold up. How much of sequence space is occupied by these fully functional proteins is unknown).

The number of protein types in any organism also appears to be very limited from several hundred in the bacteria with very small genomes to perhaps of the order of 10,000 in multicellular organisms. These proteins still contain the same domains. These limited numbers of proteins perform a limited number of reactions (again totalling hundreds to thousands of such reactions as detailed in standard biochemistry textbooks). The important thing to produce different species is how all these reactions are controlled (mostly) by proteins each built using the same structural principles as already detailed. This area of control is often referred to as part of EvoDevo or evolutionary development and it is where the real discussions of complexity are going on. That's not my speciality so I will return now to proteins.

Hopefully it is clear that we have considerable understanding of protein structure and how they function and that they are not so complex that we cannot possibly explain them (and even design them ourselves. David S Goodsell has published an excellent book on "Bionanotechnology" showing what Nature has achieved and how our understanding of it lets us design new experiments). We could if we wished define a structural space and a functional space in the same way as a sequence space. From the foregoing, it should be clear that neither of these spaces is particularly large when compared to the possible size of sequence space.

Given the small sizes of the spaces of domain structure, chemical reactivity, and biochemical function, or perhaps I could refer to them as toolboxes, it does not seem to be too much of a guess that the last common ancestor of all life on earth would be equipped with all of these toolboxes. This is why we decided to determine the limits of use of these toolboxes and calculate the upper and lower limits starting from way back in time. The tools may not have been fully utilised or explored by that stage (or even by bacteria today), but later when multicellular organisms appeared, these tools were ready to be used and even reconfigured.

So we return to the problem of how big is sequence space? It is not 20 raised to the power of the number of amino acids in the sequence but much less as we discussed in our paper. There is no need to have 20 different types of amino acid each with unique properties or sequence lengths greater than around 100. This is due to the physical similarities between amino acids and the limited range of folded structures. Only a few amino acids in any protein are crucial for its function. Changing these ones will sometimes do almost nothing to the function. However, some will change the function to something new while maintaining structure, and a very few will change the protein structure (and hence its original function) altogether but in all probability still confer a new function. Even if the original structure is destroyed by a mutation and the protein does not fold, the possibility of a new function is not lost as evidenced by the discovery of natively unfolded proteins which acquire fold and function upon binding to a target.

Sequence space when proposed was quickly recognised as a silly "paradox" by protein scientists (though not unfortunately by some other scientists) rather like the silly Levinthal "paradox" of protein folding. Sequence space may be large but that does not mean it is complex. I hope the above short essay on protein structure and function is useful even to Sean Pitman who needs to stop being obsessed with computer-based numerology and do some reading and talk to some practical protein scientists. I also hope that he will realise that proteins are not relevant to religion and that they as well as other macromolecules provide absolutely no foundation for ID.

Some short specific replies to critiques from Sean Pitman here
> The authors argue that because some types of functional proteins are smaller
> than 100aa [amino acids], and because some types of proteins have very low
> sequence specificity requirements, that pretty much all of sequence space
> could have been searched in just a few billion years.

> What these authors fail to realize is that not all protein-based systems are
> created equal. Some systems do indeed require very few amino acid building
> blocks and little sequence specificity. These, of course, are on very low
> levels of functional complexity. Other systems, however, require far more
> than the 100aa systems discussed by the authors - at minimum. And, many of
> these systems require a significant degree of specificity. These systems
> are on a much higher level of functional complexity and therefore occupy
> much much larger sequence spaces and also have exponentially lower ratios of
> potentially beneficial vs. non-beneficial at these higher levels.

With nearly 50 years of biological and chemical research shared between the authors, we really do understand that not all molecules are the same. We are not saying only a few specified amino acids are available but a few types of amino acids- polar, non-polar, negative, positive and so on. For example, any one of the polar ones might do at a particular site. All you need to specify is polarity. Since this is the entire basis of using sequence comparisons to group proteins into functional families and these computer methods are used by supporters of ID, they should know this. Small proteins are not necessarily functionally simple nor are big proteins necessarily complex in function (in fact some are remarkably boring).

> The authors of this paper do not even address the concept of different
> levels of functional complexity. They only point out the obvious that some
> types of functional systems are very very low level systems. Well duh! What
> about those systems that are on much much higher levels of minimum size
> and/or specificity requirements?...

These increasing degrees of functional complexity are a mirage. Just because a flagellum spins and looks fancy does not mean it is more complex than something smaller. The much smaller wonderful machines involved in manipulating DNA, making cell walls or cytoskeletons during the cell's lifecycle do far more complex and varied things including switching between functions. Even a small serine protease has a much harder job than the flagellum. The flagellum just spins and spins and yawn...

Another Sean Pitman critique originally here.
> The evidence shows that the distances [in sequence space] between higher and
> higher level beneficial sequences with novel functions increases in a linear
> manner."

What evidence? And if importance of function scales with sequence length and the scaling is linear then I am afraid that 20^100 is essentially identical to 2 x 20^100. Also a novel function is not a new function but just one we stumble upon in doing the hard work in the lab. It's been there a long time...

Does evolution follow a path both straight and wide?

Post of the Month: May 2009