

In recent years it has become clear that some of the information in amino acid sequences is used to determine the conformations of folding intermediates, which may differ from native states, or to avoid off-pathway aggregation ( Goldenberg et al. 1995 Broome and Hecht 2000) have extensively examined the influence of short or periodic sequence patterns, including HP alternations, in protein structures, finding that such patterns are major determinants of secondary structure and that alternating patterns are specifically disfavored in protein sequences. Hecht and collaborators ( West and Hecht 1995 Xiong et al. Strait and Dewey (1996) examined HP sequences from an information theoretic perspective, showing that they contain much less information than random sequences and therefore exhibit biases for particular arrangements of hydrophobic and hydrophilic residues. (1993) examined short HP patterns, finding several significantly favored or suppressed in real protein sequences. For example, White and Jacobs (1990) began studies of hydrophobic/polar (HP) run-length distributions in order to test for randomness of protein sequence hydrophobicities, concluding that most protein sequences are individually indistinguishable from random sequences.

More recent studies have focused specifically on simplified models of hydrophobicity. Pioneering work in the area came from studies of residue conservation in specific protein families, examining hydrophobicity as one of many factors that appeared to constrain amino acid choice in different structural environments ( Lesk and Chothia 1980, 1982 Chothia and Lesk 1982). Statistical analyses of patterns of hydrophobicity in protein sequences and structures nonetheless have a long and continuing history ( White 1994 Broome and Hecht 2000). 1991 Cohen and Parry 1994), few simple patterns unambiguously controlling chain folds are evident in sequences of globular aqueous proteins. Except for a few specialized cases, such as the heptad repeats directing chains into the coiled-coil fold ( O’Shea et al. For integral membrane proteins, the organization of hydrophobic and polar residues differ, with long hydrophobic stretches folding within the apolar lipid environment but polar residues required for stretches of the sequences that are solvent exposed in the cytosolic or extracellular environments. Protein scientists understand this in some general sense for proteins soluble in aqueous solution the removal of hydrophobic residues from water and their interaction in a buried core is a driving force for chain folding, while the presence of polar amino acids is necessary for the formation of the surface interface between protein and solvent. The presence of both hydrophobic and polar residues interspersed through polypeptides chains is one of the most general features of amino acid sequences encoded by genes ( Chothia 1984 White 1994). These results suggest that the aqueous proteins of solved structure may represent an essentially complete sample of the universe of aqueous sequences, while the membrane proteins of known structure are not yet representative of the universe of membrane-associated proteins, even by relatively simple measures of hydrophobic patterns. Comparison to prior membrane-bound protein sequences, however, shows significant qualitative changes, with the average hydrophobicity and frequency of long runs of hydrophobic residues noticeably increasing between the database editions. The expanded database does now allow us to explain several deviations of hydrophobicity statistics from models of random sequence in terms of requirements of specific secondary structure elements. Comparison between database editions reveals robustness of statistics on aqueous proteins despite an approximately twofold increase in nonredundant sequences. Avoiding aggregation of partially folded intermediates during intracellular folding remains a viable explanation for the rarity of long hydrophobic runs in soluble proteins. These long runs most commonly occur as buried α helices, with extended hydrophobic strands less common. Long runs of hydrophobic residues remain significantly underrepresented in soluble proteins, with none longer than 16 residues observed. Previous studies indicated that long hydrophobic runs, common in membrane proteins, are underrepresented in soluble proteins. Here we report statistics of hydrophobicity patterns in proteins of known structure in a current protein database as compared with results from earlier, more limited structure sets. Patterns of alternation of hydrophobic and polar residues are a profound aspect of amino acid sequences, but a feature not easily interpreted for soluble proteins.
