Multiple sequence alignment using secondary structure prediction
Multiple alignment programs are usually based solely on primary sequence information. However, since proteins are more conserved at the structural level than at the primary-sequence level, attempts have been made to use predicted secondary structure information for improved alignment, e.g. by Heringa (1999) or Kim and Xie (2006).
Here, we offer a WWW-based multiple-alignment tool that uses protein secondary structure prediction produced by PSIPRED v2 (Jones, 2004). We use DIALIGN 2 that calculates multiple alignments based on local pairwise homologies, so-called fragments (Morgenstern et al., 1996; Morgenstern, 2004). Unlike in the standard version of DIALIGN, our server uses similarities at the primary-sequence level as well as at the secondary-structure level to score these fragments.
Input sequence file:
The input for our program is a single ASCII file containing the sequences to be aligned in multiple FASTA format. We take as running example the following dataset BB11001 from the BAliBASE 3 benchmark.
>1aab_ GKGDPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE >1j46_A MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK >1k99_A MKKLKKHPDFPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK >lef_A MHIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREKFor each sequence, the first line starts with ">" and contains the name of the sequence.
Approach:
Given a set of input sequences, our method first calculates all pairwise alignments using the standard version of DIALIGN. Our web server allows the user to apply a threshold T to remove low-scoring local similarities from these pairwise alignemnts alignment.
The standard version of DIALIGN includes the fragments contained in the respective pairwise alignments greedily into a growing multiple alignment, provided they are consistent with each other, i.e. as long they fit together in a single output MSA. The priority of the fragments in the greedy algorithm depends on the degree of similarity at the primary-sequence level.
In our structure-based version, the priority of the fragments in the greedy procedure depends not only on primary-sequence similarity but also on the degree of similarity between the predicted secondary structures. Details are explained in a forthcoming paper (Subramanian et al., submitted).
Program Output:
Our web server creates different output files containing
- A full alignment of the input sequences in DIALIGN format.
- The predicted secondary structures for all of the input sequences
- A tree based on the sequential and structural similarities in PHYLIP format.
This is DIALIGN alignment format:
The output of DIALIGN with secondary structure in our running example is as follows:program call: /c1/scratch/disec/libexec/dialign2 -thr 0 -sec input.fa
1aab_ 1 gkgd------ PKKPrgkmss yafFVQTSRE EHKKK---HP DASV-NFSEF 1j46_A 1 mq------DR VKRPMNA--- ---FIVWSRD QRRKMALENP RM---RNSEI 1k99_A 1 mkklkkhpDF PKKPLTP--- ---YFRFFME KRAKYAKLHP EM---SNLDL lef_A 1 m--------H IKKPLNA--- ---FMLYMKE MRANV---VA ESTLkESAAI 0000000000 0000000000 0000000000 0000000000 0000013399 1aab_ 41 SKKCSERWKT MSAKEKGKFE DMAKADKARY EREMktyipp kge------- 1j46_A 36 SKQLGYQWKM LTEAEKWPFF QEAQKLQAMH REk------- YP------NY 1k99_A 42 TKILSKKYKE LPEKKKMKYI QDFQREKQEF ERNLarfred HP------DL lef_A 34 NQILGRRWHA LSREEQAKYY ELARKERQLH MQl------- YPgwsardNY 9999999999 9999999999 9999966666 6655000000 0000000000 1aab_ 84 ---------- --- 1j46_A 73 KYRPRRKakm lpk 1k99_A 86 IQNAKK---- --- lef_A 77 GKKKKRKrek --- 0000000000 000
- Names of the aligned sequences are shown on the left hand side of the alignment.
- Numbers on the left hand side of the alignment denote the position of the first residue in a line within the respective sequence.
- As explained in our papers, DIALIGN composes alignments from so-called fragments, i.e. from un-gapped local pair-wise alignments. Capital letters denote aligned residues, i.e. residues involved in at least one of the fragments the alignment consists of. Lower-case letters denote residues not belonging to any of these selected fragments. They are not considered to be aligned by DIALIGN. Thus, if a lower-case letter is standing in the same column with other letters, this is pure chance; these residues are not considered to be homologous.
- Numbers below the alignment roughly reflect the degree of local similarity among the sequences. More precisely: They represent the sum of `weights' of fragments connecting residues at the respective position. The numbers are normalized such that every position gets a value between 0 and 9. Thus, these numbers reflect the relative degree of similarity within an alignment, since in every alignment, the region of maximum similarity gets a score of 9.
Secondary structures predicted by PSIPRED are given in the following format:
83 CCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHHHHCCHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCC 85 CCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHCCCHHHHHHHHHHHHHHHHHHHHHCHCCCCCCCCCCCCCCC 91 CCCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHCCHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCHCCCCCC 86 CCCCCCCCHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHCCCHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCC
- Each sequence is written on one line, and sequences are ordered as in the input.
- The length of the sequence is shown on the left hand side of the line.
- Letters C (for Coil), H (for Helix) and E (for bEta strand) indicate the predicted state for the coresponding position in the sequence.
This is PHYLIP tree format:
(((1aab_ :0.000019, 1k99_A :0.000019):0.000055, 1j46_A :0.000075):0.000213, lef_A :0.000287);
Trees can be visualized using the drawtree program contained in Joe Felsenstein's PHYLIP software package.
Back to submission form.