Methodology


Clustering method

The clustering was originally (v1.0) performed on the protein-coding gene of the two model plants , O. Sativa and A. Thaliana using TribesMCL [1]. This software uses a Markov cluster (MCL) algorithm for grouping proteins into families based on a pre-computed sequence pairwise similarity matrix. We used several TribeMCL parameters Inflation (1.2, 2, 3 and 5) corresponding to level 1, 2, 3 and 4 in the website and BLAST [2] E-values (E = 1e-10) for family clustering.

The insertion of sequences from newly sequenced genomes was done by a similarity-based method (Blast evalue: 1e-10 ) with a sequence lenght control report to average lenght of family members. Finally unclassified gene where blasted, with the same parameters, to identify species specific clusters.

Cluster curation

Clusters are checked manually using a dedicated interface. We add value with basic annotations, including family names defined via a consensus from existing gene and protein pattern annotations (e.g. UniProt, InterPro, Pirsf, Kegg, GO) for the sequences composing the clusters. The tool sums up high quality annotations available in external databases for protein sequences of a cluster. Some statistics have been made to spot clusters with specific InterPro family motifs. However, we do not check each members of the clusters at this stage. Annotation and analysed is a on going process. To check the curation status, please look at the signs.

Gene family lists

The clustering step and addtional annotation steps allowed us to define various lists of gene family.

  • Annotated gene family list:
    Gene families or subfamilies manually annotated by an annotator and validated by the administrator. More information about our annotation strategy is presented here. Each "validated" family is classified using confidence levels presented above (high-normal-unknown-suspicious-clustering error)

  • Species specific list
    Gene families containing sequences from only one species after the clustering step.

  • Phylum specific list
    Gene families containing sequences that belong only to the same phylum or clade (excluding species specific lists) after the clustering step.

  • Transcription Factor list
    Gene families identified as transcription factors based on RATF and DRTF.

  • Enzymatic list
    Gene families identified as being involved in enzymatic processes based on KEGG.

  • Plant specific family list
    Gene families do not showing any similarity with the other major kingdom: Archea, Bacteria and Eukaryote (excl. plants).
    To define this list, some gene sequences (% defined according to the size and the species composition of the gene family) were submitted to BLAST (1e-10 and length coverage >70% ) against a reference sets found on NCBI and UniProt containing 10 Archea (unicellular prokaryote), 60 bacteria (unicellular prokaryote) and 22 eukaryotes.

  • PlantGOslim list
    This browser allows to search gene families using a subset of the GO term classification (Plant GOslim v1.2). GO terms are attributed to gene families if one of the sequence matches an Uniprot (SwissProt) or a InterPro domain linked to a GO term. The rules are defined as follow:
    If a gene family matches with at least one Uniprot (SwissProt), or if  60 % of gene family members contains the same InterPro domain (thresold arbitrary fixed to define IPR specifc families), then this gene family is flagged with the corresponding GO annotation .

Phylogenetic analyses method

We developed an optimized phylogenomics pipeline for ortholog inference. We validated the full procedure using test sets of orthologs and paralogs for Oryza sativa and Arabidopsis thaliana to demonstrate that this method outperforms pairwise methods for ortholog predictions.

Filtering

Before processing annotated clusters, we filter sequences based on their MEME/MAST [4] e-value and their length by comparison to the compostion of the cluster. Sequences having a low e-value are filtered and sequences with a very low are distinguised as they may be removed of the gene family in due course.

Multiple Alignment

Multi-alignment is one of the major steps in phylogenomic construction. The objective is to identify and align the characteristic domains of the gene families.They were generated using the MAFFT software [8]. Different parameters are applied according to the size of the cluster because MAFFT offers a range of multiple alignment methods including alignment of a very large number of sequences, a feature needed either for very large multigene families or when a large number of species is employed.

Then, we applied a masking procedure [9] to the optimized alignment to detect and remove amino acid columns/ positions containing either no or a low phylogenetic signal. Cluster alignements may be visualized directly from the website with Jalview [14].

Tree construction

We use PhyML software [10] (boostrap 100). PHYML is one of the fastest maximum-likelihood tree reconstruction methods for generation of large trees with an acceptable CPU computing time. PHYML first constructs a BioNJ tree using the Neighbor- Joining tree algorithm and then optimizes this tree to improve the likelihood at each iteration. Trees can be visualized from the website using the ATV applet [13].

Tree rooting

Tree are rooted with the SDI algorithm [11]. We developed a new plant species tree, based on a RIO [published tree and including the top 100 plant species based on NBCI rankings using the number of stored sequences in NCBI

Ortholog inference

We used the Resampled Inference of Orthologs (RIO) procedure to detect orthologs. RIO [12] is based on a bootstrap resampling method to check robustness of phylogenomics predictions based on the phylogenetic tree. This allow us to define minimum threshold to highlight interesting relationships between genes. RIO proposes also some concepts related to orthology such as "super-orthologs" (A) and "subtree-neighbor" (B).


Score Threshold

Orthologous, and other phylogenomics predictions, scores are based on the bootstrap procedure. The method counts the number of gene trees where a specific phylogenomics relationship is found between two sequences. (see schema below for ortholog inference)



Relationship type

-Orthologs : sequences that diverged by a speciation event.
-Super-orthologs: Given a rooted gene tree with duplication or speciation assigned to each of its internal nodes, two sequences are super-orthologous if and only if each internal node on their connecting path represents a speciation event (see figure below).




Subtree-neighbor: Given a completely binary and rooted gene tree, the k-subtree-neighbors of a sequence q are defined as all sequences derived from the k-level parent node of q, except q itself (the level of q itself is 0, q's parent is 1, and so forth) (see figure below).

Functional annotation transfer

Orthologs identification between species is one of the most accurate way to identify sequences sharing a similar function. We recommend to take into account several parameters for functional gene annotation:
1. identify Super-orthologs or orthologs with confidency score above 90 %
2. the selected orthologs should be in the same subtree so supported by a subtree-neighbor score above 50%
3. finaly the selected orthologs should be closely linked to the same subtree with a minimum distance of 0.8
These are the default parameters of the ortholog search bar.

Please note that the relationships presented in this database are only predictions and the proposed thresholds are indicative and based on our own experience.

References

  1. Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30(7):1575-1584 (2002).
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410.
  3. Zdobnov E.M. and Apweiler R. "InterProScan - an integration platform for the signature-recognition methods in InterPro" Bioinformatics, 2001, 17(9): p. 847-8.
  4. Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.
  5. Timothy L. Bailey and Michael Gribskov, "Combining evidence using p-values: application to sequence homology searches", Bioinformatics, Vol. 14, pp. 48-54, 1998.
  6. Schneider M, Bairoch A, Wu CH, Apweiler R. Plant protein annotation in the UniProt Knowledgebase.Plant Physiol. (2005) 138:59-66
  7. Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M.; From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354-357 (2006).
  8. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic acids research 2005, 33(2):511-518
  9. Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics (Oxford, England) 2001, 17(8):700-712
  10. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696-704.
  11. Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics (Oxford, England) 2001, 17(9):821-828.
  12. Zmasek CM, Eddy SR: RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 2002, 3(1):14.
  13. Zmasek CM, Eddy SR: ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics (Oxford, England) 2001, 17(4):383-384.
  14. Waterhouse, A.M., Procter, J.B., Martin, D.M.A, Clamp, M. and Barton, G. J. (2009) "Jalview Version 2 - a multiple sequence alignment editor and analysis workbench" Bioinformatics

Bioversity cirad GCP