The clustering was originally (v1.0) performed on the protein-coding gene of the two model plants , O. Sativa and A. Thaliana using TribesMCL [1]. This software uses a Markov cluster (MCL) algorithm for grouping proteins into families based on a pre-computed sequence pairwise similarity matrix. We used several TribeMCL parameters Inflation (1.2, 2, 3 and 5) corresponding to level 1, 2, 3 and 4 in the website and BLAST [2] E-values (E = 1e-10) for family clustering.

The insertion of sequences from newly sequenced genomes was done by a similarity-based method (Blast evalue: 1e-10 ) with a sequence lenght control report to average lenght of family members. Finally unclassified gene where blasted, with the same parameters, to identify species specific clusters.
Clusters are checked manually using a dedicated interface. We add value with basic annotations, including family names defined via a consensus from existing gene and protein pattern annotations (e.g. UniProt, InterPro, Pirsf, Kegg, GO) for the sequences composing the clusters. The tool sums up high quality annotations available in external databases for protein sequences of a cluster. Some statistics have been made to spot clusters with specific InterPro family motifs. However, we do not check each members of the clusters at this stage. Annotation and analysed is a on going process. To check the curation status, please look at the signs.

The clustering step and addtional annotation steps allowed us to define various lists of gene family.
We developed an optimized phylogenomics pipeline for ortholog inference. We validated the full procedure using test sets of orthologs and paralogs for Oryza sativa and Arabidopsis thaliana to demonstrate that this method outperforms pairwise methods for ortholog predictions.
Before processing annotated clusters, we filter sequences based on their MEME/MAST [4] e-value and their length by comparison to the compostion of the cluster. Sequences having a low e-value are filtered and sequences with a very low are distinguised as they may be removed of the gene family in due course.

Multi-alignment is one of the major steps in phylogenomic construction. The objective is to identify and align the characteristic domains of the gene families.They were generated using the MAFFT software [8]. Different parameters are applied according to the size of the cluster because MAFFT offers a range of multiple alignment methods including alignment of a very large number of sequences, a feature needed either for very large multigene families or when a large number of species is employed.
Then, we applied a masking procedure [9] to the optimized alignment to detect and remove amino acid columns/ positions containing either no or a low phylogenetic signal. Cluster alignements may be visualized directly from the website with Jalview [14].

We use PhyML software [10] (boostrap 100). PHYML is one of the fastest maximum-likelihood tree reconstruction methods for generation of large trees with an acceptable CPU computing time. PHYML first constructs a BioNJ tree using the Neighbor- Joining tree algorithm and then optimizes this tree to improve the likelihood at each iteration. Trees can be visualized from the website using the ATV applet [13].

Tree are rooted with the SDI algorithm [11]. We developed a new plant species tree, based on a RIO [published tree and including the top 100 plant species based on NBCI rankings using the number of stored sequences in NCBI
We used the Resampled Inference of Orthologs (RIO) procedure to detect orthologs. RIO [12] is based on a bootstrap resampling method to check robustness of phylogenomics predictions based on the phylogenetic tree. This allow us to define minimum threshold to highlight interesting relationships between genes. RIO proposes also some concepts related to orthology such as "super-orthologs" (A) and "subtree-neighbor" (B).
Score Threshold
Orthologous, and other phylogenomics predictions, scores are based on the bootstrap procedure.
The method counts the number of gene trees where a specific phylogenomics relationship is found between two sequences.
(see schema below for ortholog inference)



Functional annotation transfer
Orthologs identification between species is one of the most accurate way to identify sequences sharing a similar function.
We recommend to take into account several parameters for functional gene annotation:
1. identify Super-orthologs or orthologs with confidency score above 90 %
2. the selected orthologs should be in the same subtree so supported by a subtree-neighbor score above 50%
3. finaly the selected orthologs should be closely linked to the same subtree with a minimum distance of 0.8
These are the default parameters of the ortholog search bar.
Please note that the relationships presented in this database are only predictions and the proposed thresholds are indicative and based on our own experience.