Documentation

Methodology

Pangene construction

Pangene were computed with the get_homologues-est software v20092018 [1], based on NCBI Blast-v2.2 and used the following program options (-M -F -t 0 -m cluster). Each cluster obtained by was processed the following way:

  1. For single-gene copy clusters, sequences were aligned using MAFFT v7.313 and an automatic procedure to generate a consensus sequence was applied. For each position of the alignment, the most present amino acid (AA) was kept. In case of tie, the AA of the genome with the highest BUSCO score was selected. Finally, if more gaps than AA were present, this position was removed from the sequence.
  2. For multi-copy clusters (more than a single sequence by genome), the same procedure was applied than for single-gene copy clusters but a preliminary step to select a representative sequence by genome is required. Multiple sequence alignments were used to generate a distance matrix using distmat (EMBOSS v6.6.0). The matrix was used to define the distance for each sequences all other genomes and the sequences with the smallest sum of distance was selected as representative of the considered genome. Then, the consensus step was applied.
  3. For genotype-specific clusters, a distance matrix was also generated between all sequences and the sequence with the lowest average distance (min(d/sum(d))) between all sequences was considered as the closest to the ancestor sequence and then added to the pangene.
  4. For Singletons (i.e. cluster of one sequence), they were searched for similarity using DIAMOND with default e-value (i.e. 0.001) on all the 134 proteins to predict their putative accuracy. If at least 1 hits in at least 2 species were found, they were added as pangene, otherwise the sequence was excluded.

Clustering method

The clustering was performed on the protein-coding gene using TribesMCL [2]. This software uses a Markov cluster (MCL) algorithm for grouping proteins into families based on a pre-computed sequence pairwise similarity matrix. We ran TribeMCL with different inflations from less to more stringent thresolds (ie. 1.2, 2, 3 and 5) that correspond to level 1, 2, 3 and 4 in the website. The Pairwise similarity matrix was obtained by running Diamond [3] with default e-value (i.e. 0.001)

Cluster Annotation

Clusters are checked by human curators using a dedicated interface. We add annotations, including family names defined via a consensus from existing gene and protein pattern annotations (e.g. UniProt-SwissProt, InterPro, Pirsf, Kegg, GO) for the sequences composing the clusters. The tool sums up high quality annotations available in external databases for protein sequences of a cluster. Some statistics have been made to spot clusters with specific InterPro family motifs. Annotation and analyses are a ongoing process. To check the curation status, please look at the signs.

Graphical signs in GreenPhylDB
Clustering confidence levels
High confidence level
Normal confidence level
Unknown confidence level
Suspicious clustering
Clustering error
Non-curated
Phylogenetic analyses
Phylogenetic analyses performed
Phylogenetic analyses partially available
Phylogenetic analyses unavailable
Plant-specific family
Not a plant-specific family
Gene family plant-specificity not available
Gene pangenome status
Core gene
Soft-Core gene
Dispensable gene
Genome-specific gene
Unspecified gene pangenome status

Gene family lists

The clustering step and addtional annotation steps allowed us to define various lists of gene family.

  • Annotated gene family list:
    Gene families or subfamilies manually annotated by an annotator and validated by the administrator. More information about our annotation strategy is presented here. Each "validated" family is classified using confidence levels presented above (high-normal-unknown-suspicious-clustering error)

  • Species specific list
    Gene families containing sequences from only one species after the clustering step.

  • Phylum specific list including species-specific families
    Gene families containing sequences that belong to the same phylum including species-specific families underlying this specific phylum.

  • Phylum specific list including species-specific families
    Gene families containing sequences that belong to the same phylum excluding species-specific gene families.

  • Plant specific family list
    Gene families do not showing any similarity with the other major kingdom: Archea, Bacteria and Eukaryote (excluding plants).
    To define this list, 10 representative (using CD HIT software with ajusted parameters for each family) gene sequences from each family were submitted to BLAST (e-value: 1e-5) against the reference sequences from NCBI RefSeq (release XX: fungi, invertebrate, microbial, protozoa, vertebrate_mammalian, vertebrate_other, viral). Only families with no match were tagged as "plant specific".

Phylogenetic analyses method

We developed a phylogenomics pipeline for ortholog inference.

Multiple Alignment: This is one of the major steps in phylogenomic construction. They were generated using the MAFFT software [4]. Different parameters are applied according to the size of the cluster because MAFFT offers a range of multiple alignment methods including alignment of a very large number of sequences, a feature needed either for very large multigene families or when a large number of species is employed.

Cluster alignements may be visualized directly from the website with MSAviewer [6].

Gene Tree construction: We used FastTree software [7] Trees can be visualized from the website using the PhyD3 [8] and InTreeGreat

Gene tree rooting and Ortholog/paralogous inference: Phylogenetic trees are rooted using RAPGreen [9] and a species tree generated from the NCBI Taxonomy. RAPGreen provides a list of orthologus and paralogous genes

References

  1. Contreras-Moreira,B., Cantalapiedra,C.P., García-Pereira,M.J., Gordon,S.P., Vogel,J.P., Igartua,E., Casas,A.M. and Vinuesa,P. (2017) Analysis of Plant Pan-Genomes and Transcriptomes with GET_HOMOLOGUES-EST, a Clustering Solution for Sequences of the Same Species. Front Plant Sci, 8, 184.
  2. Enright A.J., Van Dongen S., Ouzounis C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research 30(7):1575-1584 (2002).
  3. Buchfink,B., Xie,C. and Huson,D.H. (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods, 12, 59–60
  4. Katoh,K. and Standley,D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30, 772–780.
  5. Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009 25: 1972-1973.
  6. Yachdav, Guy, Sebastian Wilzbach, Benedikt Rauscher, Robert Sheridan, Ian Sillitoe, James Procter, Suzanna E. Lewis, Burkhard Rost, and Tatyana Goldberg. 2016. “MSAViewer: interactive JavaScript visualization of multiple sequence alignments.” Bioinformatics 32 (22): 3501-3503.
  7. Price,M.N., Dehal,P.S. and Arkin,A.P. (2010) FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE, 5, e9490
  8. Kreft L, Botzki A, Coppens F, Vandepoele K, Van Bel M "PhyD3: a phylogenetic tree viewer with extended phyloXML support for functional genomics data visualization Bioinformatics 2017"
  9. Dufayard J-F., Duret L., Penel S., Gouy M., Rechenmann F. and Perriere G. (2005) Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases, Bioinformatics, 21 (11): 2596-2603, 2005.