Documentation

Data sources

Genome sequences


Organism Sources and releases Number of sequences
Oryza sativa v7
67393
Arabidopsis thaliana v10
35386
Medicago truncatula vMt3.5v5
64123
Picea abies v1.0
71158
Sorghum bicolor v1.4 (79)
29448
Populus trichocarpa v2 (210)
73013
Vitis vinifera v1 (12X)
26346
Physcomitrella patens v1.6
38357
Selaginella moellendorffii v1 (91)
(Filtered models 3)
22285
Glycine max v1 (189)
73320
Ostreococcus tauri v4
36912
Chlamydomonas reinhardtii v4.3 (236)
19524
Cyanidioschyzon merolae v1
5014
Brachypodium distachyon v1 (192)
31029
Carica papaya v1 (113)
27775
Ricinus communis v0.1 (119)
31221
Zea mays v5b60
(Filtered set)
63540
Musa balbisiana v1.0
(TE filtered)
39914
Musa acuminata v1
36549
Theobroma cacao v1
(TE filtered)
28798
Manihot esculenta v4.1 (147)
34151
Malus domestica v1 (196)
63517
Cucumis sativus vGy14 (122)
30364
Phoenix dactylifera v3.0
28889
Cajanus cajan v1
48680
Hordeum vulgare v1
26159
Phaseolus vulgaris v1 (218)
31638
Solanum lycopersicum v2.40 ITAG
34727
Citrus sinensis v1 (154)
46147
Gossypium raimondii v1 (221)
(880Mb to 2.500Mb)
77267
Lotus japonicus v2.5
37971
Solanum tuberosum v3 (27/06)
39031
Amborella trichopoda v1.0
26846
Elaeis guineensis v1
34801
Setaria italica v1.0
38801
Cicer arietinum v1.0
28269
Coffea canephora v1
25574

Remarks about sequence identifiers
Sequences are identifed by a the locus tags defined by the consortia responsible of the annotation (e.g. At5g20240.1).

Here are some example of valid gene sequence identifiers for each species of the database:
Organism Example of sequence identifiers
Oryza sativa Os01g01010.1, Os01g01010.2
Arabidopsis thaliana AT1G51370.2, AT1G50920.1
Medicago truncatula contig_65682_1.1, contig_52881_1.1
Picea abies MA_9553451g0010, MA_934111g0010
Sorghum bicolor Sb0010s002010.1, Sb0010s003120.1
Populus trichocarpa Potri.T155100.1, Potri.T155100.2
Vitis vinifera GSVIVT01000001001, GSVIVT01000002001
Physcomitrella patens Pp1s1_2V6.1, Pp1s1_4V6.1
Selaginella moellendorffii selmo_402070, selmo_139182
Glycine max Glyma0120s50.1, Glyma0120s50.2
Ostreococcus tauri Ostta4_8043, Ostta4_8044
Chlamydomonas reinhardtii g18373.t1, Cre08.g363350.t1.3
Cyanidioschyzon merolae CMA001C, CMA004C
Brachypodium distachyon Bradi0040s00200.1, Bradi0038s00200.2
Carica papaya supercontig_0.1, supercontig_0.10
Ricinus communis 55548.m000014, 31922.m000031
Zea mays GRMZM2G055768_P01, GRMZM2G055768_P02
Musa balbisiana ITC1587_Bchr10_P28349, ITC1587_Bchr10_P28350
Musa acuminata GSMUA_Achr10P23190_001, GSMUA_Achr10P00010_001
Theobroma cacao Tc01_g000010, Tc01_g000030
Manihot esculenta cassava4.1_033075m, cassava4.1_023161m
Malus domestica MDP0000133028, MDP0000360951
Cucumis sativus Cucsa.000200.1, Cucsa.000210.1
Phoenix dactylifera PDK_30s6550926g001, PDK_30s6550926g002
Cajanus cajan C.cajan_46707, C.cajan_46708
Hordeum vulgare MLOC_19.2, MLOC_51.2
Phaseolus vulgaris Phvul.010G025500.1, Phvul.010G076000.1
Solanum lycopersicum Solyc00g005000.2.1, Solyc00g005020.1.1
Citrus sinensis orange1.1g034924m, orange1.1g027486m
Gossypium raimondii Gorai.010G215500.1, Gorai.010G215500.2
Lotus japonicus chr1.CM0001.20.r2.a, chr1.CM0001.30.r2.a
Solanum tuberosum PGSC0003DMP400067339, PGSC0003DMP400027454
Amborella trichopoda evm_27.model.AmTr_v1.0_scaffold00001.498, evm_27.model.AmTr_v1.0_scaffold00001.491
Elaeis guineensis EG4P2, EG4P6
Setaria italica Millet_GLEAN_10000168, Millet_GLEAN_10000169
Cicer arietinum Ca_00001, Ca_00002
Coffea canephora Cc00_g27210, Cc00_g29300


Data associated to genomes

Protein domain and domain architecture

Each sequences was analysed using InterProScan to identify InterPro domain [1] (InterPro v41)

UniProt (Universal Protein Resource)

Correspondance between UniProtKB-Swissprot [4]( last updated: may 2011) was made based on the ordered locus when available (in 'Gene names' section).
Otherwise, mapping was done by using blast with UniProtKB-Swissprot against a specific whole genome. We transfered UniProtKB-Swissprot for the first hit having an identity score > 90%.
Uniprot Taxonomy

Gene Ontology (Controlled vocabulary of terms for describing gene product)

GO terms were obtained from the interpro and UniProt.

InterPro to GO mapping version: 2013/11/16 12:17:42

UniProt to GO mapping version: n/a

Pubmed

Clusters were mapped by selected pubmed id referenced in UniProt entries. Curators can also add additional publications.

References

  1. Zdobnov E.M. and Apweiler R. "InterProScan - an integration platform for the signature-recognition methods in InterPro" Bioinformatics, 2001, 17(9): p. 847-8.
  2. Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.
  3. Timothy L. Bailey and Michael Gribskov, "Combining evidence using p-values: application to sequence homology searches", Bioinformatics, Vol. 14, pp. 48-54, 1998.
  4. Schneider M, Bairoch A, Wu CH, Apweiler R. Plant protein annotation in the UniProt Knowledgebase.Plant Physiol. (2005) 138:59-66
  5. Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M.; From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354-357 (2006).