Gene Product Information (GPI) file description
The (Gene Product Information) (GPI) file represents every annotable biological entity for an organism: protein-coding gene, non-coding RNA genes, protein isoforms (i. e., splice variants) and modified forms, such as cleaved forms or proteins modified by post-translational modifications. The entities should be non-redundant.
This file is used to normalize annotations to single genes, and to map different identifiers for the same entity across different resources.
GPI 2.0 file format
This page is a summary of the GPI 2.0 file format; for full technical details and changes from the previous formal, GPI 1.2, see the Full GPI 2.0 Specification page.
File Header
Mandatory elements of the GPI 2.0 file header
!gpi-version: 2.0
!generated-by: database must be listed in dbxrefs.yaml
!date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM
Other header elements may be included such as links to the submitters project page, funding sources, ontology versions, etc.
Other information, such as links to the submitters project page, funding sources, ontology versions, etc., may be included in an association file as shown below.
! URL: e.g. http://www.yeastgenome.org/
! Project-release: e.g. WS275
! Funding: e.g. NHGRI
! Columns: file format written out
GPI File Contents
The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple values, those should be separated by pipes (|).
| Column | Content | Cardinality | Example 1: UniProt protein | Example 2: WormBase protein-coding gene | Example 3: UniProt isoform | Example 4: SGD protein complex | Example 5: PomBase protein complex | Example 6: UniProt human ncRNA | Example 7: FlyBase ncRNA | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | DB:Object ID | 1 | UniProtKB:Q4VCS5 | WB:WBGene00000007 | UniProtKB:Q4VCS5-1 | SGD:S000217643 | ComplexPortal:CPX-25785 | RNAcentral:URS0000527F89_9606 | FB:FBgn0262451 | |
| 2 | Object Symbol | 1 | AMOT | aat-6 | AMOT | CBF1:MET4:MET28 | SWI5-SWI2 | URS0000527F89_9606 | mir-ban | |
| 3 | Object Name | 0 or 1 | Angiomotin | Amino acid transporter protein 6 | Angiomotin | sulfur metabolism transcription factor complex | mating-type switching complex | Homo sapiens (human) hsa-miR-145-5p | bantam | |
| 4 | Object_Synonym(s) | 0 or > | KIAA1071 | aat-6 | T11F9.4 | |||||
| 5 | Object Type | 1 | PR:000000001 | SO:0000704 | PR:000000001 | GO:0032991 | GO:0032991 | SO:0000276 | SO:0000276 | |
| 6 | Object Taxon | 1 | NCBITaxon:9606 | NCBITaxon:6239 | NCBITaxon:9606 | NCBITaxon:559292 | NCBITaxon:4896 | NCBITaxon:9606 | NCBITaxon:7227 | |
| 7 | Encoded by | 0 or > | HGNC:17810 | HGNC:17810 | HGNC:31532 | |||||
| 8 | Canonical object ID | 1 | UniProtKB:Q4VCS5 | WB:WBGene00000007 | UniProtKB:Q4VCS5 | SGD:S000217643 | ComplexPortal:CPX-25785 | RNAcentral:URS0000527F89_9606 | FB:FBgn0262451 | |
| 9 | Protein Complex Members | 0 or > | SGD:S000003821 | SGD:S000001456 | SGD:S000005047 |
PomBase:SPAC1142.03c | PomBase:SPBC409.03 |
||||||
| 10 | Cross-reference(s) | 0 or > | NCBIGene:154796 | ENSEMBL:ENSG00000126016 |
UniProtKB:Q22397 | NCBIGene:154796 | ENSEMBL:ENSG00000126016 |
ComplexPortal:CPX-1016 | ENSG00000276365 | RNAcentral:URS00002F21DA_7227 | ||
| 11 | Gene Product Properties | 0 or > | db_subset=Swiss-Prot |
Definitions and requirements for GPI 2.0 field contents
1. DB:Object ID
- A unique identifier for the entity being annotated, composed of two elements: a DB prefix is the database, that must be described in the GO dbxrefs.yaml file, and a DB Object ID, which is the alphanumerical identifier corresponding to the entity. The DB:DB Object ID is the combined identifier for the database object. Examples:
- UniProtKB:P99999
- SGD:S000002164
- MGI:MGI:1919306
- The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file. Note that not all DB:Object_ID are necessarily in the same ID space. Some groups use RNA Central IDs for RNAs, some groups use ComplexPortal IDs for protein complexes, and some groups use Protein Ontology IDs for modified proteoforms. Together, the unique values in Column 8 correspond to the full set of genes products encoded by the organism, as well as the protein complexes represented in the source database.
- Cardinality = 1
2. Object Symbol
A name for the entity represented by the DB object ID. The DB Object Symbol field should be text that means something to a biologist wherever possible (a gene symbol, for example). If the entity has no name, the DB object ID can be used as a DB Object Symbol.
- Cardinality = 1
3. Object Name
- The name of the gene or gene product corresponding to the DB:Object_ID in Column 1. White spaces are allowed in this field.
- Cardinality = 0 or 1
4. Object Synonym
- Alternative names for the entity in DB:Object_ID in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching.
- Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.
5. Object Type
- An ontology identifier describing the class of biological entity of the DB:Object_ID in Column 1. The ontology identifier must be a value from Protein Ontology for proteins, Gene Ontology for protein complexes, or Sequence Ontology (SO) for all other entities. Allowed entity types:
A label corresponding to the ontology identifier describing the class of biological entity of the DB Object_ID in Column 2. The values used are shown below. The full list of entity types and their corresponding identitiers can be found in the biological_entity_mapping.yaml.
- protein (PR:000000001)
- protein-containing complex (GO:0032991)
- protein-coding gene (SO:0001217)
- ncRNA (SO:0000655) or any SO child term
-
ncRNA-coding gene (SO:0001263) or any SO child term
- Note on object types: This field should descibe the type of biological object as defined by the contributing database. For example, WormBase identifiers represent genes, PomBase identifiers represent protein-coding genes, and SGD identifiers represent proteins.
- GO does not allow ‘gene’ and ‘gene product’ as biological entity types, as this does not allow to differentiate between proteins and ncRNAs products.
- Cardinality = 1
6. Object Taxon
- The NCBI taxon ID of the organism (species or strain) encoding the DB:Object_ID from Column 1, in the format
NCBITaxon:numerical_identifier. - Cardinality = 1
7. Encoded by
- For proteins and transcripts, Encoded by refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
- Cardinality = 0, 1, > 1
- For cardinality > 1, values must be pipe-separated.
8. Canonical object ID
- If the DB:Object_ID in Column 1 describes a gene, a gene-centric reference protein accession or a protein complex, then this column should repeat the DB:Object_ID from Column 1.
- If the DB:Object_ID in Column 1 describes an entity derived from a gene product such as a protein isoform, a modified protein or a processed transcript (e. g. miRNA), then this column refers to the gene-centric ID of the Column 1 entry.
- If the DB:Object_ID in Column 1 is of a type for which the organism’s gene set represented by the GPI does not have IDs for a specific type of entity (for example, uncloned genes), then this column should repeat the DB:Object_ID from Column 1.
- Cardinality = 1
9. Protein-Containing Complex Members
- If the DB:Object_ID in Column 1 describes a protein-containing complex, this column can include the gene-centric canonical protein identifiers.
- Cardinality = 0, 1, > 1
- For cardinality > 1, values must be pipe-separated.
10. Database cross-references (DB_Xrefs)
- Identifiers for the object in DB:Object_ID found in other databases. Identifiers used must be standard 2-part global identifiers: a namespace and a value, separated by a colon, e.g. UniProtKB:Q60FP0.
- This field is mandatory if the prefix in column 1 is not one of: UniProtKB, RNACentral, ComplexPortal. In these cases, DB_Xrefs must include the correponding UniProtKB ID, RNACentral, or ComplexPortal as appropriate according to the Object Type (see Column 5).
- Note that RNACentral IDs contain the taxon ID of the entity appended after an ‘underscore’ character (“_”): see [RNACentral:URS00002F21DA]((https://rnacentral.org/rna/URS00002F21DA)] (generic for several organism) versus [RNACentral:URS00002F21DA]_7227 (specific for Drosophila melanogaster).
- Additional cross references such as NCBI gene or protein IDs, HGNC, etc, may also be included.
- Cardinality = 0, 1, > 1
- For cardinality > 1, values must be pipe-separated.
11. Gene Product Properties
- The Properties column can be filled with a pipe separated list of values in the format “property_name = property_value”. There is a fixed vocabulary for the property names as indicated in the table below:
| Property | Allowed usages per annotation | Value Grammar | Example | Comment |
|---|---|---|---|---|
db-subset |
0 or 1 | Must be either TrEMBL or Swiss-Prot |
db-subset=TrEMBL | The status of a UniProtKB accession with respect to curator review. |
uniprot-proteome |
0 or 1 | ID | uniprot-proteome=UP000001940 | A unique UniProtKB identifier for the set of proteins that constitute an organism’s proteome. |
go-annotation-complete |
0 or 1 | Date_Or_Date_Time in ISO 8601 format, i. e. YYYY-MM-DD |
2019-02-05 | Indicates the date on which a curator determined that the set of GO annotations for a given entity is complete with respect to GO annotation. Complete means that all information about a gene has been captured as a GO term, but not necessarily that all possible supporting evidence is annotated. |
go-annotation-summary |
0 or 1 | Text | go-annotation-summary=Sterol binding protein with a role in intracellular sterol transport; localizes to mitochondria and the cortical ER | A textual gene or gene product description. |
- Cardinality = 0, 1, > 1
- For cardinality > 1, values must be pipe-separated.