Gene Product Information (GPI) file description

The (Gene Product Information) (GPI) file represents every annotable biological entity for an organism: protein-coding gene, non-coding RNA genes, protein isoforms (i. e., splice variants) and modified forms, such as cleaved forms or proteins modified by post-translational modifications. The entities should be non-redundant.

This file is used to normalize annotations to single genes, and to map different identifiers for the same entity across different resources.

GPI 2.0 file format

This page is a summary of the GPI 2.0 file format; for full technical details and changes from the previous formal, GPI 1.2, see the Full GPI 2.0 Specification page.

File Header

Mandatory elements of the GPI 2.0 file header

!gpi-version: 2.0
!generated-by: database must be listed in dbxrefs.yaml
!date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM

Other header elements may be included such as links to the submitters project page, funding sources, ontology versions, etc.

Other information, such as links to the submitters project page, funding sources, ontology versions, etc., may be included in an association file as shown below.

! URL: e.g. http://www.yeastgenome.org/
! Project-release: e.g. WS275
! Funding: e.g. NHGRI
! Columns: file format written out

GPI File Contents

The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple values, those should be separated by pipes (|).

Column Content Cardinality Example 1: UniProt protein Example 2: WormBase protein-coding gene  Example 3: UniProt isoform Example 4: SGD protein complex Example 5: PomBase protein complex Example 6: UniProt human ncRNA Example 7: FlyBase ncRNA  
1 DB:Object ID 1 UniProtKB:Q4VCS5 WB:WBGene00000007 UniProtKB:Q4VCS5-1 SGD:S000217643   ComplexPortal:CPX-25785 RNAcentral:URS0000527F89_9606  FB:FBgn0262451  
2 Object Symbol 1 AMOT aat-6  AMOT CBF1:MET4:MET28 SWI5-SWI2 URS0000527F89_9606  mir-ban  
3 Object Name 0 or 1 Angiomotin Amino acid transporter protein 6  Angiomotin sulfur metabolism transcription factor complex  mating-type switching complex  Homo sapiens (human) hsa-miR-145-5p  bantam  
4 Object_Synonym(s) 0 or > KIAA1071 aat-6 T11F9.4          
5 Object Type 1 PR:000000001 SO:0000704  PR:000000001 GO:0032991 GO:0032991  SO:0000276 SO:0000276  
6 Object Taxon 1 NCBITaxon:9606 NCBITaxon:6239  NCBITaxon:9606 NCBITaxon:559292   NCBITaxon:4896 NCBITaxon:9606 NCBITaxon:7227  
7 Encoded by 0 or > HGNC:17810     HGNC:17810   HGNC:31532    
8 Canonical object ID 1 UniProtKB:Q4VCS5 WB:WBGene00000007  UniProtKB:Q4VCS5 SGD:S000217643 ComplexPortal:CPX-25785  RNAcentral:URS0000527F89_9606 FB:FBgn0262451  
9 Protein Complex Members 0 or >       SGD:S000003821 |
SGD:S000001456 |
SGD:S000005047
PomBase:SPAC1142.03c |
PomBase:SPBC409.03
     
10 Cross-reference(s) 0 or > NCBIGene:154796 |
ENSEMBL:ENSG00000126016
UniProtKB:Q22397  NCBIGene:154796 |
ENSEMBL:ENSG00000126016
ComplexPortal:CPX-1016   ENSG00000276365   RNAcentral:URS00002F21DA_7227  
11 Gene Product Properties  0 or > db_subset=Swiss-Prot              

Definitions and requirements for GPI 2.0 field contents

1. DB:Object ID

  • A unique identifier for the entity being annotated, composed of two elements: a DB prefix is the database, that must be described in the GO dbxrefs.yaml file, and a DB Object ID, which is the alphanumerical identifier corresponding to the entity. The DB:DB Object ID is the combined identifier for the database object. Examples:
    • UniProtKB:P99999
    • SGD:S000002164
    • MGI:MGI:1919306
  • The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file. Note that not all DB:Object_ID are necessarily in the same ID space. Some groups use RNA Central IDs for RNAs, some groups use ComplexPortal IDs for protein complexes, and some groups use Protein Ontology IDs for modified proteoforms. Together, the unique values in Column 8 correspond to the full set of genes products encoded by the organism, as well as the protein complexes represented in the source database.
  • Cardinality = 1

2. Object Symbol

A name for the entity represented by the DB object ID. The DB Object Symbol field should be text that means something to a biologist wherever possible (a gene symbol, for example). If the entity has no name, the DB object ID can be used as a DB Object Symbol.

  • Cardinality = 1

3. Object Name

  • The name of the gene or gene product corresponding to the DB:Object_ID in Column 1. White spaces are allowed in this field.
  • Cardinality = 0 or 1

4. Object Synonym

  • Alternative names for the entity in DB:Object_ID in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching.
  • Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.

5. Object Type

  • An ontology identifier describing the class of biological entity of the DB:Object_ID in Column 1. The ontology identifier must be a value from Protein Ontology for proteins, Gene Ontology for protein complexes, or Sequence Ontology (SO) for all other entities. Allowed entity types:

A label corresponding to the ontology identifier describing the class of biological entity of the DB Object_ID in Column 2. The values used are shown below. The full list of entity types and their corresponding identitiers can be found in the biological_entity_mapping.yaml.

  • protein (PR:000000001)
  • protein-containing complex (GO:0032991)
  • protein-coding gene (SO:0001217)
  • ncRNA (SO:0000655) or any SO child term
  • ncRNA-coding gene (SO:0001263) or any SO child term

  • Note on object types: This field should descibe the type of biological object as defined by the contributing database. For example, WormBase identifiers represent genes, PomBase identifiers represent protein-coding genes, and SGD identifiers represent proteins.
  • GO does not allow ‘gene’ and ‘gene product’ as biological entity types, as this does not allow to differentiate between proteins and ncRNAs products.
  • Cardinality = 1

6. Object Taxon

  • The NCBI taxon ID of the organism (species or strain) encoding the DB:Object_ID from Column 1, in the format NCBITaxon:numerical_identifier.
  • Cardinality = 1

7. Encoded by

  • For proteins and transcripts, Encoded by refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
  • Cardinality = 0, 1, > 1
  • For cardinality > 1, values must be pipe-separated.

8. Canonical object ID

  • If the DB:Object_ID in Column 1 describes a gene, a gene-centric reference protein accession or a protein complex, then this column should repeat the DB:Object_ID from Column 1.
  • If the DB:Object_ID in Column 1 describes an entity derived from a gene product such as a protein isoform, a modified protein or a processed transcript (e. g. miRNA), then this column refers to the gene-centric ID of the Column 1 entry.
  • If the DB:Object_ID in Column 1 is of a type for which the organism’s gene set represented by the GPI does not have IDs for a specific type of entity (for example, uncloned genes), then this column should repeat the DB:Object_ID from Column 1.
  • Cardinality = 1

9. Protein-Containing Complex Members

  • If the DB:Object_ID in Column 1 describes a protein-containing complex, this column can include the gene-centric canonical protein identifiers.
  • Cardinality = 0, 1, > 1
  • For cardinality > 1, values must be pipe-separated.

10. Database cross-references (DB_Xrefs)

  • Identifiers for the object in DB:Object_ID found in other databases. Identifiers used must be standard 2-part global identifiers: a namespace and a value, separated by a colon, e.g. UniProtKB:Q60FP0.
  • This field is mandatory if the prefix in column 1 is not one of: UniProtKB, RNACentral, ComplexPortal. In these cases, DB_Xrefs must include the correponding UniProtKB ID, RNACentral, or ComplexPortal as appropriate according to the Object Type (see Column 5).
  • Note that RNACentral IDs contain the taxon ID of the entity appended after an ‘underscore’ character (“_”): see [RNACentral:URS00002F21DA]((https://rnacentral.org/rna/URS00002F21DA)] (generic for several organism) versus [RNACentral:URS00002F21DA]_7227 (specific for Drosophila melanogaster).
  • Additional cross references such as NCBI gene or protein IDs, HGNC, etc, may also be included.
  • Cardinality = 0, 1, > 1
  • For cardinality > 1, values must be pipe-separated.

11. Gene Product Properties

  • The Properties column can be filled with a pipe separated list of values in the format “property_name = property_value”. There is a fixed vocabulary for the property names as indicated in the table below:
 Property Allowed usages per annotation  Value Grammar Example  Comment
db-subset  0 or 1 Must be either TrEMBL or Swiss-Prot db-subset=TrEMBL The status of a UniProtKB accession with respect to curator review.
uniprot-proteome  0 or 1  ID uniprot-proteome=UP000001940 A unique UniProtKB identifier for the set of proteins that constitute an organism’s proteome.
go-annotation-complete 0 or 1 Date_Or_Date_Time in ISO 8601 format, i. e. YYYY-MM-DD 2019-02-05 Indicates the date on which a curator determined that the set of GO annotations for a given entity is complete with respect to GO annotation. Complete means that all information about a gene has been captured as a GO term, but not necessarily that all possible supporting evidence is annotated.
go-annotation-summary  0 or 1  Text  go-annotation-summary=Sterol binding protein with a role in intracellular sterol transport; localizes to mitochondria and the cortical ER A textual gene or gene product description.
  • Cardinality = 0, 1, > 1
  • For cardinality > 1, values must be pipe-separated.