Gene Product Information (GPI) file description

The (Gene Product Information) (GPI) file represents every annotable biological entity for an organism: protein-coding gene, non-coding RNA genes, protein isoforms (i. e., splice variants) and modified forms, such as cleaved forms or proteins modified by post-translational modifications. The entities should be non-redundant.

This file is used to normalize annotations to single genes, and to map different identifiers for the same entity across different resources.

GPI 2.0 file format

This page is a summary of the GPI 2.0 file format; for full technical details and changes from the previous formal, GPI 1.2, see the Full GPI 2.0 Specification page.

File Header

Mandatory elements of the GPI 2.0 file header

!gpi-version: 2.0
!generated-by: database must be listed in dbxrefs.yaml
!date-generated: YYYY-MM-DD or YYYY-MM-DDTHH:MM

Other header elements may be included such as links to the submitters project page, funding sources, ontology versions, etc.

Other information, such as links to the submitters project page, funding sources, ontology versions, etc., may be included in an association file as shown below.

! URL: e.g. http://www.yeastgenome.org/
! Project-release: e.g. WS275
! Funding: e.g. NHGRI
! Columns: file format written out

GPI File Contents

The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple values, those should be separated by pipes (|).

Column	Content	Cardinality	Example 1: UniProt protein	Example 2: WormBase protein-coding gene	Example 3: UniProt isoform	Example 4: SGD protein complex	Example 5: PomBase protein complex	Example 6: UniProt human ncRNA	Example 7: FlyBase ncRNA
1	DB:Object ID	1	UniProtKB:Q4VCS5	WB:WBGene00000007	UniProtKB:Q4VCS5-1	SGD:S000217643	ComplexPortal:CPX-25785	RNAcentral:URS0000527F89_9606	FB:FBgn0262451
2	Object Symbol	1	AMOT	aat-6	AMOT	CBF1:MET4:MET28	SWI5-SWI2	URS0000527F89_9606	mir-ban
3	Object Name	0 or 1	Angiomotin	Amino acid transporter protein 6	Angiomotin	sulfur metabolism transcription factor complex	mating-type switching complex	Homo sapiens (human) hsa-miR-145-5p	bantam
4	Object_Synonym(s)	0 or >	KIAA1071	aat-6	T11F9.4
5	Object Type	1	PR:000000001	SO:0000704	PR:000000001	GO:0032991	GO:0032991	SO:0000276	SO:0000276
6	Object Taxon	1	NCBITaxon:9606	NCBITaxon:6239	NCBITaxon:9606	NCBITaxon:559292	NCBITaxon:4896	NCBITaxon:9606	NCBITaxon:7227
7	Encoded by	0 or >	HGNC:17810			HGNC:17810		HGNC:31532
8	Canonical object ID	1	UniProtKB:Q4VCS5	WB:WBGene00000007	UniProtKB:Q4VCS5	SGD:S000217643	ComplexPortal:CPX-25785	RNAcentral:URS0000527F89_9606	FB:FBgn0262451
9	Protein Complex Members	0 or >				SGD:S000003821 \| SGD:S000001456 \| SGD:S000005047	PomBase:SPAC1142.03c \| PomBase:SPBC409.03
10	Cross-reference(s)	0 or >	NCBIGene:154796 \| ENSEMBL:ENSG00000126016	UniProtKB:Q22397	NCBIGene:154796 \| ENSEMBL:ENSG00000126016	ComplexPortal:CPX-1016		ENSG00000276365	RNAcentral:URS00002F21DA_7227
11	Gene Product Properties	0 or >	db_subset=Swiss-Prot

Definitions and requirements for GPI 2.0 field contents

1. DB:Object ID

A unique identifier for the entity being annotated, composed of two elements: a DB prefix is the database, that must be described in the GO dbxrefs.yaml file, and a DB Object ID, which is the alphanumerical identifier corresponding to the entity. The DB:DB Object ID is the combined identifier for the database object. Examples:
- UniProtKB:P99999
- SGD:S000002164
- MGI:MGI:1919306
The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file. Note that not all DB:Object_ID are necessarily in the same ID space. Some groups use RNA Central IDs for RNAs, some groups use ComplexPortal IDs for protein complexes, and some groups use Protein Ontology IDs for modified proteoforms. Together, the unique values in Column 8 correspond to the full set of genes products encoded by the organism, as well as the protein complexes represented in the source database.
Cardinality = 1

2. Object Symbol

A name for the entity represented by the DB object ID. The DB Object Symbol field should be text that means something to a biologist wherever possible (a gene symbol, for example). If the entity has no name, the DB object ID can be used as a DB Object Symbol.

Cardinality = 1

3. Object Name

The name of the gene or gene product corresponding to the DB:Object_ID in Column 1. White spaces are allowed in this field.
Cardinality = 0 or 1

4. Object Synonym

Alternative names for the entity in DB:Object_ID in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching.
Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.

5. Object Type

An ontology identifier describing the class of biological entity of the DB:Object_ID in Column 1. The ontology identifier must be a value from Protein Ontology for proteins, Gene Ontology for protein complexes, or Sequence Ontology (SO) for all other entities. Allowed entity types:

A label corresponding to the ontology identifier describing the class of biological entity of the DB Object_ID in Column 2. The values used are shown below. The full list of entity types and their corresponding identitiers can be found in the biological_entity_mapping.yaml.

protein (PR:000000001)
protein-containing complex (GO:0032991)
protein-coding gene (SO:0001217)
ncRNA (SO:0000655) or any SO child term
ncRNA-coding gene (SO:0001263) or any SO child term
Note on object types: This field should descibe the type of biological object as defined by the contributing database. For example, WormBase identifiers represent genes, PomBase identifiers represent protein-coding genes, and SGD identifiers represent proteins.
GO does not allow ‘gene’ and ‘gene product’ as biological entity types, as this does not allow to differentiate between proteins and ncRNAs products.
Cardinality = 1

6. Object Taxon

The NCBI taxon ID of the organism (species or strain) encoding the DB:Object_ID from Column 1, in the format NCBITaxon:numerical_identifier.
Cardinality = 1

7. Encoded by

For proteins and transcripts, Encoded by refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
Cardinality = 0, 1, > 1
For cardinality > 1, values must be pipe-separated.

8. Canonical object ID

If the DB:Object_ID in Column 1 describes a gene, a gene-centric reference protein accession or a protein complex, then this column should repeat the DB:Object_ID from Column 1.
If the DB:Object_ID in Column 1 describes an entity derived from a gene product such as a protein isoform, a modified protein or a processed transcript (e. g. miRNA), then this column refers to the gene-centric ID of the Column 1 entry.
If the DB:Object_ID in Column 1 is of a type for which the organism’s gene set represented by the GPI does not have IDs for a specific type of entity (for example, uncloned genes), then this column should repeat the DB:Object_ID from Column 1.
Cardinality = 1

9. Protein-Containing Complex Members

If the DB:Object_ID in Column 1 describes a protein-containing complex, this column can include the gene-centric canonical protein identifiers.
Cardinality = 0, 1, > 1
For cardinality > 1, values must be pipe-separated.

10. Database cross-references (DB_Xrefs)

Identifiers for the object in DB:Object_ID found in other databases. Identifiers used must be standard 2-part global identifiers: a namespace and a value, separated by a colon, e.g. UniProtKB:Q60FP0.
This field is mandatory if the prefix in column 1 is not one of: UniProtKB, RNACentral, ComplexPortal. In these cases, DB_Xrefs must include the correponding UniProtKB ID, RNACentral, or ComplexPortal as appropriate according to the Object Type (see Column 5).
Note that RNACentral IDs contain the taxon ID of the entity appended after an ‘underscore’ character (“_”): see [RNACentral:URS00002F21DA]((https://rnacentral.org/rna/URS00002F21DA)] (generic for several organism) versus [RNACentral:URS00002F21DA]_7227 (specific for Drosophila melanogaster).
Additional cross references such as NCBI gene or protein IDs, HGNC, etc, may also be included.
Cardinality = 0, 1, > 1
For cardinality > 1, values must be pipe-separated.

11. Gene Product Properties

The Properties column can be filled with a pipe separated list of values in the format “property_name = property_value”. There is a fixed vocabulary for the property names as indicated in the table below:

Property	Allowed usages per annotation	Value Grammar	Example	Comment
`db-subset`	0 or 1	Must be either `TrEMBL` or `Swiss-Prot`	db-subset=TrEMBL	The status of a UniProtKB accession with respect to curator review.
`uniprot-proteome`	0 or 1	ID	uniprot-proteome=UP000001940	A unique UniProtKB identifier for the set of proteins that constitute an organism’s proteome.
`go-annotation-complete`	0 or 1	Date_Or_Date_Time in ISO 8601 format, i. e. `YYYY-MM-DD`	2019-02-05	Indicates the date on which a curator determined that the set of GO annotations for a given entity is complete with respect to GO annotation. Complete means that all information about a gene has been captured as a GO term, but not necessarily that all possible supporting evidence is annotated.
`go-annotation-summary`	0 or 1	Text	go-annotation-summary=Sterol binding protein with a role in intracellular sterol transport; localizes to mitochondria and the cortical ER	A textual gene or gene product description.

Cardinality = 0, 1, > 1
For cardinality > 1, values must be pipe-separated.