Document 6546924
Transcription
Document 6546924
Metrabase The Metabolism and Transport Database user manual v2014.01 Contents 1 INTRODUCTION 3 2 METRABASE CONTENT 4 2.1 2.2 2.3 2.4 7 9 10 13 ACTIVITIES (INTERACTIONS) PROTEINS COMPOUNDS DATA SOURCES 3 THE TRANSPORTER SUBSTRATE DATASET 14 4 USAGE 16 1 Introduction Metrabase is a cheminformatics and bioinformatics resource that contains manually curated structural, physicochemical and biological data related to small molecule transport and metabolism. Metrabase offers structured and easily accessible data on interactions between proteins and chemical compounds, providing not only actions and measured activities, but also chemical structural information, tissue expression data and negative action types that are essential in modelling activity. ‘Easily accessible’ refers to computational processing in the first place. Even when data is made available, an easy way to process it computationally is quite often missing in a range of freely available resources (e.g. an online search and browse facility offered, but not download). We aim to construct a comprehensive, thoroughly annotated and easy to use resource of high quality small molecule metabolism and transport information. In particular by covering the areas of biochemistry, pharmacology and toxicology, we hope diverse research communities will find Metrabase useful and valuable. 3 2 Metrabase content Metrabase version 1.0 contains curated data related to human transport and metabolism of chemical compounds. Its primary content includes over 3000 small molecule substrates and modulators of transport proteins and, to a smaller extent, cytochrome P450 enzymes (CYPs). Proteins 20 transporters and 13 CYPs 20 transporters 13 CYPs Compounds 3438 3307 212 Interactions 11649 11143 506 References 1211 1177 36 The major focus of Metrabase v1.0 is on transport proteins: specifically, on their interactions with small molecules that were experimentally found to be (or not to be) substrates. 4 5 Metrabase 1.0 schema 6 2.1 Activities (interactions) The key information held in the ‘activities’ table of the database covers the interactions between proteins and chemical compounds, indicating the compound action type as either substrate, non-substrate, inducer, non-inducer, repressor, inhibitor, non-inhibitor, stimulator or binder (the ‘action_type’ field). Protein activity (transport or catalysis) Compound activity (affecting protein activity/expression) Action types substrate non-substrate inhibitor/repressor (negative modulators) stimulator/inducer (positive modulators) non-inhibitor/non-inducer (inactive compounds) Action type was set to binder where it did not fall into any of these categories, but the molecule was found to bind to the protein. key fields: cmpd_id – protein_id – ref_id – action_type – species (However, in version 1.0 species = ‘human’ for all the records and so can be omitted.) 7 • • • • • Compounds were categorised as substrates or non-substrates according to the results presented in the publication providing the data point and no further evaluation was carried out on our side. Care must be taken with respect to the current status of the inhibition records, since depending on the measurement threshold (e.g. percentage inhibition) some of the inhibitors can be regarded as non-inhibitors and vice versa. A proper classification of compounds as either inhibitors or non-inhibitors is planned for subsequent releases of the database. Other ‘activities’ fields holding additional extracted data and annotations, such as assay descriptions, relevant experimental measurements, cell systems, compound concentrations and the substrates used in inhibition assays, may have only been partially completed in this release. This is partially due to assay information not being included in most of the reviews. The ‘published_label’ field contains chemical names, abbreviations or designations employed in publications to label compounds. This field has been completed for all except records linked to the external datasets and can be used to easily identify compounds in their respective publications. Activity types were mostly accepted as found in the publications and therefore they may be overlapping. Consequently, selecting all the activity types relevant for one’s search is recommended. 8 2.2 Proteins • • • The proteins contained in Metrabase are categorised as either transporters or enzymes (the ‘protein_type’ field) and are provided with the HUGO Gene Nomenclature Committee (HGNC) approved symbols and names (www.genenames.org) as well as UniProt IDs. Protein sequences for the indicated isoforms were included from UniProt (www.uniprot.org). Other fields include additional information, such as Gene, RefSeq and Ensembl IDs and TC (Transporter Classification) or EC (Enzyme Commission) numbers. Metrabase also contains information about protein expression levels across healthy human tissues. Part of this data is based on immunohistochemistry using tissue microarrays (gene, tissue, cell type, level, expression type and reliability) and comes from the normal_tissue.csv file of the Human Protein Atlas (HPA) v9.0 (www.proteinatlas.org). All other expression records contain data that was extracted from the literature. The levels of expression (mRNA and/or protein levels) for non-HPA records (i.e. where ‘ref_id’ is not null): expressed (if the level had not been specified), none, none-low, low, low-medium, medium, medium-high and high. 9 2.3 Compounds • • The total number of records in the ‘compounds’ table is 3562, but the number of compounds with recorded interaction data for both transporters and enzymes is 3438. The remaining compounds are used in other tables, such as ‘cmpd_variants’, which holds stereoisomers, multi-component structures and different forms of a compound. Molecular structures are available in MDL molfile format and as absolute (unique and isomeric) SMILES strings (in Kekulé form). They were mostly verified using the Chemspider (www.chemspider.com) and/or SciFinder (www.cas.org/products/scifinder) databases. The standard InChI and InChI Key strings were generated using v1.04 of the InChI software (http://www.inchi-trust.org). 10 • • • The great majority of the compounds are small organic molecules and all the other types (coordination complexes, inorganic compounds, metalloid-containing compounds, selenium-containing compounds and polymers) are listed in the ‘compound_types’ table. This table also contains the DrugBank types of drugs (approved, experimental, illicit, investigational, nutraceutical and withdrawn) taken from DrugBank v3.0 (www.drugbank.ca) and can easily be improved by annotating compounds further, for example, as natural products including their subtypes (e.g. natural product: terpene: sesquiterpene). The ‘properties’ table contains selected molecular properties that were calculated/predicted for all (molecular mass) or just the small organic single-component structures (constitutional descriptors: atom and bond counts, hydrogen bond donor and acceptor counts, ring count and rotatable bond count; log P and log D) using ChemAxon’s Calculator (cxcalc) v6.1.3 (www.chemaxon.com). Experimental properties are not currently provided, i.e. ‘properties.type’=’c’ for all records (where ‘c’ stands for ‘calculated’). The multi-component structures can easily be identified using the ‘compounds.fragment_count’ field and their single-component counterparts using the ‘cmpd_variants’ table). The ‘synonyms’ table contains chemical names of Metrabase compounds (systematic, semi-systematic, common, trade names, abbreviations, codes). One of the synonyms was selected as the main name (the ‘compounds.cmpd_name’ field) for each compound. Chemical names were obtained mostly from DrugBank 11 • • (these might refer to compound variants as well) and SciFinder. The systematic (IUPAC) names were computer generated using the ChemAxon’s IUPAC Naming Plugin v6.1.3 (the ‘compounds.iupac_name’ field). The ‘cmpd_ids’ table contains external compound IDs. Most of the compounds have ChemSpider IDs (CSIDs) and only if CSID had not been found, CAS Registry Number was provided (CASRN; CAS Registry Number is a Registered Trademark of the American Chemical Society). DrugBank IDs are also included were identified (especially for the approved drugs). MBCD number is the compound identifier in Metrabase, e.g. mbcd0027084 (MBID for compounds). cmpd_id: mbcd0027084 (CSID:14034) cmpd_name: Ethidium bromide smiles: [Br-].CC[N+]1=C(C2=CC=CC=C2)C2=CC(N)=CC=C2C2=CC=C(N)C=C12 std_inchi: 1S/C21H19N3.BrH/c1-2-24-20-13-16(23)9-11-18(20)17-10-8-15(22)12-19(17)21(24)14-64-3-5-7-14;/h3-13,23H,2,22H2,1H3;1H std_inchikey: ZMMJGEGLRURXTF-UHFFFAOYSA-N iupac_name: 3,8-diamino-5-ethyl-6-phenylphenanthridin-5-ium bromide formula_dot: C21H20N3.Br fragment_count: 2 12 2.4 Data sources • • • The ‘datasources’ table contains the sources of data in the database, including information about software that was used to calculate molecular properties. The ‘datasource_id’ and ‘datasource_version’ fields indicate the source of all Metrabase records where applicable. The ‘refs’ table contains the publications’ citation information (bibliographic fields) and links. Most of them (91%) are original peer-reviewed research articles and the aim remains to link all Metrabase records to primary literature sources (7% are reviews). PubMed IDs are provided where available, as well as DOIs (if DOI was not available, URL is given instead in the ‘doi_url’ field). Attach http://dx.doi.org/ to DOI to resolve a DOI, e.g. http://dx.doi.org/10.1021/ac0354342. 13 3 The transporter substrate dataset We aim to provide a version of the transporter substrate dataset (MBTPsubDS) as a supplement to each Metrabase release. Each MBTPsubDS version contains interactions between small molecules and transporters, and includes all the unique substrate and non-substrate records obtained from Metrabase and processed to facilitate human transporter data analysis and predictive modelling (by 'unique' we mean the unique (cmpd_id, protein_id, action_type) tuples). MBTPsubDS1_0 MBTPsubDS1_0a based on Metrabase v1.0; all the interactions involving conflicting action types (where a compound was found to be both a substrate and a non-substrate of a single transporter) were excluded some of the conflicting action types were resolved upon our evaluation of such records and the corresponding compound-transporter pairs were added to MBTPsubDS1 where we thought we could consider the compound as either a substrate or a non-substrate 14 15 4 Usage Web interface Search by protein Search by compound Expression data Protein list Download Local MySQL database To load Metrabase from a dump file (metrabase1_0.sql), you should first create a database on your system and then load the dump file, for example like this: # tar -xzvf metrabase1_0.tar.gz # mysql -u username -p mysql> CREATE DATABASE metrabase; # mysql -u username -p metrabase < metrabase1_0.sql MySQL Workbench can be used as an interface for MySQL. 16 17 18 19 20 21 22 23 Credits • Metrabase was developed by Lora Mak in collaboration with David Marcus, Andreas Bender and Robert C. Glen at the Unilever Centre for Molecular Sciences Informatics and Galina Yarova, Guus Duchateau and Werner Klaffke at Unilever, with the much appreciated help from the following (at the time) 2nd and 3rd year undergraduate students of the University of Cambridge: Claire Dickson, Joseph Dixon, Ivan Lam, Richard Lewis, Callum Picken, Claudia Pop, Heyao Shi, Emma Stirk, Yasmin Surani, Paddy Szeto, Nathaniel Wand, Julian Willis and Jing Xiangyi. • Metrabase's web interface was developed by Andrew Howlett at the Unilever Centre for Molecular Sciences Informatics. Andrew also designed the Metrabase logo. • Metrabase was realised and is being maintained in the Glen group. 24 Licensing Metrabase is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/) However, with respect to the integrated data, such as the TP-Search, ChEMBL and Human Protein Atlas records that are distributed as part of Metrabase, the user is referred to each external data source regarding their respective licensing. This means that the integrated data retains the licensing of the original data sources. The TP-Search and ChEMBL records may have been modified and augmented, while the Human Protein Atlas records were included unmodified. Attribution We hope you find our database and the associated datasets useful. If you use it, please acknowledge: Metrabase v1.0, University of Cambridge, http://www-metrabase.ch.cam.ac.uk 25 Metrabase - http://www-metrabase.ch.cam.ac.uk Contact: [email protected] Unilever Centre for Molecular Sciences Informatics Department of Chemistry, University of Cambridge Lensfield Road, Cambridge, CB2 1EW, UK This document was prepared by Dr Lora Mak and reviewed by Prof Robert C. Glen. © 2014 Metrabase Development Team, University of Cambridge. All rights reserved.