PLA Datums
Datums are a structured, formal representation of experimental findings suitable for computational reasoning and semantic query.The PL Datum Knowledge Base (DKB) contains over 35k datums related to the state of mammalian cells and cellular response to a variety of stimuli. This document provides a little introduction to datums and describes how to search the DKB using the web based interface http://light.csl.sri.com:3000. See the section Getting started with Datum Queries below for instructions.
Worked examples can be found here, with suggestions for further querying here.
Please note that this interface is a prototype designed primarily for computational biologists, with an emphasis on computational. An interface for experimental biologists is under development.
Table of Contents
- The origin of Datums
- What are Datums?
- Getting started with Datum Queries
- Datum Query Construction and Results retrieval
- Appendix
The origin of Datums
To develop a model concerning a particular aspect of cellular signaling from available literature, the first step is to gather all the available experimental evidence. To help organize gathered information, we have developed a system for recording experimental findings in a knowledge base. Individual entries are called datums. It is important that each datum capture objective information, rather than conclusions of the experimenter or curator. We want to be able to compute with datums in the knowledge base, to retrieve sets of datums satisfying possibly complex combinations of properties, and to make logical inferences based on datums and general biological information. We also want the knowledge base and its infrastructure to be generally useful for experimental biologists, thus datums should be expressed using readily understood concepts with generally agreed-upon meaning, e.g., assays, detection methods and cells. Furthermore, each datum should contain a manageable chunk of information, sufficient to unambiguously describe an experimental finding. It should also contain the source of the information so the datum for review and access to additional details or context.What are Datums?
There are two main types of datum, state and change datums, corresponding to two basic types of biological experiments.Overview
State datums describe the state of something in a defined system, often compared to the state of something else in the same system. An example of a state experiment is a comparison of the number of Egf receptors per cell on different cell lines. Protein interaction data is often produced by state experiments. If one protein can be co-precipitated by another protein from the same cell, they are considered to interact. State data is used Pathway Logic to create initial system states that require a list of components in the system, their modifications, and their location. It is also used to deduce the location and modifications of a protein demonstrated to be active (i.e., capable of performing its molecular function). Change datums describes the change in the state of something in a system in response to a stimulus. An experiment in which the Egf receptor is demonstrated to be unphosphorylated in a serum-starved cell and phosphorylated after that cell has been treated with Egf for 5 minutes is an example of a change datum.Element Details
The elements of a datum include its subject, the assay performed, the observation made, the experimental environment, the source of information, and zero or more extras (variants on experimental conditions). In addition change datum elements include a treatment and observation times if available.- Subject: The subject (mostly proteins, some genes) is what the experiment is about. Protein subjects carry attributes that provide additional information including the origin (for example expressed or endogenous), how it is identified (by antibody to the protein, by antibody to the tag on an expressed protein, or by some other sort of label), and whether or not it is an immunoprecipitate.
-
Assay: The assay is the experimental method. For example
Prot-exp[WB]
denotes an assay in which total protein expression in a cell is detected by Western Blot, andSurface-exp[FACS]
denotes an assay in which protein expression on the surface of a cell is detected by Flow Cytometry. Some assays have assay specific attributes such as hook for binding assay, and substrate for activity assay. A catalogue of assays and their attributes can be found in the curation notebook. - Treatment: The treatment can be addition of a ligand (e.g., a peptide or chemical), expression of a protein, or a stress (such as UV light).
- Result: In a state type datum the result is either detectable or undetectable. The In a change type datum the result has two parts: the change (increased, decreased, unchanged, or detectable-but unchanged) and the time(s) after addition of a stimulus that the result is observed. Times is treated as a separate datum element and each time has an associate qualitative level of change indicated by 0 or more "+"s following the time.
- Environment: The environment element of a datum describes the cell and medium in which the experiment is performed. Cell types can be cell lines or primary cells. Additional information can include mutations, or deletions. The medium is described with minimal detail using defined abbreviations such as BMS for basal medium with serum. A catalogue of cell types and media that appear in datums can be found in the curation notebook.
- Source: The source is usually a PubMed ID together with a Figure or Table number. It could also be an unpublished laboratory result.
- Extras: Extras are use to record the effect of alterations in some component of an experiment. A typical example is the use of cells in which a protein has been knocked-out. If the result in the knock-out cells is the same as that in the wild-type cells, then the protein is not required for the result to occur. Another example of an extra is an experiment where an expressed protein is replaced by a mutated form of the protein or just omitted. If the mutation or omission causes a change in the result achieved using the wild-type protein, then clues are provided about the function of the protein in the experiment.
The curation notebook contains descriptions of the assays and other datum components. Its glossary is a good place to look for definitions of abbreviations and unfamiliar terms.
Getting started with Datum Queries
To access the datum query page point your browser at http://light.csl.sri.com:3000. The first time you will need to establish a user name and password by clicking on the "Register" link. On future visits click on the login link and enter your username and password. This will take you to the saved queries page. (The system allows you to name and save queries -- to just rerun, or as a starting point for making different queries.) The first time you access the Datum Query site there will be no queries saved, so you must create one. Click on "Make a new Query", type in a name in the "Make a new query" text box, click on "create" and you are ready to go. If you have one or more saved queries, you can select one of those (or make a new one). From a specific query page you can always return to the query selection page by clicking on the "My queries" link in the upper left. A newly created query is empty (equivalent to the true predicate) and thus the query page reports the full set of datums as results. The page for a saved query page resumes in the state it was in when last accessed. Queries are specified by selecting predicates (corresponding to datum parts) and predicate attributes (fields of the part) to constrain the result set by choosing a verb and possibly typing additional text for matching. Results display can be customized by choosing the parts of each datum to be displayed in the Fields area.Datum Query Construction and Results retrieval
A datum query defines a predicate on datums. The query result is the set of datums for which the predicate is true. It is presented as a list which shows, for each datum, the subject line and any element attributes selected for display (see below). Clicking on the "Expand" link for a datum shows the full datum, you can also expand/collapse all. You can export the full query results as plain text (using the datum natural language syntax), or the selected attributes in csv (comma separated values) form. The export will appear in your browser - which you can save in a file if you like. (CSV export can be imported into a spread sheet program such as Excel.)Constructing a Query
A query is a conjunction of basic predicates, each constraining the value of some datum element. The empty query selects all datums in the current DKB. To construct a query, a new predicate element can be added to a query by clicking on the linkAdd a predicateYou choose the datum element to constrain by clicking on the leftmost selection button and selecting the desired element, one of {subject, change, assay, treatment, environments, times, source, extra}. This will cause the next selection button to the right to list the attributes of the element that can be selected. The third selection button allows you to specify the verb/relation. On the far right there are Add child/Remove links. The
Add childlink adds a new attribute line for a datum element, thus keeping all the attributes for a given element together. The
Removelink removes the attribute constraint on that line. If there is only one attribute, the element predicate is removed. You can start fresh by clicking on
Remove all predicatesThe possible verbs are
matches/does not match isa/is not a exists/does not existThe first 4 have associated text boxes to enter a string.
- "Matches" does a case insensitive substring search for the string in the appropriate datum attribute. The empty string matches anything. "Does not match" is the negation of "Matches". In some cases there is a blank attribute line for a predicate. This allows you to search the entire element string.
- "Is a" checks whether the appropriate datum attribute belongs to the sort/class named in the text box. The sort name must be exact. "Is not a" is the negation of "is a". (Some options for assays are listed below, protein families are represented using sorts, eventually you will be given a list to choose from. Files documenting the sort and protein names are available on request.)
- "Exists" checks if the attribute exists. "Does not exist" is the negation of "exists".
Formatting Results
You can change how the results are reported using the "Fields" box. (Click on show to see the choices.) Each selected field will be printed on a separate line following a result datum. Selected fields are also the fields used to determine the csv export (see Exporting Results). Selecting an element name displays the full element string from the datum natural language form (except for treatment and extras, since the element is defined from multiple substrings). Selecting an element attribute displays that attribute. In the case of extras, the selected attributes of each extra of a datum are displayed (named extra1.attr, extra2.attr etc.).Exporting Results
Search results can be exported as plain text using the "txt" link of "Export all results". Txt export prefixes each datum with the selected fields, one per line. Search results can be exported in csv format (for import into a spread sheet application) using the "csv" link of "Export all results". Csv export has one column for each field, and one row for each datum in the result set. Either choice produces a text page which you can save from the browser window or copy and paste in to a file. NB: If you want the exported page to show in a new window, hold down the command key (control or shift key on windows/linux) to open a new tab in the browser.Appendix
Datum Elements
The datum elements and searchable/displayable attributes are discussed below.
subject
subject.entity -- a protein, gene, possibly a lipid or other chemical
subject.origin -- one of
['endogenous', 'expressed', 'recombinant', 'purified', 'knockin']
subject.mods -- modifications -- matches searches in mods substring
subject.muts -- mutations -- matches searches in muts substring
subject.handle -- how the subject is identified
possibilities include Ab (antibody), phosAb (phospho Ab), tAb (tag Ab)
14C, 32P, 35S, 125I (radioactive handles)
assay
assay.type -- See appendix below for Isa and matches possibilites
assay.detection_method
assay.hooks -- only relevant for binding assays
assay.hook_handles -- like subject handles
assay.substrate -- only relevant for activation assays
(substrates and hooks are molecules, usually proteins)
change -- Controlled vocabulary:
[increased,decreased,unchanged,detectable,undetectable] ??un or not?
treatment
treatment.type -- one of "irt", "by", "itpo" via string match
treatment.entity -- looks for match in any treatment entity, should print all
treatment.origin
treatment.mods
treatment.muts
Same as Subject counterparts
-- applies to first or matched entity if more than one????
environment
environment.cells -- controlled vocabulary, too appear
environment.comment -- the comment string if any
environment.medium -- Controlled vocabulary, includes BMS BMLS BSS BMHIS
environment.cellmuts -- searches withing mutation string
environment.cmut_entity -- a protein
environment.cmut_mods
environment.cmut_muts
Same as Subject counterparts
times -- a string
source
source.pmid -- the pubmed identifier
source.figs -- the figures/tables used
extra
extra.type -- Controlled vocabulary:
[repressed by, inhibited by, enhanced by, does not req, reqs, reversed by,
unaffected by, bkg inhibited by, partially ...]
extra.entity -- Same as Subject counterpart
extra.mode -- Controlled vocabulary: [addition, substitution, KO, stim, RNAI, ...]
The default verb for any attribute is `may exist', meaning initially it won't
constrain the search. If you change the verb for some attribute and get an empty
result this is likely because there is a conflict, for example asking for the
substrate of a binding assay or for the hooks of an activation assay.
Too appear Links to CV lists
BProtein --- proteinops
Genes --- geneops
Chemical --- chemicalops
Stress --- stressops
Cells --- cellops
Assay Sorts and Types
AssayType -- isa
SimpleAssay
GXPAssay
REReporterAssay
SimpleModAssay
ModificationAssay
SimpleModAssay
BindingAssay
ActivationAssay
LocationAssay
AssayType -- matches
SimpleModAssay
upshift
dimerization
oligomerization
polymerization
GXPAssays
GDP-dissociation
GTP-association
GTP-hydrolysis
GTP-bdpd
GTP-percent
SimpleAssays
cbs-binding
Gal4-reporter
LexA-reporter
mRNA
promo-reporter
internalization
nuc-export-reporter
nuc-import
nuc-export
surface-exp
prot-exp
prot-stability
secretion
ModificationAssay .
acetylation
phos
--- Sphos >> phos(SSite)
--- Tphos >> phos(TSite)
--- STphos >> phos(STSite)
--- Yphos >> phos(YSite)
ubiq
sumo
cleavage
ActivationAssay -- has substrate
IVKA
IVLKA
IVGefA
oligo-binding
BindingAssay -- has hooks
boundby
colocwith
copptby
snaggedby
LocationAssay
locatedin
infraction
boundto
REReporterAssays --- also a kind of activity (of TFs) detection
ARE-reporter
BRE-reporter
CAGA-reporter
DE-reporter
E2fRE-reporter
EgrRE-reporter
GAS-reporter
ISRE-reporter
Lef1RE-reporter
Nfkb-reporter
SBE-reporter
SrfRE-reporter
Stat3RE-reporter
TCF-reporter
TRE-reporter