PLA Datums

Table of Contents

Introduction

The Pathway Logic (PL) STM model (available here) is a network of protein interactions and modifications which are used by the cell to transmit signals from its environment to the nucleus. The purpose of the model is to conjoin and display the various ways the reactions can be connected in order to study the effects of removal/inhibition/mutation of proteins on downstream events. The reactions are derived from published experimental evidence and the problem of how to store and search the evidence became a project in itself. We eventually developed a "shorthand" language made up of modules that can be read by biologists, traced back to their source, and have enough structure to be interrogated computationally. We call those modules "datums".

The PL Datum Knowledge Base (DKB) has grown to over 39,000 datums related to the state of mammalian cells and their response to a variety of stimuli. In order to share the KB with the scientific community, we have made a web based interface that can be used to search for specific experimental results and answer questions related to protein interactions and modifications.

This document provides an introduction to datums and describes how to search the DKB using the web based interface (http://stella.csl.sri.com/datum).

How to Read a Datum

A datum is a summary of an experimental finding curated from a published data paper (normally a paper indexed by Pubmed). Each datum represents one biological assay.

The Two Types of Datum

There are two main types of datum, state and change datums, corresponding to two basic types of biological experiments.

State datums describe the state of something in a defined system. An example of a state experiment is the demonstration of the existence of a protein in a particular cell line using a Western Blot. Protein interaction data is another example of the use of a state datum. If one protein can be co-precipitated by another protein from the same cell, they are considered to interact. State data is used by Pathway Logic to create initial system states -- a list of components, modifications, and locations that are then subjected to rules that define changes in the state in response to various stimuli.

The rules are derived from change datums which summarize the change in the state of something resulting from the addition of a stimulus to cells. An experiment in which the phosphorylation of the Egf receptor is increased after addition of Egf to a cell for 5 minutes is an example of a change datum.

Datum Structure

A datum can be parsed by a computer because it uses a structured syntax and a controlled vocabulary. It consists of at least three lines. The first line contains the protein(s) that are observed (Subject), the assay (Assay), the stimulus (Treatment), and a result (Change). The second line contains the cells and culture conditions that were used in the assay (Environment). The last line (Source) gives the PubMed ID (PMID) of the source article and the number of the figure containing the experimental result. There may be additional lines, called "extras", describing variants on experimental conditions.

The Major Elements

Subject: The subject (mostly proteins, some genes) is what the experiment is about. Protein subjects carry attributes that provide additional information including the origin (for example expressed or endogenous), how it is identified (by antibody to the protein, by antibody to the tag on an expressed protein, or by some other sort of label), and whether or not it is an immunoprecipitate.

Assay: The assay is the experimental method. For example Prot-exp[WB] denotes an assay in which total protein expression in a cell is detected by Western Blot, and Surface-exp[FACS] denotes an assay in which protein expression on the surface of a cell is detected by Flow Cytometry. Some assays have assay specific attributes such as hook for binding assay, and substrate for activity assay. A catalog of assays and their attributes can be found in the Assays section of the Datum Dictionary.

Treatment: The treatment can be addition of a ligand (e.g., a peptide or chemical), expression of a protein, or a stress (such as UV light).

Result: In a state type datum the result is either detectable or undetectable. In a change type datum the result has two parts -- the change (increased, decreased, unchanged, or detectable-but unchanged) and the time(s) after addition of a stimulus that the result is observed. If multiple time points are used they are treated as a separate datum element and each time has an associate qualitative level of change indicated by 0 or more "+"s following the time.

Environment: The environment of a datum describes the cell and medium in which the experiment is performed. Cell types can be cell lines or primary cells. Additional information can include exogenous proteins that have been over-expressed in the cells or endogenous proteins that have been removed by knockout or RNA interference technologies. The medium is described with minimal detail using defined abbreviations such as `BMS' for basal medium with serum.

Extras: Extras are use to record the effect of an alteration to some component of an experiment. The most common use of an extra is the observation of the effect of a small molecule on a result. A typical example is the inhibition of the phosphorylation of Akt1 in response to Egf by Wortmannin. Other uses include the effect of depletion of endogenous protein by inhibitory RNAs or the replacement of an expressed protein by a mutated form.

Source: The source is usually a PubMed ID together with a Figure or Table number. It could also be an unpublished laboratory result.

The Minor Elements

In addition to the major elements, datums contain clues about experimental details that are important to the interpretation of the results. The example displayed below says that the coprecipitation of expressed Mek1 by endogenous Erk2 is increased when constitutively active ("CA") Rac1 is also expressed.

The information that Mek1 is expressed is supplied by the prefix "x" before the name of the protein. The handle "tAb" says that the antibody used to detect Mek1 is against a tag on the expressed protein rather than the protein itself. Endogenous proteins do not have prefixes. The assay is named "copptby" and "WB" (western blot) is the detection method. The detail that expressed form of Rac1 is constitutively active is in quotes to show that it is a comment made by the authors but not demonstrated in the paper. The mutation "Q61L" is information that allows us to compare the use of the construct with other experiments that use the same construct.

The second line supplies the name of the cells used in the experiment, any attributes, and the medium used during the experiment. In the example above the cells are mEFs (mouse embryo fibroblasts) from Lkb1 knockout mice (Lkb1~null) which have been reconstituted with exogenous Lkb1 (xLkb1). The medium used in the experiment is "BMS" which stands for Basal Medium containing Serum.

Definitions of abbreviations used in datums can be found in the Glossary of the Datum Dictionary.

The Vocabulary

The key to making a datum readable by both humans and machines is the use of a controlled vocabulary. Each of the elements described above contains categories which in turn contain lists of acceptable terms. The present set of datums was used to provide evidence for the PL STM model so the vocabulary uses that same terms as the reactions in the model. Each component has only one PL name. Names are chosen to be familiar to cellular biologists but stripped of computer confounding punctuation and alphabets such as greek. We have provided a Datum Dictionary to aid in the interpretation of a datum. A rationale for the choice of names is below.

Proteins: A protein is defined as a gene product and is commonly given a name based on the HGNC gene symbol. Peptides, splice variants, families, composites are also included under the extended Protein classification. The Protein section of the Datum Dictionary links families to their members and composites to their subunits.

Genes: In Pathway Logic, gene names are the protein name followed by "-gene".

Chemicals: The only way to uniquely identify a small molecule is to draw a structure. Datums do not use structures because they are usually not provided by the source document. We have used whatever hints the curated papers supply to assign names to chemicals. Any information collected about the named chemical is provided in the Chemical section of the Datum Dictionary.

Assays: Assays have at leaast two parts - the assay name and a detection method. The assay name is an abbreviated version of the attribute measured such as "phos" for phosphorylation or "IVKA" for in vitro kinase activity. The detection method is contained in square brackets following the assay name. The definition of these terms can be found in the Glossary of the Datum Dictionary.

Cells: Cell Line names have been simplified to make them more computer compatible. Punctation and white spaces have been removed and only upper case letters and numbers are used. Primary Cells start with a lower case letter representing the source species (h, human: m, mouse; r, rat; b, bovine; c, chicken, s, sheep, rb, rabbit; x, xenopus).

Everything else: definitions of anything that is not included above can be found in the Glossary of the Datum Dictionary.

A datum query is a pattern defining constraints on the different elements of a datum. Given a query, Search (all) returns all datums that match the query (satisfy all the constraints). The datums web search page helps the user to formulate queries, providing a simple form for entering matching conditions for elements of interest. Once this is done, click on Search All if you want to see all the matching datums or Search 100 to only show the first 100. Press reset to clear the form before starting a new search or there may be left over constraints causing datums of interest to be missed. In general a selection list with square check boxes mean you can check any number of the boxes, forming a disjunction of conditions. For example, any of the subject origin boxed may be checked. For convenience, checking none is the same as checking all. Round buttons indicate only one may be selected, and there is a default selection.

The results can be saved as a pdf file by using the print to pdf feature of your browser, or can be copied and pasted into a text editor or Word document. A future version will provided additional export options, including sorting of the results.

Search by Subject

In the Subject section, a Protein Lookup table is used to find the dictionary name for a protein. Type in the full or partial name or Uniprot ID of a protein and click on (lookup). A new window will open with a list of proteins that have that name or ID as synonym. Click on the protein you are interested in and it will be entered into the subject window.

The list under the Subject window can be used to limit the search to endogenous, purified, expressed, recombinant, or knockin proteins. If none of the buttons are set then all types of proteins as well as their genes will be found.

Some assays such as coprecipitation involve two subjects. A second subject field is provided for cases when you wish to specify both proteins. The order does not matter - the search engine will find both Protein1-1 coprecipitated by Protein-2 as well as Protein-2 coprecipitated with Protein-1.

The DKB uses the names of proteins followed by "-gene" for gene names. To use a gene as subject find the protein that the gene codes in the lookup window and add "-gene".

Caveat. Currently you must have the PL name of the protein in the the subject window before initiating the datum search, either by clicking or typing. The Uniprot ID will not work, nor will a synonynm. If you get 0 datums found, this is one thing to check. Future versions of the query engine may be smarter.

Search by Source

To find all the datums from a given article enter the PMID into the window. Note that a PMID for any article in PubMed can be found here.

Search by Assay

The Assay section contains all the assays used by the datums in the DKB. There is a detailed account of each assay in Assays section of the Datum Dictionary. Multiple assays can be chosen as possible matches. Some assays can be limited by using the buttons under the assay type.

Search by Change

Check any change(s) that apply. If no changes are chosen then any change will match, i.e. the change element is not constrained.

Search by Treatment

This section is used to limit the results to a particular treatment.

The items entered into the associated text boxes must be in the Datum Dictionary. Future versions of the search page will provide a selection list.

Search by Environment

This section is used to limit the results to either a cell free reaction or specified cells. The names of cell lines and primary cells can be found in the Datum Dictionary.

Search by Extra

The Extras section is used to limit the results to experiments that look at things that affect the outcome of reactions.

A protein or small molecule can be entered into the associated text boxes to limit the search. The items entered must be in the Datum Dictionary. Future versions of the search page will provide a selection list.

Sample Queries

Recall that one can search the DKB using the web based interface (http://stella.csl.sri.com/datum).

Q1: What substrates have been used for an in vitro kinase assay (IVKA) assay to show if Jnk1 is activated?

In this query, the subject is Jnk1 and the assay is In Vitro Kinase Activity.

What to do:
Subject section: Type "Jnk1" into Subject text box
Assay section: Check "In Vitro Kinase Activity"
What you get:
The first lines of all the datums in which Jnk1 was used in an IVKA assay. The substrate can be found in brackets after "IVKA". The substrates that have been used are: Jun, Atf2, Elk1, and MBP (myelin basic protein).

Comment: Use of "Jnk1" as the subject limits the datums to those in which Jnk1 is the only member of the Jnk family measured. If you do not care which family member you want to see then use "Jnk" instead - that will also return Jnk2, Jnk3, and Jnks (the symbol for any member of the Jnk family).

Note: You will need to press your browser's back button to return to the query page after being shown a query result. It is also wise to press the reset button before starting the next query.

Q2: What proteins have been shown to be directly phosphorylated by Mekk1?

In PL, "directly phosphorylated" is defined by an in vitro kinase assay that contains only the kinase, the substrate, and required cofactors in a cell free reaction.

What to do:
Check "Addition of Something to Cell Free Reaction" in the Treatment section and type "Mekk1" into the "Something specific" box.
What you get:
All the datums in which something (the subject) is phosphorylated by Mekk1 in vitro. Currently the list includes Ikba, Mek1, Erk2, Mkk4, and Stat3

Q3: Is expressed Ikk1 constitutively active?

Some kinases are active when over-expressed in mammalian cells

What to do:
Subject section: Type "Ikk1" into Subject text box and check "Expressed"
Assay section: Check "In Vitro Kinase Activity"
Treatment section: Check "No Treatment"
What you get:
The subject of the retrieved datums will be xIkk1 (expressed Ikk1) and the assay will be IVKA (in vitro kinase activity). The results show that exogenous Ikk2 has detectable kinase activity when expressed in HEK293, HEK293T, and HELA cells.

Q4: What proteins have been shown to be phosphorylated on tyrosine in response to Ngf? In what cells?

What to do:
Assay section: Check "Y-only" under "Phosphorylation"
Change section: Check "Increased"
Treatment section: Type "Ngf" into Something Specific text box under "Addition of Something to Cell Supernatants"
What you get:
The subject(s) of the retrieved datums will be proteins with increased phosphorylation on tyrosine in response to Ngf (Stat5s, Trka, Trks, Aps, Arms, Crk, Erk1, Frs2, Gab1, Pi3k, Plcg1, Sh2b1, Shc1). The cells used are in the second line of each datum.

Q5: What happens when I over-express Mekk1 in MCF7?

What to do:
Treatment/Protein Expression/Expressed Protein 1: Type "Mekk1"
Environment/Cells: Type "MCF7"
What you get:
The results include a responses from reporter and phosphorylation assays caused by expression of wild-type or mutant Mekk1.

Q6: What proteins have been shown to be required for the activation of Jnk (Jnk1, Jnk2, Jnk3, or Jnks) kinase activity in response to IL1 (IL1a,IL1b, or IL1)?

What to do:
Subject/Subject: Type "Jnk"
Assay: Check "In Vitro Kinase Activity" (Protein or Lipid Substrate is preset as a default)
Treatment/Addition of Something to Cell Supernatants/Something Specific: Type "IL1"
Extras/Effect: Check "Requires"
What you get:
All datums using Jnks, Jnk1, Jnk2, or Jnk3 as the subject of an in vitro kinase assay from experiments in which the response of cells treated with IL1 are compared with the response from cells with various proteins removed. Removal of proteins is performed using knockouts, null mutations, omission of overexpressed proteins, or over-expression of dominant-negative proteins.

Q7: How long should I treat cells with IL1 to maximize Jnk phosphorylation?

What to do:
Subject/Subject: Type "Jnk"
Assay: Check "Phosphorylation"
Treatment/Addition of Something to Cell Supernatants/Something Specific: Type "IL1"
What you get:
All datums in which Jnk phosphorylation on any site in response to IL1. Note that many authors do not say whether they use a site specific antibody or not.
Comments: - The search could have been limited to phosphorylation at the TPY site by typing TPY into the Site text box under Phosphorylation. - Treatment times are represented two ways in datums. If only one time point was used then the treatment time can be found at the end of the first line. If a time course was performed then the time points are listed on the third line. The relative response is indicated by the number of + symbols after each time.

Q8: What stimuli turn on Jun-gene transcription?

What to do:
Subject/Subject: Type "Jun-gene"
Change: Check "Increased"
Treatment: Check "Addition of Something to Cell Supernatants"
What you get:
All the datums in which Jun-gene mRNA expression is increased. These datums will provide the stimuli, treatment time, and cell line used.

Q9: Do untreated HELA cells express endogenous Myc?

What to do:
Subject/Subject: Type "Myc" Check: "Endogenous"
Environment/Cells: Type "HELA"
What you get:
All the datums that look at Myc protein expression in HELA cells

Q11: Where is endogenous Atf2 located in untreated cells?

What to do:
Subject/Subject: Type "Atf2" Check: "Endogenous" Assay: Check "Location"
What you get:
All the datums in which the location of Atf2 was determined by immunohistochemistry (IHC)
Comment: Investigators also use cell fractionation followed by western blots to determine the location of proteins. These experiments will can be included by also checking Assay/Fractionation.

Q12: What ligands cause Traf2 to be ubiquitinated?

What to do:
Subject/Subject: Type "Traf2"
Assay: Check "Ubiquitination"
Change: Increased
Treatment: Addition of Something to Cell Supernatants
What you get:
Datums that show that Traf2 is ubiquitinated in response to Tnf or anti-Cd40
Comment: If you were interested Traf2 ubiquitination in response to over-expression of certain proteins you would check Protein Expression instead of Addition of Something to Cell Supernatants

Q13: What proteins have been shown to coprecipitate with Trka?

What to do:
Subject/Subject: Type "Trka"
Assay: Check "Coprecipitation"
What you get:
Datums in which Trka was seen to coprecipitate with another Protein
Comment: There are other assays that look at the interaction between two proteins. Assay/Direct-Protein-Binding yields datums in which two recombinant or purified proteins are added together IVT. Assay/Snagged-By produces datums in which recombinant or purified proteins are added to a cell lysate then removed to see what they "snagged".

Future Work

The PL DKB is limited to published information about a small number of pathways related to intracellular signal transduction. It is our intention to expand the number of pathways in the future which will require new kinds of datum (noticeably T-cell and B-cell receptor signaling; GPCRs; metabolic pathways). It has been suggested that a GUI for writing datums would be helpful to those interested in collecting datums.

Possible future improvements/extensions of the query page include: