Phenotyping Data Management System for IBDB V2
The Data Management System (DMS) is the IBDB component of that manages germplasm characterization and evaluation data for genetic resources and crop improvement projects. DMS links these data to germplasm and pedigree information in the Genealogy Management System (GMS), to location information in the Location Management Module (LMM) and to information on genes, markers and alleles in the Genotyping Data Management System (GDMS) as well as providing links to other specialized data sources. DMS also allows integration of data from different characterization and evaluation studies, thus permitting a broad range of queries across trials or types of variables.
The Functions of the DMS are to:
- Store and manage documented and structured data phenotyping data from germplasm characterization and evaluation studies as commonly conducted in crop improvement programs
- Link data to specialised data sources such as GMS, soil and climate databases and
- Facilitate queries, searches and data extraction across studies according to structured criteria for data selection as required by plnat breeders.
All types of phenotyping data will be accommodated in DMS including raw observed data, derived data, means data and summary statistics. Data may have numeric values or character values or categorical class values. For example, observations on disease resistance or nutrient efficiency of a genotype can be numerical measurements, scored or calculated indices or character data.
As a general principle, any data that are routinely represented in field books or laboratory books or spread sheets will be accommodated in DMS. It will handle data and documentation on the basis of individual phenotyping studies which may contain data from many environments (trial instances), and from different sampling levels - study level, environment level, plot level, sample level. Integration across studies will be facilitated by the use of controlled vocabularies and ontologies managed by the Ontology Management System (OMS)
Structure of Phenotyping Data
In order to clarify the definition of entities in the DMS data model we will consider data from a fictional split-plot field experiment. Although this type of experiment is not commonly used in crop improvement it does contain all the elements of the variety of experimental designs which are used and so served as a good motivational model. Data sets are typically arranged in columns as in Table 1, we call the columns VARIABLES since each row may record a different value for each variable. The first seven variables define the source and context of the data. We refer to such variables as LABELS. The last three variables contain measured data. We call these data variables VARIATES, there are usually several variates in a data set. In general we can think of the LABELS as variables for which we know the values before we do the experiement, and VARIATES, the ones we measure during the experiment.
Table 1 Serial Spread-sheet Representation for a Split-plot experiment from Study S9801
A STUDY is the basic, reportable unit of research, it is synonymous with the notions of experiment, nursery or trial. Since DMS must deal with any of these we will use the term study. A study may be characterized by a set of scientific objectives and testable hypotheses and results in the collection of one or more datasets similar to that in Table 1 or it may be simply a convenient package of research activities such as all the replicated field evaluation for a breeding program in a particular year. A study always has some metadata associated with it, such as its name, the PI, the institute, IP status and so on. These are variables, or more precisely labels which take a single value which applies to the whole study. We call them STUDY LABELS, and they often appear as headers to tables such as Table 1 above.
The division of data into sets is usually motivated by convenience, for example data collected from different sampling scales is most conveniently treated in different datasets. Similarly, data collected at different times or from different locations are also often treated as different data sets, although it is feasible and usually preferable to treat these divisions in a single dataset.
Each row in a dataset corresponds to an OBSERVATION UNIT of the study. Values of STUDY labels apply to all the observation units in a study (from any dataset in the study).
ANNOTATION OF VARIABLES
Variables are named and described freely by users, but consistently within each study. However they are annotated by terms from three controlled vocabularies:
- The PROPERTY which describes the context of the sampling unit and experiemental material, if the variable is a label, or the trait being measured if it is a variate,
- the METHOD which describes how the PROPERTY is applied or the protocol by which a variate is measured and
- the SCALE which describes the units in which the label levels or variate values are recorded.
These three controlled vocabularies are terms in the Crop Ontology, and together they define the variables in the database. Every variable in the database is annotated by a combination of PROPERTY, METHOD and SCALE terms, and every unique combination of these terms occurring in the database defines a STANDARD VARIABLE. STANDARD VARIABLES age given standard names and descriptions in the ontology, but are referred to locally (within a study) by local names and descriptions assigned by the researcher. STANDARD VARIABLES link data across studies, and sets of STANDARD VARIABLES, for example those with the same property, or those with the same PROPERTY and METHOD or PROPERTY and SCALE link data about the same property across studies. All values of a particular STANDARD VARIABLE should be of the same data type. At present three types are being considered numeric, character and database IDs (links to records in other database modules such as GIDs). Variables can also be categorical in which case they can only take on values from a defined set of VALID VALUES. This range can be extended to cover other types such as binary, picture, link or other object.
LABELS and LEVELS
LABELS are classifying variables in a study which take values from finite sets of discrete LEVELS. These levels document the source and context of the data by expressing the conditions under which the data were collected or derived. For example, the names of treatments or design structures applying to the unit or units from which the data are recorded, or conditions such as the time and location of measurement. These LABELS are usually listed in columns in the data set as in Table 1. The Study Name will be treated in the data model as a LABEL with exactly one level. Hence, every study has at least one LABEL.
In phenotyping experiments we can identify four groups of labels which describe different parts of the study - STUDY labels, LOCATION (or environment) labels, ENTRY (or germplasm) labels, and FIELD TRIAL (or design) labels. In the example in Table 2, rows 1 to 8 and 11 to 16 describe STUDY labels, rows 17-19 describe LOCATION labels, rows 26-28 describe ENTRY labels, and rows 22 to 25 and 29 describe FIELD TRIAL labels. Labels listed in the CONDITION section have only one level or value for the particular data table annotated by the description sheet. Labels listed in the LABEL section have multiple levels and correspond to columns in the observation table (Table 1). Combinations of one level from each label define the observation units - rows of a spreadsheet as shown in Table 1.
Table 2. Description Sheet showing the annotation of variables for the data shown in Table 1
The role of a variable being a CONDITION or and LABEL is dependent on the scope of the data table. For example if data were collected from several sites, then the complete data set including data from all sites would have to have a label column indication the location from where the data for that row was collected so SITE would be a LABEL. IF you only show a part of the dataset coming from one location then SITE is a CONDITION for that data table.
VARIATES AND VALUES
VARIATES are the variables which contain the data observed in the experiment - the phenotypic data. They usually appear as columns in the data table as YIELD, PHT and BLB in Table 1. Variates which have only one value pertaining to all observation units in the data table are called CONSTANTS.
Note however, that the status of CONDITION and CONSTANT depends on the data shown. If data in Table 1 were for two locations then SITE would have to be represented as a LABEL in the data table (ie a column) and similarly for PH.
Data sources such as field objects or sampling units are identified by combinations of levels of design or sampling LABELS. In Table 1, the LABELS REP, MAINPLOT and SUBPLOT are all design LABELS and combinations of one level from each identify physical sub-plots in the study. Other LABELS define the context of the data, in experiments these are called treatment LABELS. Combinations of one level from each treatment LABEL define the treatments which are applied to field objects. In Table 1 VARIETY and FERT are treatment LABELS.
Data values such as treatment means, as in Table 3, are associated with level combinations of treatment LABELS which do not correspond to field objects but which can be thought of as data sources. Both types of data sources, field objects and treatment combinations, are referred to as OBSERVATION UNITS.
Table 3. Least Squares treatment means for data in Table 1 - Study S9801.
Hence OBSERVATION UNITS are conceptually equivalent to rows in a serially structured spreadsheet, they are the real or conceptual data sources in a study and they are annotated by distinct level combinations of one or more LABELS. Not all LABELS in a study need to be involved in this indexing for every OBSERVATION UNIT. However, STUDY LABELS, with their single levels, are involved in indexing every OBSERVATION UNIT in the Study. Hence OBSERVATION UNITS belong to unique studies. Every study has a STUDY UNIT which is the single observation unit indexed by the level of the STUDY LABEL alone.
All observation units in a study which have the same labels form a DATASET. We can say that a DATASET is defined by the different level-combinations of a subset of LABELS from the STUDY. The OBSERVATIONS UNITS of the PLOT DATA in Table 1 are indexed by PLOT, REP, MAINPLOT, SUBPLOT, ENTRY, VARIETY, GID and FERT and for the TREATMENT MEANS in Table 2, ENTRY and FERT will define all the OBSERVATION UNITS, but we would like to carry over the other label of ENTRY, VARIETY and GID, as well as all the STUDY LABELS.
LOGICAL DATA MODEL FOR PHENOTYPING DATA FROM FIELD TRIALS
There are five key elements of phenotyping data from field trials which need to be captured in a logical data model. These are:
- The STUDY INFORMATION component records global contextual information about the experiment such as who conducted it, when, why, and who owns the data and models the high level structure of the experiment by describing the datasets that are part of the study for example, data collected about the trial environment(s), data collected on sub-samples, plot data, means and summary data. Each of these datasets is described in terms of the variables that occur in them, which ones are labels giving design and context, and which are variates containing observations made during the experiment. The actual values of these dataset variables are managed by other components of the model depending on their type.
- The TRIAL ENVIRONMENT component manages all data values describing the environments observed in the study including georeference information, place names, growing environments, and overall management practices (non-treatment factors). This component also links to the Location Management Module of the IBDB.
- The GERMPLASM ENTRY component manages all label values describing the germplasm entries in the the experiment including local and global identifiers, names, sources and roles (check or test lines) of the entries. It links to the Germplasm Management System of (GMS) IBDB where unique identification, global nomenclature, ownership and pedigree information is stored.
- The TRIAL DESIGN component manages the treatment and sampling design and structure of the datasets in the study. It enumerates all the observation units in the study and describes their treatment and sampling context in terms of levels of labels describing such features as replication, block, fertilizer treatment etc. the onservation units inherit global information about the study, information about the location and information about the germplasm entries by linkages to the relevant components of the model.
- The OBSERVATION component manages the values of the variates for each dataset.
Figure 1. Key elements of the Logical Data Model for Penotyping Data from Field Trials
These elements are sufficient for managing phenotyping data from any field experiment, however a sixth component is required to facilitate integration of phenotyping data across studies. This is the Ontology Management System (OMS) which identifies comparable elements - labels, variates and values across studies.