Access Keys:
Skip to content (Access Key - 0)

Application 2.2.1 Tool 2.12 - Data Exchange Protocol

IBWS Data Exchange Protocols
Any information management system requires protocols for data exchange which ensure that data is exported from the system with all its associated metadata, and which facilitates import of data to the system when the metadata can be provided in specific formats.
The ICIS workbook served well in this capacity, but some elements of the format have changed with the new data model for IBDB and its reliance on excel limits its applicability to windows environments. However excel is still the most widely used data capture and storage medium so we should continue to have an excel standard, but we should also work on an application-free standard like xml.
IBWorkbook
The key difference between the ICIS data model and the IBDB data model is that the dynamic partitioning of the model into sets of labels called factors is not replaced with a partitioning into a fixed number of domain specific 'roles' – Study information, Dataset information, Trial environment information, Trial design information Germplasm entry information, Observation variates and Categorical Variates. Within some of these are a few special cases such as Entry GID within Gernplas entry information. The full list of these roles in in Table 6 of the document Phenotyping Data Management System for IBDB V2.
A second difference which is relevant to the workbook is that data are no longer stored in numeric and character tables, but are now all stored as character strings which must be converted to the appropriate data type in the middleware, Hence the list of data types is more comprehensive that in the ICIS system. The valid data types for IBDB are listed in Table 5 of document Phenotyping Data Management System for IBDB V2.
Finally, a problem with the ICSI Workbook was that the main headers on the description sheet applied to Studies, wheras the workbook was used to record and transfer datasets.
With these issues in mind I propose a change in the Workbook format to the following:

The Headers in rows 1 to 4 still record the study to which the dataset belongs, but now also record the dataset name and description. We can consider whether we should add some more compulsory rows here – for example who owns the data and what is its IB status? Of course all these things can also be set up as Study or Trail CONDITIONS so it may be best to keep things simple.
The Columns headed PROPERTY, METHOD, SCALE, DATA TYPE and ROLE all contain cvterms from different controlled vocabularies which specify the corresponding meta data values for the variables in the dataset. DATA TYPE and ROLE have terms from the IBDB TERMS CV which are an integral part of the data model. The first two columns still contain local names and descriptions, ie chosen by the user, and the names of the FACTORS and VARIATES appear as column headings on the observation sheet where the data are recorded.

CONDITIONS and CONSTANTS have single data values which apply to all sampling units in the dataset and these values appear under the heading VALUE.
The experimental layout, treatment structure and sampling design are described by the factors and in some cases the relationship between factors is nested. When this is the case the nesting factor is recorded under the heading NESTED IN. Similarly, the sampling level at which variates are observed is recorded under the heading SAMPLING LEVEL.
The ROWTAG column in the observation sheet is simply a column used to identify rows (of comments) in the observation sheet which should be ignored by any software process. (Any value in the ROWTAG column identifies the corresponding row as a comment).



IBXMLWorkbook
The goal is to have an equivalent xml based data exchange protocol and a first attempt to describe one is as follows:
<?xml version="1.0"?>
<dataset>
<name>S9801-PLOT DATA</name>
<description>PLOT DATA FOR STUDY 1 OF 1998</description>
<condition role="Study Information" datatype="Character Variable">
<name>PI</name>
<description>PRINCIPAL INVESTIGATOR</description>
<property>PERSON</property>
<method>ASSIGNED</method>
<scale>DBCV</scale>
<value>Arllet</value>
</condition>
<condition role="Study Information" datatype="Numeric DBID variable">
<name>PI ID</name>
<description>ID OF PRINCIPAL INVESTIGATOR</description>
<property>PERSON</property>
<method>ASSIGNED</method>
<scale>DBID</scale>
<value>1</value>
</condition>
<condition role="Trial environment information" datatype="Categorical variable">
<name>DESIGN</name>
<description>EXPERIMENTAL DESIGN</description>
<property>EXPERIMENTAL DESIGN</property>
<method>APPLIED</method>
<scale>DESIGN CODE</scale>
<value>SP</value>
</condition>
<condition role="Trial environment information" datatype="Numeric variable">
<name>PLOTSIZE</name>
<description>PLOT SIZE</description>
<property>HARVESTED PLOT</property>
<method>OBSERVED</method>
<scale>SQUARE METERS</scale>
<value>6.0</value>
</condition>
<factor role="Trial design information" datatype="Numeric variable">
<name>PLOT</name>
<description>PLOT NUMBER</description>
<property>PLOT NUMBER</property>
<method>ENUMERATED</method>
<scale>NUMBER</scale>
<nestedin></nestedin>
<value>1,2,3,4,5,6,7,8,9,10,11,12</value>
</factor>
<factor role="Trial design information" datatype="Numeric variable">
<name>REP</name>
<description>REPLICATION</description>
<property>REPLICATION</property>
<method>ENUMERATED</method>
<scale>NUMBER</scale>
<nestedin></nestedin>
<value>1,1,1,1,1,1,2,2,2,2,2,2</value>
</factor>
<factor role="Trial design information" datatype="Numeric variable">
<name>MAINPLOT</name>
<description>MAIN PLOT NUMBER</description>
<property>PLOT NUMBER</property>
<method>ENUMERATED</method>
<scale>NUMBER</scale>
<nestedin></nestedin>
<value>1,1,2,2,3,3,1,1,2,2,3,3</value>
</factor>
<factor role="Trial design information" datatype="Numeric variable">
<name>SUBPLOT</name>
<description>SUB-PLOT NUMBER</description>
<property>PLOT NUMBER</property>
<method>ENUMERATED</method>
<scale>NUMBER</scale>
<nestedin></nestedin>
<value>1,2,1,2,1,2,1,2,1,2,1,2</value>
</factor>
<factor role="Entry Designation" datatype="Character variable">
<name>VARIETY</name>
<description>VARIETY NAME</description>
<property>GERMPLASM ID</property>
<method>ASSIGNED</method>
<scale>DBCV</scale>
<nestedin></nestedin>
<value>B,B,C,C,A,A,B,B,A,A,C,C</value>
</factor>
<factor role="Entry GID" datatype="Numeric DBID variable">
<name>GID</name>
<description>GERMPLASM ID</description>
<property>GERMPLASM ID</property>
<method>ASSIGNED</method>
<scale>DBID</scale>
<nestedin></nestedin>
<value>100,100,102,102,105,105,100,100,105,105,102,102</value>
</factor>
<factor role="Trial design information" datatype="Numeric variable">
<name>FERT</name>
<description>FERTILIZER LEVEL</description>
<property>FERTILIZER</property>
<method>APPLIED</method>
<scale>kg/ha</scale>
<nestedin></nestedin>
<value>100,200,200,100,100,200,200,100,100,200,200,100</value>
</factor>
<constant role="Observational variate" datatype="Numeric variable">
<name>PH</name>
<description>SITE PH</description>
<property>SOIL PH</property>
<method>PH METER</method>
<scale>PH</scale>
<value>6.3</value>
</constant>
<variate role="Observational variate" datatype="Numeric variable">
<name>YIELD</name>
<description>GRAIN YIELD</description>
<property>GRAIN YIELD</property>
<method>PADDY RICE</method>
<scale>kg/ha</scale>
<samplelevel>PLOT</samplelevel>
<value>10.3,12.7,18.7,13.7,12.6,16.7,19.2,12.3,17.1,14.1,16.3,12.2</value>
</variate>
<variate role="Observational variate" datatype="Numeric variable">
<name>PHT</name>
<description>PLANT HEIGHT</description>
<property>PLANT HEIGHT</property>
<method>At Maturity (Stages 7-9)</method>
<scale>cm</scale>
<samplelevel>PLOT</samplelevel>
<value>80,85,103,88,79,102,87,90,92,102,100,98</value>
</variate>
<variate role="Categorical variate" datatype="Categorical variable">
<name>BLB</name>
<description>BLB RESISTANCE</description>
<property>BLB RESISTANCE</property>
<method>Visual assment of percent affected leaf area at growth stage 3</method>
<scale>SES score (1-9)</scale>
<samplelevel>PLOT</samplelevel>
<value>3,3,5,4,2,1,4,3,1,2,6,5</value>
</variate>
</dataset>
Some questions regarding this schema are:

  1. I have left out STUDY and TITLE although it would be useful to know the study to which a dataset should belong. Should the whole dataset element be rooted in a Study element. If this is done can we still use the dataset element independently of the study element?
  2. I have defined attributes 'role' and 'datatype' which contain the structural terms from the similarly named columns on the excel description sheet. Is this correct?
  3. I have added a nestedin element to the factors even though it contains no values. I suppose elements can be omitted if they are missing or unknown, and indeed applications can ignore any elements they don't require. Hence some applications would only use the name and value elements for each variable – almost like a csv file.
  4. I have included the values as comma-separated lists in the value element within each variable element. I don't know if this is valid xml, and I don't know if it is desirable. Another approach to passing the actual data would be to have an observation element which has a full path to a csv file containing the data using the names of the variables as in the name elements. It would be possible to use bot methods and have applications dynamically discover which method of passing the data has been used. Any thoughts?
Adaptavist Theme Builder (3.3.3-conf210) Powered by Atlassian Confluence 2.10.3, the Enterprise Wiki.
Free theme builder license