The data dictionary


The data dictionary is an indispensable tool for navigating the data set, and it is the first thing that data users should download. The file can be found under the Study Info -> Data & Database section of the Study Files interface.

Let’s take some time now to explain all the fields that are used in the data dictionary table, and how to utilize it.

The Fields

PHASE

Within the data dictionary, this field gives the phase for which a particular definition applies.

This field is generally used in one of three different ways, depending on the table in question:

  • The field may be left blank. This is often the case with data that originates from external analyses using ADNI samples. In these cases, we can assume that definitions are consistent across all phases for which that table applies.

  • The field might contain multiple phases, in brackets, separated by commas. For instance: [ADNI1,GO,2,3] indicates that the definition applies to all phases except ADNI4. This convention is commonly seen in harmonized tables.

  • For tables that contain information from clinical CRFs, we generally expect to see one entry for each phase of the study in which that variable was recorded. In many cases, these dictionary entries will be identical, but there can also be differences in the exact definition of a variable between phases.

CRFNAME

This is a short plain language description of the contents of a table.

TBLNAME

This field gives the file name for any given table. When downloading tables from the IDA, there will generally be a date appended to the file name - this field contains the name of the file, minus that date.

For example, the data dictionary entries for the dictionary itself have TBLNAME “DATADIC”, which also happens to correspond to the name of the dictionary file itself.

FLDNAME

This is synonymous with ‘VARIABLE’, and lists the field that the dictionary entry applies to, exactly as it appears in its data set.

In combination with PHASE and TBLNAME above, this allows us to pinpoint the exact entry for any given variable in any given ADNI table.

For example, the data dictionary entries corresponding to participant race are found in the rows with PTDEMOG and PTRACCAT in the TBLNAME and FLDNAME fields, respectively.

TEXT

This field provides a brief text description of what the corresponding field contains.

TYPE

This field indicates the data type associated with the field - e.g. string, numeric, date, etc.

LENGTH

This gives the maximum length of any string inputs that appear in the field

DD_CRF_VERSION

For tables that may go through multiple versions within the same phase, this notes the version of the table that the dictionary entries apply to.

CODE

For numerically coded ordinal and categorical data, this field provides the coding scheme.

For example, the participant race field (PTRACCAT) in the demographics table (PTDEMOG) is one such numerically coded variable, and for the ADNI2 phase the CODE field reads:

1=American Indian or Alaskan Native; 2=Asian; 3=Native Hawaiian or Other Pacific Islander; 4=Black or African American; 5=White; 6=More than one race; 7=Unknown

Note that this is not the same coding scheme used in other phases.

UNITS

This field provides the units of measurement associated with the field when they are relevant; e.g. biofluid results where the concentration of a molecule may be expressed in pg/ml or some other volumetric unit.

STATUS

This is an important variable to keep in mind, particularly when fields appear to be missing from a table despite having entries in the data dictionary.

Some of the CRFs used in the study provide site staff with the option of entering free text comments to provide a detailed explanation of some clinical event - for example, the details of a hospitalization or clinical evaluation.

Because of the free text nature of these fields, there is some risk of personally identifiable information (PII) being entered by site staff. To prevent such a release of information, the majority of such fields are currently redacted pending further review. If that is the case, it will be noted here in the STATUS field of the dictionary.

CODE_CHANGES

As mentioned in a previous section, there are ongoing efforts to harmonize data sets that are split across phases but contain consistent information. In some cases, aligning these tables requires changing the coding scheme of categorical variables in order to match.

Whenever such a change occurs, it is noted in this field.

Note that prior coding schemes are also preserved in the data dictionary, both here and in the MAPPING_NOTES field.

MAPPING_NOTES

This field is relevant for variables that have been mapped from their original values or coding schemes to an updated version at some point in the history of the study. This is a process that is sometimes carried out in the interest of improved interoperability between newer and older versions of tables.

Using the Data Dictionary

Suppose we have a set of tables selected for analysis and want to pull the relevant entries out of the data dictionary.

This is a relatively simple task, as we have the tables that we want to use and can simply filter the dictionary by matching the file name to the TBLNAME column. For example, in R, one could extract the data dictionary entries for the demographics table as follows:

dictionary[dictionary$TBLNAME == "PTDEMOG",]

To find the entry for a specific field/phase the procedure is very similar:

  1. Identify the fields/tables of interest

  2. Subset out the relevant section of the data dictionary using TBLNAME and/or FLDNAME

  3. Ensure consistency of coding across phases

Now suppose we do not have a set of tables selected for analysis and simply want to check whether a measure is present in any of the available tables.

In this case it is possible to search over the descriptions in the TEXT and CRFNAME fields - for instance, using the grep function found in many programming languages.

From there, it is important to consult any relevant documentation for a full description of the measures available in those tables.