Skip to content

The Dictionary: A Self-Documenting Database Schema

Purpose

The dictionary (stored as src/open_dateaubase/dictionary.json) serves as a comprehensive metadata repository that defines every component of the datEAUbase data model. It acts as a single source of truth from which you can generate SQL schemas, documentation, and entity-relationship diagrams.

Key Principle: Each unique field concept gets exactly ONE entry in the dictionary, even if that field appears in multiple tables. The table_presence object indicates where each field appears and in what role.

The dictionary is self-referential: it contains the definitions needed to describe itself, making it bootstrapped and internally consistent. This is achieved through the use of Part_type and Member_of_set_part_ID fields, which define the structure and relationships within the dictionary itself.

Understanding the Structure

JSON Format

The dictionary uses a hierarchical JSON structure. Each part has only the metadata it needs (ie, blank properties are omitted):

{
  "parts": [
    {
      "Part_ID": "Contact_ID",
      "Label": "Contact ID",
      "Description": "Identifier for contacts",
      "Part_type": "key",
      "SQL_data_type": "int",
      "table_presence": {
        "contact": {
          "role": "key",
          "required": true,
          "order": 1
        },
        "project_has_contact": {
          "role": "compositeKeySecond",
          "required": false,
          "order": 2
        }
      }
    }
  ]
}

Core Fields

Every part has these core metadata fields:

  • Part_ID: Unique identifier for this field/table/value
  • Label: Human-readable name
  • Description: Detailed explanation of what this part represents
  • Part_type: Classification (table, key, property, compositeKeyFirst, compositeKeySecond, parentKey, valueSet, valueSetMember)
  • Value_set_part_ID: If this property is constrained by a value set, which set (optional)
  • Member_of_set_part_ID: If this is a value set member, which set it belongs to (required for valueSetMember)
  • Ancestor_part_ID: For parentKey type, the Part_ID of the ancestor being referenced (enables hierarchical relationships within the same table)
  • SQL_data_type: SQL data type (e.g., int, nvarchar(100), datetime) (optional)
  • Is_required: Whether this field is mandatory (NOT NULL) (optional)
  • Default_value: Default value for the field (optional)
  • Sort_order: Display order for documentation/UI (optional)

Table Presence Object

For fields (keys and properties), the table_presence object maps table names to metadata about how the field appears:

"table_presence": {
  "table_name": {
    "role": "key|property|compositeKeyFirst|compositeKeySecond",
    "required": true|false,
    "order": 1
  }
}

Roles:

  • key: This field is the primary key in this table
  • compositeKeyFirst: First part of a composite primary key
  • compositeKeySecond: Second part of a composite primary key
  • property: This field is a regular column in this table

Example: Equipment_ID has:

{
  "Part_ID": "Equipment_ID",
  "table_presence": {
    "equipment": {"role": "key", "required": true, "order": 1},
    "metadata": {"role": "property", "required": false, "order": 5},
    "project_has_equipment": {"role": "compositeKeySecond", "required": false, "order": 2}
  }
}

Reading the Dictionary

Find acceptable values for a field

To determine what values a field can accept, check if it references a valueSet:

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Get the value set for a field
field = mgr._find_part("Site_type")
if field and hasattr(field, 'value_set_part_id'):
    value_set_id = field.value_set_part_id
    print(f"Field 'Site_type' uses value set: {value_set_id}")
else:
    print("Field 'Site_type' has no value set constraint")
Field 'Site_type' uses value set: None
from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Get all valid values for that set
members = mgr.get_value_set_members("Site_type")
for member in members:
    print(f"{member['Part_ID']}: {member['Label']} - {member['Description']}")

Get all columns in a table

To retrieve all columns that appear in a specific table:

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Get all columns in the site table
columns = mgr.get_table_columns("site")
for col in columns:
    print(f"{col['Part_ID']}: {col['Label']} ({col['SQL_data_type']}) - {col['Role']} - Required: {col['Is_required']}")
Picture: Picture (image(2147483647)) - property - Required: False
Province: Province (nvarchar(255)) - property - Required: False
Site_ID: Site ID (int) - key - Required: False
Site_name: Site Name (nvarchar(100)) - property - Required: False
Site_type: Site Type (nvarchar(255)) - property - Required: False
Watershed_ID: Watershed ID (int) - property - Required: False
site_City: Site City (nvarchar(255)) - property - Required: False
site_Country: Site Country (nvarchar(255)) - property - Required: False
site_Description: Site Description (ntext(1073741823)) - property - Required: False
site_Street_name: Site Street Name (nvarchar(100)) - property - Required: False
site_Street_number: Site Street Number (nvarchar(100)) - property - Required: False
site_Zip_code: Site Zip Code (nvarchar(100)) - property - Required: False

Or to see what role each field plays:

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Show role information for each field in the site table
columns = mgr.get_table_columns("site")
for col in columns:
    print(f"{col['Part_ID']}: {col['Label']} - Role: {col['Role']}")
Picture: Picture - Role: property
Province: Province - Role: property
Site_ID: Site ID - Role: key
Site_name: Site Name - Role: property
Site_type: Site Type - Role: property
Watershed_ID: Watershed ID - Role: property
site_City: Site City - Role: property
site_Country: Site Country - Role: property
site_Description: Site Description - Role: property
site_Street_name: Site Street Name - Role: property
site_Street_number: Site Street Number - Role: property
site_Zip_code: Site Zip Code - Role: property

Find which tables contain a specific field

To see all tables where Equipment_ID appears:

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Find all tables where Equipment_ID appears
tables = mgr.get_field_tables("Equipment_ID")
for table_info in tables:
    print(f"Table: {table_info['Table_ID']}, Role: {table_info['Role']}, Required: {table_info['Required']}, Order: {table_info['Order']}")
Table: equipment, Role: key, Required: False, Order: 999
Table: metadata, Role: property, Required: False, Order: 999
Table: project_has_equipment, Role: compositeKeySecond, Required: False, Order: 999

Find all primary keys in the database

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Get all primary keys
primary_keys = mgr.get_primary_keys()
for pk in primary_keys:
    print(f"{pk['Part_ID']}: {pk['Label']} ({pk['SQL_data_type']}) - Primary in: {pk['Primary_in_tables']}")
Comment_ID: Comment ID (int) - Primary in: ['comments']
Condition_ID: Condition ID (int) - Primary in: ['weather_condition']
Contact_ID: Contact ID (int) - Primary in: ['contact']
Equipment_ID: Equipment ID (int) - Primary in: ['equipment']
Equipment_model_ID: Equipment Model ID (int) - Primary in: ['equipment_model']
Metadata_ID: Metadata ID (int) - Primary in: ['metadata']
Procedure_ID: Procedure ID (int) - Primary in: ['procedures']
Project_ID: Project ID (int) - Primary in: ['project']
Purpose_ID: Purpose ID (int) - Primary in: ['purpose']
Sampling_point_ID: Sampling Point ID (int) - Primary in: ['sampling_points']
Site_ID: Site ID (int) - Primary in: ['site']
Unit_ID: Unit ID (int) - Primary in: ['unit']
Value_ID: Value ID (int) - Primary in: ['value']
Watershed_ID: Watershed ID (int) - Primary in: ['hydrological_characteristics', 'urban_characteristics', 'watershed']

Or to see just the key names:

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Just the key names
primary_keys = mgr.get_primary_keys()
for pk in primary_keys:
    print(f"{pk['Part_ID']}: {pk['Label']}")
Comment_ID: Comment ID
Condition_ID: Condition ID
Contact_ID: Contact ID
Equipment_ID: Equipment ID
Equipment_model_ID: Equipment Model ID
Metadata_ID: Metadata ID
Procedure_ID: Procedure ID
Project_ID: Project ID
Purpose_ID: Purpose ID
Sampling_point_ID: Sampling Point ID
Site_ID: Site ID
Unit_ID: Unit ID
Value_ID: Value ID
Watershed_ID: Watershed ID

List all tables in the model

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# List all tables
tables = mgr.list_tables()
for table_id in tables:
    table = mgr._find_part(table_id)
    print(f"{table_id}: {table.label} - {table.description}")
comments: Comments - Stores any additional textual comments, notes, or observations related to a specific measured value
contact: Contact - Stores detailed personal and professional information for people involved in projects (e.g., name, affiliation, function, e-mail, phone)
equipment: Equipment - Stores information about a specific, physical piece of equipment (e.g., serial number, owner, purchase date, storage location)
equipment_model: Equipment Model - Stores detailed, non-redundant specifications for a specific sensor or instrument model (e.g., manufacturer, functions, method)
equipment_model_has_Parameter: Equipment Model Has Parameter - Links equipment models to the parameters they can measure
equipment_model_has_procedures: Equipment Model Has Procedures - Links equipment models to the relevant maintenance procedures
hydrological_characteristics: Hydrological Characteristics - Stores the hydrological land use percentages (e.g., forest, wetlands, cropland, grassland) within the watershed
metadata: Metadata - Contains a list of all existing unique metadata combinations (represented by a series of foreign keys/IDs) that describe a single measurement
parameter: Parameter - Stores the different water quality or quantity parameters that are measured (e.g., pH, TSS, N-components)
parameter_has_procedures: Parameter Has Procedures - Links parameters to the relevant measurement procedures
procedures: Procedures - Stores details for different measurement procedures (e.g., calibration, validation, standard operating procedures, ISO methods)
project: Project - Stores descriptive information about the research or monitoring project for which the data was collected
project_has_contact: Project Has Contact - Links projects to the personnel involved in them
project_has_equipment: Project Has Equipment - Links projects to the specific equipment used within them
project_has_sampling_points: Project Has Sampling Points - Links projects to the sampling points used within them
purpose: Purpose - Stores information about the aim of the measurement (e.g., on-line measurement, laboratory analysis, calibration, validation, cleaning)
sampling_points: Sampling Points - Stores the identification, specific geographical coordinates (Latitude/Longitude/GPS), and description of a particular spot where a sample or measurement is taken
site: Site - Stores general site information, including address, site type, and a link to the associated watershed
unit: Unit - Stores the SI units of measurement (or other relevant units) corresponding to the parameters (e.g., mg/L, g/L, s)
urban_characteristics: Urban Characteristics - Stores the urban land use percentages (e.g., commercial, residential, green spaces) within the watershed
value: Value - Stores each measured water quality or quantity value, its time stamp, replicate identification, and the link to its specific metadata set
watershed: Watershed - Stores general information about the watershed area, including surface area, concentration time, and impervious surface percentage
weather_condition: Weather Condition - Stores descriptive information about the prevailing weather conditions when the measurement was taken (e.g., dry weather, wet weather, snow melt)

Find fields that appear in multiple tables

from open_dateaubase.data_model.helpers import DictionaryManager
mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Find fields that appear in multiple tables
shared_fields = mgr.get_shared_fields()
for field in shared_fields[:10]:  # Show first 10
    print(f"{field['Part_ID']}: {field['Label']} ({field['Part_type']}) - Used in {field['Table_count']} tables")
    for table in field['Tables']:
        print(f"  - {table['Table_ID']}: {table['Role']}")
    print()
Project_ID: Project ID (key) - Used in 5 tables
  - metadata: property
  - project: key
  - project_has_contact: compositeKeyFirst
  - project_has_equipment: compositeKeyFirst
  - project_has_sampling_points: compositeKeyFirst

Equipment_model_ID: Equipment Model ID (key) - Used in 4 tables
  - equipment: property
  - equipment_model: key
  - equipment_model_has_Parameter: compositeKeyFirst
  - equipment_model_has_procedures: compositeKeyFirst

Parameter_ID: Parameter ID (compositeKeySecond) - Used in 4 tables
  - equipment_model_has_Parameter: compositeKeySecond
  - metadata: property
  - parameter: key
  - parameter_has_procedures: compositeKeyFirst

Procedure_ID: Procedure ID (key) - Used in 4 tables
  - equipment_model_has_procedures: compositeKeySecond
  - metadata: property
  - parameter_has_procedures: compositeKeySecond
  - procedures: key

Watershed_ID: Watershed ID (key) - Used in 4 tables
  - hydrological_characteristics: key
  - site: property
  - urban_characteristics: key
  - watershed: key

Contact_ID: Contact ID (key) - Used in 3 tables
  - contact: key
  - metadata: property
  - project_has_contact: compositeKeySecond

Equipment_ID: Equipment ID (key) - Used in 3 tables
  - equipment: key
  - metadata: property
  - project_has_equipment: compositeKeySecond

Sampling_point_ID: Sampling Point ID (key) - Used in 3 tables
  - metadata: property
  - project_has_sampling_points: compositeKeySecond
  - sampling_points: key

Unit_ID: Unit ID (key) - Used in 3 tables
  - metadata: property
  - parameter: property
  - unit: key

Comment_ID: Comment ID (key) - Used in 2 tables
  - comments: key
  - value: property

Editing the Dictionary

The dictionary should be edited using the DictionaryManager helper class, which ensures validation and consistency.

Adding a New Value Set

from open_dateaubase.data_model.helpers import DictionaryManager

mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Create a new value set
mgr.create_value_set("Status_set", "Status Values", "Valid status values for records")

# Add members to the set
mgr.add_value_set_member("Status_set", "active", "Active", "Record is currently active", order=1)
mgr.add_value_set_member("Status_set", "inactive", "Inactive", "Record is currently inactive", order=2)
mgr.add_value_set_member("Status_set", "pending", "Pending", "Record is pending review", order=3)

# Save the changes
mgr.save()

Adding a New Table

from open_dateaubase.data_model.helpers import DictionaryManager

mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Create the table
mgr.create_table("observation", "Observation", "Environmental observation records")

# Add primary key
mgr.add_field_to_table(
    table_id="observation",
    field_id="Observation_ID",
    label="Observation ID",
    description="Primary key for observations",
    role="key",
    sql_data_type="int",
    required=True,
    order=1
)

# Add regular fields
mgr.add_field_to_table(
    table_id="observation",
    field_id="Observation_date",
    label="Observation Date",
    description="Date when observation was made",
    role="property",
    sql_data_type="datetime",
    required=True,
    order=2
)

mgr.add_field_to_table(
    table_id="observation",
    field_id="Value",
    label="Value",
    description="Observed value",
    role="property",
    sql_data_type="float",
    required=False,
    order=3
)

# Save the changes
mgr.save()

Adding Fields to an Existing Table

The add_field_to_table() method handles both new and existing fields automatically:

  • If the field already exists (like Site_ID used in multiple tables), it updates the field's table_presence to include this table
  • If the field doesn't exist, it creates a new field part
from open_dateaubase.data_model.helpers import DictionaryManager

mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Add a foreign key (Site_ID likely already exists in the dictionary)
mgr.add_field_to_table(
    table_id="observation",
    field_id="Site_ID",
    label="Site ID",
    description="Foreign key to site",
    role="property",
    sql_data_type="int",
    required=True,
    order=4
)

# Add a new field with value set constraint
mgr.add_field_to_table(
    table_id="observation",
    field_id="Status",
    label="Status",
    description="Current status of observation",
    role="property",
    sql_data_type="nvarchar(50)",
    required=False,
    value_set_id="Status_set",
    order=5
)

# Add another new field
mgr.add_field_to_table(
    table_id="observation",
    field_id="Notes",
    label="Notes",
    description="Additional notes about observation",
    role="property",
    sql_data_type="nvarchar(500)",
    required=False,
    order=6
)

mgr.save()

Adding a Hierarchical Relationship (Parent Key)

For tables with parent-child relationships within the same table:

from open_dateaubase.data_model.helpers import DictionaryManager

mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Add a parent key for hierarchical structure
mgr.add_parent_key(
    table_id="site",
    parent_key_id="Parent_Site_ID",
    ancestor_key_id="Site_ID",
    label="Parent Site ID",
    description="Reference to parent site in hierarchy",
    sql_data_type="int",
    required=False,
    order=10
)

mgr.save()

This creates:

  • A new Parent_Site_ID field of type parentKey
  • With Ancestor_part_ID pointing to Site_ID
  • Appearing in the site table as a property

Handling Name Collisions

If a non-ID field name appears in multiple tables with different meanings (e.g., Description, City):

  • Create separate Part_ID entries with table prefixes
  • Examples: site_City, contact_City, purpose_Description, project_Description
  • Each gets its own part with table_presence set for only that table
  • Labels can be the same or differentiated: "City", "City", etc.
from open_dateaubase.data_model.helpers import DictionaryManager

mgr = DictionaryManager.load("src/open_dateaubase/dictionary.json")

# Add site-specific description
mgr.add_field_to_table(
    table_id="site",
    field_id="site_Description",
    label="Description",
    description="Description of the site",
    role="property",
    sql_data_type="nvarchar(500)",
    required=False,
    order=6
)

# Add project-specific description (different content)
mgr.add_field_to_table(
    table_id="project",
    field_id="project_Description",
    label="Description",
    description="Description of the project",
    role="property",
    sql_data_type="nvarchar(500)",
    required=False,
    order=7
)

mgr.save()

Exception: ID fields (*_ID) always use the same Part_ID across tables and are tracked via table_presence.

Regenerating Documentation and SQL

After editing the dictionary, regenerate all outputs:

# Regenerate documentation
uv run python scripts/orchestrate_docs.py

# Or regenerate specific components
uv run python scripts/generate_dictionary_reference.py dictionary.json docs/reference
uv run python scripts/generate_erd.py dictionary.json docs/reference
uv run python scripts/generate_sql.py dictionary.json sql_generation_scripts mssql

Naming Conventions

The following naming rules apply:

Part_IDs

  • Tables: Singular nouns, lowercase, words separated by underscores
  • Examples: watershed, sampling_point, equipment_model, weather_condition

  • Primary Keys: [Table_name]_ID with capitalized first letters

  • Examples: Watershed_ID, Sampling_point_ID, Equipment_model_ID
  • Rule: These appear in multiple tables with the same Part_ID

  • Foreign Keys: Use the exact same Part_ID as the referenced primary key

  • Example: site table references watershed via Watershed_ID
  • The dictionary shows this with different roles in table_presence

  • Regular Fields: Descriptive names, mixed case with underscores

  • Examples: Site_name, Street_number, Latitude_GPS, Purchase_date

  • Table-Prefixed Fields: When non-ID names collide across tables

  • Format: tablename_FieldName
  • Examples: site_City, contact_City, purpose_Description

Junction Tables (Many-to-Many)

  • Format: [table1]_has_[table2] where both tables are singular
  • Examples: project_has_equipment, project_has_contact, equipment_model_has_procedure
  • Primary keys are composite (two compositeKey* fields)

Value Sets and Members

  • Value Sets: descriptive name + Set or _set suffix
  • Part_type: valueSet
  • Examples: Part_type_set, Site_type_set, StatusSet

  • Members: short, descriptive identifiers

  • Part_type: valueSetMember
  • Member_of_set_part_ID: points to the set
  • Examples: table, key, property, valueSet, valueSetMember, active, inactive