Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
U
User docs
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • PEP Public
  • User docs
  • Wiki
  • Data structure

Last edited by Kai van Lopik Feb 09, 2021
Page history

Data structure

PEP allows for the storage and retrieval of tabular data. Conceptually, PEP provides a single table consisting of rows and columns, e.g.:

Name HatStyle BankAccountNr LastDoctorVisit BloodPressure ...
Scrooge top hat NL50ABNA3690200148 1843-12-19 131/77 ...
Donald sailor cap <none> 2021-01-06 141/76 ...
Ariel <none> DK3650519625773963 1989-11-17 119/64 ...
Eric crown DK3650519625773963 2013-11-19 122/62 ...
... ... ... ... ... ...

Each row represents a single entity or data subject. Data are stored in the same row if they are associated with the same subject. Data for different subjects should be stored in separate rows. Rows are denoted by means of one of PEP's identifiers. A new row is created by storing data into a row with a previously unused participant identifier.

Columns are used to split data into conceptual units. Different types of data or different measurements should be stored in different columns. Columns are referred to by name, and the name is determined when the column is created. Only members of the Data Administrator role can create columns and perform other administrative tasks on them.

A major difference with traditional data(base) storage systems is that no separate column should be used to store a fixed row identifier. Rows are instead referred to by means of PEP's polymorphic identifiers.

The intersection of a row and column is called a cell. Cells do not have an identifier of their own. Instead, when storing or retrieving data, users can identify cells by specifying the associated column name and row identifier.

Access cannot be managed at the cell level: rows and columns are the smallest units for which access can be granted or revoked. Data Administrator should take this into account in their column management. If (at some point) data are to be disseminated separately, they must be stored in separate cells. Consequently, separate columns must be made available.

Retention

After data has been stored into a PEP cell, downloaders can retrieve that data from there. When new data are stored into the same cell, new downloads will receive the updated version instead. But the old data are never discarded: PEP retains a complete record of all data that has ever been stored into the system. This allows PEP to reconstruct its data set as it was at any time in the past. Such "snapshots" are intended to be made accessible to users for download (although the functionality has not yet been created). This allows the exact same data to be retrieved multiple times, which is usable e.g. for scientific replication studies.

A similar policy applies to column management. When a Data Administrator removes a column, the data stored in that column is retained for future use. Therefore (once the feature is available) when users retrieve an older snapshot, they will also receive the data from the "removed" column. Data Administrators should be aware that, if they remove and then re-add a column with the same name, the newly created column will immediately contain the previously stored data.

Grouping

Members of the Data Administrator role can group

  • columns into column groups, and
  • rows into participant groups.

Such groups serve as a basis for data access management. For example, a MedicalInfo column group might contain the LastDoctorVisit and BloodPressure columns, and the rows for Scrooge and Donald might be included in a participant group called Ducks. An Access Administrator can then grant certain user(group)s access to MedicalData and to the participants classified as Ducks. Such users are then authorized to access all MedicalData for all Ducks stored in PEP.

PEP provides a number of predefined column groups and participant groups. Their names should be considered reserved words, i.e. not be (re)used for other purposes. The only predefined participant group is named * and contains all rows. Newly added rows are automatically added to this participant group. Predefined column groups and their purposes are listed in the following table:

Name Contains Updates
* All columns @@@ why does this group exist? No one should have access (except perhaps Data Administrator) @@@ Contents are kept in sync with columns created and removed by Data Administrator.
Castor Columns storing data imported from the Castor electronic data capture (EDC) system. Managed by Data Administrator.
CastorShortPseudonyms Columns storing short pseudonyms that refer to Castor EDC records. Automatically kept synchronized with environment configuration.
Device Columns storing (wearable) device registration histories. Automatically kept synchronized with environment configuration.
ShortPseudonyms Columns storing short pseudonyms. Automatically kept synchronized with environment configuration.
VisitAssessors Columns storing identifiers for the assessors that administered a(n academic study's) participant measurement session (i.e. visit to the research center). Automatically kept synchronized with environment configuration.
WatchData Columns containing data collected by wearable devices Managed by Data Administrator.
Clone repository
  • Access control
  • Data structure
  • Glossary
  • Pseudonymization
  • Using pepcli
  • Home
  • pseudonymized upload