CDISC Standards for Open-Source Pharma Teams

Reading time:

time

min

May 11, 2026

A Working Reference for CDISC Standards: ADaM, Define-XML, Dataset-JSON, and ARS

CDISC standards are non-negotiable for FDA submissions. Your team knows this. You also know that understanding CDISC and implementing it correctly are very different.

This guide is a working reference. It is the page you can bookmark when you need a quick, accurate refresher on what ADaM expects, how Define-XML is put together, how to build a readable define.html, where Dataset-JSON fits in, and what ARS will change. The goal is to be practical. If you need the full specification, go to CDISC. If you need to remember how these pieces work together on a real submission, stay here.

ADaM and FDA Submission Requirements

ADaM stands for Analysis Data Model. It is the CDISC standard for analysis-ready datasets, and it is what the FDA expects to see in the analysis section of your submission package. If SDTM is the tabulated source data, ADaM is the prepared form of that data, structured so a reviewer can reproduce your results without guessing how you got there.

The relationship between SDTM and ADaM is straightforward at a high level. SDTM captures collected study data in a standardized tabulation. ADaM takes that tabulation and transforms it into datasets that support the specific analyses reported in the clinical study report. Traceability back to SDTM is not optional. Every ADaM variable should either come directly from SDTM or have a derivation that a reviewer can follow.

On the FDA side, the agency expects certain ADaM datasets in almost every submission. ADSL, the Subject-Level Analysis Dataset, is required. It carries one record per subject and is the anchor for every other ADaM dataset. Beyond ADSL, the datasets you include depend on the therapeutic area and the analyses performed. ADAE covers adverse events. ADLB handles laboratory data. ADTTE supports time-to-event analyses such as overall survival or progression-free survival. ADVS covers vital signs. ADEX handles exposure. There are others, and CDISC publishes therapeutic-area user guides for many of them.

"Analysis-ready" has a specific meaning in practice. It means that a programmer can run the intended analysis on the dataset without further transformation. Population flags, treatment variables, analysis timepoints, and derived endpoints are all present and named in a way that matches the intended statistical methods. The ADaM Implementation Guide, currently version 1.3, is the reference for how these variables should be structured and named.

The submission package ties these datasets to their metadata. Alongside the ADaM datasets, you submit a define.xml file that describes every variable, a reviewer's guide (ADRG) that explains analysis decisions, and often a set of annotated programs. These pieces need to agree. When a reviewer opens your define and sees a variable that is not in the dataset, or a dataset column that is not documented in the define, your submission loses credibility fast. Keeping the define and the datasets in sync is one of the most common pain points in ADaM delivery, and most of the tooling in this space is built to solve exactly that problem. See a live walkthrough of how to automate this in our webinar on building an AI-ready ADaM pipeline.

Define-XML: Structure and Purpose

Define-XML is the metadata file that describes your submission datasets. It tells the reviewer what each dataset contains, what each variable means, how derived values were computed, and which controlled terminology applies. Without it, your datasets are opaque. With it, a reviewer can navigate your submission, trace derivations, and verify that your analysis is reproducible.

At a high level, Define-XML is a structured document with a consistent element hierarchy. The outer container is the Study element, which identifies the study and holds one or more MetaDataVersion blocks. The MetaDataVersion is where the real content lives. Inside it, you will find several key elements that describe your datasets and their contents.

The core elements you work with are these. ItemGroupDef describes a dataset. Each ADaM dataset, ADSL, ADAE, ADTTE, and so on, has its own ItemGroupDef. ItemDef describes a variable. Every column in every dataset has an ItemDef entry, with attributes for label, data type, length, and origin. CodeList captures controlled terminology, including CDISC-controlled terms and sponsor-defined codelists. MethodDef holds derivation methods, so when a variable is computed rather than collected, the logic lives here. CommentDef provides reviewer comments and clarifications. WhereClauseDef handles conditional logic, such as value-level metadata that applies only when a certain parameter is present.

A simplified view of the hierarchy looks like this:

Study
  MetaDataVersion
    ItemGroupDef (one per dataset)
      ItemRef (links to ItemDef)
    ItemDef (one per variable)
      CodeListRef
      MethodRef
    CodeList (one per controlled terminology list)
    MethodDef (one per derivation)
    CommentDef
    WhereClauseDef

Define-XML for ADaM is not the same as Define-XML for SDTM. The structural grammar is identical, but the content differs. ADaM defines carry analysis variable metadata, computation methods for derived variables, population flag definitions, and value-level metadata for parameter-specific analyses. Variables such as PARAM, PARAMCD, AVAL, and AVALC appear consistently, and their value-level metadata is often extensive.

The relationship between your define and your datasets is simple to state and harder to maintain. The define is the single source of truth for dataset metadata. It describes to the reviewer exactly what they should find when they open the data. If the datasets and the define drift apart, the submission fails review, and the fix is always a painful round of reconciliation.

On versions, most current submissions use Define-XML 2.0 or 2.1. Version 2.1 added support for value-level metadata improvements, better handling of analysis results metadata, and clearer structural elements. If you are starting a new submission, 2.1 is the safer choice. If you are maintaining an older submission, 2.0 is still accepted. Confirm current FDA guidance before choosing, since the technical rejection criteria evolve.

Building define.html from define.xml

Reviewers do not read raw XML. They open define.html, a human-readable, navigable rendering of the same information. If you have ever seen a reviewer click through a submission, you have seen define.html in action: a table of contents on the left, dataset details on the right, clickable links between variables, codelists, and methods.

The standard approach to producing define.html is an XSLT transformation. Your define.xml file stays as the authoritative metadata source, and an XSLT stylesheet transforms it into the HTML rendering. When a reviewer opens define.xml in a browser that knows how to apply the associated stylesheet, they see the HTML version. This is why most submissions include the stylesheet file alongside the define itself.

CDISC publishes an official stylesheet for Define-XML 2.0 and 2.1. Most sponsors start from this stylesheet and either use it as-is or modify it to match internal presentation preferences. Alternatives exist, and vendors sometimes ship their own. If you are starting fresh, the CDISC stylesheet is the safest default and the one reviewers are most familiar with.

In practice, the process looks like this. You start with a valid define.xml. You either link the CDISC stylesheet in the XML processing instruction at the top of the file, or you run an XSLT processor such as Saxon or xsltproc to transform the XML into a standalone HTML file. The standalone HTML option is often preferred because it removes the dependency on browser XSLT support, which has degraded over the years as browsers dropped XSLT handling.

Common gotchas when building define.html. Broken hyperlinks are the most frequent issue, usually because a method or codelist is referenced but not defined. Encoding issues show up when special characters in variable labels are not properly handled; UTF-8 throughout is the cleanest path. Stylesheet version mismatches happen when a Define-XML 2.1 file is rendered with a 2.0 stylesheet, and the output silently drops new elements.

Open-source tooling has matured in this space. In R, the pharmaverse ecosystem offers packages for both generating and rendering Define-XML. In Python, libraries such as lxml paired with Saxon or xsltproc handle the transformation reliably. Either path lets you automate define.html generation as part of your build pipeline, which is the usual goal once a team moves past one-off submissions.

CDISC Dataset-JSON

Dataset-JSON is the newer CDISC standard for representing clinical datasets in JSON format. It is designed as a modern alternative to SAS Transport Files, the XPT format that has been the default submission file type for decades.

Why it matters comes down to the limits of XPT. The format dates to an older era of SAS and carries constraints that no longer make sense. Variable names are capped at eight characters. Labels are capped at forty. Data types are limited, and very long text fields have to be split across columns. File sizes balloon quickly on large studies. Most importantly, XPT ties your submission pipeline to SAS, even if nothing else in your stack uses it. Dataset-JSON removes that dependency. The format is designed for modern data types, longer names and labels, and efficient representation.

Current status is worth watching closely. The FDA has been running pilots to evaluate Dataset-JSON as a submission format, and industry adoption is progressing, but the transition is not complete. As of this writing, XPT remains the required format for most submissions, and Dataset-JSON is accepted in specific contexts. Check current FDA guidance before your next submission. Dataset-JSON adoption is accelerating quickly.

The practical implications for your workflow are significant once the switch happens. R and Python handle JSON natively. You can read, write, and manipulate Dataset-JSON files without SAS in the loop. For teams moving to open-source tooling, this is the missing piece that makes the migration fully viable. You can generate submission-ready datasets from R or Python, validate them, and deliver without ever touching a SAS license. The standards stay the same. The transport changes.

CDISC ARS: What It Is and Why It Matters

ARS stands for Analysis Results Standard. It addresses a problem that has been obvious to clinical programming teams for years: the same analysis, even within the same company, gets described differently across studies, sponsors, and teams. A Kaplan-Meier survival analysis in one study might be documented one way in the SAP, another way in the TLFs, and another way in the reviewer's guide. Each description is correct, but nothing machine-readable connects them.

ARS creates a common structure for defining, structuring, and reporting analysis results. The core idea is that an analysis has a definition, a method, a set of parameters, and a result, and all of those should be captured in a standardized machine-readable format. When this works, you can trace a number in a table back to the analysis that produced it, the data that fed it, and the method that defined it, without writing custom mapping logic for every study.

For clinical programming teams, ARS changes how analysis specifications are authored and how TLFs are generated. Instead of producing tables from ad-hoc specifications, you author analyses against the ARS model, and your tooling can generate both the analysis outputs and the documentation from the same source. This is a significant shift, and it is the kind of change that takes years to work through an industry.

Current status. ARS is an emerging standard within the CDISC 360 initiative. The specification has been published, early adopters are running pilots, and tooling is beginning to appear. ARS remains optional today, and the implementation landscape is still forming. Watch this one closely if you work on TLF automation or analysis metadata.

How These Standards Connect

The standards covered here are not independent. They form a chain that moves clinical data from collection through analysis to submission.

The data flow goes SDTM to ADaM to TLFs. SDTM captures the tabulated study data. ADaM transforms it into analysis-ready form. TLFs, the tables, listings, and figures, are produced from ADaM and reported in the clinical study report.

Define-XML describes the datasets at each stage. One define covers SDTM. A separate define covers ADaM. Each one is the metadata anchor for its corresponding datasets.

Dataset-JSON modernizes the transport layer. Whether you are shipping SDTM or ADaM, Dataset-JSON is the replacement for XPT, decoupling the data format from the SAS ecosystem.

ARS standardizes the analysis outputs. Where Define-XML describes the inputs to analysis, ARS describes the analyses themselves and their results.

When teams move from SAS to open-source tooling, the standards do not change. What changes is the implementation. R, Python, and their ecosystems can produce SDTM, build ADaM, generate Define-XML, render define.html, and increasingly output Dataset-JSON and ARS-compliant analysis metadata. The standards are tool-agnostic. Your choice of tools affects how you implement them, not whether you need them. Learn how R and Shiny work in regulated environments.

Q&As

Q1: What is define.xml?

Define.xml is the metadata file that describes your submission datasets. Think of it as the instruction manual for your data: it tells the FDA reviewer what each dataset contains, what each variable means, how derived values were computed, and which controlled terminology applies. Without define.xml, your datasets are opaque. With it, a reviewer can navigate your submission, trace derivations, and verify reproducibility. For ADaM submissions, define.xml is required alongside the datasets.

Q2: What is the difference between SDTM and ADaM?

SDTM (Study Data Tabulation Model) is the standardized tabulation of raw study data as collected. ADaM (Analysis Data Model) is the analysis-ready transformation of that data. SDTM captures collected data in standard form. ADaM takes SDTM and transforms it into datasets that support the specific analyses reported in the clinical study report. Every ADaM variable should either come directly from SDTM or have a documented derivation. Traceability back to SDTM is not optional, it is how the FDA verifies reproducibility.

Q3: What is CDISC Dataset-JSON?

Dataset-JSON is the newer CDISC standard for representing clinical datasets in JSON format. It is designed as a modern alternative to SAS Transport Files (XPT), which has been the default submission file type for decades. XPT has constraints that no longer make sense: variable names capped at 8 characters, labels capped at 40, limited data types. Dataset-JSON removes those constraints and, critically, decouples your submission pipeline from SAS. You can generate submission-ready datasets in R or Python without needing a SAS license.

Q4: How do I generate define.html from define.xml?

Define.html is the human-readable, navigable version of define.xml that FDA reviewers open in their browser. The standard approach is XSLT transformation: your define.xml file is the authoritative source, and an XSLT stylesheet transforms it into HTML. CDISC publishes an official stylesheet for Define-XML 2.0 and 2.1. You can either link the stylesheet in the XML itself or run an XSLT processor such as Saxon or xsltproc to generate a standalone HTML file. Open-source tools in R (pharmaverse packages) and Python (lxml + Saxon) can automate this as part of your build pipeline.

Summing up

CDISC standards are not going away. They're evolving.

The trajectory is clear: more structure, more machine-readable metadata, more openness in tooling. Define-XML will stay the metadata anchor for submission datasets. Dataset-JSON will eventually replace XPT, decoupling your pipeline from SAS. ARS will standardize how analysis results are defined and traced. And through it all, the standards remain tool-agnostic. What changes is the implementation. If your team is navigating these standards, whether as part of a migration to open-source tools or building the infrastructure to support them, the answer is the same: reproducibility, traceability, and audit-ready automation matter more than the tooling choice. We work on exactly this. Talk with our Experts about how we build the infrastructure that makes this work.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit