Clinical data lake architecture: designing storage for submission and exploration

Reading time:

time

min

June 12, 2026

Clinical data has two jobs, and they want completely opposite things from your storage layer.

Submission needs immutability. An auditor asks you to reproduce the exact dataset that supported a particular analysis from two years ago. Not just something similar. The actual dataset, with proof that it was never touched or changed. That is a high bar, but drug development is tied directly to patient safety.

Exploration needs flexibility. Every day, statisticians, data managers, and programmers are joining tables, creating derived variables, testing different analytical approaches. This work is iterative and fast.

What makes data useful for day-to-day work (it can be modified, reshaped, improved) is exactly what makes it unreliable for regulatory purposes (it has changed, and you cannot prove it has not).

Most organizations do not solve this so much as stumble into one of two traps.

The first: build a rigid validated folder structure with formal procedures for every change. The quality team is happy. The analysts are not. So they copy data to laptops, personal SharePoint folders, wherever lets them work at normal speed. Now the same data lives in six places, nobody is sure which version is current, and the audit trail is fiction.

The second: adopt a modern lakehouse, let data flow freely, keep analysts productive. Then a submission deadline approaches and the quality team cannot sign off. Either you build a parallel system for regulatory purposes (double the cost, double the inconsistency risk), or you try to retroactively validate something that was never designed for it.

A three-layer architecture for clinical data storage

The fix is structural rather than procedural, and it lives at the storage layer of a modern statistical computing environment. Three layers, with a formal gate between each one, and data that moves in one direction only.

Landing zone

This is where data arrives from a vendor. You record the timestamp, compute a checksum, and store both alongside the file. Nothing gets transformed. Nothing gets interpreted. The landing zone is a receipt.

If a vendor delivers a corrected file later, the original stays put. The correction lands next to it, labeled. Both exist permanently.

Validated clinical layer

After a data manager reviews and approves the file, it gets promoted here. Files in this layer cannot be modified or deleted before the retention period expires. The storage system enforces this directly, not by access-control convention.

Promotion is a recorded event. Who approved it, when, and on what basis all go into the audit trail.

Analytical workspace

Analysts copy data here to work with it. They can join tables, build derived columns, run exploratory queries, experiment freely. Nothing here is the source of truth. It is all downstream of the validated layer, and everyone understands that.

Work in the workspace does not touch the validated layer. Analysts are on copies. If an auditor asks to see the original data, the validated layer is unchanged.

The gates matter as much as the layers. Moving from landing to validated is a deliberate, recorded decision. Moving from validated to workspace is a copy, never a move.

This sounds obvious. In practice, most organizations skip the explicit design and end up with a folder structure that roughly resembles this, without the guarantees. The landing zone gets edited. Validated files get overwritten when corrections arrive. The quality team cannot point to a hard boundary because there is not one.

The separation only holds if the system enforces it.

Enforcing layer separation with S3 buckets, IAM, and Object Lock

Three layers means three buckets, not just prefixes in one bucket. The reason is practical: Object Lock, lifecycle policies, and IAM boundaries all apply at the bucket level. Putting landing and validated data in the same bucket means either applying the same policies to both or managing exceptions by hand.

A typical setup:

clinical-landing-prod clinical-validated-prod clinical-workspace-prod

IAM: enforce boundaries in code, not in policy documents

The IAM layer is what stops the separation from being a convention.

A vendor ingestion role has write access to the landing bucket and nothing else. If a pipeline using that role is compromised, the damage stops at landing. A data manager role can read from landing and write to validated, but cannot modify landing, so the original receipt stays intact. An analyst role can read from validated and has full access to workspace, but cannot write to validated. A submission role is read-only on validated.

These roles attach to systems and pipelines, not just to people. The tool an analyst uses has a role. The ingestion pipeline has a role. Even if someone has broad personal AWS permissions, the system they are working through is constrained.

Object Lock on the validated layer

When a file is promoted to the validated bucket, Object Lock is applied in COMPLIANCE mode with a retention date.

COMPLIANCE mode means no one can delete or overwrite the object before that date. Not the data manager who promoted it. Not an AWS administrator. Not AWS Support. The only way out is to wait.

That is the guarantee you need. 21 CFR Part 11 requires that records are protected against modification. ICH E6 requires that source data can be reproduced exactly. Object Lock in COMPLIANCE mode is a direct technical answer to both.

Retention periods depend on the study. For a phase III trial, fifteen to twenty-five years is typical. You set the date at promotion time, calculated from the expected study end plus the required post-study period.

The promotion record

A file does not move to validated alone. It moves with a manifest: a small JSON file recording the source checksum, the promotion timestamp, who approved it, which QC process was run, and the schema version. The manifest goes into the validated bucket under the same Object Lock policy.

At any point you can recompute a file's checksum and compare it against the manifest. If they match, the file is exactly what was promoted. That check is independent of Object Lock, a second auditable proof that nothing changed.

Handling vendor data corrections without breaking the audit trail

A vendor delivers a dataset. Three weeks later they send an email: there was an error in the visit mapping, here is a corrected file.

This is one of the most common events in clinical data management. It is also the moment that breaks naive storage designs.

The bad instinct is to replace the original. It keeps things tidy. One file, correct version, done. But you have just destroyed the evidence that the error existed. If a regulator asks whether any vendor corrections occurred during the study, you cannot answer honestly. If an auditor wants to understand the history of a dataset, the trail starts at the correction, not at the original delivery.

The original file is part of the record, not a mistake to be cleaned up.

How it flows through the layers

The corrected file lands in the landing zone alongside the original. Both stay permanently. The folder structure makes the history readable:

landing/study-123/vendor-xyz/2024-03-15/ landing/study-123/vendor-xyz/2024-05-02-corrected/

Each folder has its own timestamp, checksum, and receipt record. The original is untouched.

After QC, the data manager promotes the corrected version to the validated layer. The promotion manifest references the original: "this file supersedes the version received on 2024-03-15, which remains in landing." Both manifests sit under Object Lock.

Analysts in the workspace get a notification that a new version is available. They decide when to update their analyses. Work done before the correction is still traceable to the version it used.

What you can show an auditor

When the original file was received
What it contained
When the correction arrived
What changed
Who reviewed it
When it was promoted
Which analyses used which version

None of that exists if the original file was overwritten. It only exists because both files are there, permanently, with their full context.

The analytical workspace: speed without losing traceability

Everything up to this point has been about protection. The workspace layer is about the opposite.

Analysts need to move fast. They join tables from different vendors, build derived variables, test population filters, generate the tables and figures that go into clinical study reports. A dataset gets reshaped a dozen times before anyone is satisfied. Requiring a formal approval at each step would make the work impossible.

So the workspace is permissive. Analysts can write, overwrite, delete, experiment. Scratch datasets are fine. Half-finished joins are fine. Nobody signs off on intermediate steps.

What keeps this from being chaos is one thing: nothing in the workspace is the source of truth. It is all derived from the validated layer. The validated layer does not know the workspace exists.

Flat Parquet or a table format

For the workspace, you have a choice about how to store data.

Flat Parquet files on S3 are simple. A dataset is a file. You know exactly what is in it. There is no additional software to understand or validate. For teams that want predictability and minimal tooling overhead, it works fine.

Delta Lake and Apache Iceberg add a layer on top of Parquet. Both give you a transaction log, so every write to a table is recorded. Both give you time travel, so you can query what a table looked like last Tuesday. Both handle schema changes without requiring a full rewrite of the dataset.

For analysts, time travel is genuinely useful. If an analysis produced unexpected results, you can go back to the state of the data when it ran. Schema evolution matters when vendors change their delivery format mid-study, which happens more than it should.

The tradeoff is validation overhead. Any software in a GxP environment needs to be formally validated: documented, tested, signed off. Delta and Iceberg are additional systems to put through that process. Most teams accept that cost for the workspace because the productivity gain is real. For the validated layer, it is rarely worth it. Parquet plus Object Lock is simpler and easier to defend to an auditor.

Between Delta and Iceberg: Delta is the better choice if you are already on Databricks. Iceberg works natively across more query engines (Athena, Trino, Snowflake) and does not tie you to a single vendor. For new implementations, Iceberg is usually the safer bet.

What controls you still need

Permissive does not mean unmonitored.

IAM still constrains who can write where. An analyst on study A should not be able to touch study B's folders. Access logging is on. You know who touched what and when.

Derived datasets should carry metadata pointing back to the validated source. When a programmer runs an analysis, the output should record which version of which validated dataset was the input. That link is what lets you answer the reproducibility question: run the same code against the same validated snapshot and you get the same result.

The workspace is not regulatory evidence. But it needs to be traceable back to what is.

Why early storage architecture decisions are expensive to fix later

Clinical data sticks around. A phase III trial ends, but the data stays live for fifteen to twenty-five years. The architecture you build at the start of a study is the architecture you will be defending to auditors long after the drug is approved or shelved.

That is why the decisions here are front-loaded. Getting them right at the start is manageable. Fixing them afterward is not.

What a migration actually costs

Suppose you built the landing and validated layers without Object Lock, relying on access controls and process instead. Two years in, a compliance review flags it. Now you need to migrate existing data into a properly configured bucket.

The migration itself is not the hard part. The hard part is proving to a regulator that nothing changed during the move. You need to show that every file in the new bucket is byte-for-byte identical to what was in the old one, that the chain of custody is unbroken, and that the migration event is documented and auditable. That is weeks of work.

Do it for one study. Now imagine doing it for twelve concurrent studies, some mid-trial, some already in submission.

The flat Parquet vs. lakehouse decision

Flat Parquet is boring. That is its main virtue in the validated layer. Auditors understand files. There is no additional software to explain or validate. If someone asks what is in a dataset, you open the file.

A managed table format like Iceberg adds real capabilities: time travel, transaction history, cleaner schema management. Those capabilities have a cost. Every piece of software in a GxP environment needs a validation package: installation qualification, operational qualification, documented test cases. Iceberg is not a small package. If a version upgrade changes behavior, you have a problem that touches every dataset the validated layer holds.

Most teams land here: flat Parquet in the validated layer, Iceberg or Delta in the workspace. The validated layer trades capability for simplicity. The workspace trades simplicity for productivity. Both are deliberate choices.

The IAM decision

Separate buckets per layer feel like over-engineering at the start. Adding a bucket later means migrating data, re-pointing pipelines, and re-validating access controls. Doing it at setup takes an afternoon.

The same goes for IAM roles. Narrow roles for ingestion, promotion, analysis, and submission are straightforward to define upfront when the buckets and roles are provisioned as infrastructure as code. Unpicking overly broad permissions after data is in production, with live studies depending on the pipelines, is a different kind of problem.

The thing most teams skip

The link between a validated dataset and the analyses that consumed it.

When a dataset is promoted, the manifest records what it is. But there is often nothing recording the version of the dataset used by a given analysis. That link lives in a programmer's head, or in a comment in a SAS/R script, or nowhere.

When a correction arrives two years later and someone needs to know which analyses need to be rerun, the answer should come from a query. Not from emailing five people and hoping someone remembers.

Building that link at the start is a small amount of work. Reconstructing it later, for a study that has been running for two years, is not.

Building compliance into clinical data storage from the start

None of these decisions are technically difficult. Separate buckets, narrow IAM roles, Object Lock on promotion, a manifest with every file. Each one is straightforward. What makes them hard is applying them consistently, before the pressure to move fast arrives.

If you get the layers right, the compliance evidence mostly takes care of itself. The audit trail exists because the system built it, not because someone remembered to write it down.

The result is a system that is both compliant and usable. Statisticians, programmers, and other end users can focus on the development work, while the inputs and outputs stay audit-ready.

This is part of our Data in the SCE series on building compliant, usable clinical data infrastructure. See also session isolation in a GxP environment, on running Posit Workbench on Kubernetes. For the wider picture, see our guide to the modern statistical computing environment or the Modern SCE for Pharma ebook. To see this kind of architecture in production, read how we built GxP data workflows for a pharma client. If you are working through similar storage and compliance decisions and want a second pair of eyes, talk to our pharma team.