Least privilege for clinical data: access control that holds up to an audit

Reading time:
time
min
By:
Michał Janusz
June 23, 2026

An auditor's question is rarely about your architecture. It's about a person and a date. Who could read patient-level data for this study between these two dates, and how do you know? If the honest answer involves opening a spreadsheet someone maintains by hand, cross-referencing it against a list of group memberships, and then admitting that the spreadsheet might be out of date, the problem doesn't lie in access control. It lies in the evidence.

Access control in a pharma compute environment tends to drift toward one of two failure modes. The first is too permissive: anyone with a workspace account can read any dataset on the platform, because scoping access per study seemed like more work than it was worth on day one, and nobody went back to fix it. The second is too restrictive in a way that doesn't actually help: elaborate access-control lists that nominally enforce least privilege but are too painful to navigate. In the end, analysts request blanket access just to get their work done and whoever approves those requests stops tracking them. Both approaches fail the auditor's question: the permissive model because the answer is "everyone," the restrictive model because the real access pattern doesn't match the documented one.

This post is about the third option: access control for clinical data that is genuinely least privilege, genuinely auditable, and genuinely usable at the same time. Those three properties are harder to hold together than they sound, and most of the engineering is in the seams between them.

The short version

  • Scope access to studies, not job titles
  • Treat access control and access audit as separate problems
  • Use time-boxed roles for temporary access, not standing IAM edits
  • Generate the access review from live state, not a spreadsheet

What the regulations actually require

It helps to be precise about what the rules ask for, because the requirement is narrower and more concrete than "be secure."

21 CFR Part 11 governs electronic records and electronic signatures. For access, two clauses do most of the work. The system must limit access to authorized individuals, and it must keep a secure, computer-generated, time-stamped audit trail of actions taken on records, including who did what, and when. Good Clinical Practice layers on the expectation that you can demonstrate data integrity throughout a study's life: that the people who touched patient data were supposed to do that, and that you can prove it after the fact.

Read those together and you get two distinct obligations that are easy to conflate. One is access control: restricting who can reach a record. The other is access audit: an immutable trail of who actually did reach it. A system can satisfy the first and completely fail the second. You can have airtight IAM policies and still be unable to answer the auditor, because nothing was logging read actions at the object level. The reverse is also true: you can log everything and still have granted half the company standing access to data they never needed.

For a cloud storage layer, this turns into a small set of design questions. How is identity established? How is authorization scoped, and to what granularity? Where do read events get recorded, for how long, and can anyone tamper with that record? Everything below is an answer to one of those.

Study-scoped roles, not platform-wide ones

The single most consequential decision is the unit that access attaches to. The instinct is to define roles by job function ("biostatistician," "data manager," "reviewer") and grant each role broad access to the platform's data. It reads cleanly in a policy document. It also means a biostatistician assigned to Study A can read Study B's datasets, because the role isn't scoped to a study; it's scoped to a title.

The model that survives an audit attaches access to the study, not the person's job. Data lives under a per-study prefix in the storage layer, and the roles that can read it are defined per study. A typical clinical bucket is organized in a way that each study (or asset, for a program spanning several protocols) owns a prefix, and standard folders sit beneath it: raw data, programs, documents, working areas. Authorization is then expressed against that prefix and nothing else.

In practice that means the policy granting access to a study reaches exactly one path and stops:

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
  "Resource": "arn:aws:s3:::clinical-data/studyA/*"
}

A person working on Study A and Study B holds two such grants. A person on Study A holds one. Nobody holds a grant that says "all studies," because no such grant is ever written. The scoping is structural, not a matter of remembering to be careful. When Study A closes, the grant is removed and the access is gone, atomically, rather than living on inside a broad role that nobody wants to touch for fear of breaking something else.

Identity itself should come from your organization's directory, federated in through single sign-on, rather than from long-lived credentials minted per user. Users authenticate against the corporate identity provider, assume a role mapped to their entitlements, and operate with temporary credentials that expire. This matters for the audit story as much as the security one: when access is a role assumption rather than a static key, every session has a start, an end, and an identity attached to it, and there are no orphaned credentials sitting in a config file to explain away.

This also fixes the joiners-movers-leavers problem that breaks most manual access models. When platform access flows from the corporate identity provider, deprovisioning someone from the IdP atomically removes their reach into the platform: no orphaned tokens to chase, no "we'll get to it next quarter" cleanup, no stale entries in the access review six months after someone left. The same federation that gives you a clean session-by-session audit trail also gives you a clean off-boarding story.

A useful piece of defense in depth sits underneath all of this: a bucket policy that denies data access unless the request originates from inside the platform's own network. Even a principal holding a valid grant cannot read an object from the open internet or a VPN connection; the read has to come from the compute environment running inside the trust boundary. It costs almost nothing to add and it closes the gap where a leaked credential is used from somewhere it shouldn't be.

Access control and access audit are two different problems

Having decided who can read, you still have to record who did. These are separate systems with different properties, and clinical compliance needs both.

On AWS there are two ways to capture object reads, and the difference between them is not cosmetic. S3 server access logging writes a best-effort log of requests to a bucket. It's cheap, but it's delivered on a delay, it can drop records, and the log itself lands in another bucket you then have to protect separately. CloudTrail data events record object-level operations as structured, tamper-evident events with a guaranteed identity on each one: the principal, the time, the object, the source. They cost more, because you're paying per data event and patient-level datasets get read a lot, but they are the records that answer the auditor cleanly. For the data that matters, the cost difference is the price of being able to say "here is the exact immutable list generated by the system."

The other half of the audit obligation is retention. Part 11 records have to outlive the study by a wide margin: seven years is the working figure most teams design to. That means the audit trail can't live only in a 90-day operational log; it has to flow into long-term, immutable storage with retention measured in thousands of days, encrypted, and write-protected against the very administrators who run the platform.

Done this way, you filter the object-level audit store for reads against the study's prefix, inside the date range, and read off the principals:

SELECT eventTime, userIdentity.arn, requestParameters.key
FROM cloudtrail_data_events
WHERE eventName = 'GetObject'
  AND requestParameters.bucketName = 'clinical-data'
  AND requestParameters.key LIKE 'studyA/%'
  AND eventTime BETWEEN '2026-01-01' AND '2026-03-31'
ORDER BY eventTime;
The auditor's question stops being an investigation and becomes a query.

The answer is a result set, not a meeting where you can guess and clarify the right record. That is the entire point.

Time-limited access without bespoke IAM changes

The hardest access requests are the temporary ones, and they're the ones most likely to leave a mess. A consultant needs to look at one study's outputs for two weeks. An external auditor needs read access to a specific dataset for the duration of an inspection. A CRO needs to pull a deliverable. The wrong way to handle these is to edit a standing IAM role each time, because every edit is a change you have to remember to reverse, and the reversal is exactly the step that gets skipped under deadline pressure. Six months later the consultant who left in March still appears in the access review, and nobody can say why.

Two mechanisms handle this without permanent change. The first is STS assume-role with a short session: the temporary user is allowed to assume a study-scoped role, the session is capped at a few hours, and when it expires the access is simply gone. There is nothing to clean up because nothing standing was granted. Entitlement to assume the role can itself be time-boxed and tied to the federated identity, so the grant and its expiry are both recorded. The second is a pre-signed URL: a single object, a single recipient, an expiry measured in minutes or hours, no role assumption at all. It's the right tool when someone needs one file and nothing more, and it's the wrong tool the moment "one file" turns into "browse the folder," because a pre-signed URL has no concept of a session or a directory.

The trade-off between them is about scope and traceability. Assume-role gives you a real session with a real identity in the audit trail, suited to interactive work; pre-signed URLs are lighter but attribute to whoever generated the link unless you're careful, so they suit one-shot handoffs rather than ongoing access. Neither one requires touching the standing entitlements, which is what keeps the access review honest.

One related case is worth naming: break-glass access. Every regulated platform needs a documented escape hatch for incident response and genuine emergencies, but it isn't "admin access by another name." It's a separately documented identity, used by a specifically named small group, with explicit alerting on every assumption and a written post-incident review attached to each use. The audit signal is meant to be loud: that's the whole point. Treat it as an instrument of last resort, not a workaround for everyday access friction.

Not all clinical data needs the same tier

Treating every dataset on the platform identically is a quiet mistake, because it's wrong in two directions at once. The most sensitive data is under-protected relative to its risk, and the least sensitive is wrapped in friction that drives people to work around it.

Patient-level datasets (SDTM-domain data with individual subject records) are the high-watermark. They carry the most re-identification risk and they deserve the tightest scoping, the per-object audit trail, and the narrowest list of principals who can ever assume a role that reaches them. Analysis datasets, the ADaM tables derived from them, still describe individuals but are a step removed; many analysts need them routinely, and the access tier should reflect that they're a working surface, not a vault. The final outputs (tables, listings, and figures) are aggregated by construction; the people who need to review a TLF are not the same, smaller set who need raw subject data, and forcing them through the same gate produces requests for broad access that erode the whole model.

The cleaner design gives each tier its own access path and, where it's warranted, its own encryption boundary: separate keys for the most sensitive data so that access to the key is a second, independent control on top of the bucket policy. Tiering isn't about adding more locks. It's about putting the heavy locks where the risk actually is, and not everywhere else, so that the controls stay credible and people stop trying to defeat them.

Reviewing access without a spreadsheet

Periodic access review is a standing requirement, quarterly for most programs, and per-study at close-out. The common failure is that the review runs against a manually maintained record of who should have access, rather than the system's record of who does. Those two drift apart the moment anyone makes a change outside the documented process, and the review then certifies a fiction.

The fix is to generate the review from live state. The authorization model already lives as policy: role definitions, the principals entitled to assume each role, the prefixes each role can reach. That is a queryable source of truth. A report built from it answers "who can currently read Study A's patient-level data" by reading the actual policy graph, not a description of it. Pair that with the object-level audit trail and the review gains a second dimension: not just who can read, but who has been reading it, over the review period. Standing access that was never exercised is a candidate for removal; access that was exercised by someone who shouldn't have it is a finding you want to surface yourself, before an auditor does.

When the review is generated rather than transcribed, it also becomes cheap enough to run often, and a review you run monthly catches drift that a yearly one institutionalizes.

Common objections

Three questions come up almost every time this design is discussed.

Doesn't all of this slow down analysts?

No. Study-scoping is invisible to the analyst: they open a session and the studies they're assigned to are simply available, with no extra clicks. The friction sits where it should, on access provisioning, where someone is asking for access to data they don't yet have. That friction is the control; the goal is to make it deliberate and short-lived, not to remove it.

What about cross-study analysis like pooled safety or integrated summaries?

A separate, named role grants time-boxed access to a defined set of studies, with its own audit trail. The pattern is the same as everywhere else: scope is explicit, expiry is built in, and the audit can answer "who could see the integrated data, and when." Cross-study access isn't an exception to least privilege; it's a different scope, written down as a different role.

How is this different from Active Directory groups?

AD groups (or their equivalent in your IdP) handle identity: who someone is, what team they belong to, what they're authorized to claim. This post is about resource-scoped authorization: what a session can reach inside the platform. Identity is upstream; the bucket policy and role mapping are downstream. The two layers complement each other. The IdP says "this person is on the Study A team," and the platform's authorization model translates that into "this session can read clinical-data/studyA/*." Without both, you have either an identity store with no reach, or a permission system you can't tie back to people.

How this lands in the platform

None of this is abstract by the time a user touches it. It resolves into a few concrete points in a working environment.

When an analyst opens a session in Posit Workbench, that session runs in an isolated compute context that assumes an execution role, and that role, not the user's broad workspace login, is what determines which study prefixes the session can read from storage. The workspace decides who you are; the assumed role decides what this session can reach. Session isolation and study-scoped IAM are the same control viewed from two sides: the isolation keeps one user's session from reaching into another's, and the execution role keeps any session from reaching data outside its study's scope.

The result is that least privilege isn't a policy bolted onto the platform after the fact. It's the path of least resistance through it. An analyst gets exactly the studies they're assigned to, with no ceremony, because the role they assume is already scoped. They never see Study B, not because a rule warns them off it, but because the credential their session holds was never able to reach it. And every read along the way wrote a record that no one can quietly edit.

That's what holds up to an audit. Not a binder of policies describing intentions, but a system where the access a person had was the access they needed, the access they used is recorded in a trail they couldn't tamper with, and the answer to who could read this study's data, and when is a query that returns a list. The clinical data was protected the whole time. Being able to prove it, on demand, without an investigation, is the part that turns a good architecture into one that an auditor signs off.

Designing an SCE that handles this end-to-end is what Appsilon does. If you're building one, or auditing one, talk to us, or read how we approach SCE for pharma and biotech.

Michał Janusz

Platform Engineer

The AI-Ready SCE: A Pharma Guide to Building Modern Environments

Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.