In pharma and biotech, the Statistical Computing Environment (SCE) is the engineered platform where regulated analytical work runs. It's where biostatisticians and statistical programmers turn clinical trial data into the analyses, tables, listings, and figures that support submissions to the FDA, EMA, PMDA, and other regulators, all under GxP requirements. The question is how to make this data available for analytical work.
Data is where most SCE programs quietly struggle. Clinical data lands from a long list of sources: EDC (electronic data capture) systems, central labs, imaging vendors, ePRO (electronic patient-reported outcome) platforms, multiple CROs, each interpreting CDISC (Clinical Data Interchange Standards Consortium) standards slightly differently and breaking automation in slightly different ways. Most organizations still manage all of this as flat files in folder structures, even though the volume and the audit expectations have moved on. Data engineering has become the foundation of the modern SCE: standardizing incoming data, orchestrating ETL/ELT between storage and compute, and maintaining lineage from source to final output.
That's why this series starts with data delivery, and specifically with SFTP. It's the protocol most CROs and external vendors use to hand off clinical data. Get the setup right and downstream work gets clean data. Get it wrong and your statisticians inherit a fragile shared folder that breaks every time a new vendor onboards. Later pieces will cover what happens after the data lands.
SFTP has outlived every protocol that was supposed to replace it. APIs were going to handle the structured integration. Managed file transfer platforms were going to handle the workflow and audit. Cloud-native services were going to handle the scale. None of them displaced SFTP for clinical data transfers, and the conversation with a new CRO or lab vendor still ends the same way: "just give us an SFTP endpoint."
What follows is for the people who design and run that infrastructure. If you're on the biometrics side, jump to "What biometrics teams actually need from SFTP" further down.
Vendors already have SFTP clients. Their SOPs reference SFTP. Their IT teams know how to allowlist an IP and configure a connection in WinSCP. Asking them to adopt a new transfer protocol means revalidation, updated SOPs, maybe a new tool purchase, and months of back and forth with their own quality teams. For a file that needs to arrive yesterday, none of that is realistic.
The question isn't whether to use SFTP. That call has already been made for you. What you decide is whether you run it the way most organizations do, or whether you build something the auditors and the data management team can both live with.
The usual setup and where it gets complicated
The traditional approach is a VM running OpenSSH somewhere, SSH keys managed out-of-band, files landing on a mounted volume that someone monitors. SFTP itself handles transport integrity through the SSH protocol. The bytes that arrive are the bytes that were sent, and OpenSSH logs every session. None of that is broken. The protocol works, the data is intact, and with enough discipline you can satisfy 21 CFR Part 11.
The complicated part shows up in operations. Authentication means local OS users for each vendor, keys rotated by hand, and a process for revoking access that depends on someone raising a ticket and someone else handling it. The audit trail exists in syslog but isn't structured for the questions auditors ask. Proving a specific file arrived intact at a specific time means correlating sshd logs, filesystem timestamps, and whatever checksumming you bolted on afterwards. Onboarding a new vendor takes days. High availability (HA) means running a second instance behind a load balancer with a shared filesystem underneath, and now you own the failover behavior, the filesystem consistency, and whatever happens when the two instances disagree about who holds a session.
It works. It just demands that someone, consistently, does all the supporting work: rotating keys on schedule, running the offboarding checklist, keeping the audit correlation scripts in sync with what auditors are asking for this year. Most teams don't sustain that. The audit story takes real effort to produce on demand rather than being a side effect of the system running.
Four ways to run SFTP in pharma
Once you decide the operational tax of self-managed SFTP isn't worth paying, there's a real spread of options:
- Self-managed SFTP on Kubernetes or Amazon ECS (Elastic Container Service). Containerize OpenSSH, back it with EFS (Elastic File System) or an S3-mounted filesystem, run it behind an NLB (Network Load Balancer). Full control, full ops burden, and you still own the audit story.
- Managed file transfer products. MOVEit, GoAnywhere, Globalscape, Cerberus FTP. The incumbents in regulated industries. Workflow engines, vendor portals, and audit trails come built in. Expensive, often per-connection licensed, and they add another product to your validation scope. If your compliance team already knows one of them, the path of least resistance is real.
- SFTP-as-a-service. Couchdrop, SFTP To Go, Files.com. Fast to stand up, nothing to operate, but you're handing customer data to a third party. Usually a non-starter for GxP data and a vendor risk assessment in any case.
- AWS Transfer Family. Managed SFTP endpoint backed by S3, with pluggable auth via AWS Lambda. You give up some flexibility in exchange for not running the server. The S3 backend is what makes it worth picking, because it lets you build the compliance pipeline around it using primitives you already use.
On a recent pharma platform build, we went with AWS Transfer Family, the managed SFTP service backed by S3. The S3 integration meant Object Lock, lifecycle policies, cross-region replication, and event-driven processing came for free, and the team was already deep in AWS for everything else. The lessons below are specific to that choice. Most of the design patterns (three-bucket architecture, immutability via Object Lock, structured audit logs, identity-provider-backed auth) translate to the other options with some adaptation.
How the AWS pipeline maps to 21 CFR Part 11
If you work in GxP pharma, every one of these design choices traces back to a regulatory requirement.
Object Lock in COMPLIANCE mode satisfies 21 CFR Part 11's requirement that electronic records cannot be altered or deleted. The seven-year retention period matches FDA record retention guidelines. The audit bucket with its own Object Lock and cross-region replication to a second AWS region gives you the "readily retrievable copies" the regulation asks for.
The structured audit logs (who uploaded what, when, checksums, transfer status) feed directly into your change control documentation. When an auditor asks "prove this file was not modified after receipt," you point to the S3 object metadata, the Object Lock retention timestamp, and the audit log entry with matching checksums. Nobody greps through syslog.
CloudWatch alarms on auth failures, AWS Lambda errors, and unusual upload volumes give you the monitoring layer GxP audits require. SNS (Simple Notification Service) notifications mean the data management team knows within minutes when a transfer succeeds or fails, rather than finding out at the next status meeting.
What biometrics teams actually need from SFTP
The biometrics team doesn't care about any of this infrastructure, and they shouldn't have to. What they care about: did the vendor's data arrive, is it the right file, and can they trust nobody tampered with it between the vendor's system and theirs.
A properly built pipeline answers those questions without them having to ask. Files arrive in a known location with a predictable path structure. Checksums are verified at the protocol level. Immutability is enforced by S3, not by someone's discipline. And there's a paper trail for every transfer that will still be there in seven years when someone asks about it.
Compare that to the spreadsheet someone maintains manually to track which files arrived when, cross-referenced with emails from the vendor confirming they sent the right version. That spreadsheet is what breaks at 2 AM before a submission deadline.
Tradeoffs worth knowing about
ROPC (Resource Owner Password Credentials) authentication is incompatible with MFA (multi-factor authentication). Vendor user accounts have to be excluded from MFA enforcement for the flow to work at all. That was already a security tradeoff worth documenting: you're betting that network controls (security groups, CIDR allowlisting) plus session policy isolation are sufficient compensating controls. It's also now a deprecation timeline. Microsoft is rolling out mandatory MFA enforcement across Entra tenants and explicitly steering customers away from ROPC, with the Phase 2 postponement window closing July 1, 2026. If you build on this pattern today, plan the migration path: SSH key auth, a different identity layer in front of Transfer Family, or a managed file transfer product that brokers auth differently. Worth having that conversation with your security team now rather than when the enforcement deadline hits.
Object Lock in COMPLIANCE mode is irreversible. If someone uploads a 50-GB test file with a seven-year retention, you're paying to store it for seven years. There is no admin override, not even for root. Keep your dev environment retention short or disabled entirely.
Transfer Family isn't cheap either. The per-protocol, per-hour cost adds up, especially across multiple environments. Worth comparing against the operational burden of maintaining the EC2 alternative, but don't ignore it.
And SFTP itself has limits. No transfer resumption, no parallelism, no metadata beyond the filename. For large genomics datasets, you'll eventually outgrow it. But for the CSV and SAS transport files that make up most clinical data transfers, it works, and the vendor ecosystem knows how to use it.
Where this is going
SFTP isn't going anywhere. The protocol is good enough, the vendor ecosystem is built around it, and the switching costs dwarf the benefits. The real work is making the infrastructure around it compliant, automated, and boring enough that nobody has to think about it.
This is the first piece in our Data in the SCE series. Future posts cover ingestion, validation, lineage, and storage patterns. For a broader architectural overview, see our Modern SCE for Pharma ebook. If you're working through similar SCE infrastructure decisions and want a second pair of eyes, talk to our pharma team.

