Not just SFTP: why your data delivery should speak every vendor's language
SFTP is the default for a reason. It’s what most vendors are already set up for, compliance teams trust it, and the automation around it is well understood. But locking your data delivery layer to a single protocol is a design decision you’ll eventually regret. (We’ve unpacked why SFTP still dominates pharma data transfers earlier in this series.)
Clinical data doesn’t come from one kind of source:
- CROs upload SDTM datasets over SFTP because that’s what their SOPs say.
- Central labs have APIs that can push results the moment they’re finalized.
- Imaging vendors deal with files too large for SFTP to handle efficiently.
- EDC platforms export directly to cloud storage.
Each of these is a legitimate delivery path, and the platform that forces every one of them through the same narrow pipe, is the platform that creates workarounds.
What follows is how to build a data delivery layer that accepts data in whatever way vendors can send it, without sacrificing the compliance properties (audit trails, immutability, identity attribution) that make the whole system trustworthy. The protocol changes, but the verification doesn’t.
The protocol is a detail, not the architecture
Earlier in this series, we described a specific pattern: files land in a landing bucket, an S3 event fires, a Lambda function processes the file (with checksums, metadata extraction, audit logging), and the data gets promoted to the clinical data bucket with Object Lock. Notifications fire, failures quarantine, and everything is traceable.
Nothing in that pattern depends on SFTP. The pipeline triggers on an S3 event. It doesn’t care how the object got there — whether a vendor uploaded it via Transfer Family, an API integration wrote it, or a partner pushed it directly to S3 with scoped credentials. The moment the object lands, the same processing chain fires — same checksum, same audit log entry, same Object Lock on the clinical data bucket, same notification to the data management team.
That’s the architectural decision that matters more than any protocol choice. When the event-driven pipeline is the backbone, adding a new delivery channel means adding a new way to write objects to the landing bucket. It doesn’t mean building a new pipeline. The ingestion logic, the audit log schema, the quarantine flow, the retention enforcement: all of it stays untouched. You’re extending the front door, not renovating the house.
The delivery layer is protocol-agnostic by construction. SFTP is the default because it covers most vendor handoffs, but the architecture is designed to support additional entry points as delivery needs grow. The compliance layer sits below the protocol layer, not inside it.
When vendors bring something other than SFTP
Most of the time, SFTP is the right answer. It’s what most vendors are already set up for, their IT teams can configure it in an afternoon, and the compliance story is well understood. But there are real scenarios where other protocols genuinely fit better, and your platform should be able to handle them without architectural changes.
API-based delivery makes sense when a vendor’s system can push data programmatically. Central labs, EDC platforms, and some larger CROs offer APIs that can deliver results as soon as they’re ready, rather than waiting for a batch upload window. A lightweight integration layer (a service or Lambda function) receives the API payload, validates the request, and writes the data to the same landing bucket. From that point forward, the processing is identical to an SFTP upload. The API caller authenticates with scoped credentials tied to a specific vendor identity, so source attribution follows the same path as any other authenticated write to the landing bucket. The pipeline doesn’t branch, the audit trail doesn’t branch — only the transport changes.
Direct S3 upload works for trusted partners who can handle it properly. Some vendors, particularly large CROs or technology-forward labs, are comfortable with cloud-native tooling. Instead of routing through an SFTP endpoint, they get temporary, scoped credentials (STS-issued, time-limited, prefix-restricted) and upload directly to their namespace in the landing bucket. This skips the Transfer Family layer entirely, which matters when files are large. Genomics data, high-resolution imaging, or bulk historical data loads which could overwhelm, single SFTP session transfer much faster over direct S3 with multipart upload. The same IAM policies, namespace conventions, and path structures from the vendor onboarding pattern apply. The vendor writes to the same prefix they would via SFTP — the pipeline doesn’t know the difference.
Managed file transfer platforms fill a different gap. Some organizations already run products like Axway, GoAnywhere, or IBM Sterling that handle file routing, scheduling, and audit logging across the enterprise. Rather than replacing these with a custom pipeline, the integration point is simple: the platform deposits files into the landing bucket, and the event-driven pipeline takes over. The compliance posture actually gets stronger here, not weaker, because you end up with two independent audit trails (the managed transfer platform’s and the pipeline’s) covering the same delivery event.
None of these require a new pipeline — each of these options need just a new front door to the same landing bucket. The validation layer, the audit trail, the Object Lock, the notifications: all of it stays the same regardless of how the file arrived. That’s what protocol-agnostic means in practice. And each new delivery channel goes through the same validation and onboarding rigor as the first one, including documented testing before production use.
What holds this together
The compliance properties are enforced at the storage and processing layer, not at the transport layer. That’s the structural reason this works without becoming a maintenance problem.
Identity attribution comes from whatever authentication mechanism the protocol uses:
- For SFTP, it’s Transfer Family’s auth logs.
- For API delivery, it’s the API credentials and request signing.
- For direct S3, it’s the IAM role or STS session that performed the upload.
Each of these produces a different kind of authentication event, but they all feed into the same compliance posture: CloudTrail captures the authenticated identity behind every write to the landing bucket, and the processing pipeline captures file integrity from that point forward. Correlating the two gives you the complete chain from “who sent this” to “what happened to it,” in a format that’s queryable and reproducible for any delivery method.
Immutability is enforced at the bucket level, not the protocol level. Once a file is promoted to the clinical data bucket with Object Lock in COMPLIANCE mode, it doesn’t matter whether it came in over SFTP, an API, or a direct upload. The retention guarantee is the same. The object physically cannot be altered or deleted for the duration you set.
Namespace isolation works the same way for every protocol. The vendor onboarding pattern from the previous post (scoped IAM policies, prefix-based separation, blinded/unblinded segregation) applies regardless of delivery method. A vendor who uploads via API is restricted to the same namespace as they would be via SFTP. The conventions don’t change, the enforcement mechanism doesn’t change — only the transport layer does.
When a regulator asks “how do you ensure data integrity for vendor deliveries?” the answer doesn’t branch into six different explanations depending on the protocol. The storage, processing, and retention layers are the same regardless of how data enters the system. How a specific file got to the landing bucket is a detail in the vendor’s onboarding record, not a variable in the compliance architecture.
Where this takes you
Once the delivery layer is designed for multiple protocols, the conversation with vendors shifts. Onboarding becomes: what’s the best way for you to send us data? Instead of: here’s the only way we accept it. Most vendors will still choose SFTP because it’s what they’re set up for. But the ones who can do better (real-time API feeds from central labs, large-file direct upload for genomics partners) aren’t forced into a lowest-common-denominator workflow just because the platform can’t handle anything else.
This also matters for longer-running clinical programs where the vendor landscape evolves mid-trial. A study that starts with one CRO uploading SDTM datasets over SFTP might later add a central lab with an API, a genomics partner sending terabyte-scale files, and an EDC integration pushing data on a schedule. If each of those requires a separate pipeline with its own audit trail and its own monitoring, you’re maintaining four systems and reconciling four sets of compliance evidence. If they all write to the same landing bucket and trigger the same processing chain, you’re maintaining one. One set of monitoring. One set of alerts. One audit trail format your quality team can hand to an inspector without assembling evidence from four different systems.
The next question, and the next post in this series, is what happens after data arrives and enters the system. Every file that lands gets checksummed, logged, and locked. But when a statistician uses a dataset six months later for a regulatory submission, the question isn’t just “did this file arrive intact?” It’s “where did this data come from, what touched it, and can you prove it?” That’s data lineage, and it’s where the delivery layer connects to the compliance story your quality team actually needs to tell.

