From file drop to audit trail: event-driven ingestion for clinical data delivery

Reading time:
time
min
By:
Appsilon Team
May 19, 2026

In the previous post in this series, we covered why SFTP persists as the default for clinical data transfers and how a managed SFTP endpoint backed by cloud storage gives you a foundation that auditors and data management teams can both live with. The protocol is settled. The endpoint is running. Vendors are uploading SDTM datasets, lab results, and imaging metadata on their own schedules.

But in most organizations we work with, there’s a gap between “file arrived” and “file is processed, verified, and logged.” That gap is usually filled with a cron job someone wrote two years ago, a shared folder someone checks every morning, and a spreadsheet someone updates when they remember. It’s a responsibility that needs to be assumed and it works – until it doesn’t. The moment it stops working is usually the moment an auditor asks a question nobody can answer quickly, especially because such a process doesn’t provide the right answers by itself.

What follows is the pattern we use to replace that gap, an event-driven pipeline where the audit trail is a structural byproduct of the system running, not a separate deliverable someone has to produce after the fact.

The gap between arrival and action

Most teams do some version of this today. A scheduled script runs every fifteen minutes, or every hour, or once a day. It scans a directory for new files. When it finds something, it copies it to a processing location, maybe computes a checksum, maybe logs a line to a file. Between runs, files sit unprocessed. If the script fails silently – and scripts do fail silently more often than anyone wants to admit – files can sit for hours before anyone notices.

The processing logic usually lives in a shell script or a Python script that one person wrote and one person maintains. Rarely version-controlled and almost never tested. When that person leaves the organization or moves to a different project, the script becomes institutional folklore: everyone knows it runs, nobody wants to touch it, and the day it breaks is the day someone discovers it was doing three things nobody documented.

The audit trail in this setup is either a manual spreadsheet cross-referenced with vendor email confirmations, or something reconstructed after the fact by correlating sshd logs with filesystem timestamps and whatever checksumming someone bolted on. When an auditor asks “prove this file was not modified between receipt and processing,” the cron approach requires assembling evidence from multiple, disconnected sources. It can be done, but it takes real effort to produce on demand rather than being there by default. The audit story takes work to tell rather than telling itself.

The tension isn’t about whether the data arrives safely – SFTP handles transport integrity at the protocol level. It’s about whether the system around it creates the compliance properties your quality team needs without depending on human discipline to maintain them. Human discipline doesn’t scale, and it doesn’t survive personnel changes.

What event-driven ingestion actually looks like

Instead of polling on a schedule, the system reacts the moment a file arrives. The walkthrough below uses AWS primitives: Azure Event Grid and GCP Pub/Sub offer equivalent building blocks, but the structural properties are the same regardless of cloud. In AWS, S3 event notifications fire on object creation and trigger a processing function immediately. No polling interval. No delay. No window where a file sits unnoticed.

When a vendor uploads a dataset through the SFTP endpoint, the file lands in a landing bucket, the only bucket vendors have write access to, and the only place external traffic touches. The moment the object is created, an S3 event fires and a Lambda function picks it up. It extracts metadata from the S3 key path, vendor identity, study identifier, data type based on naming conventions established during vendor onboarding. It computes a checksum, and it writes a structured audit log entry to a dedicated audit bucket. That entry captures everything an auditor would ask about: source identity, arrival timestamp, original filename, SHA-256 checksum, file size, and processing status, all in a single JSON record, written atomically, before the file moves anywhere else.

From there, the file gets promoted to the clinical data bucket, where validated datasets live and where scientists actually work with them. Biostatisticians and statistical programmers access this data for their analyses, tables, listings, and figures. The bucket has S3 Object Lock enabled in COMPLIANCE mode, matching regulatory retention requirements. Once locked, the object can’t be modified or deleted, not by administrators, not by root, not by anyone for the duration of the retention period. The original file in the landing bucket gets cleaned up on a lifecycle policy after the promotion succeeds.

An SNS notification fires to the data management team: what arrived, from which vendor, whether it passed integrity checks. If something fails – e.g. checksum mismatch, unexpected file format, missing manifest – the file goes to a quarantine path with its own audit entry explaining what went wrong and an alert to the operations team. Nothing is silently dropped. Every outcome, success or failure, gets documented.

In our implementations, we separate these concerns across distinct storage layers – landing, clinical data, and audit, because vendor-facing infrastructure stays isolated from the compliance-critical layer where scientists access validated datasets. Your organization may structure this differently depending on how research and delivery workflows are organized. The key point is isolation between ingestion and consumption, however you choose to get there.

The same pattern works beyond initial ingestion, too. Downstream processing stages, schema validation against expected CDISC structures, metadata extraction and standardization, data transformation into submission-ready formats, row count checks, controlled vocabulary verification which plugs into the same architecture as additional functions triggered by the same events or chained in sequence. What starts as automated file intake becomes the backbone for a data management pipeline where metadata enrichment and format standardization happen as data moves through the system. The pipeline grows without changing the core pattern.

The audit trail as a system property

The difference between the before and after isn’t incremental. It’s structural.

Before this pipeline, proving file integrity meant correlating session logs from the SFTP server with filesystem timestamps and whatever checksumming was bolted on afterward. Proving no modification after receipt means trusting that nobody touched the filesystem between arrival and processing. Trust us, that’s hard to demonstrate to a regulator. Pulling together a transfer history for an inspection means exporting logs, cross-referencing with vendor emails, and assembling a report by hand. And retention? That depends on someone maintaining backups and not rotating out old log files.

After the pipeline is running, the same questions have different answers. File integrity? Query the audit bucket for the structured transfer record timestamp, checksum, and source identity in a single document. No modification? Point to the Object Lock retention timestamp on the clinical data bucket, the object physically can’t be altered for the retention period you set. Transfer history? Filter structured JSON logs by study, vendor, or date range. Retention? Enforced by Object Lock for however long your compliance team needs, not by someone remembering to keep the backups.

Each of those answers maps directly to a 21 CFR Part 11 requirement — and the pipeline satisfies them not by configuration, but by construction:

  • Immutability – Object Lock in COMPLIANCE mode, records can’t be altered or deleted during the retention period.
  • Computer-generated audit trails – the Lambda function writes structured entries on every event, no human step required to create or maintain them.
  • Readily retrievable records – audit bucket with cross-region replication, queryable at any time.
  • Identity attribution – Transfer Family authentication logs tie each upload session to a specific vendor identity.

The audit trail isn’t something the team produces. It’s something the system produces as a consequence of running. That matters when your quality team is preparing for an inspection and the question is whether producing the required evidence is a time-consuming project or an efficient query.

Where this takes you

Once the ingestion pipeline is running and the audit foundation is in place, the same architecture opens up capabilities that would be impractical on top of a manual process. Schema validation, checking that incoming datasets match expected structures, row counts, and controlled vocabularies before they reach the analytical workspace are all a natural extension. But the pipeline also becomes the right place for metadata management and data transformation: extracting and standardizing metadata fields, converting datasets to submission-ready formats like define.xml, enriching records with study context. Each of these is another function in the same event chain, automated, audited, and traceable by default.

The structured metadata captured at ingestion feeds directly into data lineage. When a statistician pulls a dataset six months later for a regulatory submission, the lineage chain traces back to the original file drop: who uploaded it, when, the checksum at arrival, every transformation applied. That’s what regulators expect, and it’s what manual processes struggle to maintain across the lifecycle of a clinical program.

One dependency makes all of this work: vendors uploading files to consistent, predictable paths. The key structure vendor namespace, study identifier. Data type is what the processing function uses to route, classify, and log each transfer. When those conventions are loose or ad hoc, the automation falls apart. When they’re standardized and enforced from the moment a vendor is onboarded, the entire pipeline runs without manual intervention.

The shift isn’t complicated to describe, but it’s significant in practice. When every file drop writes its own record, when the storage layer enforces immutability by default, and when failures surface immediately rather than accumulating silently, the compliance burden moves from people to infrastructure. That’s where it should have been all along.

That’s the next conversation in this series – how to turn vendor onboarding from a bespoke, multi-week process into a repeatable pattern that plugs each new CRO or lab into the same automated pipeline from day one.

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.
Explore Possibilities

Share Your Data Goals with Us

From advanced analytics to platform development and pharma consulting, we craft solutions tailored to your needs.