Lineage from the first byte: tracing data from delivery through transformation

Reading time:

time

min

May 28, 2026

When a regulator asks where a dataset came from, the answer should be a query, not an investigation.

Over the course of this series, we've built up a complete data delivery layer. SFTP endpoints backed by cloud storage. An event-driven pipeline that checksums, logs, and locks every file the moment it arrives. A vendor onboarding pattern that keeps path conventions consistent. A protocol-agnostic architecture that handles SFTP, APIs, and direct uploads through the same processing chain. Every file that enters the system is verified, attributed, and immutable from the moment it's promoted to the clinical data bucket.

However, audit logging and data lineage are not the same thing. Audit logs tell you what happened: this file arrived at this time, from this vendor, with this checksum. Lineage tells you the full story: where the data came from, every transformation applied, which version of which code touched it, and how the dataset a statistician is using today, connects back to the raw file a CRO uploaded six months earlier. Audit logs are snapshots. Lineage is the chain that connects them.

In regulated environments, both matter. 21 CFR Part 11 requires audit trails, but when your quality team is preparing a submission package and needs to demonstrate that every dataset is traceable from source through analysis, audit logs alone force you to stitch together evidence from multiple systems by hand. Lineage built into the platform turns that reconstruction project into a query.

What you already have (and probably don't realize)

Most organizations running clinical data pipelines on AWS already have the raw materials for data lineage. They just haven't connected them into a coherent chain.

CloudTrail captures every S3 API call - every PutObject, CopyObject, GetObject, DeleteObject. Each event records the IAM principal that made the call, the timestamp, the bucket and key, the source IP, and the request parameters. For the data delivery pipeline described in this series, that means every file arrival (PutObject from Transfer Family), every promotion to the clinical data bucket (CopyObject from the processing Lambda), and every access by a scientist (GetObject from their Workbench session) is recorded with full identity attribution. CloudTrail doesn't know its building lineage, but the raw material is there.

Transfer Family authentication logs in CloudWatch tie each SFTP session to a specific vendor identity. Combined with CloudTrail, you can trace a file from "vendor X authenticated at 14:31" through "file landed in the landing bucket at 14:31:58" to "Lambda promoted it to the clinical data bucket at 14:32:07." The vendor identity, the file path, and the processing chain are all captured across these two sources.

The processing Lambda's structured audit records add application-level context that CloudTrail doesn't capture. The SHA-256 checksum is computed at processing time. The Object Lock retention date applied to the promoted file. The processing status (success or quarantine) is recorded. The vendor and study metadata extracted from the S3 key path. These records live in a dedicated audit bucket as structured JSON, queryable and immutable.

CloudTrail Lake ties a bow on the API-level events by making them queryable with SQL. Instead of exporting raw JSON logs and parsing them with scripts, you can run structured queries: show me every file delivered by vendor X for study Y in the last 90 days, with checksums and promotion timestamps. For organizations that haven't enabled CloudTrail Lake for S3 data events, this is one of the highest-value, lowest-effort improvements available.

The gap is not that the data doesn't exist, it's that these sources live independently. CloudTrail knows the API calls. The Lambda audit log knows the application-level processing. Transfer Family logs know the vendor identity. Nevertheless, no single system connects them into a continuous chain that says "this dataset in the statistician's workspace traces back through these specific steps to this specific file uploaded by this specific vendor on this specific date." Correlating them by hand is possible. It just scales poorly, and it turns every compliance question into a research project.

From audit logs to lineage

The shift from "we have audit logs" to "we have data lineage" requires two things: structured metadata at every transition point, and a way to link those records into a chain.

Transition-point metadata is the foundation. Every time data moves or transforms, the system should capture:

what went in (S3 URI, version ID, checksum),
what came out (S3 URI, version ID, checksum),
what performed the transformation (code version, container image digest, Lambda function ARN),
who triggered it (IAM identity, orchestration run ID),
when it was triggered.

Each record points backward to its input and forward to its output. This is what turns a collection of independent audit log entries into a directed graph you can traverse.

Applied to the delivery pipeline presented in this series, the chain looks like this:

A vendor uploads a file through the SFTP endpoint.
CloudTrail records the PutObject.
Transfer Family logs record of the vendor identity.
The processing Lambda fires, computes a checksum, writes its structured audit record (input key, output key, input checksum, output checksum, processing metadata), and promotes the file to the clinical data bucket with Object Lock.
CloudTrail records the CopyObject.
If downstream processing stages follow (schema standardization, metadata enrichment, format conversion into submission-ready structures), each one produces its own transition record with the same input/output/transformation/identity pattern.

By the time the dataset reaches the analytical workspace where scientists use it for tables, listings, and figures, every step in the chain has a record.

A transformation ledger makes this queryable. Rather than relying on CloudTrail queries joined with audit bucket reads, joined with CloudWatch log filters, a dedicated metadata store (DynamoDB is a natural fit in AWS) captures every transition in a single, purpose-built table. Each entry contains:

the input object reference (S3 URI, version ID, ETag, SHA-256),
the output object reference (same fields),
the transformation bundle identity (git commit SHA of the processing code, container image digest, recipe or configuration version),
the execution context (IAM role, orchestration run ID, CloudTrail event ID for cross-referencing),
timestamps.

The ledger becomes the single source of truth for "what happened to this data and why."

Why this matters for submissions. When a biostatistician produces a table for a regulatory filing, the lineage chain should trace backward through every transformation to the original vendor delivery. Not by reconstructing it from scattered logs after the fact, but by following the links in the ledger. The regulator's question ("prove this dataset was not modified outside of your validated pipeline") gets answered by the chain itself: here is every step, every input, every output, every code version, every checksum, all linked. The evidence isn't assembled for the inspection, it's rather a structural property of how data moves through the system.

The transformation ledger also enables something auditors increasingly ask about: reproducibility. If the ledger records not just what ran, but which exact version of the code and configuration ran, the organization can replay any transformation and verify that the same inputs produce the same outputs. That capability, sometimes called a replay test, is a powerful compliance property that's almost impossible to support without structured lineage metadata.

Where OpenTelemetry fits

The infrastructure described above (CloudTrail, structured audit records, a transformation ledger) gives you lineage. However, this lineage is constructed from discrete segments inserted at defined locations. OpenTelemetry offers something different: lineage as a first-class, real-time property of the system, with causal connections built in.

OTel's distributed tracing model was designed for microservices, but it maps naturally onto data pipelines. A trace starts when a file arrives in the landing bucket and spans the entire processing chain: the S3 event notification, the Lambda invocation, the checksum computation, the audit log write, the promotion to the clinical data bucket, and any downstream transformation stages. Each processing step is a span within the trace. The trace ID connects them all into a single, navigable timeline.

In a platform like Bioverse, the implementation would look like the following:

The S3 event notification triggers a Lambda function instrumented with the OTel SDK (available as a managed AWS Lambda layer).
That function creates a root trace, attaches the vendor identity, study ID, and file metadata as span attributes, and propagates the trace context forward.
If the processing chain continues into another Lambda, an Airflow task, or a containerized transformation job on EKS, each downstream step inherits the same trace context and adds its own span with transformation-specific metadata.

The result is a single, queryable trace that shows the complete journey of a file from the landing bucket to the analytical workspace, with timing, attribution, and metadata at every step.

What does this get you that CloudTrail and a transformation ledger can't provide? Most importantly, causal connections and timing. CloudTrail records that a CopyObject happened at 14:32:07. The ledger records that the copy was a pipeline step with specific input/output checksums. OTel records that the CopyObject was the third step in a processing chain that started with a vendor upload at 14:31:58, and that the chain included checksum verification (200ms), metadata extraction (50ms), audit log writing (30ms), and file promotion (120ms). When something fails or takes unexpectedly long, the trace shows you exactly where and why. CloudTrail and the ledger tell you what happened. OTel tells you how it happened, with causality intact.

The implementation effort is real but bounded. Instrumenting Lambda functions with the OTel SDK is measured in days per function, not weeks. The AWS-managed OTel Lambda layer handles most of the plumbing. The bigger investment is the backend: a trace collector (OTel Collector running as a Lambda extension or as a service in EKS), a trace storage backend (AWS X-Ray for a managed option, Grafana Tempo for organizations already running the Grafana stack, or Jaeger for a self-hosted approach), and a query interface. For a platform already running on EKS with Grafana for monitoring, Tempo is a natural fit. The traces complement existing metrics and logs in a single observability surface.

OTel doesn't replace CloudTrail or the transformation ledger. It layers on top. CloudTrail remains the authoritative record of API-level events for compliance. The ledger remains the structured lineage store for regulatory queries. OTel adds the real-time, causally connected tracing that makes debugging, performance analysis, and operational monitoring possible across the entire data pipeline.

Where this takes you

With lineage built into the platform, the compliance posture shifts from "we can reconstruct the history if you give us time" to "the history is already there, ask any question you want." Every dataset carries its provenance. Every transformation is linked to code versions and execution context. Every access is attributed. The regulator's question gets answered before it's asked.

This also changes how the data management team works day to day. When a statistician finds an unexpected value in a dataset, they don't open a ticket and wait for someone to trace it manually through three systems. They query the lineage chain: where did this value come from, which transformation touched it last, what was the input, what version of the processing code was running. Debugging becomes navigation, not archaeology.

This post closes the Data Delivery arc of this series. The progression was deliberate. We started with the protocol (SFTP and why it persists), built the pipeline that processes what arrives (event-driven ingestion), standardized how vendors connect to that pipeline (onboarding), made the delivery layer flexible enough to handle whatever protocol vendors bring (not just SFTP), and now traced every byte from arrival through transformation (lineage). Together, these form the data delivery foundation of a modern SCE platform: the infrastructure that gets clinical data from external sources into the analytical workspace where scientists do their regulated work, with every step auditable, traceable, and reproducible.

Lineage from the first byte: tracing data from delivery through transformation

What you already have (and probably don't realize)

From audit logs to lineage

Where OpenTelemetry fits

Where this takes you

Open source, pharma, and AI insights - once a week.

Share Your Data Goals with Us