Data Architectures - How to Think Like a System Designer
From Trade-offs to Blueprints: Navigating System Design in the Age of Data-Driven Decisions
Modern data engineering isn’t just about moving data from one place to another; it’s about designing systems that remain scalable, reliable, and adaptable as the volume, variety, and velocity of data grow. As organizations increasingly depend on data to drive decision-making in real time, the ability to architect robust pipelines and infrastructures has become a defining skill of a data engineer. But what exactly does it mean to “think like a system designer”?
At its core, system design in the data domain involves making informed trade-offs—between latency and throughput, consistency and availability, batch and stream, complexity and maintainability, and cost versus performance. These trade-offs are not merely academic; they directly impact whether a data system delivers insights in milliseconds or hours, whether it fails gracefully or catastrophically, and whether it remains agile in the face of business growth.
In our previous newsletters, we established foundational knowledge:
In Part 1, we explored what data engineering really means—the domain’s evolving definition, its intersection with analytics and DevOps, and its growing influence across industries.
In Part 2, we examined the lifecycle of data engineering, from ingestion and transformation to storage and serving.
Now in Part 3, we begin our transition from foundational understanding to architectural fluency. This issue is about cultivating the mindset of a system designer: someone who sees the bigger picture, understands how components interact over time, and knows how to build for both present needs and future scale.
“Let’s explore what it means to architect modern data systems in practice—click below to continue reading.”
As we move forward, we’ll explore some of the most widely used architectural paradigms in data engineering today—like batch processing, Lambda and Kappa architectures—and how cloud-native designs have reshaped these mental models. Along the way, we’ll emphasize real-world examples and decision-making frameworks that equip you to choose wisely based on your business and technical requirements.
Imagine you’re building an analytics platform for a fast-scaling e-commerce company. Should you focus on real-time order tracking or accurate historical reporting? Do you need to support ad-hoc SQL queries, or do you care more about streaming events with sub-second latency? These are not just engineering decisions—they’re architectural decisions, and this newsletter is here to help you make them well.
In the pages ahead, you’ll gain the vocabulary and mental models to confidently approach system design as a data engineer—not just writing pipelines, but designing sustainable systems.
Why Architecture Matters in Data Engineering
Architecture in data engineering isn’t just about selecting the right tools—it’s about designing a cohesive system that balances trade-offs while meeting business requirements at scale. As data volumes grow and demands for low-latency insights rise, architecture becomes the framework that holds the entire ecosystem together.
To understand why architecture is central to modern data engineering, consider the contrast between building a prototype and building a platform. A prototype might move data from source to dashboard, but a platform ensures that data is consistently reliable, easy to monitor, recoverable from failure, scalable under load, and governed for security and compliance. This platform-centric thinking is what architectural decisions enable.
At the heart of data architecture are five critical trade-offs:
Latency vs Throughput: Latency is the time delay between data generation and data consumption, while throughput refers to the volume of data processed over time. For example, financial tick data analysis demands low latency, whereas aggregating user clickstream data hourly is a high-throughput use case with relaxed latency needs.
Batch vs Streaming: Batch processing handles data in fixed-size chunks, suitable for use cases like nightly sales reports. Streaming processes data as it arrives, necessary for real-time fraud detection or live personalization. The choice between them shapes the entire processing pipeline.
Consistency vs Availability: Stemming from the CAP theorem, systems must often choose between strong consistency and high availability in distributed setups. For example, an inventory management system may prioritize availability to avoid downtime, even if some data becomes temporarily inconsistent.
Data Freshness vs System Complexity: The desire for real-time insights often introduces architectural complexity. Fresh data is valuable, but maintaining a streaming architecture may require more infrastructure, monitoring, and failover logic than batch alternatives.
Operational Complexity vs Maintainability: More components often mean more power—but also more potential for failure. A simple Airflow + BigQuery setup might be easier to manage than a microservices-based event-driven architecture with Kafka, Spark, and Flink.
In designing data systems, architects must recognize that no solution is perfect. Each trade-off is a deliberate decision that reflects business needs, team capabilities, and the evolving data maturity of the organization.
Take, for example, a media streaming platform: while recommendation engines demand low-latency pipelines to serve real-time suggestions, batch jobs might be used for offline analytics, content popularity analysis, or financial reporting. In such systems, architectural design is not just about supporting different data workloads—it’s about orchestrating them to work in harmony.
In the rest of this newsletter, we’ll explore how common architectural patterns—batch, Lambda, and Kappa—approach these trade-offs differently, and how cloud-native principles are reshaping these paradigms for modern workloads.
Architectural Patterns in Modern Data Systems
Designing a data system requires more than just selecting the right tools—it involves architecting how data flows through an ecosystem. This flow determines how efficiently, reliably, and quickly data is collected, processed, stored, and served. Over the past decade, three dominant architectural patterns have emerged as mental models for data engineers: Batch Processing, Lambda Architecture, and Kappa Architecture. These designs shape how modern systems manage latency, throughput, complexity, and scalability.
1. Batch Processing — Traditional Data Warehouse Flow
The batch architecture is the most mature and widely used model in data engineering. It revolves around periodic ETL processes, where data is ingested from operational sources (e.g., logs, databases, APIs), transformed in scheduled jobs, and then loaded into a centralized data warehouse.
This architecture is highly linear—data moves step-by-step, and downstream users only see updated information after the batch process completes. The latency is high, often measured in hours, but the operational simplicity and stability make this architecture ideal for analytics use cases that don’t demand real-time freshness.
Use Case Example:
A retail chain runs a nightly job to process all sales data, applying business rules before loading it into Snowflake. The leadership team accesses refreshed dashboards every morning.
Pros:
Simpler to build and debug
Mature tooling ecosystem
Cost-efficient at lower scale
Cons:
High latency (data isn’t fresh between batch runs)
Poor fit for real-time insights or alerting
Less responsive to late-arriving data
Where it shines:
Financial reporting, historical analytics, compliance dashboards.
2. Lambda Architecture — Dual Pipeline for Speed + Accuracy
The Lambda Architecture, coined by Nathan Marz, introduces a dual path architecture where data is processed in two parallel flows: a batch pipeline for comprehensive, accurate historical analysis, and a speed layer for real-time data ingestion and low-latency outputs.
The batch layer stores raw, immutable data and periodically recomputes views using tools like Spark or Hadoop. The speed layer processes only new, real-time data using stream engines like Kafka Streams, Flink, or Spark Streaming. The serving layer merges outputs from both to present unified results to users or applications.
Use Case Example:
A ride-sharing app tracks real-time driver locations via the speed layer and recalculates weekly earnings using batch processes. Both feed into the same dashboard for seamless UX.
Pros:
Combines low-latency outputs with batch-level accuracy
Resilient to late or out-of-order data
Scalable for diverse workloads
Cons:
Two separate code paths (batch + stream) to maintain
Higher testing and operational complexity
More fragile data reconciliation logic
Where it shines:
Fintech dashboards, IoT analytics, hybrid latency-tolerant use cases.
3. Kappa Architecture — Stream-First Unified Model
The Kappa Architecture, introduced by Jay Kreps (creator of Kafka), eliminates the batch layer entirely. Instead of duplicating logic across batch and stream paths, Kappa advocates a stream-only design where all data is treated as an immutable log of events. Stream processors consume this log and compute real-time analytics, materialized views, or alerting systems.
Kappa supports reprocessing by replaying historical events from Kafka, enabling late correction or recomputation without needing a separate batch system.
Use Case Example:
A cybersecurity company streams logs from thousands of firewalls and intrusion detection systems. Kafka captures the logs; Flink processes them in real-time to detect threats and trigger alerts.
Pros:
Simplifies architecture and code maintenance
Excellent for real-time, event-driven systems
Seamless scalability with log-based design
Cons:
Requires strong observability and stream infrastructure
Costly reprocessing at scale
Less efficient for batch-heavy workloads
Where it shines:
Fraud detection, ad tech pipelines, recommendation systems, telemetry.
These architectures are foundational mental models in a data engineer’s toolkit. In the next section, we’ll explore how cloud-native innovations—like serverless pipelines, data lakehouses, and event mesh systems—have evolved and hybridized these traditional paradigms for the demands of the modern data stack.
Alright, take a moment to absorb that information. We've covered a lot of ground on architectural patterns, and before we dive into the cloud-native evolution, it's a good time for a quick mental reset.
While you're processing everything, I'd love to hear your thoughts.
Got a question about Lambda vs. Kappa?
Have you faced a tough architectural decision at work?
Or perhaps you just want to share a brief thought.
Let’s Connect!
Evolution to Modern Cloud-Native Architectures
The rise of cloud computing has significantly altered the landscape of data architecture. While Batch, Lambda, and Kappa still provide foundational mental models, modern data systems are no longer restricted by on-premise hardware or rigid infrastructure. Instead, they embrace cloud-native patterns—which emphasize modularity, scalability, serverless design, and decoupled components—to handle ever-increasing volumes and velocities of data with greater agility.
Let’s examine how cloud-native thinking reimagines traditional architectures and pushes the boundaries of what’s possible in modern data engineering.
1. From Monoliths to Modular, Managed Services
In traditional systems, the data pipeline often ran end-to-end on a single cluster: ingest, transform, store, and serve. In cloud-native systems, these stages are decoupled into independently managed services, allowing for flexibility, cost efficiency, and parallel evolution.
Example Architecture Flow:
Each layer scales independently and can be replaced without rearchitecting the entire pipeline. For instance, you can switch from Dataflow to Databricks without changing your storage backend.
2. ELT over ETL — Let the Warehouse Do the Heavy Lifting
In legacy data architectures, ETL was the norm: data was transformed before loading into the warehouse. But cloud-native warehouses like BigQuery and Snowflake are compute-optimized, enabling a shift to ELT (Extract → Load → Transform). In ELT, raw data is ingested first, and then transformation is handled within the warehouse using SQL or tools like dbt.
Why this matters:
Encourages schema-on-read and late binding, which increases flexibility.
Supports versioned transformations, improving reproducibility.
Simplifies orchestration and allows analysts to work directly on raw data.
Diagram Flow:
This transformation inversion has been critical for agile teams working with fast-changing schemas or semi-structured data like JSON and Parquet.
3. Streaming-First and Event-Driven Architecture
As the cost of managing real-time data infrastructure has dropped, many teams are adopting streaming-first architectures by default. Tools like Kafka, Pub/Sub, Kinesis, and Pulsar act as event buses that decouple producers and consumers in a pipeline.
With stream-native storage engines (e.g., Apache Hudi, Delta Lake, Iceberg), you can achieve near real-time data lakes that support both streaming and batch access patterns. This creates what’s often called a Lakehouse Architecture—merging the flexibility of data lakes with the performance and structure of warehouses.
Flow Diagram:
This architecture is resilient, scalable, and real-time ready by design.
4. Decoupled Storage and Compute
Cloud platforms enable separation of storage and compute, a major architectural milestone. In monolithic systems, scaling compute often meant duplicating storage and vice versa. Today, services like BigQuery and Snowflake allow you to scale compute elastically without moving data around.
Benefits:
Run multiple workloads (e.g., ML training + BI queries) in parallel on the same data
Pay only for the compute you use
Enable multi-tenant and multi-team workflows with centralized storage
5. Orchestration, Observability, and DataOps
Modern cloud-native architectures aren’t complete without observability and orchestration layers. Tools like Airflow, Dagster, and Prefect allow you to orchestrate complex pipelines as DAGs, while DataDog, OpenLineage, and Monte Carlo ensure pipeline reliability through monitoring, lineage, and anomaly detection.
This shift reflects the rise of DataOps: applying DevOps principles to data workflows—versioning, CI/CD, rollback, and automated testing.
Modern architectures are:
Composable: each layer is pluggable and replaceable
Serverless-first: rely on managed infrastructure with autoscaling
Streaming-native: treat all data as real-time by default
Observable: built for debugging, lineage, and SLA enforcement
These innovations don’t discard traditional patterns like Batch, Lambda, or Kappa—they evolve them into more modular, resilient, and cloud-ready forms. As a system designer, your role is to mix and match these components based on your latency tolerance, data volume, team size, and compliance needs.
In the next section, we’ll dive into a decision-making framework to help you choose the right architecture for your specific use case.
These innovations don’t discard traditional patterns like Batch, Lambda, or Kappa—they evolve them into more modular, resilient, and cloud-ready forms. As a system designer, your role is to mix and match these components based on your latency tolerance, data volume, team size, and compliance needs.
Curious how your current setup aligns with these modern principles? We'll break down a decision-making framework next to help you choose the right architecture for your specific use case.
How to Choose an Architecture: A Decision-Making Framework
Selecting the right data architecture is not about following trends—it’s about making principled decisions that reflect business priorities, data characteristics, and operational constraints. In this section, we’ll outline a practical framework to evaluate architectures like Batch, Lambda, Kappa, or Modern Cloud-Native Hybrids based on measurable criteria and contextual trade-offs.
1. Start with Latency Requirements
Latency is the most immediate filter. Ask yourself:
Do users need this data within seconds or minutes?
Is it acceptable if the data is refreshed hourly or daily?
Decision Guide:
Use Batch for high-latency-tolerant workflows (e.g., monthly revenue reports, regulatory filings).
Use Lambda if you need both low-latency alerts and full historical accuracy (e.g., ad attribution + clickstream analysis).
Use Kappa or Streaming-first if real-time decision-making is essential (e.g., fraud detection, recommendation engines).
2. Assess Data Volume and Frequency
The shape and size of your data significantly affect architecture choice.
High volume + high frequency: Kappa or streaming-first setups offer scalability and low-latency ingestion.
Moderate volume, daily batch loads: Batch pipelines via Airflow + BigQuery or dbt are usually sufficient.
Massive historical data recomputations: Lambda or batch reprocessing via Spark/Hadoop may be necessary.
Heuristic Rule:
The more data you generate in a short period, the stronger your incentive to adopt streaming or append-only architectures.
3. Understand Operational Complexity and Team Maturity
Ask: Can your team manage a dual-code pipeline? Are you equipped to debug streaming failures?
Batch offers lower operational overhead.
Lambda demands strong DevOps/DataOps maturity.
Kappa is simpler logically but complex infra-wise (needs reliable stream processing, backpressure handling, observability).
If you're a small team or a startup, lean towards simplified batch or streaming-first cloud-native platforms (e.g., Cloud Functions + BigQuery streaming), avoiding unnecessary architectural burden.
4. Consider Cost and Resource Efficiency
Cost is often a deciding factor in architecture—not just in terms of cloud bills but also engineering time.
Batch is cost-efficient for low-frequency jobs.
Streaming platforms can be expensive if not volume-justified.
Serverless cloud-native solutions (e.g., BigQuery + Pub/Sub) offer elastic pricing.
Framework Tip:
Calculate the Cost-to-Latency Benefit Ratio—if your dashboard needs 1-min updates but costs 10x more to run as a stream pipeline, consider if that freshness is truly worth it.
5. Evaluate Failure Tolerance and SLAs
If your business requires high availability, replayability, and correctness guarantees, architectures that support immutable logs, event replay, and exactly-once semantics are crucial.
Lambda offers fault-tolerant accuracy across two systems.
Kappa relies on replayable logs with stateful streaming engines.
Batch pipelines often fail silently unless observability is added.
For SLA-bound applications (e.g., alert systems, real-time risk analysis), Kappa with full observability and failover strategies is preferred.
6. Match to Use Case Examples
Choosing a data architecture is fundamentally about trade-offs. The key is to align your system design with what your business truly needs now, while allowing room to evolve. Over-architecting early leads to waste; under-architecting leads to bottlenecks.
Rather than fixating on one model, adopt a principled mindset: prioritize simplicity, observability, and scalability—only optimizing when clear pressure demands it.
In the next section, we’ll put this into practice with a real-world case study comparison, contrasting architectural choices in an e-commerce analytics platform.
Case Study: Designing Data Architecture for an E-Commerce Analytics Platform
Now that we’ve explored architectural models and how to choose among them, let’s anchor these concepts with a practical, real-world case study. In this section, we’ll walk through the design decisions behind building a data architecture for an e-commerce analytics platform. You’ll see how different architectural paradigms impact latency, complexity, scalability, and cost—through the lens of multiple data use cases.
Business Context
Imagine you're designing the backend for an e-commerce platform similar to Flipkart, Amazon, or Meesho. This system handles:
Order management
Real-time inventory updates
Customer behavior analytics
Sales reporting
Promotional performance tracking
Your stakeholders include product managers, marketing teams, supply chain analysts, and the executive team—each with varying latency needs, query patterns, and data freshness expectations.
Use Case 1: Daily Sales Summary for Leadership Dashboards
Requirement:
Leadership wants a sales dashboard that provides metrics like daily revenue, orders placed, gross margin, and return rates—refreshed every morning by 9 AM.
Recommended Architecture: Batch Processing
Pipeline Flow:
Rationale:
Data freshness tolerance is high (daily).
This workflow benefits from low-cost batch compute and a structured warehouse schema.
Business users get consistent, tested reports that don’t rely on streaming infrastructure.
Why not streaming?
Unnecessary overhead for daily cadence. Batch is operationally simpler and more cost-efficient here.
Use Case 2: Real-Time Inventory Tracking
Requirement:
The operations team needs a real-time view of product inventory across warehouses to avoid stockouts, update listings, and trigger restocks dynamically.
Recommended Architecture: Kappa (Streaming-First)
Pipeline Flow:
Rationale:
Data needs to be processed within seconds to reflect current stock levels.
Stream-first architecture allows real-time ingestion, aggregation, and lookup.
Systems like Kafka provide exactly-once guarantees and ordered event replay, ensuring inventory correctness.
Operational Benefit:
You can also plug alerts into this stream to automatically trigger notifications when stock drops below thresholds.
Use Case 3: Customer Behavior Analytics
Requirement:
The product team wants to analyze clickstream events like page views, cart additions, and product searches to generate funnel drop-off reports, session paths, and recommendations.
Recommended Architecture: Lambda
Pipeline Flow:
Rationale:
Real-time streaming layer provides instant behavioral insights, personalization, and triggers.
Batch layer is essential for long-term analytics, ML feature generation, and deep cohort analysis.
Lambda allows you to merge the speed of real-time with the depth and accuracy of batch.
Operational Tradeoff:
You must manage dual pipelines and ensure eventual consistency in merged views.
Use Case 4: Marketing Attribution and Campaign Performance
Requirement:
The marketing team wants to track which ads, campaigns, or email journeys led to purchases—across channels like Google Ads, Meta, email, SMS, etc.
Recommended Architecture: Lambda or Batch with Change Data Capture (CDC)
Pipeline Flow:
Rationale:
Some signals (e.g., impressions, clicks) come in real-time, but conversion signals (e.g., purchases) arrive later.
Attribution windows span hours to days, so sub-second latency is not essential.
Lambda adds streaming sophistication if needed, but many platforms still use scheduled micro-batches or CDC jobs to capture state changes.
Alternative Strategy:
Use streaming ingestion with buffered joins and windowed aggregations in tools like Flink when real-time campaign feedback is needed.
We just walked through several common e-commerce scenarios and their architectural solutions. Which of these examples resonated most with your own challenges or current projects?
Share your thoughts, or tell us about an architectural dilemma you're currently tackling.
Conclusion
No single architecture is universally superior—each one thrives when it aligns with the business domain, latency constraints, and team maturity. As a system designer, your role is to synthesize these requirements and deliver a resilient, maintainable pipeline.
Designing modern data systems isn’t just about plugging technologies together—it’s about adopting a systems mindset that balances reliability, latency, scalability, and cost. In this newsletter, we transitioned from foundational concepts to architectural decision-making, unpacking how Batch, Lambda, Kappa, and Cloud-Native architectures serve different business needs.
Each architecture reflects a philosophy:
Batch values simplicity and predictability.
Lambda aims to blend real-time responsiveness with deep accuracy.
Kappa embraces the streaming-first paradigm—where everything is an event.
Modern cloud-native systems decouple complexity into composable services, prioritizing observability, flexibility, and horizontal scalability.
Understanding these models equips you to build solutions that scale with your organization, instead of against it.
Final Reflection Prompt
“If your platform suddenly had to support 10x more users and real-time decision-making, would your current architecture hold up—or collapse?”
System design isn’t static—it’s a living process of trade-offs. Choosing when to introduce complexity, how to manage latency boundaries, and when to decouple components are all markers of a thoughtful data engineer.
If this deep dive helped you frame your thinking around data architectures, consider sharing your thoughts:
Comment your biggest "aha!" moment
Share the architecture you’re working with—and where it struggles
Link this newsletter to a team member currently "just adding another Airflow DAG" without a system-wide view
Found this deep dive valuable?
In the next newsletter, we’ll move beyond architecture and dive into one of the most under-discussed but deeply impactful aspects of data engineering:
Data Modeling for Scale - Why SQL Still Reigns Supreme
We’ll cover dimensional modeling, normalization, denormalization, and modern paradigms like wide tables, columnar formats, and modeling for lakehouses and streaming.
Until then—keep architecting boldly, but wisely.