Systems Design – Learning Series

As a data engineer, it’s essential to understand how systems function in order to design data solutions that make the most efficient use of available resources. Data must align with core data quality principles, which means choosing a system architecture that best meets our specific needs.

I invite you to join me on a journey to explore system design and how it can be tailored to support modern data requirements. I am reading the book System Design Interview by Alex Xu, and I will share my notes here. By the end of this journey, we should be equipped to select the most effective architecture to handle Big Data workloads and ensure high-quality, scalable solutions.

Key Concepts:

Domain Name System (DNS): a fundamental part of the internet that translates human-readable domain names (like google.com) into machine-readable IP addresses. It is a paid service provided by 3rd parties and not hosted by our servers.
Internet Protocol (IP) Address: is a numerical label such as 192.0.2.1 that is assigned to a device connected to a computer network.
Private IP: is an IP address reachable only between servers in the same network; however is unreachable over the internet
HyperText Transfer Protocol (HTTP): is the foundation of the World Wide Web, and is used to load web pages using hypertext links
JavaScript Object Notation (JSON): is a text format for storing and transporting data. JSON is “self-describing” and easy to understand.
Transactional Outbox Pattern: It is an approach that solves for data inconsistency. Basically, we have two tables: object (e.g. orders) and events. We update both tables at the same time. A separate process reads from events to publish events to a message broker.
ACID properties (Atomicity, Consistency, Isolation, Durability) ensure database transactions are processed reliably, maintaining data integrity even during failures or concurrent access. They guarantee that transactions either succeed completely or fail totally, transition data between valid states, act independently, and persist permanently after commitment.
- Atomicity (All or Nothing): A transaction is treated as a single unit; if any part fails, the entire transaction fails and is rolled back.
- Consistency (Valid State): Ensures the database moves from one valid state to another, maintaining rules and constraints.
- Isolation (Independent Execution): Concurrent transactions do not interfere with each other, preventing intermediate, uncommitted data from being visible.
- Durability (Permanent Changes): Once a transaction is committed, its changes are permanent and survive system failures.
Data Tiering: is a storage management strategy that categorizes data based on access frequency, performance requirements, and value, moving it across different storage types (hot, warm, cold) to optimize costs and performance.
Cell-based architecture is a software design pattern that improves system resilience and scalability by partitioning applications into small, isolated, and autonomous units called “cells.”
The General Data Protection Regulation (GDPR) is the European Union’s comprehensive data privacy law, effective May 25, 2018, that sets strict standards for processing personal data of EU residents.