Back to the Basics: Refreshing Some System Design Concepts

Databases

SQL vs NoSQL databases:

When to use Relational Database:

  • Data is well structured with clear relationships
  • Strong consistency and transactional integrity

When to use Non-Relational Database:

  • Super low-latency for rapid responses
  • Data is unstructured or semi-structured
  • Scalable storage for massive data volumes

Scaling

Vertical Scaling (Scale Up)

Adding more resources to our existing servers.

  • RAM
  • CPU

This is recommendable for application that have low or moderate traffic

You will reach a limit. Also, if the server fails, then all your system will fails since it depends on one server

Horizontal Scaling (Scale Out)

Adding more servers to share the load

More suitable for large applications since it comes with high fault tolerance. If a server goes down, you can use another server.

Also, it allows scalability

How to implement? How to distribute the users’ requests? You can use a load balancer to distribute the traffic among multiple servers. It also controls the fault tolerance since if a server fails, it will stop sending to it, and send data to the other servers  

Load Balancer

Distribute incoming network traffic across multiple servers to ensure no single server bears to much load

7 strategies used in Load Balancing:

  • Round Robin: simplest way to assign traffic. Let’s say we have three servers, the load balancer will send the traffic to the first server first, then to the second one, and then the third one, and repeat. This is good for servers with similar specifications. Meaning all the servers have the same capabilities
  • Least Connections: it redirects traffic to the server with the fewer active connections. Fo example, if server 1 has 10 active connection, server 2 has 9, and server 3 has 30, then the load balancer will send the traffic to server 2, and now it will have 10 connections. Useful for applications that have sessions of variable length.
  • Least Response Time: focus responsiveness of the server. You have three servers, the first one is highly responsive, the second one is low responsive, and the third one is medium responsive. The load balancer will choose the lowest response time and with the fewest active connections, meaning that it will try to reach out to the highest responsive server, but it also checks for the active connections, so if the first one has many connections, it will go to the medium responsive server. This is effective when we want to provide the fastest response time to requests and you also have different servers with different capabilities
  • IP Hash: It determinates which server receives the request based on the hash of the client’s IP address. This is useful when you want your client to connect consistently to the same server
  • Weighted Algorithms: Servers are assigned two weights based on their capacity and performance metrics. The load balancer takes into consideration the weights to send traffic. Let’s say that we have 3 servers, server 1 has a weight of 16, server 2 has a weight of 32, and server 3 has a weight of 64, then the load balancer will try to send requests to server 3
  • Geographical Algorithms: direct request to the server that is geographically closest to the user. This is useful for global services where latency redundancy is important
  •  Consistent Hashing: this is the most popular one. In this case, we use a hash function to distribute data across various nodes. There is a hash function inside the load balancer, then we get the hash from the user based on his ip address, then we place the hash on a imaginary ring which has the servers, and depending how close is to the server, we move the traffic there. Ensures that the same client connects to the same server

Load balancers come with health checks features to make sure all the servers are up.

Examples of load balancers:

NGIX – For software

HAPROXY – For software

Citrix – for Hardware

The easier solutions are cloud load balancers

AWS – Elastic Load Balancers

GCP – Google Cloud Load Balancer

Single Point of Failure (SPOF)

Any component that could cause the whole system to fail if it stops working

Having SPOF is problematic because it will affect reliability, scalability, and security issues.

The strategies to prevent these issues are:

  1. Redundancy
    1. Adding redundancy to our system. Meaning adding more than one of the same elements
  2. Health Checks and Monitoring
    1. Continuously check the health of our objects, so if it fails we will stop the traffic there and redirect to another object. It works with the redundancy strategy
  3. Self-healing systems
    1. Continuously check the health of our objects, so if it fails we will stop the traffic there and replace with a new healthy object

API Design

API – Application Programming Interface. It defines how software components should interact with each other

An API is a contract that defines:

  1. What requests can be made
  2. How to make them
  3. What responses to expect

An API is an abstraction mechanism since it hides implementation details while exposing functionality. It also defines clear interfaces between system components

API Styles:

REST – Representational State Transfer. Multiple end-points. Most used for Web and Mobile Apps.

GraphQL – Single endpoint for all the operations (queries). Single end-points for all the operations. Recommended for Complex UIs

gRPC – it used the RPC framework. This is used for MicroServices

Design Principles to build APIs

  • Consistency
    • Consistent naming and patterns
  • Simplicity
    • Focus on core use cases, and intuitive design
  • Security
    • Authentication, authorization, input validation, rate limiting
  • Performance
    • Caching strategies, pagination, minimize payloads, reduce round trips

The API Design Process

  • Identify core use cases and user stories
  • Define scope and boundaries
  • Determine performance requirements
  • Consider security constraints

API Protocols

Application Layer
– HTTP/HTTPS
– Web sockets
– MQTT
– AMQP – Advanced Message Queuing Protocol
– gRPC
Transport Layer
TCP
UDP
Network Layer
IP
Data Link Layer
Ethernet
WiFi
Bluetooth
Physical Layer

Transport Layer

TCP

  • Transmission Control Protocol
  • Reliable but Slower
  • Guaranteed Delivery
  • Connection based
  • Ordered packets
  • Error checking
  • Best for banking, emails, payments

UDP

  • User Datagram Protocol
  • Fast but unreliable
  • No delivery guarantee
  • Connectionless
  • Faster transmission
  • Less overhead
  • Best for video, gaming, streaming

7 Techniques to protect your APIs

  1. Rate Limiting
  2. CORS (Cross-Origin Resource Sharing)
  3. SQL & NoSQL injection prevention
  4. Firewalls
  5. VPNs (Virtual Private Networks)
  6. CSRF (Cross-Site Request Forgery)
  7. XSS (Cross-Site Scripting)

Defining scope and constraints for Data Engineering pipelines

When we are designing data pipelines, it is important to start by asking the following questions:

1. Requirements and Goals

  • What is the primary goal of this pipeline?
  • What types of data are we processing (structured, semi-structured, unstructured)?
    • Structured data: Data that is organized in a fixed schema
    • Semi-structured: Data that doesn’t follow a rigid schema but still has some organizational markers
    • Unstructured: Data that has no predefined structure, and cannot easily fit into rows and columns.
  • How often does the data need to be available for consumers? (real-time, near-real-time, daily batch)
  • Are there any SLAs for latency or throughput?
    • Latency: The time it takes for a single piece of data to travel through the system from input to output.
    • Throughput: The amount of data a system can process in a given time period.
  • How large is the expected data volume now, and how fast do we expect it to grow?

2. Sources and Destinations

  • What are the data sources? (databases, APIs, logs, IoT devices, external systems)
  • Are there multiple sources with different reliability characteristics?
  • Where should the processed data be stored or consumed from? (data warehouse, lake, dashboards, ML models)

3. Data Processing Expectations

  • Do we need complex transformations, aggregations, or enrichment?
  • Should the pipeline handle streaming, batch, or a mix of both?
  • Are there specific data quality requirements? (e.g., deduplication, schema validation)

4. Reliability and Fault Tolerance

  • What level of reliability is expected? (exactly-once, at-least-once, at-most-once)
    • At-most-once:
      • Each message is delivered zero or one time.
      • Messages might be lost, but never duplicated.
    • At-Least-Once:
      • Each message is delivered one or more times
      • Messages are never lost, but duplicates can occur.
    • Exactly-Once
      • Each message is delivered exactly once.
      • No duplicates, no losses.
      • Achieved with careful transactional processing or idempotent operations.
  • Are there any retention requirements?

5. Operational Considerations

  • Are there cost or resource constraints?
  • What kind of monitoring or alerting is expected?
  • Do we need to support schema evolution or backward compatibility?
    • Schema Evolution:
      • The ability of a system to adapt to changes in the data schema over time without breaking existing pipelines or applications.
    • Backward Compatibility:
      • A schema change is backward compatible if old consumers can still read new data without errors.

6. Security and Governance

  • Are there access controls or privacy regulations we need to consider?
  • Is data lineage or auditing required?

These questions will help us clearly understand the requirements, enabling us to design a pipeline that is robust, scalable, maintainable, and reliable.