Daniela Fernandez

Here’s to Second Chances: How I Became a College Athlete at 33

Progress is not linear, but every road leads somewhere.

Three years ago, August 2023, I joined my brother for his weekly workouts at Town Athletics, a local running studio in Oakland focused on running fitness. I really enjoyed those sessions, not only because they were different from my regular routine (3 easy miles around Lake Merritt), but because I got to discover a new facet of running: speed. Our little cohort, my brother, one of our friends, and I, trained together for three months to run a mile time trial. The day came, October 2023, we ran, and I surprised myself with a 5:53. My initial goal had been 6:10, so when my coach called out that time, I could not believe it. The feeling of going so fast that you don’t have time to think, I wanted to feel that again.

I talked to my coach and told him I wanted to keep training, to keep getting faster, even though I wasn’t sure it was possible at 31. Our coach, Tinu, is amazing. If you don’t believe in yourself, he will absolutely help you change that. He said: let’s find some races and keep training. Our next goal: my first track meet at Laney College.

I had never run competitively in high school or college, unlike most of my friends from run clubs. I had no idea what a track meet looked like or what to expect. We followed the same approach, three months of training, two weekly speed workouts, one long distance day. I followed the plan, and in April 2024, the day arrived. With zero expectations, I found myself crossing the finish line at 5:23.14, a 5:47 mile pace. I had an incredible finish, passing people on the last lap, and that feeling of going so fast you don’t have time to think? It was there again.

Afterward, I went straight to my coach: I don’t want this to be the end. I want to keep training. I want a new goal.

There was another meet on the horizon, a big one, one I had only ever dreamed about because I never believed I could run that fast: the Mike Fanelli Classic at San Francisco State University. I checked the qualifying standards and got in with just two seconds to spare, the cutoff for the 1500m was 5:25. I had a year to prepare, and I was determined to make it count.

Two months after Laney, I got COVID. At first it felt like any seasonal bug, runny nose, sore throat, nothing out of the ordinary. Then one day, after taking BART and walking home, my left ear popped. Like when water gets trapped after swimming, and then suddenly releases. I felt a strange emptiness in that ear, but brushed it off, convinced it was nothing.

Two days later the sensation was still there. I went to the doctor, got some tests, and he confirmed I had lost 30% of my hearing capacity. I went through the standard treatment, but my hearing didn’t recover right away. I felt devastated, just months after feeling the strongest I’d ever been, my body felt broken. I couldn’t believe COVID had actually done this to me.

After three weeks of treatment, my hearing came back. But as our grandparents say: nothing is free. The treatment came with a side effect: 10 extra pounds.

I kept training, kept racing here and there, but the version of me from Laney felt very far away. I kept showing up, but the weight was hard on my body. I got injured and had to scale back on volume and intensity. My physical therapist was direct: it will be hard for you to race in the future if you don’t stop running. My answer: I need to race. My parents already booked their flights.

April 2025. Race day for the Mike Fanelli Classic arrived. My parents were in the bleachers, taking pictures the way they used to a decade ago when I was in college playing handball. The gun went off. I started running, and my body just didn’t respond. I finished last. But I finished happy, because I had made it to the place I used to only dream about.

At the end of the race, my coach said something that has stuck with me ever since: Think like an athlete. See you next week.

I wasn’t entirely sure what that meant, but I understood the spirit of it: let the past be the past. Focus on the present.

New season, new goals. I started going to the gym more consistently, partly because I realized rest alone wasn’t healing my injury, and partly because I wanted to spend more time with friends who had been going there for over a year. Then summer 2025 arrived with a great opportunity: I was invited to join the Renegade-HOKA Kodiak team, a group of 40 women from LA and Oakland training together to race the Kodiak Big Sur race.

My endurance started coming back. My injury began to ease. And I felt something I hadn’t felt since college: the energy of a team of women all chasing the same goal, an ultra trail race.

When I was at university in Mexico (Universidad Autónoma de Aguascalientes), I was part of the handball varsity team. It was one of the best parts of my college experience, the conversations during practice, the road trips to compete against other schools, the celebrations. Sometimes you’re not getting stronger on your own, but the people around you make you feel strong, and you end up believing them, and you end up actually becoming stronger. That’s what happened with the Kodiak team. Those hills we climbed together made me feel fast again. My confidence climbed alongside my fitness.

I never stopped going to track workouts either. Trail running is fun, but track is a different game, and I didn’t want to neglect that part of my journey.

October 2025. The Kodiak race came and went. Every teammate had an amazing race. We celebrated not just the finish lines, but the people we had become through the process. I am so grateful to HOKA, Renegade, and everyone involved. You gave me back a spark I thought I had lost.

Back on the track, my coach scheduled a 1500m time trial. I didn’t have high expectations, I’d been splitting my focus between trail and track and wasn’t fully committed to either. But I started running, felt good, and looked down at my watch: 5:13. A PR. I was back.

December 2025. My coach mentioned that Laney College was recruiting for their track team and asked, half joking, if I’d join if invited. It was a no-brainer. The chance to run track every week, to belong to a team again, to be part of a group of women who support each other? Best decision I’ve ever made.

Injuries, of course, never fully leave. That’s the price athletes pay for chasing goals, and I believe it’s worth it every time. I started the season with an Achilles injury. Workouts got harder to complete, my splits didn’t look great, and my coach moved me from the 1500m to the 800m. I was fine with that. I’d never done collegiate track, so I was open to trying everything.

February 2026. Our first meet: Merritt College. I had taken three weeks off because the injury had gotten bad. I showed up with no expectations, just hoping not to get lapped. Then my body reminded me that willpower is a real force and we should never underestimate it. I started last and finished third, running 2:37 when I’d been targeting 2:45. I was back, and ready for the rest of the season.

I’ve been competing for Laney for nearly three months now, across six track meets. After the third meet, with my injury finally improving, my coach moved me back to the 1500m. I had unfinished business there.

The meet was at San Mateo. I started in the second heat with no particular expectations, just hoping to stay competitive. The race started, I settled near the front of the pack, finished strong on the final lap, and ended up first in my heat, third overall. I really needed that. Earlier that day, I had gotten a call from a company I’d been hoping to join, and received a rejection. But running gave me a small spark of joy and a reminder of what life is: ups and downs. The important thing is to keep showing up.

Next came Modesto. This time I had a clear objective: run below 5:20 to qualify for the Mike Fanelli Classic again, my redemption race. I went out too fast, hit a PR at the 800m mark, but then my legs gave out. My heart carried me the rest of the way. I crossed the line at 5:14. I made it.

Last week, April 2026. The Mike Fanelli Classic arrived. I had gotten sick two days before, so expectations were low. But I got ready, admired my WRC white jersey, laced up my spikes, and kept repeating my splits in my head: 1:10, 2:24, 3:47. I walked to the start line still whispering them. The gun went off.

300 meters: 1:10. Yes. Let’s go. 800 meters: 2:24. Perfect. Finish: 3:47. 5:10.66.

I went from 5:35 to 5:10. I was over the moon.

This is a lesson I never want to forget: progress is not linear, and it matters deeply how you think about yourself along the way.

Life has been hard in 2026. I lost my job, got injured, faced rejections that genuinely hurt. The world feels heavy, AI reshaping everything, global tensions, rising costs. Oakland, the city I love, is going through its own struggles: businesses closing, hardship everywhere you look. But even in the darkest places, there is light: God, my family, my friends, my community (WRC, OTC, Renegade, YMCA), my coach, my teammates (Go Laney! Go Eagles!), my former colleagues, the volunteers at Community Kitchen. Myself.

I want to take a moment to thank my community, because it has never once left me alone.

I want to start with God. For always being there, even when I can’t see it in the moment.

A few weeks before I lost my job, I had been thinking about buying new running shoes, mine were over two years old and well past their prime. Then the job was gone, and so was that idea. I had to be smart with money. That was that.

A few weeks later, Renegade hosted an urban race. I showed up purely for the free pizza and to cheer on friends, no expectations, no agenda. Somehow I ended up joining a relay team. We finished second. And with that second place came a brand new pair of shoes.

God is there. He is always there. Not always in the ways you plan for, but always right on time.

My family, first and always. My parents, who send me a good morning text every single day without fail. My brother, whose advice I carry with me even when he can’t be at the finish line, I can always hear his voice: Con determinación. Así es, con determinación. (Do it with determination. That’s all. Determination.) And my little sister, who keeps me grounded by sharing the latest tea about Hollywood actors, pulling my brain away from adult worries exactly when I need it most.

My WRC family, who have never stopped showing up for me. I will never forget the moment I broke my arm and received a DoorDash gift card from them. That gesture said everything. I am grateful every single day for having WRC in Oakland.

My friends, especially Marin, Ale, Caro, Cristina, Imanol, Priyanka, and Katherine. Being far from family is one of the hardest things about building a life in a new place, until you find people who fill that space so completely that you stop noticing the distance. Thank you for always being there. Thank you for seeing me cry and still finding a way to make me laugh. You are gems in my life, every single one of you.

To my former colleagues, for their kind messages and warm wishes during a difficult season.

To Jueves de Tacos, for keeping my spark alive, reminding me that dancing, singing, and tacos can always heal a heavy heart.

To Otto and Genaro, for being father figures to this running community. For always checking in on my brother and me, for the words of encouragement that landed exactly when we needed them, and for making Oakland feel like home. You remind us that community is not just about the miles we share, but about the people who show up for you off the track too.

To my teammate Stef, for always smiling after the hardest workouts, it matters more than you know.

To my coach Tinu, for never letting me forget that I should think like an athlete, even when I wasn’t sure I was one.

And to Oakland, for being exactly what it is, always.

Running has taught me that life is full of ups and downs, and what matters is not just showing up, but showing up with determination, with confidence, with joy, and above all, knowing that you have your own back. If you have yourself, everything else becomes possible.

Three weeks from now, we’ll have our final track meet of the season. It might go well. It might not. But no matter the outcome, I will keep showing up. I will keep trying. I will keep working hard.

And most importantly, I will always have my back.

Back to the Basics: Refreshing Some System Design Concepts

Databases

SQL vs NoSQL databases:

When to use Relational Database:

Data is well structured with clear relationships
Strong consistency and transactional integrity

When to use Non-Relational Database:

Super low-latency for rapid responses
Data is unstructured or semi-structured
Scalable storage for massive data volumes

Scaling

Vertical Scaling (Scale Up)

Adding more resources to our existing servers.

This is recommendable for application that have low or moderate traffic

You will reach a limit. Also, if the server fails, then all your system will fails since it depends on one server

Horizontal Scaling (Scale Out)

Adding more servers to share the load

More suitable for large applications since it comes with high fault tolerance. If a server goes down, you can use another server.

Also, it allows scalability

How to implement? How to distribute the users’ requests? You can use a load balancer to distribute the traffic among multiple servers. It also controls the fault tolerance since if a server fails, it will stop sending to it, and send data to the other servers

Load Balancer

Distribute incoming network traffic across multiple servers to ensure no single server bears to much load

7 strategies used in Load Balancing:

Round Robin: simplest way to assign traffic. Let’s say we have three servers, the load balancer will send the traffic to the first server first, then to the second one, and then the third one, and repeat. This is good for servers with similar specifications. Meaning all the servers have the same capabilities
Least Connections: it redirects traffic to the server with the fewer active connections. Fo example, if server 1 has 10 active connection, server 2 has 9, and server 3 has 30, then the load balancer will send the traffic to server 2, and now it will have 10 connections. Useful for applications that have sessions of variable length.
Least Response Time: focus responsiveness of the server. You have three servers, the first one is highly responsive, the second one is low responsive, and the third one is medium responsive. The load balancer will choose the lowest response time and with the fewest active connections, meaning that it will try to reach out to the highest responsive server, but it also checks for the active connections, so if the first one has many connections, it will go to the medium responsive server. This is effective when we want to provide the fastest response time to requests and you also have different servers with different capabilities
IP Hash: It determinates which server receives the request based on the hash of the client’s IP address. This is useful when you want your client to connect consistently to the same server
Weighted Algorithms: Servers are assigned two weights based on their capacity and performance metrics. The load balancer takes into consideration the weights to send traffic. Let’s say that we have 3 servers, server 1 has a weight of 16, server 2 has a weight of 32, and server 3 has a weight of 64, then the load balancer will try to send requests to server 3
Geographical Algorithms: direct request to the server that is geographically closest to the user. This is useful for global services where latency redundancy is important
Consistent Hashing: this is the most popular one. In this case, we use a hash function to distribute data across various nodes. There is a hash function inside the load balancer, then we get the hash from the user based on his ip address, then we place the hash on a imaginary ring which has the servers, and depending how close is to the server, we move the traffic there. Ensures that the same client connects to the same server

Load balancers come with health checks features to make sure all the servers are up.

Examples of load balancers:

NGIX – For software

HAPROXY – For software

Citrix – for Hardware

The easier solutions are cloud load balancers

AWS – Elastic Load Balancers

GCP – Google Cloud Load Balancer

Single Point of Failure (SPOF)

Any component that could cause the whole system to fail if it stops working

Having SPOF is problematic because it will affect reliability, scalability, and security issues.

The strategies to prevent these issues are:

Redundancy
1. Adding redundancy to our system. Meaning adding more than one of the same elements
Health Checks and Monitoring
1. Continuously check the health of our objects, so if it fails we will stop the traffic there and redirect to another object. It works with the redundancy strategy
Self-healing systems
1. Continuously check the health of our objects, so if it fails we will stop the traffic there and replace with a new healthy object

API Design

API – Application Programming Interface. It defines how software components should interact with each other

An API is a contract that defines:

What requests can be made
How to make them
What responses to expect

An API is an abstraction mechanism since it hides implementation details while exposing functionality. It also defines clear interfaces between system components

API Styles:

REST – Representational State Transfer. Multiple end-points. Most used for Web and Mobile Apps.

GraphQL – Single endpoint for all the operations (queries). Single end-points for all the operations. Recommended for Complex UIs

gRPC – it used the RPC framework. This is used for MicroServices

Design Principles to build APIs

Consistency
- Consistent naming and patterns
Simplicity
- Focus on core use cases, and intuitive design
Security
- Authentication, authorization, input validation, rate limiting
Performance
- Caching strategies, pagination, minimize payloads, reduce round trips

The API Design Process

Identify core use cases and user stories
Define scope and boundaries
Determine performance requirements
Consider security constraints

API Protocols

Application Layer
– HTTP/HTTPS
– Web sockets
– MQTT
– AMQP – Advanced Message Queuing Protocol
– gRPC

Transport Layer
TCP
UDP

Network Layer
IP

Data Link Layer
Ethernet
WiFi
Bluetooth

Physical Layer

Transport Layer

TCP

Transmission Control Protocol
Reliable but Slower
Guaranteed Delivery
Connection based
Ordered packets
Error checking
Best for banking, emails, payments

UDP

User Datagram Protocol
Fast but unreliable
No delivery guarantee
Connectionless
Faster transmission
Less overhead
Best for video, gaming, streaming

7 Techniques to protect your APIs

Rate Limiting
CORS (Cross-Origin Resource Sharing)
SQL & NoSQL injection prevention
Firewalls
VPNs (Virtual Private Networks)
CSRF (Cross-Site Request Forgery)
XSS (Cross-Site Scripting)

Defining scope and constraints for Data Engineering pipelines

When we are designing data pipelines, it is important to start by asking the following questions:

1. Requirements and Goals

What is the primary goal of this pipeline?
What types of data are we processing (structured, semi-structured, unstructured)?
- Structured data: Data that is organized in a fixed schema
- Semi-structured: Data that doesn’t follow a rigid schema but still has some organizational markers
- Unstructured: Data that has no predefined structure, and cannot easily fit into rows and columns.
How often does the data need to be available for consumers? (real-time, near-real-time, daily batch)
Are there any SLAs for latency or throughput?
- Latency: The time it takes for a single piece of data to travel through the system from input to output.
- Throughput: The amount of data a system can process in a given time period.
How large is the expected data volume now, and how fast do we expect it to grow?

2. Sources and Destinations

What are the data sources? (databases, APIs, logs, IoT devices, external systems)
Are there multiple sources with different reliability characteristics?
Where should the processed data be stored or consumed from? (data warehouse, lake, dashboards, ML models)

3. Data Processing Expectations

Do we need complex transformations, aggregations, or enrichment?
Should the pipeline handle streaming, batch, or a mix of both?
Are there specific data quality requirements? (e.g., deduplication, schema validation)

4. Reliability and Fault Tolerance

What level of reliability is expected? (exactly-once, at-least-once, at-most-once)
- At-most-once:
  - Each message is delivered zero or one time.
  - Messages might be lost, but never duplicated.
- At-Least-Once:
  - Each message is delivered one or more times
  - Messages are never lost, but duplicates can occur.
- Exactly-Once
  - Each message is delivered exactly once.
  - No duplicates, no losses.
  - Achieved with careful transactional processing or idempotent operations.
Are there any retention requirements?

5. Operational Considerations

Are there cost or resource constraints?
What kind of monitoring or alerting is expected?
Do we need to support schema evolution or backward compatibility?
- Schema Evolution:
  - The ability of a system to adapt to changes in the data schema over time without breaking existing pipelines or applications.
- Backward Compatibility:
  - A schema change is backward compatible if old consumers can still read new data without errors.

6. Security and Governance

Are there access controls or privacy regulations we need to consider?
Is data lineage or auditing required?

These questions will help us clearly understand the requirements, enabling us to design a pipeline that is robust, scalable, maintainable, and reliable.

Oakland’s Port as a Metaphor for Modern Data Applications

I’ve lived in Oakland for the past five years, and over that time, I’ve come to appreciate three places that, to me, define the city:

The Port of Oakland
Lake Merritt
Reinhardt Redwood Regional Park

Admittedly, I may be biased, these are the places I frequent for my regular runs, but I also believe they represent the heart and soul of Oakland. In this post, I want to focus on the Port of Oakland and explore how it serves as a perfect analogy for a modern data application built using Docker, Kubernetes, and Argo Workflows.

The Port of Oakland: A Brief Overview

The Port of Oakland was established in 1927. It employs nearly 500 people directly and supports close to 100,000 jobs throughout Northern California. It generates $174 billion in annual economic activity and spans 1,300 acres, with approximately 780 acres dedicated to marine terminals.

It’s one of the four largest ports on the Pacific Coast, alongside Los Angeles, Long Beach, and the Northwest Seaport Alliance (Seattle and Tacoma). It’s a critical hub for the flow of containerized goods, handling over 99% of the container traffic in Northern California.¹

Why Compare a Port to a Data Application?

At first glance, ports and data systems might seem unrelated. But if you’ve ever worked with containerized data applications, especially those using Docker, Kubernetes, and Argo, the parallels are striking. Let’s look at the core components of our stack before diving into the analogy.

Kubernetes: The Port Authority

Kubernetes, also known as K8s ²

Formal definition: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications

Let’s dive in this definition.

Modern architecture is constructed from independent building blocks called micro services. These micro services can be independently maintained and updated, and they are ideally suited for cloud computing.

To deploy, maintain, and updated these micro services we do it via containers. Containers are a standard unit of software that packages up code and all its dependencies so the applications runs quickly and reliably from one computer environment to another.

To keep track of all these containers, we use Kubernetes.

There are many tools for container orchestration, but Kubernetes has 95% of the market.

Users love Kubernetes because it solves the typical challenges of container orchestration:

Scheduling and Networking
Attaching storage to a container

To solve these issues, Kubernetes interacts with a Container Engine. Kubernetes tells the Container Engine to start or stop containers in the correct order and in the right place.

Docker: The Shipping Container

Docker is the Container Engine. ³

Formal definition: Docker is a platform designed to help developers build, share, and run containerized applications.

To create containers, we use a Docker image. A Docker image is a light weight stand alone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.

A Docker image becomes a container at runtime

To create a Docker image, you need to define a Docker file

Docker images let the Docker engine construct containers with all the necessary software components from an image.

Argo workflows: The shipping manifest

Argo lets us define workflows where each step runs as a containerized task in Kubernetes. These steps can be run sequentially or in parallel, and they’re defined in workflow templates that are stored and managed within the Kubernetes cluster⁴. An argo workflow lists different steps to be executed. Each step is a container that runs as a Pod on Kubernetes.

Formal definition: Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes, implemented as a Custom Resource Definition (CRD).

To define the list of steps, we use workflow templates which are definitions of workflows that are persistent on the cluster and can be called by running workflows or submitted on their own.

Argo is built for Kubernetes. It doesn’t just “run on” Kubernetes. It uses Kubernetes’ architecture to define, schedule, run, and manage complex workflows using containers.

The Analogy: Running a Data Application is Like Running a Port

Docker containers are like shipping containers: standardized, portable, and self-contained.
Kubernetes is like the Port Authority: orchestrating the arrival, placement, and routing of containers.
Argo Workflows are like the shipping manifest: specifying the sequence of operations to be performed, and ensuring that the containers reach their destination in the correct order.

Just like a modern port requires tight coordination between ships, cranes, trucks, and storage units, a modern data application requires precise orchestration of containers, storage, and compute resources. Each tool plays a crucial role in ensuring everything runs efficiently, securely, and reliably.

Final Thoughts

Living in Oakland, the Port has become more than just a backdrop, it’s a symbol of organization, logistics, and the power of containers to move the world. Similarly, in my work as a data engineer, I see Docker, Kubernetes, and Argo not just as tools, but as the infrastructure that powers the efficient movement of data.

In both worlds, whether it’s goods moving through a harbor or data flowing through pipelines, the principle is the same: with the right tools and coordination, complex systems can run smoothly.

Reflections on the Agentic AI Summit – August 2, 2025

Yesterday (August 2, 2025), I had the opportunity to attend the Agentic AI Summit at UC Berkeley. The event started at 9 a.m., but I had a morning run planned with friends, so I wasn’t able to arrive until around 1 p.m.

Despite missing the morning sessions, I attended two main talks that really stood out:

Foundations of Agents
Next Generation of Enterprise Agents

Foundations of Agents

The first talk I attended was part of the Foundations of Agents session, given by Dr. Dawn Song, Professor at UC Berkeley. Her talk, titled “Towards Building Safe and Secure Agentic AI,” explored the dual-edged nature of AI technology.

Who will benefit most from AI: the good guys or the bad guys?

That question stuck with me. Dr. Song highlighted how AI can both protect and attack systems. The talk emphasized the critical need to secure agents, especially in terms of privacy and vulnerability. She introduced GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning, a research project aimed at evaluating and strengthening AI agent security. The explanation of how GuardAgent works gave me insight into how researchers are building AI to be more robust and safe.

The next talk was by Ed Chi from Google, titled “Google Gemini Era: Bringing AI to Universal Assistant and the Real World.” He introduced Project Astra, a Google initiative focused on developing truly universal AI assistants.

During his talk, he asked the audience, “How many of you have a personal assistant?” Only a few hands went up. He then said, “Imagine if we all had one. With Project Astra, that’s possible.”

He shared impressive demo videos showcasing the power and responsiveness of AI agents. Though his talk was short, it was one of the most compelling presentations of the day. You can check out one of the videos here:

The third talk, “Automating Discovery” by Jakub Pachocki, Chief Scientist at OpenAI, was delivered remotely via Zoom. Unfortunately, due to poor audio quality and connection issues, I wasn’t able to follow much of it and decided to skip it.

The last talk of the session was by Sergey Levine, who discussed Multi-Turn Reinforcement Learning for LLM Agents.

Here’s a simplified breakdown:

Multi-Turn: In real-world scenarios, decisions are not made in isolation, they unfold over multiple steps. Agents need to plan, iterate, and adjust based on feedback.
Reinforcement Learning (RL): Teaching AI through rewards and penalties, enabling it to learn optimal actions based on past experiences.

Sergey explained how agents can become more efficient by “remembering” past decisions and using them to improve future outcomes, reducing resource use while improving decision-making. It was a technical but deeply insightful talk.

Next Generation of Enterprise Agents

After a short break, we moved into the Next Generation of Enterprise Agents session, which featured speakers:

Burak Gokturk – VP, ML, Systems & Cloud AI Research, Google
Arvind Jain – Founder/CEO, Glean
May Habib – Co-Founder/CEO, Writer
Richard Socher – Founder/CEO, You.com

To be honest, this portion felt more like a sales pitch than a research discussion. Each speaker highlighted how their AI solutions are boosting productivity and driving profits for enterprise customers. While interesting, it lacked the critical lens and depth of the earlier technical talks.

Final Thoughts: With Great Power…

One moment from the panel discussion really stuck with me. When asked about the future of AI, Ed Chi responded:

“Uncertainty causes anxiety. As humans, we want to know what’s going to happen. When we don’t know, we feel anxious. That’s exactly what’s happening with AI today.”

That quote lingered with me as I walked to my car. I told my brother I felt both excited and uneasy about what lies ahead. Dr. Song’s question echoed in my mind: What if AI empowers the wrong people? How do we ensure it’s used for good?

While there are no simple answers, I believe one critical factor is data quality. AI systems are only as good as the data we feed them. As a data engineer, I see our role as essential in guiding AI systems toward ethical and effective decision-making.

We are the ones who curate, clean, and manage the data AI learns from. It’s our responsibility to ensure that this data is accurate, fair, and useful—not just for internal company use but for the broader systems that may rely on it.

Final Thoughts: With Great Power…

One moment from the panel discussion really stuck with me. When asked about the future of AI, Ed Chi responded:

“Uncertainty causes anxiety. As humans, we want to know what’s going to happen. When we don’t know, we feel anxious. That’s exactly what’s happening with AI today.”

We are the ones who curate, clean, and manage the data AI learns from. It’s our responsibility to ensure that this data is accurate, fair, and useful, not just for internal company use but for the broader systems that may rely on it.

One of my favorite quotes comes from Spider-Man: “With great power comes great responsibility.“

While AI agents might appear to hold the power, I believe data professionals are the real power-holders. If we provide quality, thoughtful, and well-structured data, then AI agents stand a better chance of acting responsibly and effectively.

To my fellow data engineers: With great power comes great responsibility.
Let’s use that power to help build an AI-powered future that we can all be proud of.

System Design

As a data engineer, it’s essential to understand how systems function in order to design data solutions that make the most efficient use of available resources. Data must align with core data quality principles, which means choosing a system architecture that best meets our specific needs.

I invite you to join me on a journey to explore system design and how it can be tailored to support modern data requirements. By the end of this journey, we should be equipped to select the most effective architecture to handle Big Data workloads and ensure high-quality, scalable solutions.

Single Server Setup

In this setup, everything runs on a single server: the web application, database, cache, and other components are all hosted together.

Here’s how the request flow works:

A user accesses the website through a domain name, such as mysite.com.
The Domain Name System (DNS) resolves the domain name and returns the corresponding Internet Protocol (IP) address to the user’s browser or mobile app.
With the IP address in hand, the client sends HyperText Transfer Protocol (HTTP) requests directly to the web server.
The web server processes the request and responds with either HTML pages or JSON response for rendering on the client side.

The traffic source comes from two sources:

Web Application
Mobile Application

Key Concepts:

Domain Name System (DNS): a fundamental part of the internet that translates human-readable domain names (like google.com) into machine-readable IP addresses. It is a paid service provided by 3rd parties and not hosted by our servers.
Internet Protocol (IP) Address: is a numerical label such as 192.0.2.1 that is assigned to a device connected to a computer network.
HyperText Transfer Protocol (HTTP): is the foundation of the World Wide Web, and is used to load web pages using hypertext links
JavaScript Object Notation (JSON): is a text format for storing and transporting data. JSON is “self-describing” and easy to understand.

Data Engineering – Basic Skills – Git

Hi fellow data heroes!

Today, I want to dive into Git—a tool that’s as essential for data engineers as bash scripting. While bash scripting might make you feel like a real developer, Git takes it to the next level by facilitating collaboration and version control. After all, knowledge is only valuable if shared, and Git is the perfect tool to enable sharing and enhance our collective expertise. Remember, teamwork makes the dream work!

But, what exactly is Git?

According to Wikipedia:

Git is a distributed version control system that tracks versions of files. It is often used to control source code by programmers collaboratively developing software.

I like to think of Git as a community mural where everyone has the opportunity to contribute. Each time someone adds a new section, the mural improves. If mistakes are made, more experienced artists can correct them. The more organized the community, the more beautiful the mural becomes. Git embodies the balance between freedom and control. Just as we start with drafts, seek feedback, and refine our work before committing to the final mural, Git allows us to manage code with similar care.

Git provides a space where code evolves through contributions from multiple developers. It keeps a history of all changes, allowing you to revert to any previous version of your code. This balance between creative freedom and controlled processes is crucial for data engineers. While we should have the liberty to experiment with new techniques or enhance data pipelines, Git branches offer the structure we need. Think of branches as drafts that can be reviewed and refined before merging into the main branch. This control ensures that no change is made to the main branch without peer approval.

Git is an incredible tool for collaboration, but it’s important to establish guidelines within your team to use it effectively.

Here are some of my favorite Git commands that I use daily. While these commands are just the tip of the iceberg, they’re a great starting point for mastering Git.

I hope you find this helpful!

Git Download

Data Engineering – Basic Skills – Bash Scripting

When I was a teenager, I watched the movie Swordfish. I barely remember the plot, but what stuck with me was Hugh Jackman as the fastest typer and most incredible software engineer I had ever seen. Before that, my idea of a software engineer was Al McWhiggin, the “chicken man” from Toy Story II. But after Swordfish, my perception of what a software engineer should look like completely changed.

In my mind, a top-tier software engineer had to be:

Height: Over 6 feet
Body: Six-pack abs, strong arms, and killer legs—because how else could you type as fast as Wolverine?
Looks: Amazing hair, a perfect nose, and intense black eyes.
Hygiene: A bit of dirt on your face, a rugged, unshowered look.
Scars: A few battle wounds as proof that your coding skills have saved the world.

And, most importantly, you had to be able to solve a complex coding challenge in 60 seconds flat—a feat that would take most experienced engineers 16 minutes.

But thanks to my Computer Science degree and exposure to more tech-related movies, I eventually realized that none of these characteristics are necessary to be a successful software engineer—or to save the world.

In this new series, I want to share the essential skills every software engineer needs to kickstart a successful career. My focus will be on the core competencies a Data Engineer requires to thrive.

The first post in this series is about bash scripting. I love bash scripting because it makes me feel like a real software engineer: black screen, basic commands in white text—what more do you need to feel like Hugh Jackman saving the world from dark hackers?

Bash scripting is incredibly easy to learn. Memorize these 10 basic commands, and you’re good to go:

grep – Filters input based on regex pattern matching.
ls – Lists the contents of your directory.
cd – Moves within directories.
rm – Removes files.
mkdir – Creates directories.
nano – Opens a file editor.
echo – Prints a message.
clear – Clears the console.
| – Pipe (not a command, but essential for streaming data from one command to another).
cat – Concatenates files.

I recently took a Bash Scripting course on DataCamp and have attached my notes for anyone looking to get started.

Yes, there’s more to bash than just 10 commands, but these are enough for the daily tasks of a Data Engineer. Thanks to developers like “Wolverine,” we now have user-friendly tools with powerful UIs that allow us to put our bash skills on the back burner. But don’t forget to dust off those skills from time to time—you never know when John Travolta might ask you to solve a coding challenge using only bash scripting in 60 seconds! The world needs those bash skills.

Bash_Scripting Download

Data-Intensive Applications Series :: Chapter II – Data Models and Query Languages

Hi!

This weekend, I read Chapter II of Designing Data-Intensive Applications. It was a long chapter, but very interesting.

A Data Model is an abstraction of how you represent your data. It shows how your data elements are organized and how they relate to one another. Data Models are one of the most important parts of developing software because they represent how you approach the problem you are solving. You will have to write your code based on the data model that you select for your application. It is important to think in your functional and non-functional requirements when you select the “best fit” – data model.

In this chapter, the author covered the history of Data Models, an overview of each model, and how each model is good in its respective domain.

Data started being represented as one big tree (hierarchical model), but this was not a good model for the “many-to-many” relationships. The relational data model came into the picture to solve this problem. The relational data model has been the dominant for a while because of its simplicity and the broad variety of use cases it can cover.

As technology continues to evolve, applications require different functionalities that the relational data model cannot cover. NoSQL data stores aim to fill this gap. New nonrelational datastores have diverged in two main directions: Document databases and Graph Databases.

My main takeaway from this chapter is: Different data models (document, relational, and graph) are designed to satisfy different use cases. It is important to consider the functional and non-functional requirements to pick a data model that is suitable for your application.

Check my notes to learn more about Data Models.

ddia-chapterii Download

Data Engineering Learning Series :: Introduction to Airflow with Python

Hi,

This weekend, I could not read the second chapter of Designing Data Intensive Applications. I promise to read two chapters and share my notes with you next weekend!

Yesterday, after 1 year and half without participating in “person” races, I attended the Alameda 10-miler. It was great! It felt so good to returning to “normal” in the Covid-19 era! Something that I love about running is the amount of data that we get from it. I was planning to keep my pace at 8 minute per mile and I ended running at 7:45 minute per mile! I have been following the Garmin Coach 5K training and based on my long runs, I was not expecting to get this result. I have to thank my Arete teammates and my brother (IG = @thetechyrunner) who cheered me on from the beginning of the race until the finish line! Their support was my incentive to try to run at my best! Take a look at my Certificate to find out more about my results!

alameda-10-miler-results Download

This weekend, I decided to re-take the DataCamp course – Introduction to Airflow with Python. This course has 4 chapters, it is a quick introduction to Airflow main components. The first chapter is about Airflow basic components, what is a Directed Acyclic Graph (DAG), and the main views in the Airflow UI.

My main takeaway from the 1st Chapter is: Data Engineering is complex because it involves designing, managing, and optimizing the data flow to ensure that the organization can access and trust the data. Airflow is a great tool for managing the data flow. It helps us to create, schedule, and monitor workflows (set of steps to accomplish a given data engineering task). Like any tool, it has its pros and cons, but in general, Airflow makes things easier for Data Engineering teams. Take a look at my notes for knowing more about Airflow.

introduction-to-airflow-in-python Download

https://www.instagram.com/thetechyrunner/?hl=en