Blog

Oakland’s Port as a Metaphor for Modern Data Applications

I’ve lived in Oakland for the past five years, and over that time, I’ve come to appreciate three places that, to me, define the city:

  • The Port of Oakland
  • Lake Merritt
  • Reinhardt Redwood Regional Park

Admittedly, I may be biased, these are the places I frequent for my regular runs, but I also believe they represent the heart and soul of Oakland. In this post, I want to focus on the Port of Oakland and explore how it serves as a perfect analogy for a modern data application built using DockerKubernetes, and Argo Workflows.

The Port of Oakland: A Brief Overview

The Port of Oakland was established in 1927. It employs nearly 500 people directly and supports close to 100,000 jobs throughout Northern California. It generates $174 billion in annual economic activity and spans 1,300 acres, with approximately 780 acres dedicated to marine terminals.

It’s one of the four largest ports on the Pacific Coast, alongside Los Angeles, Long Beach, and the Northwest Seaport Alliance (Seattle and Tacoma). It’s a critical hub for the flow of containerized goods, handling over 99% of the container traffic in Northern California.1

Why Compare a Port to a Data Application?

At first glance, ports and data systems might seem unrelated. But if you’ve ever worked with containerized data applications, especially those using Docker, Kubernetes, and Argo, the parallels are striking. Let’s look at the core components of our stack before diving into the analogy.

Kubernetes: The Port Authority

Kubernetes, also known as K8s 2

Formal definition: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications

Let’s dive in this definition.

Modern architecture is constructed from independent building blocks called micro services. These micro services can be independently maintained and updated, and they are ideally suited for cloud computing.

To deploy, maintain, and updated these micro services we do it via containers. Containers are a standard unit of software that packages up code and all its dependencies so the applications runs quickly and reliably from one computer environment to another.

To keep track of all these containers, we use Kubernetes.

There are many tools for container orchestration, but Kubernetes has 95% of the market.

Users love Kubernetes because it solves the typical challenges of container orchestration:

  • Scheduling and Networking
  • Attaching storage to a container

To solve these issues, Kubernetes interacts with a Container Engine. Kubernetes tells the Container Engine to start or stop containers in the correct order and in the right place.

Docker: The Shipping Container

Docker is the Container Engine. 3

Formal definition: Docker is a platform designed to help developers build, share, and run containerized applications.

To create containers, we use a Docker image. A Docker image is a light weight stand alone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings.

A Docker image becomes a container at runtime

To create a Docker image, you need to define a Docker file

Docker images let the Docker engine construct containers with all the necessary software components from an image.

Argo workflows: The shipping manifest

Argo lets us define workflows where each step runs as a containerized task in Kubernetes. These steps can be run sequentially or in parallel, and they’re defined in workflow templates that are stored and managed within the Kubernetes cluster4. An argo workflow lists different steps to be executed. Each step is a container that runs as a Pod on Kubernetes.

Formal definition: Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes, implemented as a Custom Resource Definition (CRD).

To define the list of steps, we use workflow templates which are definitions of workflows that are persistent on the cluster and can be called by running workflows or submitted on their own.

Argo is built for Kubernetes. It doesn’t just “run on” Kubernetes. It uses Kubernetes’ architecture to define, schedule, run, and manage complex workflows using containers.

The Analogy: Running a Data Application is Like Running a Port

  • Docker containers are like shipping containers: standardized, portable, and self-contained.
  • Kubernetes is like the Port Authority: orchestrating the arrival, placement, and routing of containers.
  • Argo Workflows are like the shipping manifest: specifying the sequence of operations to be performed, and ensuring that the containers reach their destination in the correct order.

Just like a modern port requires tight coordination between ships, cranes, trucks, and storage units, a modern data application requires precise orchestration of containers, storage, and compute resources. Each tool plays a crucial role in ensuring everything runs efficiently, securely, and reliably.


Final Thoughts

Living in Oakland, the Port has become more than just a backdrop, it’s a symbol of organization, logistics, and the power of containers to move the world. Similarly, in my work as a data engineer, I see Docker, Kubernetes, and Argo not just as tools, but as the infrastructure that powers the efficient movement of data.

In both worlds, whether it’s goods moving through a harbor or data flowing through pipelines, the principle is the same: with the right tools and coordination, complex systems can run smoothly.

  1. https://www.oaklandseaport.com/resources/about-the-oakland-seaport#:~:text=The%20Oakland%20Seaport%20(Seaport)%20is,780%20acres%20of%20marine%20terminals. ↩︎
  2. https://kubernetes.io ↩︎
  3. https://docs.docker.com/get-started/docker-overview/
    ↩︎
  4. https://argoproj.github.io/workflows/ ↩︎

Reflections on the Agentic AI Summit – August 2, 2025

Yesterday (August 2, 2025), I had the opportunity to attend the Agentic AI Summit at UC Berkeley. The event started at 9 a.m., but I had a morning run planned with friends, so I wasn’t able to arrive until around 1 p.m.

Despite missing the morning sessions, I attended two main talks that really stood out:

  • Foundations of Agents
  • Next Generation of Enterprise Agents

Foundations of Agents

The first talk I attended was part of the Foundations of Agents session, given by Dr. Dawn Song, Professor at UC Berkeley. Her talk, titled “Towards Building Safe and Secure Agentic AI,” explored the dual-edged nature of AI technology.

Who will benefit most from AI: the good guys or the bad guys?

That question stuck with me. Dr. Song highlighted how AI can both protect and attack systems. The talk emphasized the critical need to secure agents, especially in terms of privacy and vulnerability. She introduced GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning, a research project aimed at evaluating and strengthening AI agent security. The explanation of how GuardAgent works gave me insight into how researchers are building AI to be more robust and safe.

The next talk was by Ed Chi from Google, titled “Google Gemini Era: Bringing AI to Universal Assistant and the Real World.” He introduced Project Astra, a Google initiative focused on developing truly universal AI assistants.

During his talk, he asked the audience, “How many of you have a personal assistant?” Only a few hands went up. He then said, “Imagine if we all had one. With Project Astra, that’s possible.”

He shared impressive demo videos showcasing the power and responsiveness of AI agents. Though his talk was short, it was one of the most compelling presentations of the day. You can check out one of the videos here:

The third talk, “Automating Discovery” by Jakub Pachocki, Chief Scientist at OpenAI, was delivered remotely via Zoom. Unfortunately, due to poor audio quality and connection issues, I wasn’t able to follow much of it and decided to skip it.

The last talk of the session was by Sergey Levine, who discussed Multi-Turn Reinforcement Learning for LLM Agents.

Here’s a simplified breakdown:

  • Multi-Turn: In real-world scenarios, decisions are not made in isolation, they unfold over multiple steps. Agents need to plan, iterate, and adjust based on feedback.
  • Reinforcement Learning (RL): Teaching AI through rewards and penalties, enabling it to learn optimal actions based on past experiences.

Sergey explained how agents can become more efficient by “remembering” past decisions and using them to improve future outcomes, reducing resource use while improving decision-making. It was a technical but deeply insightful talk.


Next Generation of Enterprise Agents

After a short break, we moved into the Next Generation of Enterprise Agents session, which featured speakers:

  • Burak Gokturk – VP, ML, Systems & Cloud AI Research, Google
  • Arvind Jain – Founder/CEO, Glean
  • May Habib – Co-Founder/CEO, Writer
  • Richard Socher – Founder/CEO, You.com

To be honest, this portion felt more like a sales pitch than a research discussion. Each speaker highlighted how their AI solutions are boosting productivity and driving profits for enterprise customers. While interesting, it lacked the critical lens and depth of the earlier technical talks.


Final Thoughts: With Great Power…

One moment from the panel discussion really stuck with me. When asked about the future of AI, Ed Chi responded:

“Uncertainty causes anxiety. As humans, we want to know what’s going to happen. When we don’t know, we feel anxious. That’s exactly what’s happening with AI today.”

That quote lingered with me as I walked to my car. I told my brother I felt both excited and uneasy about what lies ahead. Dr. Song’s question echoed in my mind: What if AI empowers the wrong people? How do we ensure it’s used for good?

While there are no simple answers, I believe one critical factor is data quality. AI systems are only as good as the data we feed them. As a data engineer, I see our role as essential in guiding AI systems toward ethical and effective decision-making.

We are the ones who curate, clean, and manage the data AI learns from. It’s our responsibility to ensure that this data is accurate, fair, and useful—not just for internal company use but for the broader systems that may rely on it.


Final Thoughts: With Great Power…

One moment from the panel discussion really stuck with me. When asked about the future of AI, Ed Chi responded:

“Uncertainty causes anxiety. As humans, we want to know what’s going to happen. When we don’t know, we feel anxious. That’s exactly what’s happening with AI today.”

That quote lingered with me as I walked to my car. I told my brother I felt both excited and uneasy about what lies ahead. Dr. Song’s question echoed in my mind: What if AI empowers the wrong people? How do we ensure it’s used for good?

While there are no simple answers, I believe one critical factor is data quality. AI systems are only as good as the data we feed them. As a data engineer, I see our role as essential in guiding AI systems toward ethical and effective decision-making.

We are the ones who curate, clean, and manage the data AI learns from. It’s our responsibility to ensure that this data is accurate, fair, and useful, not just for internal company use but for the broader systems that may rely on it.

One of my favorite quotes comes from Spider-Man: With great power comes great responsibility.

While AI agents might appear to hold the power, I believe data professionals are the real power-holders. If we provide quality, thoughtful, and well-structured data, then AI agents stand a better chance of acting responsibly and effectively.

To my fellow data engineers: With great power comes great responsibility.
Let’s use that power to help build an AI-powered future that we can all be proud of.

System Design

As a data engineer, it’s essential to understand how systems function in order to design data solutions that make the most efficient use of available resources. Data must align with core data quality principles, which means choosing a system architecture that best meets our specific needs.

I invite you to join me on a journey to explore system design and how it can be tailored to support modern data requirements. By the end of this journey, we should be equipped to select the most effective architecture to handle Big Data workloads and ensure high-quality, scalable solutions.


Single Server Setup

In this setup, everything runs on a single server: the web application, database, cache, and other components are all hosted together.

Here’s how the request flow works:

  1. A user accesses the website through a domain name, such as mysite.com.
  2. The Domain Name System (DNS) resolves the domain name and returns the corresponding Internet Protocol (IP) address to the user’s browser or mobile app.
  3. With the IP address in hand, the client sends HyperText Transfer Protocol (HTTP) requests directly to the web server.
  4. The web server processes the request and responds with either HTML pages or JSON response for rendering on the client side.

The traffic source comes from two sources:

  • Web Application
  • Mobile Application

Key Concepts:

  • Domain Name System (DNS): a fundamental part of the internet that translates human-readable domain names (like google.com) into machine-readable IP addresses. It is a paid service provided by 3rd parties and not hosted by our servers.
  • Internet Protocol (IP) Address: is a numerical label such as 192.0.2.1 that is assigned to a device connected to a computer network.
  • HyperText Transfer Protocol (HTTP): is the foundation of the World Wide Web, and is used to load web pages using hypertext links
  • JavaScript Object Notation (JSON): is a text format for storing and transporting data. JSON is “self-describing” and easy to understand.

Data Engineering – Basic Skills – Git

Hi fellow data heroes!

Today, I want to dive into Git—a tool that’s as essential for data engineers as bash scripting. While bash scripting might make you feel like a real developer, Git takes it to the next level by facilitating collaboration and version control. After all, knowledge is only valuable if shared, and Git is the perfect tool to enable sharing and enhance our collective expertise. Remember, teamwork makes the dream work!

But, what exactly is Git?

According to Wikipedia:

Git is a distributed version control system that tracks versions of files. It is often used to control source code by programmers collaboratively developing software.

I like to think of Git as a community mural where everyone has the opportunity to contribute. Each time someone adds a new section, the mural improves. If mistakes are made, more experienced artists can correct them. The more organized the community, the more beautiful the mural becomes. Git embodies the balance between freedom and control. Just as we start with drafts, seek feedback, and refine our work before committing to the final mural, Git allows us to manage code with similar care.

Git provides a space where code evolves through contributions from multiple developers. It keeps a history of all changes, allowing you to revert to any previous version of your code. This balance between creative freedom and controlled processes is crucial for data engineers. While we should have the liberty to experiment with new techniques or enhance data pipelines, Git branches offer the structure we need. Think of branches as drafts that can be reviewed and refined before merging into the main branch. This control ensures that no change is made to the main branch without peer approval.

Git is an incredible tool for collaboration, but it’s important to establish guidelines within your team to use it effectively.

Here are some of my favorite Git commands that I use daily. While these commands are just the tip of the iceberg, they’re a great starting point for mastering Git.

I hope you find this helpful!

Data Engineering – Basic Skills – Bash Scripting

When I was a teenager, I watched the movie Swordfish. I barely remember the plot, but what stuck with me was Hugh Jackman as the fastest typer and most incredible software engineer I had ever seen. Before that, my idea of a software engineer was Al McWhiggin, the “chicken man” from Toy Story II. But after Swordfish, my perception of what a software engineer should look like completely changed.

In my mind, a top-tier software engineer had to be:

  • Height: Over 6 feet
  • Body: Six-pack abs, strong arms, and killer legs—because how else could you type as fast as Wolverine?
  • Looks: Amazing hair, a perfect nose, and intense black eyes.
  • Hygiene: A bit of dirt on your face, a rugged, unshowered look.
  • Scars: A few battle wounds as proof that your coding skills have saved the world.

And, most importantly, you had to be able to solve a complex coding challenge in 60 seconds flat—a feat that would take most experienced engineers 16 minutes.

But thanks to my Computer Science degree and exposure to more tech-related movies, I eventually realized that none of these characteristics are necessary to be a successful software engineer—or to save the world.

In this new series, I want to share the essential skills every software engineer needs to kickstart a successful career. My focus will be on the core competencies a Data Engineer requires to thrive.

The first post in this series is about bash scripting. I love bash scripting because it makes me feel like a real software engineer: black screen, basic commands in white text—what more do you need to feel like Hugh Jackman saving the world from dark hackers?

Bash scripting is incredibly easy to learn. Memorize these 10 basic commands, and you’re good to go:

  1. grep – Filters input based on regex pattern matching.
  2. ls – Lists the contents of your directory.
  3. cd – Moves within directories.
  4. rm – Removes files.
  5. mkdir – Creates directories.
  6. nano – Opens a file editor.
  7. echo – Prints a message.
  8. clear – Clears the console.
  9. | – Pipe (not a command, but essential for streaming data from one command to another).
  10. cat – Concatenates files.

I recently took a Bash Scripting course on DataCamp and have attached my notes for anyone looking to get started.

Yes, there’s more to bash than just 10 commands, but these are enough for the daily tasks of a Data Engineer. Thanks to developers like “Wolverine,” we now have user-friendly tools with powerful UIs that allow us to put our bash skills on the back burner. But don’t forget to dust off those skills from time to time—you never know when John Travolta might ask you to solve a coding challenge using only bash scripting in 60 seconds! The world needs those bash skills.

Data-Intensive Applications Series :: Chapter II – Data Models and Query Languages

Hi!

This weekend, I read Chapter II of Designing Data-Intensive Applications. It was a long chapter, but very interesting.

A Data Model is an abstraction of how you represent your data. It shows how your data elements are organized and how they relate to one another. Data Models are one of the most important parts of developing software because they represent how you approach the problem you are solving. You will have to write your code based on the data model that you select for your application. It is important to think in your functional and non-functional requirements when you select the “best fit” – data model.

In this chapter, the author covered the history of Data Models, an overview of each model, and how each model is good in its respective domain.

Data started being represented as one big tree (hierarchical model), but this was not a good model for the “many-to-many” relationships. The relational data model came into the picture to solve this problem. The relational data model has been the dominant for a while because of its simplicity and the broad variety of use cases it can cover.

As technology continues to evolve, applications require different functionalities that the relational data model cannot cover. NoSQL data stores aim to fill this gap. New nonrelational datastores have diverged in two main directions: Document databases and Graph Databases.

My main takeaway from this chapter is: Different data models (document, relational, and graph) are designed to satisfy different use cases. It is important to consider the functional and non-functional requirements to pick a data model that is suitable for your application.

Check my notes to learn more about Data Models.

Data Engineering Learning Series :: Introduction to Airflow with Python

Hi,

This weekend, I could not read the second chapter of Designing Data Intensive Applications. I promise to read two chapters and share my notes with you next weekend!

Yesterday, after 1 year and half without participating in “person” races, I attended the Alameda 10-miler. It was great! It felt so good to returning to “normal” in the Covid-19 era! Something that I love about running is the amount of data that we get from it. I was planning to keep my pace at 8 minute per mile and I ended running at 7:45 minute per mile! I have been following the Garmin Coach 5K training and based on my long runs, I was not expecting to get this result. I have to thank my Arete teammates and my brother (IG = @thetechyrunner) who cheered me on from the beginning of the race until the finish line! Their support was my incentive to try to run at my best! Take a look at my Certificate to find out more about my results!

This weekend, I decided to re-take the DataCamp course – Introduction to Airflow with Python. This course has 4 chapters, it is a quick introduction to Airflow main components. The first chapter is about Airflow basic components, what is a Directed Acyclic Graph (DAG), and the main views in the Airflow UI.

My main takeaway from the 1st Chapter is: Data Engineering is complex because it involves designing, managing, and optimizing the data flow to ensure that the organization can access and trust the data. Airflow is a great tool for managing the data flow. It helps us to create, schedule, and monitor workflows (set of steps to accomplish a given data engineering task). Like any tool, it has its pros and cons, but in general, Airflow makes things easier for Data Engineering teams. Take a look at my notes for knowing more about Airflow.

https://www.instagram.com/thetechyrunner/?hl=en

Data-Intensive Applications Series :: Chapter I – Reliable, scalable, and maintainable applications

Hi there,

Thank you for visiting my blog!

The purpose of this blog is to record my journey as I learn about Designing Data-Intensive Applications by Martin Kleppmann and to share the lessons I learned with you!

My plan is to read a chapter every weekend, write detailed notes (which will always be included with my blog), and share a summary of the most important takeaways with you!

The first chapter, Reliable, scalable, and maintainable applications is very informative! I love the way the author explains concepts for data systems and shares examples of how developers can design apps while considering nonfunctional requirements in the early stages of the Systems Development Life Cycle. My main takeaway from the 1st Chapter is: Even if your application is small, always consider reliability, scalability, and maintainability in your design. Your app will not survive without these principles.

Take a look at my notes for more details.

Happy to start this “data trip” with you!

Twitter Users’ Privacy Concerns

Our paper (Daniela Fernandez Espinosa & Lu Xiao), titled “Twitter Users’ Privacy Concerns: What do Their Accounts’ First Names Tell Us?”, is published at Journal of Data and Information Science.

In this paper, we describe how gender recognition on Twitter can be used as an intelligent business tool to determine the privacy concerns among users, and ultimately offer a more personalized service for customers who are more likely to respond positively to targeted advertisements.

Check it out here https://sciendo.com/article/10.2478/jdis-2018-0003