bullish on data engineering

24 Apr, 2024

I like to think I'm a "recovering data scientist". in my first and last data science role, I was given a project to compare revenue forecasts vs. actuals and write a backtesting algorithm. I noticed immediately that the data was a mess and that the source system didn't agree with destination systems. the tables within our source system didn't even agree with each other.

my entire job from there on out was to quantify how bad our data actually was. I'll pass through the details, but our data was growing more incorrect by the hour.

I learned just how easy it is to mess up a data pipeline, and ruin a well-functioning business. we had a team of 8 data scientists running experiments, forecasts, and research on bad data. I was appalled that we were doing analytics & science on top of datasets that made no sense - so it didn't matter how amazing our data scientists were, their work would never be correct.

an OpenAI employee tweeted "it's becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree" ... "model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else" [1].

the quality of the model is determined by it's data. now let's apply this same analogy to a few other industries:

self driving cars
- requires real time data streams, and real time actions & decisions based on those data streams. if the data is inaccurate or delayed, it can lead to catastrophic outcomes.
rocketships
- rocket telemetry allows for control of rockets. if the data capturing & transferring instruments are compromised, there will be critical failures.
banks
- every time you tap your card, tons of data is sent & received over the card networks to ensure that your card has the available balance, the merchant is legitimate, and that the transaction is secure and compliant with regulatory standards.
robots
- for a humanoid to make sense of the world, it requires data. it requires understanding the human environment and making decisions based on the information received in real time

I could go on forever, but I think that architecting data systems is a job that isn't going to be automated anytime soon. individual components of a system may be automated by LLM's, but the architect role is here to stay, and will only grow.

bullish on data engineering.

[1] tweet here