prakhesar's blog

bullish on data engineering

I like to think I'm a "recovering data scientist". in my first and last data science role, I was given a project to compare revenue forecasts vs. actuals and write a backtesting algorithm. I noticed immediately that the data was a mess and that the source system didn't agree with destination systems. the tables within our source system didn't even agree with each other.

my entire job from there on out was to quantify how bad our data actually was. I'll pass through the details, but our data was growing more incorrect by the hour.

I learned just how easy it is to mess up a data pipeline, and ruin a well-functioning business. we had a team of 8 data scientists running experiments, forecasts, and research on bad data. I was appalled that we were doing analytics & science on top of datasets that made no sense - so it didn't matter how amazing our data scientists were, their work would never be correct.

an OpenAI employee tweeted "it's becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree" ... "model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else" [1].

the quality of the model is determined by it's data. now let's apply this same analogy to a few other industries:

I could go on forever, but I think that architecting data systems is a job that isn't going to be automated anytime soon. individual components of a system may be automated by LLM's, but the architect role is here to stay, and will only grow.

bullish on data engineering.


[1] tweet here