I have experienced few things in my professional life as emotionally turbulent as debugging software. The contrast between the feeling of sheer joy when you have root caused an issue and the feeling of sheer despair when you have lost the trail is, to say the least, stark. It can be all-consuming and, in an environment with a client breathing down your neck, extraordinarily stressful. Watching a software engineer debug a production issue can feel a bit like watching someone dismantle a bomb.
Luckily, the modern engineer does not have to go into this battle alone. Over the years, the software engineering community has developed a myriad of tools to aid them in this endeavor. Companies like Github, DataDog, PagerDuty, and New Relic, have built large companies providing logging, source code management, APM, and alerting. APIs provide clear agreements between organizational and system boundaries. Stack traces, thread dumps, and bin logs, provide low level insight into what is going on at the deepest levels of the stack. All of these combined form the toolbox a software practitioner brings with her to the production outage gauntlet.
But what about data practitioners? With so many tools for teams building software products, surely the same must exist for teams building data products, right?
Oh Data Where Art Thou
"We had so many tools to help us guarantee the quality of our software, and our applications rarely crashed. But our data was “crashing” all of the time. Why was there no solution to this on the market?"
— Bryan Offutt, Index Ventures
Historically, the answer has been “sort of”. Though individual tools provided pieces of the data quality puzzle (constraints on database tables, protocols in segment, tests in DBT), there was not yet something that solved it holistically. A database constraint might have been good enough in the traditional ETL to a data warehouse world, but it is just not well suited for the modern data landscape, where data often flows through N number of systems before arriving at it’s final resting place. It wasn’t enough to inspect things at the end of the line, we needed something that could monitor data constantly through its arduous journey across organizational and software system boundaries. A yearly checkup was better than nothing, but it wasn’t enough. We needed an Apple watch.
I can testify to the pain and frustration involved in resolving these issues. Having worked on data pipelines at Palantir and SingleStore, I remember countless instances of “heisenbugs,” extremely difficult to reproduce “now you see it now you don’t” customer issues that could take numerous days and numerous engineers to resolve. Inevitably, these issues would come down to a disconnect somewhere in the data integration process. We knew the user was likely seeing the result of faulty data, that part was usually pretty clear. But determining where and how the data had been corrupted...that was a different story. For better or worse, when software breaks it’s usually pretty obvious -- things crash or look funny, often in a way that is at least relatively obvious to both the user and the person responsible for fixing the issue (race conditions are an obvious exception to this). When data breaks, it’s much more subtle. End users can go days or weeks without noticing, and it’s not always clear where to even begin hunting for the root cause.
We had so many tools to help us guarantee the quality of our software, and our applications rarely crashed. But our data was “crashing” all of the time. Why was there no solution to this on the market?
When I first met with Abe Gong to talk about his work on Great Expectations, I knew I had found the solution I had been looking for. In the book Great Expectations, there is a quote from Charles Dickens that goes “Suffering has been stronger than all other teaching.” It was clear that this was the case with Abe. As someone who had spent the majority of his career helping Healthcare organizations with their data problems, Abe had seen some gnarly stuff. You’d be hard pressed to find a tougher data environment than that which exists in healthcare, and after countless hours of just the type of debugging I mentioned above, Abe knew there had to be a better way.
"By giving data teams an end-to-end way to monitor quality from pipeline to production, Abe wanted to bring the same ability to pinpoint and resolve issues that exists in traditional software to the world of data."
— Bryan Offutt, Index Ventures
The result was Great Expectations. Though the project was still in its infancy, it seemed like Abe already had a plan for every step of the journey. The vision was simple, yet ambitious: to create a single place to observe, monitor, and collaborate on the quality of your data, at any level of granularity, on any system. By giving data teams an end-to-end way to monitor quality from pipeline to production, Abe wanted to bring the same ability to pinpoint and resolve issues that exists in traditional software to the world of data. Finally, data teams could catch issues before they made their way to end users. It was as if Abe had read the book on every single problem I had experienced as an Engineer working on data pipelines. It felt like the data world had its own DataDog, and I couldn't wait to get involved.
And I was not the only one. Since the day Abe and I first met, the Great Expectations community has grown from a small number of folks in a slack room to a community of believers more than 3,200 strong (and growing). It is one of the fastest growing software communities that we have ever seen. These community members span the spectrum: practitioners and hobbyists, data engineers and data analysts, citizens of dozens of different countries, all after the pursuit of quality data. I am honored to count myself among them.
“Ask no questions, and you’ll be told no lies.” This used to be the unwritten contract when using data within any sizable organization. Thanks to Superconductive, we can now expect better.
Published — May 20, 2021