Database-agnostic querying is unavoidable at scale

Published in

Kensho Blog

11 min readJul 22, 2020

As the amount of data managed by an organization grows, the difficulty of managing and querying that data grows as well: more databases, more query languages, more gotchas to keep in mind and avoid. Beyond “fits on my laptop” data scale, database-agnostic querying is a necessary technology that allows engineers to focus on the data they need, instead of the data access mechanisms at work behind the scenes. In our previous post, we covered how Kensho uses GraphQL as a database query language via our open-source GraphQL compiler project. We’ll now dig into why this kind of database-agnostic querying is unavoidable at scale, and how this made GraphQL compiler become an essential productivity-booster for Kensho engineers.

Database-agnostic querying is the Holy Grail of data infrastructure: it’s an abstraction that allows querying users to focus on the “what” of their query while completely ignoring its “how” and “where.” Users see a single unified representation of all the data available to them and get answers to their questions (the “what”) by writing declarative queries in a single easy-to-use query language.

Holy Grail as depicted in the movie “Indiana Jones and the Last Crusade” — Depiction of the Holy Grail from the Indiana Jones Wiki, CC BY-SA

Beneath this database-agnostic abstraction, a sophisticated system keeps track of the details around every piece of data:

The kind of database that houses the data: relational, graph, time-series, key-value, etc.;
The particular database instance, shard, or replica that has the data;
The particular flavor and version of the database that has the data: PostgreSQL 12.3, Microsoft SQL Server 2017, OrientDB 2.2.37, etc.

Given a database-agnostic query, the system must first locate the data being queried (the “where”) and then decide how to most efficiently retrieve the data in question (the “how”). Doing this well is obviously a tremendous challenge!

So what makes database-agnostic querying a challenge worth solving? And in particular, why would a 150-person company like Kensho decide to invest heavily into solving it?

Cross-database querying is unavoidable at scale, and unaided humans are generally awful at it.

Two facts combine to give us the answer: cross-database querying is unavoidable at scale, and unaided humans are generally awful at it. Therefore, a good solution would have immense leverage: it would help many people, and help a lot.

Before we dig into the specific benefits Kensho derives from GraphQL compiler, let’s examine our two premises in more detail.

Premise #1: Cross-database querying is unavoidable at scale.

There comes a point in many companies’ lifecycle where it is no longer reasonable to place all of the company’s data into a single database.

If your company is lucky, this is simply a problem of having a large volume of relatively homogeneous data: it’s simply too big for a single database, so splitting up schemas or sharding large tables among multiple databases may be required. Suddenly, querying isn’t so easy anymore: what used to be a simple single-database SQL SELECT query may now need to be a cross-database operation, adding operational complexity and new failure modes, and threatening productivity and correctness. There are plenty of examples of companies building data infrastructure that addresses this problem by abstracting away the physical data storage layer: Google’s BigTable, Dropbox’s Edgestore, or Facebook’s Tao.

A database table stored in one database, or “sharded” across 3 replicas by assigning each row to one of the three replicas. — When a table grows too large to reside in a single database, it can be “sharded” across multiple databases: different rows are assigned to different databases. Each shard is responsible for storing and querying across the rows assigned to it. This distributed nature of the table’s data makes queries significantly more complex.

The more difficult case is when a company has heterogeneous data: not just relational data, but also time-series, graphs, geospatial data, images, audio files, XML files, PDFs, machine learning models that are gigabytes in size, and metadata linking all of these together. In this case, splitting up the data is a necessity since it’s essentially impossible to make a database system sufficiently good at all of these use cases simultaneously. Faced with this problem, companies generally adopt a “best tool for the job” approach, placing relational data in SQL databases, time-series in specialized time-series databases, large files in Amazon S3, etc.

Different types of data (graph, SQL, time-series, various file formats) interconnected across different database types. — Different databases are optimized for managing different kinds of data. Despite the different storage locations, data is usually interconnected and must be used in a cross-database context. For example, a time series is much more useful when linked to the entity to which it belongs (e.g., in a SQL or graph database) and to the primary source documents that corroborate its numbers (e.g., in Amazon S3).

Kensho faces both problems at once. Not only do we have every kind of heterogeneous data mentioned above, but we have A LOT of it — especially after being acquired by S&P Global and thereby getting access to data curated over the span of 100+ years! Just to give you a sense of scale:

Our largest SQL database system is easily dozens of terabytes in size and contains hundreds of thousands of tables and views.
Parsing the schema generated from the most popular such logical database requires almost a full minute of Python CPU time to validate and convert into a GraphQL schema object. That resulting schema object consumes over 500MB of memory.
For that schema, executing the standard GraphQL introspection query (e.g., as used by the popular GraphiQL editor tool) takes over a minute of Python CPU time using the Python GraphQL library and in response produces a 55MB JSON payload. When delivered to GraphiQL in a web browser, the entire UI freezes up for over a second while GraphiQL parses and loads that absolutely gigantic result object.

Across Kensho, S&P Global, and all its other subsidiaries, we store data in over 30 kinds of data systems from various vendors — usually more than one major version of each. Just in the SQL space, we run at least three major versions of PostgreSQL, two major versions of Microsoft SQL Server, some Oracle instances, dozens of terabytes’ worth of Amazon Aurora nodes, and countless SQLite databases scattered across servers and individual engineers’ computers. Data systems are constantly being consolidated, migrated, and upgraded, and as we ingest ever more data sets and acquire new companies with new technology stacks, the number of systems tends to go up.

Various data system logos: OrientDB, Elasticsearch, Amazon Aurora, SQLite, Microsoft SQL Server, PostgreSQL, Amazon S3, … — Just a few of the data storage systems we use across Kensho, S&P Global, and its other subsidiaries. Frequently, we run more than one major version of each system concurrently. At our scale, data systems are constantly being consolidated, migrated, and upgraded.

All this goes to show how unavoidable cross-database querying really is at this scale. In the short term, workarounds such as hand-written, special-cased logic for particular cross-database use cases can act as a bandaid for the problem. However, such workarounds are by definition technical debt and thus cause ever-increasing drag on organization-wide productivity, so investing in a comprehensive cross-database solution is best in the long run.

Premise #2: Unaided humans are awful at cross-database querying.

If you are a data engineer, right now you may be thinking: “Databases are my job. I am not awful, I am great at my job!” And you are right! In terms of writing the best queries, between an automated system and an expert human, always place your bet on the human — that is, if considering the particular database at which the human is an expert.

However, most humans aren’t experts for even one database, and restricting access to your company’s data to only experts is wildly impractical. Any solution that relies on humans always writing only good queries ultimately fails at scale — the data eventually outgrows humans’ ability to keep up.

Kensho is far from the first company to make this observation. Dropbox points out the same thing in their Edgestore blog post:

“As we rapidly added both users and features, we soon ended up with multiple, independent databases; […] Having to write SQL and interact with MySQL directly impacted developer productivity.” — Dropbox

Dropbox found writing a single SQL dialect to be a productivity bottleneck. Just imagine what happens to engineer productivity if needing to query the 30 different kinds of data systems we have at Kensho and S&P!

Since Kensho engineers can write queries in a database-agnostic fashion, they can remain productive even without knowing:

how recursive JOINs can be implemented in SQL using recursive common table expressions (CTEs), and that the “obvious” recursive CTE approach makes queries excessively slow by hindering the query optimizer in most SQL flavors except PostgreSQL v12;
that Microsoft SQL Server does not have an array aggregation operator, but that a sophisticated workaround via some obscure XML functionality, together with some post-processing, can produce equivalent behavior;
that OrientDB versions before 3.0 may produce incorrect query outputs due to incorrectly tracked query dependencies, and the gloriously simple workaround that can be automatically applied to affected queries;
that the OrientDB query planner in the 2.2 releases may cause excessively slow query performance by failing to consider existing indexes for operators like CONTAINS, and the sophisticated query transformation can ensure indexes are used and thereby recover the lost performance;
that OrientDB does not support further traversals after crossing an “optional” edge, and the clever “union of queries” workaround for this, or even
the syntax differences among even the simplest operations across all the major SQL databases, despite the fact that they all implement “the same SQL standard.”

Even the most straightforward recursive SQL query is massively complex and full of opportunities to accidentally hurt query performance. With GraphQL compiler, recursive queries remain simple without sacrificing performance. Everyone is a database expert when GraphQL compiler writes their queries!

This minefield of query problems only grows as datasets get larger and cross-database queries become needed. In the absence of a powerful database-agnostic querying system, systems start to come apart at the seams, and more and more engineering time is spent simply gluing broken things back together. A few examples of what this might look like, loosely inspired from adventures in Kensho’s early days:

The website search team is desperately trying to speed up search index build times, and is attempting to parallelize their indexing queries. Lacking the ability to adjust query execution based on real-time database CPU and network load statistics, they make an “educated guess” for the parallelism factor to use — fingers crossed that it doesn’t crash the database! Let’s hope they remembered to make sure their new queries still hit indexes correctly!
One of the product teams is dismayed to discover that their backend got a lot slower over the course of a week, even though the endpoint results are unchanged and no updates to that code path were deployed. The backend queries database A for some data, then cross-references the result against database B. After much debugging, they discover that another team loaded some historical data into database A — this new data has increased the query result from database A that the backend cross-references against database B, even though nothing in database B will match this new data. The backend code is rewritten to query database B first and then cross-reference against database A.
And so on, across the entire organization.

This is not anyone’s happy place, database expert or not.

These kinds of scaling-related problems aren’t unique to Kensho, either. For example, Dropbox points out a related scaling problem as motivation for Edgestore, as they realized they kept making database sharding decisions and inevitably outgrew them each time:

“Reactively (re)sharding individual databases as they hit capacity limits was cumbersome; conversely, setting up new database clusters for every use case added a lot of overhead.” — Dropbox

In fact, these are all real-life Kobayashi Maru scenarios, situations where defeat is certain no matter our course of action. In our examples above, the best parallelism factor, optimal query order, or ideal sharding setup for any point in time is a function of things that are constantly changing: the volume of data in each database, the current CPU utilization and network load, the particular version of the database executing the query, which indexes are available to help, etc. Any choice here could be right or wrong, depending on the circumstances — so any fixed choice made here by a human at one point in time is a liability, a future “epic debugging story”, or perhaps even an outage waiting to happen.

Since relying on humans for these decisions is guaranteed to eventually be a losing strategy, we thought it best to teach the machines instead.

How database-agnostic querying helps Kensho

Rather than teaching every engineer at Kensho how to navigate these minefields on their own, we decided to “teach” our code instead. We figured out the most common set of query functionality our engineers relied on and ensured the GraphQL compiler’s query semantics were powerful enough to fulfill their needs. We now teach our users a single query language that works in the same predictable way regardless of the underlying database type and version.

Multiple products using GraphQL compiler to query data in database-agnostic fashion from various databases. — Everyone wins when products are built atop database-agnostic queries! Products can be rapidly built atop a common query platform with access to all data. Meanwhile, database engineers can freely decide to move data around or switch to a completely new database type without breaking products. Every query that reaches those systems is guaranteed to apply all best practices. As best practices evolve, users’ queries are automatically recompiled — everything “just works.”

Along the way, every time we discovered a query performance problem, an unexpected database limitation, or an outright database bug, we added a compilation rule in the GraphQL compiler that ensured we never hit that issue again. Rather than just fixing each immediate problem one at a time (“this query is making the product slow”), the compiler let us banish entire classes of problems at a time: “this query structure makes queries slow, always avoid it.” This gave us immense leverage!

We started with five rules, grew to ten, then to twenty and beyond, and without realizing it, went through a phase of emergence of complexity beyond any human’s capacity. Within a few months, the compiler was simply better at writing queries — better than even the group of humans that designed and implemented the rules the compiler was merely following. It was more consistent, it never forgot or misinterpreted a rule, it never had an “off day” and slipped up — and if it ever did make a mistake, that mistake became a valuable test case, was quickly corrected, and never ever came back.

As we scale our use of the compiler to more and larger databases, its advantages over even expert humans are only growing. Here are a few examples we hope to cover in more detail in future posts:

GraphQL compiler is now able to estimate query costs, and use those estimates to automatically parallelize, paginate, and (soon) order cross-database queries while maintaining a predictable impact on index use. As data shapes and sizes change, the compiler’s query execution decisions change with them to keep query performance in the sweet spot.
Large normalized schemas are unwieldy and difficult to navigate, so the compiler allows querying users to use macros to reshape their perception of the schemas to fit their needs, without requiring any changes to the underlying data or databases. Do you wish the Company table directly pointed to the “current CEO” row of the Person table, instead of having to constantly query for “the company’s officer with title "CEO" whose role start date is in the past, and whose end date either doesn’t exist or is still in the future?” Then just define that macro, and “current CEO” appears as a relation between Company and Person in your view of the schema, even if Company and Person live in different databases!
We are even working on a way to integrate non-database sources of data, such as file systems, APIs, and machine learning models, into our database-agnostic querying universe. To integrate a new source of data using this approach, one would simply have to describe its schema and implement four simple functions that become the backbone of a provably-efficient query interpreter over that data set, allowing all the same query capabilities as when querying any other database.

If you’ve reached this point, thank you! We appreciate you reading this far, and we are sure you must have many questions, like “Why does GraphQL compiler not use standard GraphQL semantics?” and “How can you represent all those kinds of databases in a single database-agnostic way?” These are questions that we’ll explore in subsequent blog posts, so please stay tuned! In the meantime, here’s a demo repository where you can experiment with GraphQL compiler’s flavor of database-agnostic querying.

Our team is incredibly proud of the work we’ve done so far on GraphQL compiler, but our journey has just begun. Join us on this adventure by reaching out on Twitter, using GraphQL compiler for your projects, contributing to it on GitHub, or getting paid to work on it as a member of the Kensho team!

Thanks to Bojan Serafimov, Caroline Gerenyi, Joshua Pan, Julian Goetz, Leon Wu, Melissa Whitehead, Michaela Shtilman-Minkin, Pedro Mantica, Selene Chew, and to all GraphQL compiler contributors on GitHub!

Database-agnostic querying is unavoidable at scale

Premise #1: Cross-database querying is unavoidable at scale.

Premise #2: Unaided humans are awful at cross-database querying.

How database-agnostic querying helps Kensho

Written by Predrag Gruevski