John Y

SMCQL - Secure Querying for Federated Databases

WEDNESDAY, FEBRUARY 03, 2021 •

DBMS

Privacy

Security

Today's paper visits one of the more recently tackled problems in the realm of privacy preserving systems - facilitating data processing for multiple distrustful parties. There have been numerous papers with similar motivations such as Ryoan and Conclave. While the case studies and impetus for such systems is the same, the papers differ in the threat model, security guarantees, and technical solutions they present.

Secure multi-party computation (MPC) is a popular cryptographic protocol that is useful for facilitating joint computation of a function over inputs while keeping the input private. The paper presents an intuitive scenario that I'll summarize here - imagine Alice, Bob, and Carl are 3 healthcare providers, each with their own collection of confidential patient records. Alice wants to know how many patients a certain ailment across all three datasets has (i.e. SELECT COUNT(*) FROM data WHERE disease = "flu";. However, because of patient confidentiality, Alice should not be able to reverse engineer the results or execution of the query such that she might discern Bob and Carl's private data.

When nobody trusts each other, the simple but straightforward solution is then, to designate a new, third party entity that everyone trusts! Introducing, the honest broker. Alice, Bob, and Carl will all submit their data and queries to the honest broker. The honest broker carries out aggregating the data and responding to the requested queries from each party.

The problem with this schema is the honest broker. First, there's the problem of determining an honest broker; in some cases, there might not be one. For example, sensitive intelligence from government agencies might have stringent permissions that prevent it from being shared with a third-party source. Second, for adversaries, the broker represents a single point of failure - if the broker is compromised, everyone's data and queries are no longer confidential. Trusting the broker unconditionally is a major issue. In addition, side channels in the form of network traffic between brokers and data providers are a source of information leakage that curious observers can inspect even without compromising any of the participants or the broker itself. And there-in lies the motivation for this paper.

We introduce a framework for executing PDN [Private Data Networks] queries named SMCQL. This system translates SQL statements into SMC [Secure Multiparty Computation] primitives to compute query results over the union of its source databases without revealing sensitive information about individual tuples to peer data providers or the honest broker.

To put it simply, the goal is to simulate a completely trustworthy third party to query private datastores.

SMCQL's authors frame the threat model as several passive, "honest-but-curious" data providers, who may attempt to listen in on side channels (i.e., memory access patterns, program counters) generated during a secure query execution.

At a high level, sensitive query evaluation is carried out in-situ among the data providers using secure multi-party computation. In other words, the query execution is guaranteed to be oblivious, which means it reveals nothing about the data to any parties other than the results of the query itself. Figure 2 of the paper presents a nice graphic summarizing the system workflow. The main contribution that SMCQL makes is the query planner. After the SQL statement is converted into a directed acyclic graph, via modifications to traditional query optimization techniques, SMCQL generates a secure execution plan. In addition to the customary heuristics for minimizing execution cycles, SMCQL introduces an additional heuristic for minimizing the amount of query processing that needs to be run in an MPC setting as much as possible. To do this, SMCQL generates hybrid query execution plans, where operators are carried out either by SMC or in plaintext.

There's a couple cryptographic building blocks that SMCQL is founded on, namely garbled circuits, oblivious RAM, and ObliVM. Garbled circuits are a cryptographic protocol that protects a query's program traces from snooping. Oblivious RAM, as discussed in a previous write-up, shuffles data on all read and write operations to add noise to memory traces of secure computations. The authors apply these techniques to solve the leakage problem that arises with broker to participant communication. ObliVM is the bridge that converts imperative code into garbled circuits and ORAM. The authors convert database operators into SMC commands via ObliVM.

Another innovation introduced by the paper is attribute-level security. The authors recognize that access privileges need to be more granular than at the table level. Different columns within the table may have different degrees of desired visibility, warranting a new access control policy. Three tiers of visibility (public, protected, private) influence the query planning step with taint analysis, by tracing the flow of sensitive attributes through the operator tree. Lastly, the paper also discusses some new query optimization techniques (sliced evaluation, semi-joins) that complement the new query planning and execution mechanisms.

I think this paper was overall, an interesting read that is a book opener for future work in this domain. It was also a wonderfully practical demonstration of MPC's capabilities. With that said, I think there's also plenty of room for improvement.

The most evident improvement area is performance. MPC is an expensive algorithm, and it really shows in this paper. Relative to a purely MPC system, SMCQL performs well. However, compared to non-secure systems that operate in plaintext, SMCQL runs a couple orders slower than the baseline. Later papers, particularly Conclave, build on SMCQL's work while improving performance either via redefining some of the security guarantees or using different techniques to minimize what is run with MPC and not plaintext.

A second point (full disclosure - more borne out of my opinion), is that for tracing sensitive data via taint analysis (public, protected, private), a better alternative could be second path analysis. The paper was published long ago (Hinke 1988), but it discusses this idea of utilizing sensitivity inference rules in relational tables. The paper mentions:

If an attribute of a table is private, the entire table is private and all tables reachable via primary-foreign key relationships.

Attribute level security is a bit weak because in practice, information is not completely independent of one another. A practical example - let's say in a table of patient records, there's two columns - Diagnosis (private) and Treatment Plan (protected). Acting as one of the data providers, if we know the treatments for certain diagnoses from our own data, we may be able to infer diagnoses for another data provider's dataset, thus affecting the confidentiality of the Diagnosis column. Hinke's paper describes a way to secure columns based on sensitivity inference rules, which may have been a better strategy for tagging columns with the appropriate access control permissions.