Privacy Integrated Queries

MONDAY, JANUARY 25, 2021 •

PL

This paper addresses a growing problem concerning the security and privacy of large data pools. The context of the problem presents a current trend where large amounts of individual data ranging across a large spectrum of information are collected, analyzed, and shared by a variety of technology companies. While the data is received under the promise that it will remain private, the reality is that data analysis and storage systems are built in a way that does not reliably guarantee the privacy of underlying records, and while there have been advancements in designing privacy preserving algorithm, this approach has been completely outpaced by the explosion in the diversity and amount of data being gathered. This problem is further exacerbated by the difficulty of collaboration between non-expert data analysts and data providers, groups that are coming from completely different domains. The goal of PINQ is to approach this problem with differential privacy, a primitive that allows for data based on group patterns to be shared, but not through querying individual records. The reason PINQ solves this problem is because of how differential privacy works for any arbitrary data type, from numeric and text values to images for medical purposes.

The PINQ language is primarily based on the differential privacy guarantee. The “Privacy Integrated Queries” is an API that allows one to compute on privacy sensitive data sets that guarantees underlying records are protected from being exposed. PINQ itself is a DSL that aims to be simple yet expressive enough to provide security, high performance, and a wide range of possible analyses. PINQ is very similar to LINQ, but the main differences include 1. No direct accesses to data (aggregations disabled) 2. Users get access to aggregations specially designed for privacy preservation and 3. All queries provide differential privacy. The main innovation is that queries are first filtered against a set of checks for tricks and compliance to a data policy, then randomly executed. The author also describes this term called “stability”, which refers to how transformations for data sets are at most a constant “C” different from the true result.

The author gives a bit of an overview regarding how the PINQ language works. I think one of the challenges is whether this language is easily usable. While the method headers for PINQ don’t look too different from LINQ, it seems like writing in PINQ does require a basic understanding of security primitives. For instance, the author makes use of a PINQAgent that acts as a privacy proxy. While the system seemingly can be easily attached to an existing query driven language, it does not entirely abstract away privacy primitives. As a result, this may be easy to use for computer scientists who have some domain knowledge in security. However, for data providers or analysts who don’t have a background, this may not be as usable.

One of the more significant research questions to ask is the accuracy of queries generated by PINQ. While the author’s evaluations seemingly demonstrate that inaccuracy is bounded, it would be more reassuring with either formal methods and proofs or practical validation with a larger array of datasets. Another direction would be to expand the base of queries that PINQ is able to capture. PINQ has extensible design that should allow for support of transformations and aggregations unique to certain types of data or purposes. Last but not least, I think the performance costs and overhead associated with PINQ can be elaborated upon with advancements in either the accuracy or expression coverage of PINQ.