The Case for Learned Index Structures

MONDAY. JANUARY 11, 2021 •

ML

Presentation slides I created for this paper when taking UC Berkeley's CS 294-152 Machine Learning Systems class (Spring 2019), taught by Professor Joseph Gonzalez.

Traditionally, for efficient data access, storage, and deletion (among a multitude of DB operations), many database and operating systems have been built utilizing a very established set of existing index structures such as Bloom Filters and B+ Trees. While each of the indices have its own set of advantages, their drawbacks mainly originate from the fact that these structures are not adaptive. These advantages come from the fact that index structures attempt to take advantage of common, real world access patterns that try to simulate how a human might use a database. This very concept is the root of the authors’ argument which states that machine learning’s inherent ability to recognize and follow unique patterns (not just based on temporal or spatial locality) creates many opportunities to build models that can reflect patterns in data more naturally, rather than require us to build high-cost, specialized, un-scalable index structures.

The methods for evaluations referenced throughout the paper are the traditional metrics of speed, I/O count, capacity, and similar factors that determine how quickly an index structure can handle user requests for data-oriented operations. Later in the paper, some of the metrics that are compared are lookup time, size of the database, and the size of the index structure or model itself. The paper very much focuses on a system’s time and space attributes, with shorter times and less space being positives.

The main advancement put forth by the authors is based upon the idea that machine learning models are more capable of better learning the pattern of how users are interacting data. More specifically, this pattern can be thought of as a continuous function describing the data distribution, and by training a model around this quantification of behavior, more space efficient, faster data structures can be constructed. Juxtaposed against B-Trees, the authors’ created a two-layer, fully-connected CNN for lookup tasks. The authors also experimented with hybrid indexes that combined indexes with neural nets. Another experiment was aimed at lowering hash function conflicts by using a model to determine placement of different values within different bins.

By implementing a shallow, easily trainable hash table, the authors were able to reduce hashing conflicts for map data by nearly 80%. The also achieved around 30% reductions in web and log data. However, the model did not win in every case. When it came to B-Trees, the authors noticed that their Tensorflow based CNN came with a significant amount of overhead and computation related latency that seemed to be a bit of an overkill. The invocation overhead paired with Python as the front end showed that Tensorflow based models are more advantageous for larger tasks, not as great for smaller, lightweight operations. B-Trees also proved to be more cache efficient, although they tend to overfit data due to the recursive nature of if-statements for bucketing data. With Bloom Filters, an increasingly accurate model with a larger RNN continuously outperformed the traditional algorithm.

The index structures that the authors mentioned in this paper all could be modeled with continuous linear functions. One future research avenue would be studying and designing different ML models to tackle data structures that behave in a nonlinear fashion. For more multidimensional indexed structures, such as virtual to physical memory page addressing in operating systems, neural networks could be more advantageous at capturing higher dimensional, complex interactions. In addition to indexes, machine learning could also be used to performs join and sorts more efficiently. Finally, while the memory requirement and size of machine learning models may seem intimidating in the present day, the authors predict that in the future, GPUs and TPUs may be powerful enough to support indexing and searching procedures on a small scale, too.

The authors’ experiments in this paper demonstrate a very interesting and hopeful future involving the optimization and improvement of index structures with machine learning. This paper opens all sorts of research possibilities. In the long run, this paper may be the take off point for a whole new field of research focused on building smarter, more robust systems.