"Git is for Data" Published in CIDR 2023

Yonatan Alexander

Rajat Arya

We are excited to share that our paper, "Git is for Data" was accepted to CIDR 2023! CIDR is the premier conference for practical data systems research, and Yucheng presented this work in Amsterdam this past week.

Yucheng Low, Co-founder & CEO presenting 'Git is for Data' at CIDR 2023 in Amsterdam.

In this post I want to share the abstract section, conclusions section, and some of the key figures and results from the paper. I encourage you to read the full paper for context, background, and details.

‍

Abstract

Dataset management is one of the greatest challenges to the application of machine learning (ML) in industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility.

In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated in existing pipelines, facilitating adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.

Selected Figures & Results

Figure 1: (a) Content Defined Merkle Tree. Each leaf chunk is derived from a Content Defined Chunking procedure. Nodes are merged using a simple hashing rule: a partition is inserted whenever the hash modulo a target child count equals 0 with constraints on a minimum / maximum child count. (b) Insertion of a new chunk (Data 0) can maintain tree stability. New nodes are in red.

Figure 4: Cumulative storage cost for committing 50 CORD-19 dataset versions from 2021-02-01 to 2022-06-02. The Y axis is truncated for readability. LFS required 545GB, Xet required 287GB including a 2.4GB MerkleTree. Xet+CSV required 87GB including a 1.7GB MerkleTree. See Sec. 6.4 for details.

Figure 6: Total MB downloaded per query for each dataset. xet mount was configured with prefetch disabled and 1MB cache block size to optimize for random access. (a) SQL queries using DuckDB [29, 30] performed on a 54GB Xet dataset of Parquet files obtained from [31]. The LAION-400M dataset comprises of Image URLs, text descriptions and other image metadata including a license. As Parquet is columnar, columnar queries are efficient and only a small fraction (2.3%) of the dataset needs to be downloaded to obtain a license distribution. (b) SQL queries on a 9GB SQLite [32] database built from a 12M image subset of the Laion-Aesthetic dataset [33]. Appropriate column indexes are created to avoid a complete table scan for the queries tested. (c) Queries and indices used for for the Parquet and SQLite queries in (a) and (b).

Figure 7: All benchmarks performed on a t2.xlarge AWS instance with 4 vCPUs and 16GB RAM. (a) Performance of the Parquet license count query (Fig. 6) comparing query runtime: (i) immediately after mount (uncached), (ii) subsequent runs (cached), (iii) from local disk, (iv) directly from S3 bucket using DuckDB’s native connector. Linux page caches were flushed prior to every query. Since the parquet page size is large and DuckDB parallelizes data access, we were able to obtain very good performance for Parquet queries even outperforming DuckDB’s native connector by 21%. Once accessed, our cached performance is comparable to direct local disk performance. (b) Performance of the SQLite 10 Cat Images query (Fig. 6) comparing query runtime: (i) immediately after mount (uncached), (ii) subsequent runs (cached), (iii) from local disk, (iv) directly from S3 bucket using a SQLite VFS HTTP Connector [34]. As the SQLite default page size is small (4K), our 1MB cache block size is far too large resulting in a nearly 4x slowdown for the uncached query compared with the SQLite VFS connector. However, once cached, performance is on-par with local disk (<0.1s).

Conclusions

At first glance, we integrate with Git in a comparable method as Git LFS. However, the core differentiation is the holistic set of tooling XetHub provides to fully support the needs of ML datasets by fully embracing the use of software engineering practices for data.

We believe that with the right architecture design, pre-existing systems for source control can be extended to fully support the dataset use case, addressing a significant fraction of dataset management needs while minimizing cognitive friction.

The significance is the observation that the needs around dataset management are not unique, and have been addressed by source code management tools. What is unique is only the scale at which it happens. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source control, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs.

Again, read the full paper here and get started with XetHub today!

Share on