Data Versioning Challenges
Data versioning tools are software tools that allow users to keep track of changes made to datasets, machine learning models, and experiments. They are similar to version control systems used in traditional software development but are optimized to allow better processing of data. They allow for recording the exact state of datasets at a particular moment in time, making it easier to reproduce and understand experimental outcomes.
When considering machine learning workflows, managing different versions and copies of data incurs inherent costs when cleaning data, creating new features, and splitting the data for different datasets.
Although data versioning tools can assist with the former by converting your datasets into git-managed assets, you still need to consider the expenses in terms of time and costs associated with storage, upload, and download times for all data changes.
When using Git, cloning a repository also includes the history of the files. This can make Git inefficient for handling large files, especially if changes are made to them. For example, GitHub has a maximum file size limit of 100MB.
Here are some some popular data versioning tools:
- DVC (Data Version Control) is a free and open-source version system for data, machine learning models, and experiments which manage files with .dvc pointers which are managed by git while the file bytes are stored elsewhere.
- Git LFS Is an extension to git that enables git to version large files by replacing the files themselves with pointers.
- LakeFS is a DataLake middleware between your code and blob-store which implement Git operations on the data.
All of them implement file-level deduplication, which helps ensure that only one copy of the data is stored locally at any given point. However, it still requires you to store, upload, and download the entire file whenever there is a minor change (adding a new row or column).
This is where XetHub's block-level deduplication comes into play. It offers an alternative approach that minimizes storage and network costs by deduplicating blocks.
To compare the performance between these 4 tools, we’ll use the metric of average time to upload changes to the central version control system. This will stress test the deduplication technology each tool has to offer. The higher the throughput, the better the deduplication because it means less of the raw data needed to be uploaded.
We’ll focus the benchmarks on two important parts of the machine learning workflow: feature engineering and generating test-train splits.
A core part of the ML engineer iteration loop is feature engineering, where new features (or columns) are created to improve model performance during training.
To see how these 4 tools stack up against each other, I run through the following:
I start with a single 8 GB file (dataset) and then repeat the following 10 times:
- I carry out a simulated iteration of feature engineering by adding random new columns to the dataset
- I commit and push the changes to the central repository
I then average the upload times for each tool and plot them:
Generating train-test validation split
In another use case, data is split into different folders for the train, validation, and test sets to ensure consistent and robust machine learning experiments within teams of all sizes.
One way to handle this is by assigning the new batch of data as the new test subset, while the previous test subset becomes the validation subset. The former validation subset is then appended to the train set.
Alternatively, instead of appending, you can manage the batch sizes and files shuffling to prevent changes to the files, which is hinders precisely what data versioning aim to assist with.
To simulate this I run the following steps:
I start with a single 8 GB file (dataset) and repeat the following 10 times:
- I append the validation set to the train set
- I move the test set as-is to the validation folder
- I generate new test set of 800MB
I then average the upload times for each tool and plot them:
How XetHub's Deduplication Technology Works
In XetHub, we utilize Merkle Trees. Our Content Addressed Store (CAS) strikes a balance between the need for storing, communicating, and managing large objects efficiently, while also enabling efficient data deduplication on small block sizes.
Each file is divided into small variable-sized blocks, with an average size of 16 KB per block. A Merkle Tree is constructed by grouping the hashes of these blocks together using Content Defined Chunking. This approach ensures that operations such as inserting, deleting, or modifying blocks do not require significant rewriting of the entire tree, thus preserving most of the tree across different versions of the file.
For example, when new bytes are inserted at the beginning of the file, only a few nodes may need to be added to the tree.
When files are added, any new blocks are concatenated together into 16 MB blocks. A separate Merkle Tree is used to represent each block and all the blocks it contains. To reconstruct any file, a graph traversal is performed to intersect the file tree with the block tree. This resolves the set of block ranges needed to reconstruct the file.
These large block sizes offer a significant amount of data locality, meaning there is a high likelihood of needing the whole block when accessing part of it. This makes them much more efficient to send and receive, resulting in improved performance.
In conclusion, data versioning is essential for machine learning engineers to manage different versions and copies of data effectively. While traditional tools like Git-LFS and DVC offer some assistance, XetHub's block-level deduplication takes data versioning to the next level. By utilizing Merkle Trees and Content Defined Chunking, XetHub minimizes storage and network costs by deduplicating file blocks. This approach allows for efficient management of large objects and enables data deduplication. Performance comparisons demonstrate the advantages of XetHub in feature engineering and data splitting workflows.