We have launched pyxet, a Python library for building apps with XetHub! pyxet will help developers iterate faster by storing their data, code, and ML models all in one place. Get started here (repo + tutorials). pyxet implements most of pathlib, fsspec, pandas, and common command line functions such as ls and cp. Join us and get involved to help shape the future of pyxet.
pyxet is at ‘alpha’ quality and is available on PyPI today. Get involved (join our Discord server) and join our pyxet community. pyxet supports Python 3.7+ on MacOS and Linux today.
Everyone is racing to develop generative AI apps, but the tools most developers use to build these applications are not optimized for large data sets — in fact, they don’t work very well at all. Developers have to flip back and forth between systems that hold code and data, introducing significant drag on efficiency, productivity, and accuracy.
Today, the best practices are to fragment solutions’ code, models, logs, data etc., and manage assets, environment and versions as addresses somehow. Often, this means there is a naming convention in S3 like the following: 's3://models-logs-data/ <environment>/ <version>/ <date>/ <file>'.
That makes sense under the constraints of what Git can handle and how blob stores are designed, but would you do it otherwise? What would be a natural way?
That seems simple enough.
Restore and audit databases as we do with code, experiment with models as we do with branches, test and CI/CD on a project level as if the project is a local app. Much easier.
That’s why today, at PyData Conference 2023 in Seattle, we announced the launch of pyxet, an open-source Python library for building apps with XetHub!
pyxet will help developers iterate faster by enabling storage of their data, code, and ML models all in one place. Get started here.
pyxet was built with developer productivity in mind. XetHub scales Git repositories to 1TB but we know that using the Git command line breaks your flow. With pyxet you can now work with your data like you do today - while staying in Python.
As an example, here is how you can read a file directly from a XetHub repo into a Pandas DataFrame:
pyxet implements most of pathlib and fsspec today. It’s easy to navigate, use, understand, and remember, enabling ML teams to ship better projects faster. As human time and machine time get more expensive, pyxet simplifies the ability to get and work with your data easily, quickly, and consistently —without having to waste hours downloading and uploading data.
For more details on how to get started with pyxet, check out the documentation here.
We are planning to open source pyxet and extend its functionality to include writing back to repositories, mounting repositories (to allow streaming data), adding Windows support, and more. Follow the repo to stay updated & get involved!