January 8, 2024
Run Simple ETL Pipelines Inside GitHub Using Git-Xet & GitHub Actions
At XetHub, we scale Git to 100 terabytes per repo so you can version your code and large files together. After we released our GitHub integration to bring this functionality as an upgrade to your GitHub repos, we had an interesting thought:
What if GitHub Actions was enough for running lightweight, recurring data ETL (extract-transform-load) jobs?
With this approach, you don’t have to carefully setup credentials, introduce an orchestration tool, find a logging solution, or find a new place to store data (like S3). You can instead lean on all the batteries GitHub has included in its platform and use our XetData add-on to manage the large files that Git traditionally struggles with.
How it Works
The example we’ll be showcasing fetches air quality data from OpenAQ on an hourly basis and the work is entirely contained in this GitHub repo.
At the heart of our approach is the workflow file that defines how our GitHub action runs. If you’re new to GitHub actions, check out GitHub’s excellent documentation. Here’s a conceptual diagram from GitHub’s documentation:
We created a single workflow file that lives in .github/workflows/etl-action.yml in our repo. Here’s our visualization of the workflow:
Let’s walk through the key components.
Event: A CRON job triggered on an hourly schedule (on the 21st minute) in GitHub Actions.
Steps: Defining the precise sequence of steps to run on the hourly trigger.
Analyzing & Visualizing your Data
Browsing Data in GitHub
By enabling our GitHub app for your repo, our XetData bot comments with links to helpful views into your datasets, models, images, etc. Here’s an example you can view yourself. You can also install our browser extension to get links to Xet hosted file views while opening specific files in the GitHub UI.
If your GitHub repo is private, you can share access to the datasets by adding someone to your repo as a collaborator. We use your repo’s permissions to restrict who can & can’t access the datasets and views into your datasets.
If your GitHub repo is already public, anyone can access it by installing our Git-Xet extension and running:
If the repo contains lots of large files (gigabytes to terabytes), you can also read-only mount the repo locally instead and the blocks of data needed are fetched behind the scenes just-in-time:
Learn more about read-only mounting here.
Create your Own ETL Repo
To help you create our own GitHub hosted ETL repos, we’ve created a template repo to start with. All of the instructions you need can be found in the README file.
This workflow enables lots of interesting use cases:
Run GitHub Action-based QA when new data arrives in the repo
Find anomalies & outliers in datasets automatically when new data arrives in the repo
Run ETL pipelines at the git branch level (e.g. to go from
rawbranch to a
If you’re proud of any public repos you’ve created that is ETL-ing data or are running into issues, join our Slack and reach out to us!