January 8, 2024

Run Simple ETL Pipelines Inside GitHub Using Git-Xet & GitHub Actions

Srini Kadamati

At XetHub, we scale Git to 100 terabytes per repo so you can version your code and large files together. After we released our GitHub integration to bring this functionality as an upgrade to your GitHub repos, we had an interesting thought:

  • What if GitHub Actions was enough for running lightweight, recurring data ETL (extract-transform-load) jobs?

With this approach, you don’t have to carefully setup credentials, introduce an orchestration tool, find a logging solution, or find a new place to store data (like S3). You can instead lean on all the batteries GitHub has included in its platform and use our XetData add-on to manage the large files that Git traditionally struggles with.

How it Works

The example we’ll be showcasing fetches air quality data from OpenAQ on an hourly basis and the work is entirely contained in this GitHub repo.

At the heart of our approach is the workflow file that defines how our GitHub action runs. If you’re new to GitHub actions, check out GitHub’s excellent documentation. Here’s a conceptual diagram from GitHub’s documentation:

We created a single workflow file that lives in .github/workflows/etl-action.yml in our repo. Here’s our visualization of the workflow:

Let’s walk through the key components.

  • Event: A CRON job triggered on an hourly schedule (on the 21st minute) in GitHub Actions.

     - cron: "21 * * * *"
  • Steps: Defining the precise sequence of steps to run on the hourly trigger.

    1. Checkout repo using Git-Xet

      • Git-Xet augments Git to push & pull just the large files to XetHub

    2. Setup Python

    3. Install Python dependencies

    4. Run src/pipeline.py to fetch data, transform it, and save as a CSV in the data/ folder

    5. Display first 10 rows of CSV

    6. Commit back to repo using Git-Xet

Analyzing & Visualizing your Data

Browsing Data in GitHub

By enabling our GitHub app for your repo, our XetData bot comments with links to helpful views into your datasets, models, images, etc. Here’s an example you can view yourself. You can also install our browser extension to get links to Xet hosted file views while opening specific files in the GitHub UI.

Sharing Access

If your GitHub repo is private, you can share access to the datasets by adding someone to your repo as a collaborator. We use your repo’s permissions to restrict who can & can’t access the datasets and views into your datasets.

If your GitHub repo is already public, anyone can access it by installing our Git-Xet extension and running:

git clone git@github.com:xetdata/easy-etl.git

If the repo contains lots of large files (gigabytes to terabytes), you can also read-only mount the repo locally instead and the blocks of data needed are fetched behind the scenes just-in-time:

git xet mount git@github.com:xetdata/easy-etl.git

Learn more about read-only mounting here.

Create your Own ETL Repo

To help you create our own GitHub hosted ETL repos, we’ve created a template repo to start with. All of the instructions you need can be found in the README file.

Next Steps

This workflow enables lots of interesting use cases:

  • Run GitHub Action-based QA when new data arrives in the repo

  • Find anomalies & outliers in datasets automatically when new data arrives in the repo

  • Run ETL pipelines at the git branch level (e.g. to go from raw branch to a clean branch)

If you’re proud of any public repos you’ve created that is ETL-ing data or are running into issues, join our Slack and reach out to us!

Share on

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.

Simplify your ML development today

Stop juggling multiple tools and streamline your workflow with XetHub.