Monorepo with Turborepo and Docker Matrix on GitHub Actions: Selective Build Modified Packages

Optimizing Selective Docker Builds in a Turborepo Monorepo on GitHub Actions

Our GitHub Actions matrix containers build workflow was always building every container image in our Turborepo monorepo — even if only a few packages had actually changed. This not only resulted in increased build time but also inflated our operational cost. In this post, I’ll will outline how we tackled this issue and implementing a robust solution that leverages Turborepo’s dry-run capabilities and GitHub Actions caching to build Docker images only for the changed packages.


The Technology Stack

Before diving in, here’s a brief overview of the tools and infrastructure we employed:

  • Turborepo: A powerful build system for monorepos that intelligently tracks and caches changes.
  • Docker: Used for containerizing our applications, ensuring consistent and reproducible builds.
  • GitHub Actions: Our CI/CD platform for automating the entire build and deployment pipeline.
  • PNPM: Our package manager, enabling efficient dependency management across the monorepo.

The Challenge: Unnecessarily Building All Docker Images

Originally, we had a static matrix strategy to build Docker images. However, this approach always triggered builds for every package — even if only one or two had actually changed. This blanket strategy led to long build times and increased costs, as we were not leveraging the incremental builds.

The key insight was that we needed a way to dynamically generate the matrix input so that only the changed packages (i.e., those with a cache MISS) are rebuilt. In our context, the “changed” packages are identified by running a dry-run build with Turborepo.


Our Refined Approach: Building Only the Changed Packages

To address the issue, we restructured our workflow into three main jobs:

  1. monorepo_filter:

    • Discovery without Premature Caching: We run pnpm turbo run build --dry-run to generate an output.json that lists each package’s build status.
    • Dynamic Matrix Generation: By parsing this JSON, we extract a list of packages with a cache MISS and feed this list into the matrix configuration for the next stage.
    • Artifact Upload and Cache Invalidation: We also upload the dry-run output as an artifact and perform cache invalidation based on previous outputs to ensure no stale cache persists.\n\nSee “Saving and Parsing Output JSON for Matrix Generation” and “Invalidating Previous Cache for Packages with Cache MISS” sections below for more details.
  2. build:

    • Selective Matrix Build: This job uses the dynamically generated matrix from the monorepo_filter job to build Docker images only for the packages that have changed.
    • Efficient Resource Usage: Each matrix job checks out the code, installs dependencies, and runs a filtered Turbo build for its respective package, followed by building and pushing its Docker image.
  3. finalize_cache:

    • Post-Build Cache Update: Only if all Docker builds pass does this job run a full Turbo build to update the global cache. This ensures that the new cache reflects only fully successful builds, avoiding future false positives.

Saving and Parsing Output JSON for Matrix Generation

One of the critical steps in this pipeline is the extraction of a build plan from Turborepo’s output. After executing a dry-run using pnpm turbo run build --dry-run json, we capture the resulting output.json file. This file contains detailed information about each task in our monorepo, including the package name, build status, and cache indicators.

To convert this information into a matrix that GitHub Actions can use, we use jq—a lightweight and powerful command-line JSON processor. Here’s how the process works:

  1. Saving the Output:
    The dry-run output is saved to a file called output.json. This file acts as our single source of truth for determining which packages need to be rebuilt.

    pnpm turbo run build --cache-dir .turbo/cache --dry-run json > output.json
    
  2. Parsing with jq:
    We parse output.json to extract only those packages where the cache status is MISS (indicating that they need to be rebuilt). We also filter out libraries (packages starting with @lib) and test-related packages. This produces a list of package names that require building.

    packages=$(jq -r '.tasks[] | select(
        .cache.status == "MISS" 
        and (.package | startswith("@lib") | not)
        and (.package | startswith("test") | not)
      ) | .package' output.json | sort | uniq | jq -Rsc 'split("\n") | map(select(length > 0))')
    
  3. Feeding the Matrix:
    The resulting JSON array of package names is then assigned to a GitHub Actions output variable. This output is used in the matrix strategy of the subsequent build job, ensuring that only the packages identified as needing a rebuild are processed.

    echo "matrix=$packages" >> $GITHUB_OUTPUT
    matrix_length=$(echo "$packages" | jq 'length')
    if [ "$matrix_length" -gt 0 ]; then
      echo "should_run_build=true" >> $GITHUB_OUTPUT
    else
      echo "should_run_build=false" >> $GITHUB_OUTPUT
    fi
    

This method ensures that our build matrix is dynamically generated, focusing resources solely on the packages that have actually changed.

Finalizing Cache Update After Successful Builds

After successfully building all Docker images, we perform a final Turbo build without the dry-run flag to update the cache. This step is not used for building container images but solely to update the global cache and is crucial to ensure that the cache is updated with only fully successful builds.

  - name: Run Full Turbo Build to Finalize Cache
    run: |
      pnpm turbo run build --cache-dir .turbo/cache

Invalidating Previous Cache for Packages with Cache MISS

Even with dynamic matrix generation, there’s a risk that stale caches from previous builds could cause Turborepo to incorrectly skip necessary builds. For instance, if a package’s code reverted, the old cache might still be present, leading to a false assumption that the package is up-to-date.

To address this, we implemented a cache invalidation step. This process involves comparing the current build’s package hashes (from output.json) with those saved from the previous run in output.prev.json. If a previous hash exists for a package, we remove all cache files in the .turbo/cache directory that begin with that hash.

Here’s how we achieve this:

  1. Extract Current Package Hashes:
    Use jq to extract packages with a cache MISS from the current output.json, along with their corresponding hash values.

    packages_with_miss=$(jq -c '[.tasks[]
      | select(.cache.status == "MISS")
      | {package: .package, hash: .hash}]' output.json)
    
  2. Load Previous Hashes:
    Check for the existence of .turbo/output.prev.json and extract the previous package hashes if available.

    if [ -f ".turbo/output.prev.json" ]; then
      prev_hashes=$(jq -c '[.tasks[] | {package: .package, hash: .hash}]' .turbo/output.prev.json)
    else
      prev_hashes='[]'
    fi
    
  3. Delete Stale Cache Files:
    Iterate over each package with a cache MISS. For each package, if a previous hash is found, delete all cache files in the .turbo/cache directory that start with that hash.

    echo "$packages_with_miss" | jq -c '.[]' | while read -r package_info; do
      package=$(echo "$package_info" | jq -r '.package')
      prev_hash=$(echo "$prev_hashes" | jq -r --arg pkg "$package" '.[] | select(.package == $pkg) | .hash')
    
      if [ -n "$prev_hash" ]; then
        echo "Invalidating cache for package: $package (previous hash: $prev_hash)"
        find .turbo/cache -type f -name "${prev_hash}*" -delete
      else
        echo "No previous hash found for package: $package"
      fi
    done
    

By purging these stale caches, we ensure that Turborepo is forced to rebuild packages where the source has changed—even if an outdated cache is still present. This step is crucial for maintaining build accuracy and reliability.


Wrapping Up: Key Outcomes and Lessons Learned

Our refined approach has significantly reduced build times and cost by ensuring that only the changed packages trigger Docker image builds. Here’s a quick recap of what we achieved:

  • Selective Matrix Builds: By parsing the output JSON from a Turbo dry-run, we dynamically generated a build matrix that only includes packages with a cache MISS. This eliminates unnecessary Docker builds, saving time and cost.
  • Robust Cache Invalidation: Implementing a cache invalidation mechanism ensures that stale caches don’t cause erroneous build skips. If a package’s code changes (or reverts), the stale cache is purged, triggering a rebuild.
  • Efficient Post-Build Cache Update: The final cache update step only runs if all Docker image builds succeed, ensuring that our global cache reflects only fully verified builds.

These changes have not only improved the performance and cost-efficiency of our CI/CD pipeline but also provided valuable insights into the nuanced interplay between build systems, caching, and dynamic workflows.

By adopting these strategies, you too can optimize your monorepo pipelines, reduce build times, and lower operational costs.

comments powered by Disqus