Large scale project management¶

How to manage a Git LFS Repositories with Thousands of Files.

1. Context and Problem Statement¶

In large projects, it’s common for a Git repository to track thousands to hundreds of thousands of files via Git LFS. Typical use cases:

A research study with many samples (VCFs, BAMs, images, etc.)
A data lake-ish repo where each commit adds more LFS pointers
Monorepos that aggregate multiple datasets or experiments

In these cases, standard Git LFS introspection commands become painfully slow. A concrete example:

git lfs ls-files --json

On a repo with thousands of LFS pointers, this can take several minutes. That’s a non-starter for:

Interactive CLI tools
Editor/IDE integrations
CI/CD steps that run frequently

This note describes architectural patterns to avoid global enumeration and keep operations fast and predictable as your LFS population grows.

2. Why `git lfs ls-files` is Slow in Large Repos¶

Conceptually, git lfs ls-files must:

Walk the Git index / working tree to identify LFS-tracked files.
For each file, resolve and hydrate metadata (pointer, OID, size, etc.).
Optionally serialize to JSON.

Even if the LFS objects are local, this is O(N) over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:

Doesn’t change very often, and
Could be cached or maintained elsewhere.

From an architecture perspective, the problem is:

We’re using git lfs ls-files as a query engine and index, when it’s really just a dumb enumerator over the current state.

3. Design Goals¶

For a repository with many LFS objects, we want:

Predictable latency Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.
Incremental updates Avoid full scans of N files when only a handful are new or changed.
Subset operations by default Most tasks only need a subset (by path, tag, type, or commit range), not the full universe.
Separation of metadata from Git internals Use Git (and Git LFS) as the transport and integrity layer, not as a full-featured metadata store.

4. Core Architectural Pattern: External LFS Metadata Index¶

Instead of deriving everything on demand from git lfs ls-files, maintain a separate index of LFS metadata that is:

Versioned alongside the repo (e.g., tracked TSV/JSON),
Derived incrementally from Git/LFS events, and
Fast to query (path lookup, OID lookup, tags, etc.).

4.1. Example: `META/lfs_index.tsv`¶

A simple pattern:

Maintain a tracked file such as META/lfs_index.tsv with columns like:

path    oid_sha256                             size    tags    logical_id
data/a.bam  1a2b3c...                          12345   tumor   sample:XYZ
data/b.bam  4d5e6f...                          67890   normal  sample:ABC

This TSV becomes your primary, fast, queryable index, not git lfs ls-files.

Pros:

Constant-time query by path via grep / awk / Python / SQL.
Easy to join with other metadata tables (specimens, assays, etc.).
Can be regenerated in a controlled, explicit operation (like make rebuild-index).

4.2. How to Keep It Up-to-Date¶

You don’t want manual edits. Use automation on “add” paths:

use a pre-commit hook:
For newly staged LFS pointer files, update the index before commit.

This shifts expensive work into the write path where it is amortized and expected, and keeps the read path (queries) fast.

5. Avoiding `git lfs ls-files` in Common Operations¶

5.1. Don’t use `ls-files` as your data plane¶

Refactor any tools that currently:

git lfs ls-files --json | jq ...

to instead read from your external index (TSV/JSON/SQLite). For example:

# Old, slow:
git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))'

# New, fast:
awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv

or in Python:

import csv

with open("META/lfs_index.tsv") as f:
    for row in csv.DictReader(f, delimiter="\t"):
        if row["path"].endswith(".vcf.gz"):
            ...

5.2. Use `ls-files` only for rare “rebuild index” operations¶

When you first introduce the index, you may need a one-time or occasional rebuild:

git lfs ls-files --all --json > /tmp/lfs_files.json
# transform into META/lfs_index.tsv

This can take minutes in huge repos—and that’s fine, as long as it is rare and documented as a heavy operation (like npm install, docker build, etc.).

6. Subset-First Design: Operate on Paths, Tags, or Commits¶

If you must derive state from Git directly, design your commands to start with a subset, not the full repo.

6.1. Path-based subsets¶

For example, instead of:

# Scans entire repo
git lfs ls-files --json

use:

# Only data under a project or cohort
git lfs ls-files --include "data/StudyX/**" --json

and structure your tooling around the concept of project subtrees (data/studyA/, data/studyB/, etc.) so most operations are scoped.

6.2. Commit-range subsets¶

For incremental workflows (ETL, indexing, sync), use git to find changed files:

git diff --name-only <old-commit> <new-commit> \
  | git check-attr --stdin filter \
  | awk '$2 == "lfs"' # or similar

Then only examine LFS metadata for changed files, merging that into your external index.

7. Caching and Incremental Computation¶

If you really want a “git lfs ls-files --json-like view,” you can implement your own cached snapshot:

Keep a file like .cache/lfs_snapshot.json keyed by commit hash (HEAD).
On invocation:
If HEAD has not changed, just read the cache.
If HEAD changed, compute the diff from the last snapshot and patch the cached JSON.

This means you only pay full-scan costs when the diff is large, and usually pay a small, incremental cost.

8. CI/CD Considerations¶

In CI, naive patterns like:

- run: git lfs ls-files --json | jq ...

will slow your builds significantly once the LFS population grows.

Better patterns:

For linting or validation:
Operate on META/*.tsv and cross-check with a small sample of pointers.
For publishing or sync steps:
Use git diff between the last deployed commit and current one to identify only the LFS files that changed.
For health checks:
Schedule a periodic “heavy” job (nightly or weekly) that runs git lfs ls-files to verify repo consistency, rather than doing it on every push.

9. Git + LFS as Transport, Not Primary Index¶

The underlying architectural theme:

Git is an excellent tool for content addressing, branching, merging, and history.
Git LFS is an excellent tool for large object transport and storage.

Neither is optimized as a high-level metadata query system for tens of thousands of objects.

So:

Let Git/LFS handle integrity and distribution.
Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle queries, tags, and relationships.

You can always rebuild your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.

10. Practical Recommendations / Checklist¶

When you notice git lfs ls-files --json taking minutes:

Audit your tools
- Search for any use of git lfs ls-files in scripts, CI configs, and CLIs.
- Replace them with operations over an external index.
Introduce a canonical LFS index
- Add META/lfs_index.tsv (or similar) to the repo.
- Define columns: path, oid_sha256, size, tags, logical_id, etc.
- Commit it and treat it as the primary query surface.
Automate index maintenance
- Add a wrapper command or pre-commit hook that updates the index on git add.
- Provide a “heavy” rebuild-lfs-index command that users run explicitly when necessary.
Scope operations by default
- Design new commands to accept --path, --tag, --study, or --since <commit> flags.
- Document that global “scan everything” commands are expensive and should be infrequent.
Use CI wisely
- Only operate on changed LFS files between commits.
- Reserve full LFS integrity checks for scheduled jobs, not every PR.