Large scale project management¶
How to manage a Git LFS Repositories with Thousands of Files.
1. Context and Problem Statement¶
In large projects, it’s common for a Git repository to track thousands to hundreds of thousands of files via Git LFS. Typical use cases:
- A research study with many samples (VCFs, BAMs, images, etc.)
- A data lake-ish repo where each commit adds more LFS pointers
- Monorepos that aggregate multiple datasets or experiments
In these cases, standard Git LFS introspection commands become painfully slow. A concrete example:
git lfs ls-files --json
On a repo with thousands of LFS pointers, this can take several minutes. That’s a non-starter for:
- Interactive CLI tools
- Editor/IDE integrations
- CI/CD steps that run frequently
This note describes architectural patterns to avoid global enumeration and keep operations fast and predictable as your LFS population grows.
2. Why git lfs ls-files is Slow in Large Repos¶
Conceptually, git lfs ls-files must:
- Walk the Git index / working tree to identify LFS-tracked files.
- For each file, resolve and hydrate metadata (pointer, OID, size, etc.).
- Optionally serialize to JSON.
Even if the LFS objects are local, this is O(N) over every matching file visible to the command. When N = 10,000+, you’re essentially asking Git + Git LFS to do a full scan and re-derive information that:
- Doesn’t change very often, and
- Could be cached or maintained elsewhere.
From an architecture perspective, the problem is:
We’re using
git lfs ls-filesas a query engine and index, when it’s really just a dumb enumerator over the current state.
3. Design Goals¶
For a repository with many LFS objects, we want:
-
Predictable latency Operations that touch “all LFS files” should be rare and explicit; routine commands should be sub-second, even as the repo grows.
-
Incremental updates Avoid full scans of N files when only a handful are new or changed.
-
Subset operations by default Most tasks only need a subset (by path, tag, type, or commit range), not the full universe.
-
Separation of metadata from Git internals Use Git (and Git LFS) as the transport and integrity layer, not as a full-featured metadata store.
4. Core Architectural Pattern: External LFS Metadata Index¶
Instead of deriving everything on demand from git lfs ls-files, maintain a separate index of LFS metadata that is:
- Versioned alongside the repo (e.g., tracked TSV/JSON),
- Derived incrementally from Git/LFS events, and
- Fast to query (path lookup, OID lookup, tags, etc.).
4.1. Example: META/lfs_index.tsv¶
A simple pattern:
- Maintain a tracked file such as
META/lfs_index.tsvwith columns like:
path oid_sha256 size tags logical_id
data/a.bam 1a2b3c... 12345 tumor sample:XYZ
data/b.bam 4d5e6f... 67890 normal sample:ABC
- This TSV becomes your primary, fast, queryable index, not
git lfs ls-files.
Pros:
- Constant-time query by path via grep / awk / Python / SQL.
- Easy to join with other metadata tables (specimens, assays, etc.).
- Can be regenerated in a controlled, explicit operation (like
make rebuild-index).
4.2. How to Keep It Up-to-Date¶
You don’t want manual edits. Use automation on “add” paths:
-
use a pre-commit hook:
-
For newly staged LFS pointer files, update the index before commit.
This shifts expensive work into the write path where it is amortized and expected, and keeps the read path (queries) fast.
5. Avoiding git lfs ls-files in Common Operations¶
5.1. Don’t use ls-files as your data plane¶
Refactor any tools that currently:
git lfs ls-files --json | jq ...
to instead read from your external index (TSV/JSON/SQLite). For example:
# Old, slow:
git lfs ls-files --json | jq '.[] | select(.name|test("VCF$"))'
# New, fast:
awk -F'\t' '$1 ~ /\.vcf$/ {print $0}' META/lfs_index.tsv
or in Python:
import csv
with open("META/lfs_index.tsv") as f:
for row in csv.DictReader(f, delimiter="\t"):
if row["path"].endswith(".vcf.gz"):
...
5.2. Use ls-files only for rare “rebuild index” operations¶
When you first introduce the index, you may need a one-time or occasional rebuild:
git lfs ls-files --all --json > /tmp/lfs_files.json
# transform into META/lfs_index.tsv
This can take minutes in huge repos—and that’s fine, as long as it is rare and documented as a heavy operation (like npm install, docker build, etc.).
6. Subset-First Design: Operate on Paths, Tags, or Commits¶
If you must derive state from Git directly, design your commands to start with a subset, not the full repo.
6.1. Path-based subsets¶
For example, instead of:
# Scans entire repo
git lfs ls-files --json
use:
# Only data under a project or cohort
git lfs ls-files --include "data/StudyX/**" --json
and structure your tooling around the concept of project subtrees (data/studyA/, data/studyB/, etc.) so most operations are scoped.
6.2. Commit-range subsets¶
For incremental workflows (ETL, indexing, sync), use git to find changed files:
git diff --name-only <old-commit> <new-commit> \
| git check-attr --stdin filter \
| awk '$2 == "lfs"' # or similar
Then only examine LFS metadata for changed files, merging that into your external index.
7. Caching and Incremental Computation¶
If you really want a “git lfs ls-files --json-like view,” you can implement your own cached snapshot:
- Keep a file like
.cache/lfs_snapshot.jsonkeyed by commit hash (HEAD). -
On invocation:
-
If
HEADhas not changed, just read the cache. - If
HEADchanged, compute the diff from the last snapshot and patch the cached JSON.
This means you only pay full-scan costs when the diff is large, and usually pay a small, incremental cost.
8. CI/CD Considerations¶
In CI, naive patterns like:
- run: git lfs ls-files --json | jq ...
will slow your builds significantly once the LFS population grows.
Better patterns:
-
For linting or validation:
-
Operate on
META/*.tsvand cross-check with a small sample of pointers. -
For publishing or sync steps:
-
Use
git diffbetween the last deployed commit and current one to identify only the LFS files that changed. -
For health checks:
-
Schedule a periodic “heavy” job (nightly or weekly) that runs
git lfs ls-filesto verify repo consistency, rather than doing it on every push.
9. Git + LFS as Transport, Not Primary Index¶
The underlying architectural theme:
- Git is an excellent tool for content addressing, branching, merging, and history.
- Git LFS is an excellent tool for large object transport and storage.
Neither is optimized as a high-level metadata query system for tens of thousands of objects.
So:
- Let Git/LFS handle integrity and distribution.
- Let a simple, explicit index (TSV/JSON/SQLite, or an external service like Indexd) handle queries, tags, and relationships.
You can always rebuild your index from Git LFS if needed, but you shouldn’t be doing that implicitly on every command.
10. Practical Recommendations / Checklist¶
When you notice git lfs ls-files --json taking minutes:
-
Audit your tools
- Search for any use of
git lfs ls-filesin scripts, CI configs, and CLIs. - Replace them with operations over an external index.
- Search for any use of
-
Introduce a canonical LFS index
- Add
META/lfs_index.tsv(or similar) to the repo. - Define columns:
path,oid_sha256,size,tags,logical_id, etc. - Commit it and treat it as the primary query surface.
- Add
-
Automate index maintenance
- Add a wrapper command or pre-commit hook that updates the index on
git add. - Provide a “heavy”
rebuild-lfs-indexcommand that users run explicitly when necessary.
- Add a wrapper command or pre-commit hook that updates the index on
-
Scope operations by default
- Design new commands to accept
--path,--tag,--study, or--since <commit>flags. - Document that global “scan everything” commands are expensive and should be infrequent.
- Design new commands to accept
-
Use CI wisely
- Only operate on changed LFS files between commits.
- Reserve full LFS integrity checks for scheduled jobs, not every PR.