Create a Database
afquery create-db builds the AFQuery database from a manifest of single-sample VCFs. This is a one-time setup step; incremental updates use afquery update-db.
Basic Usage
For cohorts with WES/panel samples, provide the BED file directory:
afquery create-db \
--manifest manifest.tsv \
--output-dir ./db/ \
--genome-build GRCh38 \
--bed-dir ./beds/
What Happens
- Ingest phase — Each VCF is parsed with cyvcf2. Genotypes and quality fields are written to a temporary per-sample Parquet file, one row per variant per sample.
- Build phase — DuckDB reads the temporary Parquet files, aggregates per 1-Mbp bucket, and writes Roaring Bitmap Parquet files partitioned by chromosome and bucket.
- Finalize —
manifest.jsonandmetadata.sqliteare written to the output directory.
Directory Layout After Creation
./db/
├── manifest.json # Build configuration (genome build, schema version, etc.)
├── metadata.sqlite # Sample/phenotype/technology/changelog metadata
├── variants/ # Parquet files partitioned by chromosome and bucket
│ ├── chr1/
│ │ ├── bucket_0.parquet # Positions 0–999,999
│ │ ├── bucket_1.parquet # Positions 1,000,000–1,999,999
│ │ └── ...
│ ├── chr2/
│ └── ...
└── capture/ # Interval trees for WES technologies (pickle files)
├── wes_v1.pkl
└── wes_v2.pkl
Memory and Thread Tuning
For large cohorts, tune these options:
afquery create-db \
--manifest manifest.tsv \
--output-dir ./db/ \
--genome-build GRCh38 \
--build-threads 32 \
--build-memory 4GB
| Option | Default | Recommendation |
|---|---|---|
--build-threads |
all CPUs | Set to min(cpu_count, available_RAM_GB / 2) |
--build-memory |
2GB |
Increase for dense WGS regions or large cohorts |
--threads |
all CPUs | Controls ingest parallelism (VCF parsing) |
The build phase uses one DuckDB process per 1-Mbp bucket. With --build-threads 32 and --build-memory 4GB, peak RAM usage is approximately 32 × 4 = 128 GB.
Resume Behavior
If create-db is interrupted, it resumes automatically from where it left off. Individual bucket Parquet files that were already written are skipped.
To force a complete restart:
Warning
--force deletes all existing output in --output-dir. Use with caution.
FILTER=PASS Behavior
Only FILTER=PASS calls (or calls with no FILTER field) contribute to AC, and therefore to AF. AN is not affected — it counts every eligible sample, failed calls included. Calls that fail a filter are tracked in fail_bitmap and surfaced as N_FAIL. PASS-only counting is always enforced — there is currently no CLI option to change this behaviour.
See FILTER=PASS Tracking for details.
Coverage-Evidence Filters
Four optional flags enable per-sample, quality-aware tracking of which positions each partially-covered technology (WES, panels) actually covered. They are fully opt-in.
| Flag | Default | Effect |
|---|---|---|
--min-dp D |
0 | Minimum FORMAT/DP for a carrier to count as quality evidence. |
--min-gq G |
0 | Minimum FORMAT/GQ for a carrier to count as quality evidence. |
--min-qual Q |
0.0 | Minimum VCF QUAL field for a carrier to count as quality evidence. |
--min-covered K |
0 | Per partially-covered tech, the position is "trusted" only if at least K of its carriers pass the quality thresholds. Non-carriers of failing positions are recorded as N_NO_COVERAGE. |
When any of these flags is non-zero AFQuery reads FORMAT/DP, FORMAT/GQ,
and QUAL from each variant call during ingest. Use the bundled
resources/normalize_vcf.sh (which preserves these FORMAT fields) or ensure
your own preprocessing keeps them.
Example:
afquery create-db \
--manifest samples.tsv \
--output-dir ./db/ \
--genome-build GRCh38 \
--bed-dir ./beds/ \
--min-dp 30 --min-gq 20 --min-covered 1
Thresholds are fixed at creation time. update-db --add-samples reuses them
and re-applies them to every position whose partially-covered tech receives
new samples (see Update Database).
See Coverage Evidence for when to reach
for each flag, how N_NO_COVERAGE is computed, and the query-time companion
flag --min-quality-evidence.
Validating the Result
After creation, run:
And inspect database metadata:
Example info output:
Database: ./db/
Schema version: 2.0
Genome build: GRCh38
DB version: 1.0
Samples: 1371
Technologies: wgs, wes_v1, wes_v2
Chromosomes: chr1 ... chrX chrY chrM
These commands are available at any time after database creation — not only immediately after create-db. Use afquery check to verify database integrity (manifest consistency, Parquet file health, capture index presence) and afquery info to inspect sample counts, registered technologies, and phenotype codes.
Full Option Reference
See CLI Reference → create-db.
Next Steps
- Manifest Format — TSV manifest column reference and common mistakes
- Query Allele Frequencies — run your first queries against the new database
- Performance Tuning — build thread and memory configuration for large cohorts