mkref

Table of contents

  1. Introduction
    1. Key Use Cases
  2. Input and Output
    1. Input
    2. Output
  3. Usage Example
    1. Complete workflow: Population-level annotation
  4. Parameters
    1. Required Parameters
    2. Parameter Details
  5. Best Practices
    1. Quality Control
    2. Population Analysis
    3. Iterative Refinement
  6. Common Issues
    1. Empty output file
    2. Duplicate motifs
  7. See Also

Introduction

The mkref command in VAMPIRE is used to generate a motif database (in FASTA format) from a motif file produced by the anno command. This enables downstream analyses that require a curated set of motifs.

Key Use Cases

  • Population-level analysis: Create consensus motif set from multiple samples
  • Batch annotation: Use same motif database across multiple datasets
  • Motif refinement: Iteratively improve annotation quality
  • Reference building: Create custom motif databases for specific organisms

Input and Output

Input

Input Format Description Default
Motif TSV Motif file generated by anno None

Output

Output Format Description Default
Motif Database FASTA Curated motif database in FASTA None

Input file format ({prefix}.motif.tsv):

id	motif	rep_num	label
0	GGC	150	alpha
1	GGT	45	alpha

Output file format (FASTA):

>0
GGC
>1
GGT

Usage Example

Generate a motif database from an annotation file:

vampire mkref annotation_prefix motif_database.fa

Complete workflow: Population-level annotation

A common workflow for analyzing multiple samples from the same population:

# Step 1: Annotate each sample de novo
vampire anno sample1.fa sample1_anno
vampire anno sample2.fa sample2_anno
vampire anno sample3.fa sample3_anno

# Step 2: Merge motif files
cat sample1_anno.motif.tsv sample2_anno.motif.tsv sample3_anno.motif.tsv > all_motifs.tsv

# Step 3: Create reference database
vampire mkref all_motifs population_motif_database.fa

# Step 4: Re-annotate all samples with unified motif set
vampire anno --no-denovo -m population_motif_database.fa sample1.fa sample1_curated
vampire anno --no-denovo -m population_motif_database.fa sample2.fa sample2_curated
vampire anno --no-denovo -m population_motif_database.fa sample3.fa sample3_curated

Parameters

Required Parameters

Parameter Type Description Default
prefix string Input prefix from anno command (reads .motif.tsv) Required
output string Output FASTA file path for reference motif database Required

Parameter Details

The command reads the .motif.tsv file generated by anno and extracts each unique motif sequence. Each motif is written to the output FASTA file with its ID as the header.


Best Practices

Quality Control

Before creating the final reference database:

  1. Review motif statistics: Check rep_num in the motif file to ensure sufficient evidence
  2. Filter low-frequency motifs: Remove motifs with very low counts
  3. Manual curation: Examine motif sequences for artifacts
# Example: Keep only motifs with >= 10 occurrences
awk -F'\t' '$3 >= 10' annotation.motif.tsv > filtered_motifs.tsv
vampire mkref filtered_motifs curated_database.fa

Population Analysis

When working with multiple samples:

  1. Annotate all samples with de novo mode first
  2. Merge motif tables from all samples
  3. Deduplicate motifs across samples
  4. Create reference database from merged set
  5. Re-annotate all samples with the unified database

This ensures consistent motif labeling across the population.

Iterative Refinement

For difficult datasets:

# Round 1: De novo annotation
vampire anno data.fa round1
vampire mkref round1 reference1.fa

# Round 2: Annotation with reference
vampire anno --no-denovo -m reference1.fa data.fa round2

# Round 3: Refine and create new reference
vampire refine round2 refine_actions.tsv -o round2_refined
vampire mkref round2_refined reference2.fa

# Round 4: Final annotation
vampire anno --no-denovo -m reference2.fa data.fa final

Common Issues

Empty output file

Cause: No motifs detected in the input annotation file.

Solution: Check the input annotation file:

head annotation.motif.tsv

Ensure it contains valid motif entries with id, motif, and rep_num columns.

Duplicate motifs

Cause: Multiple samples have the same motif with different IDs.

Solution: Deduplicate before creating the database:

# Remove duplicates, keep first occurrence
awk '!seen[$0]++' combined_motifs.tsv > unique_motifs.tsv
vampire mkref unique_motifs final_database.fa

See Also

  • Parameters - Full parameter reference
  • anno - Generate annotation for mkref
  • refine - Manually curate motifs before creating database