generator

Table of contents

  1. Introduction
  2. Input and Output
    1. Input
    2. Output
  3. Usage Examples
    1. 1. Generate a simple TR sequence
    2. 2. Generate TR with mutations
    3. 3. Generate multi-motif TR sequence
    4. 4. Reproducible generation
  4. Parameters
    1. Required Parameters
    2. Optional Parameters
    3. Parameter Details
  5. Output Format
    1. FASTA File
    2. Annotation File (with mutations)
    3. Annotation File (without mutations)
  6. Use Cases
    1. Parameter Tuning
    2. Benchmarking
    3. Multi-motif Validation
  7. See Also

Introduction

The generator command in VAMPIRE is designed to generate simulated tandem repeat sequences with user-defined motifs, length, and mutation rate. This is useful for:

  • Benchmarking: Test VAMPIRE’s annotation performance on known patterns
  • Parameter tuning: Explore effects of different settings
  • Method validation: Verify algorithms work as expected

Random seed can be set to enable reproducible results.


Input and Output

Input

Input Format Description Default
Motifs string (parameter) One motif or a list of motifs to generate the sequence None
Length integer (parameter) Length of the simulated sequence 1000
Mutation Rate float (parameter) Mutation rate for SNVs and INDELs 0
Seed integer (parameter) Random seed for reproducibility 42

Output

Output Format Description Default
Sequence FASTA Simulated tandem repeat sequence None
Annotation TSV Annotation file with mutations None
Annotation_woMut TSV Annotation file before introducing mutations None

File descriptions:

  • {prefix}.fa: Simulated TR sequences in FASTA format
  • {prefix}.anno.tsv: Annotation results including mutations (SNVs, indels)
  • {prefix}.anno_woMut.tsv: Annotation of perfect tandem repeats before mutations

Usage Examples

1. Generate a simple TR sequence

Generate a perfect tandem repeat with motif “GGC”.

vampire generator -m GGC -l 1000 -r 0 -p simple_tr

2. Generate TR with mutations

Introduce mutations at 1% rate.

vampire generator -m GGC -l 1000 -r 0.01 -p mutated_tr

3. Generate multi-motif TR sequence

Create a sequence alternating between two motifs.

vampire generator -m AATGG CCATT -l 1000 -r 0 -p multi_motif_tr

4. Reproducible generation

Set a specific random seed for reproducible results.

vampire generator -m GGC -l 1000 -r 0.01 -s 12345 -p reproducible_tr

Parameters

Required Parameters

Parameter Type Description Default
-m, --motifs string One or more motif sequences (space-separated) Required
-p, --prefix string Output prefix for generated files Required

Optional Parameters

Parameter Type Description Default
-l, --length integer Length of simulated tandem repeat sequence (bp) 1000
-r, --mutation-rate float Mutation rate (0-1) for SNVs and indels 0
-s, --seed integer Random seed for reproducibility 42

Parameter Details

Mutation types:

  • SNVs: Single nucleotide variants (substitutions)
  • Indels: Insertions and deletions of varying lengths

Motif patterns:

  • Single motif: Repeats continuously (e.g., GGCGGC GGC...)
  • Multiple motifs: Alternates in specified order (e.g., AATGG CCATT AATGG CCATT...)

Output Format

FASTA File

Standard FASTA format with sequence name and repeat sequence.

Annotation File (with mutations)

Columns include motif ID, position, actual sequence, and mutation details.

Annotation File (without mutations)

Shows the perfect tandem repeat structure before mutations were introduced.


Use Cases

Parameter Tuning

Generate sequences with known structure to test different k-mer sizes:

# Test k=5
vampire generator -m GGC -l 1000 -r 0 -p test_k5
vampire anno -k 5 test_k5.fa test_k5

# Test k=9
vampire generator -m GGC -l 1000 -r 0 -p test_k9
vampire anno -k 9 test_k9.fa test_k9

Benchmarking

Compare annotation accuracy across mutation rates:

for rate in 0 0.01 0.05 0.1; do
  vampire generator -m GGC -l 1000 -r $rate -p benchmark_$rate
  vampire anno benchmark_$rate.fa benchmark_$rate
done

Multi-motif Validation

Test annotation with complex motif patterns:

vampire generator -m AATGG CCATT -l 2000 -r 0.01 -p complex_tr
vampire anno complex_tr.fa complex_tr
vampire evaluate complex_tr complex_tr_eval

See Also

  • Parameters - Full parameter reference
  • anno - Annotate the generated sequences
  • evaluate - Evaluate annotation quality on simulated data