A Prototypical Network (meta-learning) built on top of rich audio features. Key numbers, architecture, and how we train for robust, cross-domain detection.
Representative performance from our evaluation setup. Actual numbers depend on dataset and evaluation protocol.
~94%
Accuracy (in-domain)
Typical on held-out same-domain data
< 8%
EER (cross-domain)
Equal error rate on unseen domains
> 0.96
AUC
Area under ROC curve
< 100 ms
Latency (CPU)
Per 3 s clip, single request
Instead of a single classifier head on raw waveform, the production model uses a Prototypical Network over rich 288‑dimensional audio features. Each 3 s chunk is mapped into an embedding space and compared against learned prototypes for Real and Fake speech.
Input
288-dim feature vector per 3 s chunk (MFCC, chroma, mel, spectral, tonnetz)
Encoder
Multi-layer MLP with BatchNorm, ReLU, dropout → low‑dim embedding
Decision
Distance to Real/Fake prototypes → softmax over distances → per-chunk probabilities
A meta-learner on top of the encoder that adapts quickly to new domains or attack types with few examples—our main research contribution. In practice this is implemented as a Prototypical Network with class prototypes for Real and Fake speech.
Why it matters
New generators appear constantly. Meta-learning lets the system adapt without full retraining.
Checkpoint: augmented_triple_cross_domain_focal_rawnet_lite.pt
Augmentation
Noise, time stretch, codec simulation for robustness.
Triple / multi-domain
Multiple datasets and synthesis methods.
Cross-domain eval
EER, AUC on unseen domains.
Focal loss
Focus on hard examples, better calibration.
Training loop
Typical sources: ASVspoof (LA, PA, DF), ADD, in-the-wild; real from VCTK/LibriSpeech; fake from TTS/VC systems. Triple cross-domain = multiple domains for train and eval.
Load .pt with torch.load, same preprocessing (resample → mono → normalize → 3 s), one forward pass → logit → P(fake) and label. No augmentation at inference.