RawNetLite + meta-learning layer. Key numbers, architecture, and how we train for robust, cross-domain detection.
Representative performance from our evaluation setup. Actual numbers depend on dataset and evaluation protocol.
~94%
Accuracy (in-domain)
Typical on held-out same-domain data
< 8%
EER (cross-domain)
Equal error rate on unseen domains
> 0.96
AUC
Area under ROC curve
< 100 ms
Latency (CPU)
Per 3 s clip, single request
Lightweight 1D CNN on raw waveform—no mel or hand-crafted features. Learns directly from samples; "Lite" = fewer parameters for fast inference.
Input
(batch, 1, 48000) @ 16 kHz, 3 s
Backbone
1D conv blocks, no RNN
Output
Single logit → P(fake)
A meta-learner on top of RawNetLite embeddings that adapts quickly to new domains or attack types with few examples—our main research contribution.
Why it matters
New generators appear constantly. Meta-learning lets the system adapt without full retraining.
Checkpoint: augmented_triple_cross_domain_focal_rawnet_lite.pt
Augmentation
Noise, time stretch, codec simulation for robustness.
Triple / multi-domain
Multiple datasets and synthesis methods.
Cross-domain eval
EER, AUC on unseen domains.
Focal loss
Focus on hard examples, better calibration.
Training loop
Typical sources: ASVspoof (LA, PA, DF), ADD, in-the-wild; real from VCTK/LibriSpeech; fake from TTS/VC systems. Triple cross-domain = multiple domains for train and eval.
Load .pt with torch.load, same preprocessing (resample → mono → normalize → 3 s), one forward pass → logit → P(fake) and label. No augmentation at inference.