Model & training

A Prototypical Network (meta-learning) built on top of rich audio features. Key numbers, architecture, and how we train for robust, cross-domain detection.

Key metrics

Representative performance from our evaluation setup. Actual numbers depend on dataset and evaluation protocol.

~94%

Accuracy (in-domain)

Typical on held-out same-domain data

< 8%

EER (cross-domain)

Equal error rate on unseen domains

> 0.96

AUC

Area under ROC curve

< 100 ms

Latency (CPU)

Per 3 s clip, single request

Model card

Input
Audio file split into 3 s chunks, resampled to 22.05 kHz, mono
Output
Per-file verdict (Real / Fake), confidence, per-chunk probabilities, and vote breakdown
Framework
PyTorch, Prototypical Network + meta-learning
Limitations
3 s window; performance may vary on very low quality or unseen attack types

Prototypical encoder

Instead of a single classifier head on raw waveform, the production model uses a Prototypical Network over rich 288‑dimensional audio features. Each 3 s chunk is mapped into an embedding space and compared against learned prototypes for Real and Fake speech.

Input

288-dim feature vector per 3 s chunk (MFCC, chroma, mel, spectral, tonnetz)

Encoder

Multi-layer MLP with BatchNorm, ReLU, dropout → low‑dim embedding

Decision

Distance to Real/Fake prototypes → softmax over distances → per-chunk probabilities

Meta-learning layer (our golden point)

A meta-learner on top of the encoder that adapts quickly to new domains or attack types with few examples—our main research contribution. In practice this is implemented as a Prototypical Network with class prototypes for Real and Fake speech.

  • Adapts with minimal extra data
  • Improves cross-domain and few-shot performance
  • Stays effective as new deepfakes appear

Why it matters

New generators appear constantly. Meta-learning lets the system adapt without full retraining.

Training methodology

Checkpoint: augmented_triple_cross_domain_focal_rawnet_lite.pt

Augmentation

Noise, time stretch, codec simulation for robustness.

Triple / multi-domain

Multiple datasets and synthesis methods.

Cross-domain eval

EER, AUC on unseen domains.

Focal loss

Focus on hard examples, better calibration.

Training loop

  • Preprocess (16 kHz, 3 s, mono, normalize)
  • Augment per epoch → forward → focal loss → backprop
  • Validate on held-out and cross-domain sets → save best checkpoint

Datasets

Typical sources: ASVspoof (LA, PA, DF), ADD, in-the-wild; real from VCTK/LibriSpeech; fake from TTS/VC systems. Triple cross-domain = multiple domains for train and eval.

Inference

Load .pt with torch.load, same preprocessing (resample → mono → normalize → 3 s), one forward pass → logit → P(fake) and label. No augmentation at inference.