How the pipeline works

From file upload to real/fake label: loading, preprocessing (resample, normalize, trim/pad), and inference with RawNetLite. The same ideas apply when using librosa for loading and feature extraction in other setups.

Try detection →

Pipeline at a glance

Upload
Load
Resample
Normalize
3s
Model
Label
1

Upload & receive

You select an audio file (WAV, MP3, FLAC, etc.) in the browser. The frontend sends it via POST to the Next.js API route, which forwards the raw file to the Flask backend. The server writes it to a temporary file so the audio stack can load it.

POST /api/predict  →  Flask /predict  →  request.files['audio']
2

Load waveform (torchaudio / librosa-style)

The backend loads the file with torchaudio.load(), which decodes the format and returns a float tensor (waveform) and sample rate. In many research pipelines, librosa is used instead (e.g. librosa.load) for the same purpose—decode to a time-domain signal and sample rate. Our stack uses torchaudio for consistency with PyTorch and the model.

waveform, sr = torchaudio.load(tmp.name)  # shape: (C, N), sr: int
3

Convert to mono & resample to 16 kHz

Stereo or multi-channel audio is mixed to a single channel (e.g. mean across channels). The signal is then resampled to 16 kHz using torchaudio.transforms.Resample. This matches the model’s expected input rate and reduces length for a fixed window. Librosa equivalent: librosa.resample(y, orig_sr, 16000) and mono: librosa.to_mono().

Resample(sr, 16000)(waveform);  waveform = waveform.mean(dim=0, keepdim=True)
4

Normalize & fix length (3 seconds)

The waveform is peak-normalized (divide by max absolute value) so amplitudes lie in [-1, 1]. Then it’s trimmed to 3 seconds (48,000 samples at 16 kHz) or zero-padded if shorter. This gives a fixed-size input tensor (1, 48000) for the model. Optional: in feature-based pipelines, you’d extract mel spectrograms (e.g. librosa.feature.melspectrogram) or other hand-crafted features; RawNetLite operates on raw waveform.

waveform = waveform / waveform.abs().max();  trim/pad to 48000 samples
5

Forward pass: RawNetLite + meta-learning layer

The preprocessed tensor is moved to the same device as the model (CPU or CUDA) and passed through RawNetLite in eval mode with torch.no_grad(). Our system adds a meta-learning layer on top (our golden point), which consumes RawNetLite embeddings and produces a single scalar logit for robust, adaptable detection. The combined model outputs one logit (no softmax; single binary output).

output = model(waveform)  # RawNetLite + meta-learning → (1,) logit
6

Probability & label

The logit is converted to a probability via sigmoid: P(fake) = σ(logit). We threshold at 0.5: above 0.5 → label 'fake', otherwise → 'real'. This probability and label are returned as JSON to the frontend, which displays the result and stores it in the results history.

prob = sigmoid(logit);  label = 'fake' if prob > 0.5 else 'real'

Librosa vs torchaudio

This backend uses torchaudio for loading and resampling so everything stays in PyTorch. In many papers and codebases, librosa is used to load audio (librosa.load), resample (librosa.resample), and extract features (e.g. mel spectrogram, librosa.feature.melspectrogram). RawNetLite is trained on raw waveform, so we only do load → resample → mono → normalize → fixed length; no mel step. If you switch to a feature-based model, you’d add a librosa/torchaudio feature extraction step here.

Data flow summary

Audio file → temp file → waveform (C×N) + sr → mono → 16 kHz → normalize → (1, 48000) → RawNetLite → logit → P(fake), label → JSON → frontend display & history.