From file upload to real/fake label: loading, preprocessing (resample, chunking, feature extraction), and inference with a prototypical meta-learning model over 3-second segments.
Try detection →Pipeline at a glance
You select an audio file (WAV, MP3, FLAC, etc.) in the browser. The frontend sends it via POST to the Next.js API route, which forwards the raw file to the FastAPI backend. The server writes it to a temporary file so the audio stack can load it.
POST /api/predict → FastAPI /predict → UploadFile
The backend loads the file with torchaudio.load(), which decodes the format and returns a float tensor (waveform) and sample rate. In many research pipelines, librosa is used instead (e.g. librosa.load) for the same purpose—decode to a time-domain signal and sample rate. Our stack uses torchaudio for consistency with PyTorch and the model.
waveform, sr = torchaudio.load(tmp.name) # shape: (C, N), sr: int
Stereo or multi-channel audio is mixed to a single channel (e.g. mean across channels). The signal is then resampled to 16 kHz using torchaudio.transforms.Resample. This matches the model’s expected input rate and reduces length for a fixed window. Librosa equivalent: librosa.resample(y, orig_sr, 16000) and mono: librosa.to_mono().
Resample(sr, 16000)(waveform); waveform = waveform.mean(dim=0, keepdim=True)
The waveform is peak-normalized (divide by max absolute value) so amplitudes lie in [-1, 1]. It is then sliced into non-overlapping 3 second windows and padded where needed. Each window is fed into the feature extractor to build a 288‑dimensional vector per chunk.
waveform = waveform / waveform.abs().max(); trim/pad to 48000 samples
Each 288‑dimensional feature vector is filtered, scaled, embedded by the encoder, and compared against learned Real/Fake prototypes. Distances are converted to probabilities via softmax over negative distance, giving per‑chunk P(real) and P(fake).
chunk_probs = predictor.predict_audio_file(path) # prototypical meta-learning over chunks
The logit is converted to a probability via sigmoid: P(fake) = σ(logit). We threshold at 0.5: above 0.5 → label 'fake', otherwise → 'real'. This probability and label are returned as JSON to the frontend, which displays the result and stores it in the results history.
prob = sigmoid(logit); label = 'fake' if prob > 0.5 else 'real'
This backend uses torchaudio for loading and resampling so everything stays in PyTorch. In many papers and codebases, librosa is used to load audio (librosa.load), resample (librosa.resample), and extract features (e.g. mel spectrogram, librosa.feature.melspectrogram). Our production backend uses a feature-based Prototypical Network for inference; if you swap in a different model, you can adjust the feature step here.
Audio file → temp file → waveform (C×N) + sr → mono → resample → 3-second chunks → 288‑dim features → prototypical meta-learning → per‑chunk votes → majority vote & aggregates → JSON → frontend display & history.