Tutorial · Python · ~45 minutes

Build an sEMG fatigue classifier from scratch

A step-by-step walkthrough — for students new to applied ML in sport science.

This tutorial rebuilds the surface-EMG fatigue project end to end. You will synthesise a realistic sEMG signal, compute the classical spectral fatigue indicators (median and mean frequency), engineer a small feature set, train and calibrate a logistic-regression classifier, and add honest uncertainty intervals with split conformal prediction. By the end you will have ~150 lines of Python that beat the textbook MDF-threshold rule on held-out athletes — and, more importantly, a working mental model of every line.

I wrote this for the version of me who started in sport science and only later picked up Python. If you can read a scatter plot and write a for loop, you can finish this tutorial.

What you need. Python 3.9 or later, plus four libraries:

pip install numpy scipy scikit-learn matplotlib

What you get. A single Python script that runs in under a minute and reproduces every result on the project page. The project codebase uses zero dependencies (a hand-rolled FFT in pure Python) — that's for portability, not for teaching. Here we use the standard scientific stack because that's what you'll reach for in your own work.

01 Set up imports

Single block. We'll fix the random seed up front so your numbers match mine throughout the tutorial.

import numpy as np
import pandas as pd
import scipy.signal as signal
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score

np.random.seed(7)

Tip: keep the seed at the top of the file, not inside any function. It's the single line that controls every random choice for the rest of the notebook.

02 Background — what is sEMG fatigue?

Surface electromyography (sEMG) records the electrical activity of a contracting muscle through skin electrodes. The raw signal is a noisy oscillation at hundreds of Hz, sampled typically at 1000–2000 Hz.

When a muscle holds an isometric contraction — gripping a sleeve, holding a static stance, hanging on a pull-up — fatigue accumulates. Three things happen, on the muscle-physiology side:

Muscle-fibre conduction velocity slows. Action potentials propagate along the fibres more slowly because of accumulating metabolites (lactate, H⁺).
The power spectrum of the sEMG shifts toward lower frequencies. Direct consequence of slower conduction velocity.
Amplitude (RMS) usually rises. The central nervous system recruits additional motor units to compensate for the dropping force-generating capacity.

The classical scalars that summarise the spectral shift are median frequency (MDF) and mean frequency (MNF). The standard practical fatigue rule, used in countless sport-science papers, is: fatigued if MDF drops below ~90% of its initial value. That's the rule we're going to beat.

03 Synthesise a realistic sEMG signal

We need data. In a real lab you'd record from a subject; here we generate a physiologically motivated synthetic signal so the whole tutorial runs from your laptop. The trick: generate the signal one second at a time as bandpass-filtered Gaussian noise, with the centre frequency of the bandpass slowly drifting from ~120 Hz (fresh) to ~70 Hz (exhausted).

def synth_emg(seconds=60, fs=1000, mdf_start=120, mdf_end=70,
              t50=33, k=0.15, bw=40):
    """Synthesise one isometric sEMG trial with a time-varying spectrum.

    seconds   : trial duration
    fs        : sampling rate (Hz)
    mdf_start : initial median frequency (Hz) — fresh muscle
    mdf_end   : final median frequency (Hz) — exhausted muscle
    t50       : time (s) where the spectral shift is half-complete
    k         : sigmoid steepness — how abrupt the fatigue onset is
    bw        : spectral half-bandwidth (Hz) around the centre frequency
    """
    chunks = []
    for sec in range(seconds):
        t_mid = sec + 0.5
        # Sigmoid interpolation between fresh and fatigued centre frequency
        s = 1 / (1 + np.exp(-k * (t_mid - t50)))
        f_c = mdf_start * (1 - s) + mdf_end * s

        # Bandpass: [f_c - bw, f_c + bw], clipped to typical sEMG band (15-450 Hz)
        low  = max(15,  f_c - bw) / (fs / 2)
        high = min(450, f_c + bw) / (fs / 2)
        b, a = signal.butter(4, [low, high], btype='band')

        noise = np.random.normal(0, 1, fs)
        chunk = signal.filtfilt(b, a, noise)
        chunks.append(chunk)

    return np.concatenate(chunks)


emg = synth_emg()
print(f"Signal length: {len(emg)} samples  ({len(emg)/1000:.1f}s at 1 kHz)")
print(f"RMS amplitude: {emg.std():.3f}")

Notice we used signal.filtfilt (forward-backward) rather than signal.lfilter. That gives zero phase distortion — important when you'll later care about when features change, not just their magnitude.

Plot the first two seconds to confirm it looks like sEMG:

plt.figure(figsize=(10, 2.6))
plt.plot(np.arange(2000) / 1000, emg[:2000], lw=0.6, color='#7c2d12')
plt.xlabel('Time (s)'); plt.ylabel('Amplitude (a.u.)')
plt.title('Synthetic sEMG — first 2 seconds')
plt.tight_layout(); plt.show()

You should see a dense, zero-mean oscillation. If it looks like a clean sine wave, your bandwidth is too narrow; if it looks like white noise, your bandwidth is too wide.

04 Compute the power spectrum

The power spectrum tells us how energy is distributed across frequencies. For a fixed-length window, we taper with a Hann window (to reduce edge leakage), apply the FFT, square the magnitudes.

def power_spectrum(window, fs=1000):
    """One-sided periodogram of a Hann-tapered real-valued window."""
    tapered  = window * signal.windows.hann(len(window))
    spectrum = np.abs(np.fft.rfft(tapered)) ** 2
    freqs    = np.fft.rfftfreq(len(window), d=1 / fs)
    return freqs, spectrum

Try it on a single window at t=5 s:

seg = emg[5000:5512]                   # 0.512-second window starting at t=5s
freqs, psd = power_spectrum(seg)

plt.figure(figsize=(8, 3.5))
plt.plot(freqs, psd / psd.sum(), color='#7c2d12', lw=2)
plt.xlim(0, 250)
plt.xlabel('Frequency (Hz)'); plt.ylabel('Normalised power')
plt.title('PSD at t=5s (fresh muscle)')
plt.tight_layout(); plt.show()

Expected output: a bell-ish peak centred near 120 Hz (the initial MDF), with most of the mass between 80 and 160 Hz.

05 Define MDF and MNF

Two classical scalars. Median frequency: the frequency that splits the spectrum into two halves with equal power. Mean frequency: the power-weighted average frequency. Both restricted to the typical sEMG band (20–450 Hz).

def median_frequency(freqs, psd, band=(20, 450)):
    """Frequency at which cumulative power crosses half of total in-band."""
    mask = (freqs >= band[0]) & (freqs <= band[1])
    f, p = freqs[mask], psd[mask]
    cumulative = np.cumsum(p)
    half_total = cumulative[-1] / 2
    return f[np.searchsorted(cumulative, half_total)]


def mean_frequency(freqs, psd, band=(20, 450)):
    """Power-weighted mean frequency, in-band."""
    mask = (freqs >= band[0]) & (freqs <= band[1])
    f, p = freqs[mask], psd[mask]
    return np.sum(f * p) / np.sum(p)


mdf = median_frequency(freqs, psd)
mnf = mean_frequency(freqs, psd)
print(f"At t=5s:   MDF = {mdf:.1f} Hz   MNF = {mnf:.1f} Hz")

MNF is more sensitive to outliers in the tails (because it weights every frequency by its power). MDF is more robust because it's quantile-based. In a published paper you'll see both reported — they typically agree on direction but not always on magnitude.

06 Visualise the spectral shift

The single chart everyone in this subfield has seen: power spectrum near the start of an isometric trial vs near exhaustion. The whole curve slides left.

seg_fresh    = emg[5000:5512]     # t ≈ 5s
seg_fatigued = emg[55000:55512]   # t ≈ 55s

f1, psd1 = power_spectrum(seg_fresh)
f2, psd2 = power_spectrum(seg_fatigued)

plt.figure(figsize=(9, 3.8))
plt.plot(f1, psd1 / psd1.sum(), color='#78716c', lw=2, label='Fresh (t=5s)')
plt.plot(f2, psd2 / psd2.sum(), color='#7c2d12', lw=2, label='Fatigued (t=55s)')
plt.xlim(0, 250)
plt.xlabel('Frequency (Hz)'); plt.ylabel('Normalised power')
plt.title('Spectrum shifts left as fatigue accumulates')
plt.legend(frameon=False)
plt.tight_layout(); plt.show()

You should clearly see the fresh curve peak near 120 Hz and the fatigued curve peak near 70 Hz. That's the entire physiological motivation for every fatigue indicator in this tutorial.

07 Build a sliding-window feature pipeline

Now we run the analysis across the whole 60 s trial: one feature row per non-overlapping 0.5 s window. We compute MDF, MNF, RMS amplitude, and a 5-window rolling slope of MDF (because rate of change tells us something the static value alone cannot).

def extract_features(emg_signal, fs=1000, win_n=512):
    """Return a DataFrame with one row per non-overlapping analysis window."""
    rows = []
    n_windows = len(emg_signal) // win_n
    for w in range(n_windows):
        seg  = emg_signal[w * win_n: (w + 1) * win_n]
        rms  = np.sqrt(np.mean(seg ** 2))
        freqs, psd = power_spectrum(seg, fs=fs)
        mdf = median_frequency(freqs, psd)
        mnf = mean_frequency(freqs, psd)
        t   = (w * win_n + win_n / 2) / fs
        rows.append({'t': t, 'rms': rms, 'mdf': mdf, 'mnf': mnf})

    df = pd.DataFrame(rows)

    # 5-window short-term slope of MDF — a fatigue *acceleration* feature
    df['mdf_slope5'] = (
        df['mdf']
        .rolling(window=5, min_periods=2)
        .apply(lambda x: np.polyfit(np.arange(len(x)), x, 1)[0], raw=True)
        .fillna(0)
    )
    return df


features = extract_features(emg)
print(features.head().round(2))

Expected: the first few rows show MDF around 120 Hz, RMS roughly stable. Tail the DataFrame (features.tail()) and you should see MDF down in the 70s.

08 Generate a cohort of 20 athletes

One athlete can't tell us whether a model generalises. Let's build a cohort with realistic inter-individual variability — different baselines, different fatigue half-times, different RMS amplification.

def generate_cohort(n_athletes=20):
    cohort = []
    for i in range(n_athletes):
        # Reseed per athlete so each call to synth_emg is deterministic
        np.random.seed(100 + i)
        params = {
            'mdf_start': np.random.normal(120, 8),
            'mdf_end':   np.random.normal(70, 6),
            't50':       np.random.uniform(28, 38),
            'k':         np.random.uniform(0.10, 0.20),
        }
        emg_i      = synth_emg(**params)
        features_i = extract_features(emg_i)
        features_i['athlete_id'] = i
        features_i['true_t50']   = params['t50']
        cohort.append(features_i)
    return pd.concat(cohort, ignore_index=True)


cohort_df = generate_cohort()
print(f"{cohort_df['athlete_id'].nunique()} athletes, "
      f"{len(cohort_df)} total windows")

09 Label and split

Binary label: every window after the athlete's individual t50 is "fatigued". For the split, we hold out athletes not windows — testing on time within the same athlete would be the wrong evaluation, because the trajectory is so autocorrelated that any model would look good.

cohort_df['fatigued'] = (cohort_df['t'] >= cohort_df['true_t50']).astype(int)
print(f"Class balance: {cohort_df['fatigued'].mean():.1%} fatigued")

athletes = cohort_df['athlete_id'].unique()
np.random.seed(11)
np.random.shuffle(athletes)
train_ids = athletes[:12]
calib_ids = athletes[12:16]
test_ids  = athletes[16:]

def subset(ids):
    return cohort_df[cohort_df['athlete_id'].isin(ids)]

train_df = subset(train_ids)
calib_df = subset(calib_ids)
test_df  = subset(test_ids)

FEATURES = ['mdf', 'mnf', 'rms', 'mdf_slope5']
X_train, y_train = train_df[FEATURES], train_df['fatigued']
X_calib, y_calib = calib_df[FEATURES], calib_df['fatigued']
X_test,  y_test  = test_df[FEATURES],  test_df['fatigued']

Why three splits, not two? Calibration is a separate concern from training. If you reuse the training data to pick a decision threshold or compute conformal quantiles, you over-fit them. The calibration set is held out for that purpose only and is never seen by the model fit.

10 Fit logistic regression

Standardise (so coefficients are interpretable) and fit. We use scikit-learn's logistic regression; it's a closed-form problem on this scale, no hyperparameter to think about.

scaler = StandardScaler()
X_train_z = scaler.fit_transform(X_train)
X_calib_z = scaler.transform(X_calib)
X_test_z  = scaler.transform(X_test)

model = LogisticRegression(C=1.0, max_iter=500)
model.fit(X_train_z, y_train)

p_test = model.predict_proba(X_test_z)[:, 1]
acc_default = accuracy_score(y_test, p_test >= 0.5)
auc_test    = roc_auc_score(y_test, p_test)
print(f"Default-threshold accuracy: {acc_default:.3f}")
print(f"ROC AUC on test:            {auc_test:.3f}")

AUC near 1.0 with synthetic data isn't a surprise — the signal is well-separated. The real test is what happens at the decision boundary, which we tune next.

11 Calibrate the decision threshold

0.5 is the wrong default in most asymmetric settings. We pick the threshold τ* that maximises balanced accuracy on the calibration set — never on training, never on test.

def best_threshold(probs, y):
    """Find τ that maximises balanced accuracy (= mean of sensitivity, specificity)."""
    best_tau, best_bal = 0.5, 0.0
    for tau in np.linspace(0.05, 0.95, 50):
        pred = (probs >= tau).astype(int)
        tp = ((pred == 1) & (y == 1)).sum()
        tn = ((pred == 0) & (y == 0)).sum()
        fp = ((pred == 1) & (y == 0)).sum()
        fn = ((pred == 0) & (y == 1)).sum()
        sens = tp / max(1, tp + fn)
        spec = tn / max(1, tn + fp)
        bal  = 0.5 * (sens + spec)
        if bal > best_bal:
            best_tau, best_bal = tau, bal
    return best_tau

p_calib  = model.predict_proba(X_calib_z)[:, 1]
tau_star = best_threshold(p_calib, y_calib.values)
print(f"τ*  = {tau_star:.3f}")
acc_ml = accuracy_score(y_test, p_test >= tau_star)
print(f"Calibrated test accuracy:   {acc_ml:.3f}")

12 Compare to the classical baseline

The classical rule, implemented properly: per-athlete, take the mean MDF of the first 10 windows and call the athlete fatigued from the moment MDF drops below 90% of that.

def classical_baseline(df):
    """MDF < 90% of athlete-specific initial mean."""
    preds = []
    for aid, group in df.groupby('athlete_id'):
        initial_mdf = group.iloc[:10]['mdf'].mean()
        threshold   = 0.90 * initial_mdf
        preds.append((group['mdf'] < threshold).astype(int))
    return pd.concat(preds).sort_index()

baseline_pred = classical_baseline(test_df).values
acc_baseline  = accuracy_score(y_test, baseline_pred)

gap_pp = (acc_ml - acc_baseline) * 100
print(f"Classical baseline accuracy: {acc_baseline:.3f}")
print(f"ML accuracy:                 {acc_ml:.3f}")
print(f"Improvement:                 +{gap_pp:.1f} percentage points")

Where does the improvement come from? Mostly from the transition window, where the classical threshold flips abruptly. The model uses MDF and RMS and the short-term slope of MDF, so it sees the fatigue trajectory bending before MDF itself crosses the threshold.

13 Add conformal uncertainty

A point prediction isn't enough — the coach needs to know how confident the model is. Split conformal gives this without retraining and without distributional assumptions.

The recipe:

On the calibration set, compute the absolute residuals |p − y|.
Sort them. Take the empirical quantile at level 1 − α, with the standard (n + 1) / n correction.
For any new prediction p, the interval is [p − q, p + q], clipped to [0, 1].

nonconformity = np.abs(p_calib - y_calib.values)
nonconformity.sort()

def conformal_halfwidth(alpha):
    n = len(nonconformity)
    rank = min(n, int(np.ceil((n + 1) * (1 - alpha))))
    return nonconformity[rank - 1]

q_80 = conformal_halfwidth(0.20)
q_90 = conformal_halfwidth(0.10)
print(f"80% PI half-width: ±{q_80:.3f}")
print(f"90% PI half-width: ±{q_90:.3f}")

That's it — three lines for a finite-sample distribution-free coverage guarantee, on top of any model that outputs probabilities. Conformal is not magic and it does break (exercise: see what happens when you train on adult athletes and calibrate on a totally different population). But as a default uncertainty layer, it's hard to beat for the effort.

14 Visualise one held-out athlete

Pick a test athlete, plot the MDF time series on one axis and the model's fatigue probability with conformal band on a second axis. Mark the ground- truth t₅₀.

test_df = test_df.copy()
test_df['p']    = p_test
test_df['lo90'] = np.clip(p_test - q_90, 0, 1)
test_df['hi90'] = np.clip(p_test + q_90, 0, 1)

demo_id = test_ids[0]
demo    = test_df[test_df['athlete_id'] == demo_id].sort_values('t')

fig, ax1 = plt.subplots(figsize=(10, 4))
ax2 = ax1.twinx()

ax1.plot(demo['t'], demo['mdf'], color='#78716c', lw=2, label='MDF (Hz)')
ax2.fill_between(demo['t'], demo['lo90'], demo['hi90'],
                 color='#7c2d12', alpha=0.20, label='90% PI')
ax2.plot(demo['t'], demo['p'], color='#7c2d12', lw=2.2, label='P(fatigued)')
ax2.axhline(tau_star, ls='--', color='#7c2d12', alpha=0.6)
ax1.axvline(demo['true_t50'].iloc[0], ls=':', color='gray', alpha=0.7)

ax1.set_xlabel('Trial time (s)')
ax1.set_ylabel('MDF (Hz)', color='#78716c')
ax2.set_ylabel('P(fatigued)', color='#7c2d12')
ax2.set_ylim(-0.05, 1.05)
plt.title(f'Athlete {demo_id} — model output vs ground-truth t50')
fig.tight_layout(); plt.show()

What to look for: MDF declines monotonically; the fatigue probability crosses τ* at roughly the same time the dotted ground-truth line is crossed; the conformal band is widest near the transition (where the decision is genuinely hardest).

15 What to try next

That's the whole pipeline. Some natural extensions if you want to keep going:

Replace synthetic with real data. The NinaPro database, PhysioNet's EMG collections, and the EMG-UKA dataset are good starting points. Your pipeline should not need any changes other than the loader.
Try gradient boosting. Swap LogisticRegression for GradientBoostingClassifier. Measure whether the calibration plot stays as clean — boosted trees often need an explicit Platt/isotonic calibration step.
Group-conditional conformal. Split the calibration residuals by some grouping (e.g., athlete weight class) and compute per-group quantiles. This is the simplest way to improve coverage when the cohort isn't homogeneous.
Multi-channel EMG. Real grip-endurance studies record from the biceps brachii and the forearm flexors. Concatenate features across channels and refit.
Predict force-failure time. Instead of a binary "fatigued / not" label, regress the time-to-failure from the current feature vector. Same conformal logic applies; you get continuous prediction intervals.

Source files

The full standalone version (no-dependency, hand-rolled FFT) is on the project page. The library-based version from this tutorial is short enough to keep in a single file. If you concatenate the code blocks above into tutorial.py and run it, here is the exact output to expect — the cohort is reseeded per-athlete inside generate_cohort so the numbers are reproducible to the decimal:

Default-threshold accuracy: 0.925
ROC AUC on test:            0.989
τ*  = 0.307
Calibrated test accuracy:   0.917
Classical baseline accuracy: 0.844
Improvement:                +7.3 percentage points
80% PI half-width: ±0.053
90% PI half-width: ±0.217

The improvement over the classical baseline (+7.3 pp here) is slightly smaller than the +10 pp on the project page — because that version uses a different sEMG synthesis (frequency-domain construction with sharper spectra) and a hand-rolled FFT. Both are real; the comparison to beat is the same.

Final word

Two skills compound in this tutorial: signal processing (the FFT, the spectral indicators, the windowing) and calibrated ML (the standardisation, the held-out calibration set, the conformal layer). They feel orthogonal, but in applied sport-science work you almost always need both, and the people who can switch fluently between them produce the work that actually gets deployed. That's why I built this project, and why I'm writing this tutorial.

Questions, corrections, or "I tried this on a real recording" — please get in touch.