Effective Data Pruning through Score Extrapolation

Effective Data Pruning through Score Extrapolation

✂️ Preprint, 2025

Overview

Training on massive datasets is prohibitively expensive. Data pruning can help — but existing methods need a full training pass just to figure out which samples to remove, defeating the purpose for single training runs.

We introduce Score Extrapolation, a framework that predicts sample importance for the entire dataset after training on only a small subset. Using k-nearest neighbors or graph neural networks, we extrapolate importance scores to samples never seen during the initial training — making data pruning efficient from the very first run.

Key Results

  • ✅ Validated on 2 SOTA pruning methods (Dynamic Uncertainty & TDDS)
  • 📊 Tested across 4 datasets: CIFAR-10, CIFAR-100, Places-365, and ImageNet
  • 🔄 Works for 3 training paradigms: supervised, unsupervised, and adversarial

Key Contributions

  • 🧠 Score extrapolation framework — predicts sample importance without full training
  • 🔮 Two approaches — kNN-based and GNN-based extrapolation methods
  • 📈 Scalable — a promising direction for scaling expensive score calculations (pruning, data attribution, and beyond)

Why It Matters

Data pruning should save compute, not cost more. Score extrapolation breaks the chicken-and-egg problem — enabling efficient pruning from day one, with implications beyond pruning for any task requiring sample-level scoring.

(Schmidt et al., 2025)

References

2025

  1. Effective Data Pruning through Score Extrapolation
    Sebastian Schmidt, Prasanga Dhungel, Christoffer Löffler, and 3 more authors
    ArXiv, Jun 2025