A Python package for extracting instrument-specific musical loops from MIDI datasets. Implements a modular pipeline for bass, guitar, piano, drums, and synthesizer parts with configurable filtering and multiple output formats.
A Practical Tool for Symbolic Music Dataset Curation
The midi loop extractor project addresses a persistent challenge in music information retrieval (MIR): preparing instrument-specific datasets from large MIDI collections. This Python package extracts musical loops for bass, guitar, piano, drums, and synthesizer parts. It then filters them through instrument-specific criteria and outputs multiple formats.
The tool emerges from a practical need. Researchers working with symbolic music data often spend considerable time parsing MIDI files and separating instrument tracks. This package automates that process with a structured pipeline approach.
Design Philosophy and Architecture
The architecture follows a clean separation of concerns. An extraction layer identifies candidate tracks. A processing layer applies instrument-specific filters. Finally, representation layers convert the data into usable formats.
Each instrument gets its own extractor class. The configuration system uses YAML files. This makes the tool extensible for new instruments or custom filtering logic.
The choice of output formats shows attention to different use cases. Piano roll matrices suit computer vision approaches. Token sequences work for recurrent neural networks (RNNs), transformer, and other such models. MIDI files preserve data for human evaluation. This flexibility matters in a research context where different experiments need different representations.
Technical Implementation
The instrument identification relies on MIDI program numbers and text matching. This is a pragmatic choice. Program numbers are standardized for many instruments. Text matching catches edge cases where metadata is inconsistent.
However, this approach has limits. MIDI files often contain ambiguous or missing metadata. Some producers use non-standard program assignments. The tool acknowledges this through its keyword matching system. Still, users should verify extraction quality for their specific datasets.
The parallel processing implementation uses multiprocessing. This speeds up batch operations on large collections. The Docker containerization includes GPU support. These features suggest the tool is meant for serious dataset processing, not just small experiments.
Position in the Research Landscape
The project sits in the symbolic music processing subfield of MIR. It resembles tools like music21, pretty_midi, and MusPy. Those libraries focus on parsing and analysis. This tool focuses on dataset preparation.
Recent ISMIR papers emphasize the need for high-quality training data. The rise of symbolic music generation models makes curated datasets valuable. This tool could help researchers prepare data for those models.
Practical Trade-offs
The tool makes reasonable trade-offs. It prioritizes usability over novelty. The code is readable and well-documented. The configuration system is clear.
Some decisions warrant scrutiny. The instrument-specific filtering thresholds and loop extraction length are hard-coded. Users can override these through configuration, but the defaults shape many use cases.
The reliance on the Lakh MIDI Dataset (LMD) reflects convenience. LMD is large and widely used. But it also contains noisy transcriptions of variable quality.
Future Development Paths
I foresee several extensions that would strengthen the project. Adding confidence scores for extraction decisions would help users filter uncertain cases. Implementing active learning could refine instrument classification over time. Integration with music21 or Partitura might expand analytical capabilities.
Final Assessment
The project reflects an approach to research software. It balances generality with specificity, and shows the ability to build useful systems that others might adopt. The tool’s value lies in its practicality. Symbolic music research needs better data preparation pipelines. This project attempt to deliver one.