Benchmarks
Maestro has been extensively benchmarked across diverse quantum workloads, covering single-circuit execution, batched multi-tenant scenarios, GPU acceleration, and distributed quantum computing simulation. Full details are available in the Maestro paper (arXiv:2512.04216).
Summary of Results
| Benchmark | Key Result |
|---|---|
| Automode vs QCSim | 9.2× speedup on heterogeneous 90-circuit batch |
| Automode vs Qiskit | 8.4× speedup on the same batch |
| Automode vs Qiskit Auto | 1.8× speedup — deeper optimization and broader circuit detection |
| MPS fidelity | Adaptive bond dimension achieves ≥ 0.95 mirror fidelity |
| GPU acceleration | Significant speedup for wide state vector circuits |
| DQC simulation | Circuits with 1000+ qubits simulated via p-block mode |
| External validation | NPL (UK National Physical Laboratory) confirmed competitive performance |
General Benchmarks
Circuits tested include GHZ states (entanglement generation), QFT (interference), random Clifford+T (universal gate sets), and QAOA layers of increasing depth (variational workloads).
MPS Simulation
MPS simulation uses adaptive bond dimension (χ) to balance speed and accuracy:
- Start with a low bond dimension (χ = 4)
- Simulate the circuit and measure mirror fidelity
- Double χ until fidelity ≥ 0.95
This ensures quality standards while minimizing runtime. Maestro’s MPS engine (QCSim) is benchmarked against Qiskit Aer’s MPS, with results showing competitive or superior performance across circuit widths.
GPU Acceleration
GPU offloading provides significant speedups for state vector simulations, but the advantage depends on circuit characteristics:
- Wide circuits (many qubits): GPU advantage is clear
- Narrow circuits: CPU may be faster due to GPU transfer overhead
- GPU acceleration applies to both state vector and MPS backends
Currently, GPU offloading is a user-configurable toggle. Future versions will automate this decision based on predicted transfer overhead.
Batched Circuit Execution (Torture Test)
The most representative benchmark is the Torture Test — a batch of 90 heterogeneous circuits designed to span the full complexity spectrum:
- Clifford circuits: Composed entirely of Clifford gates (ideal for stabilizer simulation)
- Low-entanglement circuits: GHZ states, shallow QAOA (p=1), hardware-efficient ansätze (ideal for MPS)
- High-entanglement circuits: Densely entangled circuits requiring state vector or high-bond-dimension MPS
Comparison
Four policies were compared:
| Policy | Strategy |
|---|---|
| QCSim | State vector for ≤30 qubits, MPS otherwise (χ tuned for fidelity ≥ 0.95) |
| Qiskit | Same strategy using Qiskit Aer backends |
| Qiskit Auto | Qiskit Aer’s automatic backend selection |
| Maestro Auto | Feature-based prediction selecting optimal backend per circuit |
Results
Maestro Auto achieved 9.2× speedup over QCSim, 8.4× over Qiskit, and 1.8× over Qiskit Auto.
The key insight: in heterogeneous batches, a few highly entangled circuits typically dominate the total runtime. Maestro addresses this by matching every individual circuit to its optimal backend — routing Clifford circuits to the stabilizer simulator (orders-of-magnitude faster), MPS-friendly circuits to MPS, and reserving state vector for circuits that truly need it. The prediction overhead is negligible compared to execution time.
Distributed Quantum Computing
Maestro’s p-block simulation mode enables simulating distributed quantum computing (DQC) by partitioning circuits across multiple virtual QPUs (vQPUs) connected through entanglement links:
- Qubits are allocated across vQPUs to minimize inter-node entangling gates
- Each vQPU runs its own local simulator, reducing peak memory requirements
- Results are validated against monolithic state vector simulations using Hellinger fidelity
Key Findings
- Single vQPU: Performance matches monolithic simulation exactly
- Multiple vQPUs: Communication overhead introduces runtime cost, but enables simulation of circuits far beyond single-device limits
- Deep circuits (1000+ qubits): Achievable with constant runtime when the right simulator is selected per block
- Limitation: High-entanglement circuits (e.g., W states) become bottlenecked by entanglement sharing — efficient protocols are needed for general-purpose DQC
External Validation
Maestro was independently benchmarked by the UK National Physical Laboratory (NPL) as part of the M4Q program. Their assessment:
"[The] Maestro framework [is] well-suited for HPC environments due to [its] ability to exploit parallelism through multithreading and multiprocessing. Features such as Maestro Auto for batched execution and distributed simulation strategies enable efficient scaling across clusters and reduce overhead compared to single-threaded runs."
HPC Integration
Maestro is integrated into two major European HPC centers:
- CESGA (Spain): Integrated via the CUNQA platform
- LRZ (Germany): Integrated as a QDMI (Quantum Device Management Interface) backend
For full benchmark methodology and figures, see the paper: Maestro: Intelligent Execution for Quantum Circuit Simulation (arXiv:2512.04216).