Evaluating the Reasoning Limits of LLMs on Quantum Computing
| # | Model | Provider | Size | Expert Written | LLM Extracted | Complete Dataset |
|---|
Models handle foundational concepts well but decline sharply on advanced topics. Security questions see the steepest drop, with performance falling to 76%.
Examples from the Quantum-Audit benchmark illustrating the depth and breadth of questions.
Download the Quantum-Audit benchmark dataset and evaluation code.
Multiple choice questions developed by quantum computing researchers.
Download JSONQuestions extracted from research papers using LLMs and validated by domain experts.
Download JSONThe full benchmark combining expert-written and LLM-extracted questions across all topics.
Download JSONQuestions with intentionally incorrect assumptions to test error detection.
Download JSONA curated subset of 500 expert-written multiple choice questions.
Download JSONThe QA500 subset translated into Spanish for cross-lingual evaluation.
Download JSONThe QA500 subset translated into French for cross-lingual evaluation.
Download JSONScripts and evaluation code used to run the Quantum-Audit benchmark on all models.
Download