Quantum Computing for Finance: How High-Level Quantum Programming and GPU Acceleration Are Changing the Game
By: the Classiq and NVIDIA Teams
The financial industry runs onoptimization and uncertainty. Every day, portfolio managers balance competing assets, risk desks price derivatives under volatile conditions, and trading algorithms make decisions in milliseconds. Classical computers handle these problems well enough, until they don't. As portfolios grow, as derivative structures become more exotic, and as risk models demand higher fidelity, the computational cost explodes. Quantum computing promises a fundamentally different way to attack these hard problems. But the path from quantum theory to practical financial tooling has, until now, been full of friction.
This post shows how Classiq's high-level quantum modeling language and NVIDIA's CUDA-Q platform remove that friction. We'll walk through two of the most promising quantum algorithms for finance, QAOA for portfolio optimization, and IQAE for derivative pricing, and we’ll show how the two platforms work together to accelerate iterative workflows on GPU hardware.
The Finance Problem Space: Where Quantum Has an Edge
Before we write a single line of quantum code, it's worth being precise about where quantum computing actually helps infinance as not every problem benefits. The strongest cases today are:
Portfolio optimization — Finding the optimal allocation across N assets subject to constraints (budget, risk, sector exposure) is a combinatorial problem. With N assets, the search space grows as 2^N. The best classical solvers like Gurobi, CPLEX, branch-and-bound variants are formidably good and have decades of engineering behind them. Quantum approximate optimization (QAOA) has not yet demonstrated a clear advantage over these solvers on real hardware. What it has demonstrated is that the problem maps naturally onto quantum circuits, that those circuits run on today's devices, and that the algorithmic framework scales in the right direction as hardware improves. We are building toward advantage, not claiming it today.
Derivative pricing and risk estimation — Monte Carlo simulation is the industry workhorse for pricing options, computing Value-at-Risk (VaR), and estimating CVA. It converges at O(1/√N) and you need 100× more samples to get 10× more precision. Quantum Amplitude Estimation (QAE) achieves O(1/N) convergence, a quadratic speedup that translates into dramatic runtime reductions for high-precision pricing.
Credit risk modeling — Correlated default probabilities in loan portfolios are computationally expensive to model. Quantum approaches to distribution loading and amplitude estimation apply here as well.
The common thread: quadratic or exponential classical complexity that quantum algorithms address directly.
The Programming Challenge: From Math to Quantum Circuit
Here's where most quantum finance explorations stall. Implementing QAOA or QAE correctly requires deep knowledge of quantum circuit construction such as qubit encoding schemes, parameterized rotation gates, ancilla management, entanglement patterns, and transpilation to hardware-native gate sets. A financial engineer shouldn't need a PhD in quantum information theory to run a portfolio optimization.
Classiq's quantum modeling language flips the paradigm. Instead of specifying how to build a circuit gate by gate, you describe what the circuit should compute. Classiq's synthesis engine handles the translation to an efficient, hardware-ready implementation so quantum programs look like financial models, not physics experiments.
QAOA for Portfolio Optimization
The Problem
We want to select a portfolio of k assets from a universe of N candidates that maximizes expected return while keeping risk below a threshold. This is a constrained combinatorial problem: with N binary variables (invest or don't), it maps naturally onto the space algorithms are built to navigate.
Given an expected return vector μ,a covariance matrix Σ, and a risk aversion parameter q, the objective is:
maximize: μᵀx - q · xᵀΣx
subject to: Σᵢ xᵢ = k, xᵢ ∈ {0, 1}
A word on expectations: QAOA on current hardware does not beat Gurobi or other best-in-class classical solvers on this problem. The honest value of running it now is different — it validates the end-to-end quantum workflow, it stress-tests the circuit synthesis and execution stack, and it puts teams in a position to benefit when hardware improves.
Expressing It in Classiq
What Classiq provides here is meaningfu lregardless of that hardware caveat: the problem is expressed entirely at the financial level. Classiq uses Pyomo, a standard optimization modeling framework that operations researchers already know as its interface for combinatorial problems. The QAOA circuit structure, the cost Hamiltonian encoding, and the mixer layers are all synthesized automatically from the model.
Step 1 — Define the financial problem in Pyomo:
import numpy as np
import pyomo.core as pyo
from classiq import *
from classiq.applications.combinatorial_optimization import (
CombinatorialProblem,
pyo_model_to_pauli_operator,
)
import cudaq
# Problem parameters: 3 assets, integer allocation weights
returns = np.array([3, 4, -1])
covariances = np.array(
[
[ 0.9, 0.5, -0.7],
[ 0.5, 0.9, -0.2],
[-0.7, -0.2, 0.9],
]
)
total_budget = 6
# Pyomo model — pure finance, no quantum yet
portfolio_model = pyo.ConcreteModel("portfolio_optimization")
num_assets = len(returns)
portfolio_model.w = pyo.Var(range(num_assets), domain=pyo.Integers, bounds=(0, 7))
w_array = list(portfolio_model.w.values())
portfolio_model.budget_rule = pyo.Constraint(expr=(sum(w_array) <= total_budget))
portfolio_model.expected_return = returns @ w_array
portfolio_model.risk = w_array @ covariances @ w_array
portfolio_model.cost = pyo.Objective(
expr=portfolio_model.risk - portfolio_model.expected_return,
sense=pyo.minimize,
)
This is a standard intege rprogramming model: minimize risk minus expected return, subject to a total budget constraint. No quantum concepts yet.
Step 2 — Synthesize the QAOA circuit with Classiq:
NUM_LAYERS = 3
PENALTY = 10 # constraint penalty — must match penalty_energy in the CUDA-Q Hamiltonian
combi = CombinatorialProblem(pyo_model=portfolio_model, num_layers=NUM_LAYERS, penalty_factor=PENALTY)
qprog = combi.get_qprog()
show(qprog) # inspect the synthesized circuit in the Classiq IDE
One call to CombinatorialProblem — Classiq reads the Pyomo objective and constraints, constructs the cost Hamiltonian, and synthesizes the full depth-optimized QAOA circuit. show(qprog) opens it in the Classiq platform where you can inspect gate counts, qubit layout, and depth interactively.
The Variational Loop and the CUDA-Q Bridge
QAOA is a variational algorithm: the circuit has trainable parameters (γ, β angles, one pair per layer) that aclassical optimizer tunes iteratively to minimize the cost expectation value. This outer loop can require hundreds to thousands of circuit evaluations which is exactly where GPU acceleration makes the difference between a research prototype and something practical.
The Classiq-to-CUDA-Q bridge is two function calls:
# Build the Hamiltonian using the same penalty as CombinatorialProblem,
# so both paths optimise the identical cost landscape
classiq_hamiltonian = pyo_model_to_pauli_operator(portfolio_model, penalty_energy=PENALTY)
hamiltonian = pauli_operator_to_cudaq_spin_op(classiq_hamiltonian)
# Convert the synthesized Classiq circuit to a native CUDA-Q kernel
kernel = qprog_to_cudaq_kernel(qprog, is_main_kernel=True)The kernel is now a native CUDA-Q object runnable on any CUDA-Q backend by simply changing a single set_target() call. The circuit itself hasn't changed; it's the same gate sequence Classiq synthesized. The PENALTY constant is the critical shared parameter: it controls how heavily constraint violations are penalised in the cost Hamiltonian, and both CombinatorialProblem and pyo_model_to_pauli_operator must use the same value, otherwise the two methods are literally optimising different objective functions and their results will diverge.
Step 3 — Run the Classiq native optimizer:
combi.optimize() must run first as it initialises the execution session (_es) that combi.sample() depends on for decoding. We time it here so the comparison is straightforward.
import time
from classiq.execution import ExecutionPreferences, ClassiqBackendPreferences
execution_preferences = ExecutionPreferences(
backend_preferences=ClassiqBackendPreferences(backend_name="simulator"),
)
t0 = time.perf_counter()
optimized_params = combi.optimize(
execution_preferences,
maxiter=70,
quantile=0.7, # CVaR alpha: focus on the best 30% of samples
)
classiq_time = time.perf_counter() - t0
classiq_result = combi.sample(optimized_params)
best_classiq = classiq_result.solution[classiq_result.cost.idxmin()]
classiq_weights = np.array(best_classiq["w"])
classiq_cost = classiq_result.cost.min()
print(f"Classiq allocation: {classiq_weights.tolist()} cost: {classiq_cost:.4f} time: {classiq_time:.2f}s")
Step 4 — Run the CUDA-Q GPU optimizer:
The CUDA-Q objective now uses the same quantile=0.7 as combi.optimize() — meaning both methods minimise the expected cost over the best 30% of sampled bitstrings rather than the plain expectation value. This is what aligns the two optimisers on the same landscape.
def initial_qaoa_params(NUM_LAYERS) -> np.ndarray:
initial_gammas = math.pi * np.linspace(
1 / (2 * NUM_LAYERS), 1 - 1 / (2 * NUM_LAYERS), NUM_LAYERS
)
initial_betas = math.pi * np.linspace(
1 - 1 / (2 * NUM_LAYERS), 1 / (2 * NUM_LAYERS), NUM_LAYERS
)
initial_params = []
for i in range(NUM_LAYERS):
initial_params.append(initial_gammas[i])
initial_params.append(initial_betas[i])
return np.array(initial_params)
initial_params = initial_qaoa_params(NUM_LAYERS)
cudaq.set_target("nvidia")
QUANTILE = 0.7 # must match combi.optimize(quantile=...)
SHOTS = 1000 # samples per evaluation
parameter_count = 2 * NUM_LAYERS
def cvar_objective(parameters):
"""QAOA objective: mean cost of the best (1-QUANTILE) fraction of shots."""
counts = cudaq.sample(kernel, *parameters, shots_count=SHOTS)
# Evaluate cost for every observed bitstring
costs = []
for bitstring, count in counts.items():
result = combi.sample(list(parameters)) # decode via combi
cost = result.cost.min()
costs.extend([cost] * count)
costs.sort()
cutoff = max(1, int(len(costs) * (1 - QUANTILE)))
return float(np.mean(costs[:cutoff]))
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 70
optimizer.initial_parameters = initial_params
t0 = time.perf_counter()
optimal_cvar, optimal_parameters = optimizer.optimize(
dimensions=parameter_count, function=cvar_objective
)
cudaq_time = time.perf_counter() - t0
# Decode the final allocation using the optimal parameters
cudaq_result = combi.sample(optimal_parameters)
best_cudaq = cudaq_result.solution[cudaq_result.cost.idxmin()]
cudaq_weights = np.array(best_cudaq["w"])
cudaq_cost = cudaq_result.cost.min()
print(f"CUDA-Q allocation: {cudaq_weights.tolist()} cost: {cudaq_cost:.4f} time: {cudaq_time:.2f}s")
Comparing the Two Results
Both methods optimise the same Hamiltonian over the same synthesised circuit and decode through the same combi.sample() function, so their allocations and costs should agree. The timing column is the point of interest where combi.optimize() runs on Classiq's cloud simulator while the CUDA-Q path runs on a local NVIDIA GPU.
print(f"{'Method':<10} {'Allocation':>20} {'Cost':>10} {'Time (s)':>12} {'Speedup':>10}")
print("-" * 66)
print(f"{'Classiq':<10} {str(classiq_weights.tolist()):>20} {classiq_cost:>10.4f} {classiq_time:>12.2f} {classiq_time/cudaq_time:>9.1f}x")
print(f"{'CUDA-Q':<10} {str(cudaq_weights.tolist()):>20} {cudaq_cost:>10.4f} {cudaq_time:>12.2f} {'1.0x':>10}")
Comparing these methods we get these results:
Method Allocation Cost Time (s) Speedup
------------------------------------------------------------------
Classiq [2, 2, 1] -4.5000 160.13 2.5x
CUDA-Q [2, 2, 1] -4.5000 64.35 1.0x
The GPU speedup here is not about beating classical portfolio solvers yet, it's about making the quantum research loop fast enough to be useful by exploring layer depths, tuning penalty factors and benchmarking ansatz structures. GPU simulation is the development environment that makes quantum finance work today while hardware matures.
IQAE for Derivative Pricing
The Problem
Pricing a European option requires estimating an expectation value over a probability distribution of future asset prices:
Price = e^(-rT) · E[max(S_T - K, 0)]
where S_T is the asset price at maturity, K is the strike, and the expectation is taken over a log-normal distribution. Classical Monte Carlo estimates this by averaging over N random samples, converging at O(1/√N). Quantum Amplitude Estimation achieves O(1/N) convergence which is a quadratic speedup in oracle calls.
Iterative QAE (IQAE) is the near-term variant of choice: it avoids the deep controlled-phase circuits of textbook QAE, instead adaptively querying a Grover-like oracle with increasing repetition counts to home in on the true amplitude. Each iteration uses a small number of shots and narrows a confidence interval via Clopper-Pearson statistics which is practical on today's hardware.
Unlike QAOA, the quantum advantage case for IQAE is more concrete. The quadratic speedup in oracle queries is proven, and the question is whether it survives circuit overhead on real hardware. GPU-accelerated simulation lets us run the full IQAE loop at scale today and measure where the crossover point lies.
Step 1 — Define the Distribution and Payoff
The financial parameters map directly onto the Classiq model. We use 3 qubits, giving 2³ = 8 discretisation points over the log-normal distribution. The scaling_factor normalises the payoff into[0, 1] so it can be loaded as a quantum amplitude.
import numpy as np
import scipy
import cudaq
from classiq import *
from classiq.applications.iqae.iqae import IQAE
from classiq.qmod.symbolic import ceiling
# --- Financial parameters ---
num_qubits = 3
mu = 0.7 # log-normal mean
sigma = 0.13 # log-normal std dev
K = 1.9 # strike price
# --- Discretise the log-normal distribution over 2^num_qubits points ---
def get_log_normal_probabilities(mu, sigma, num_points):
mean = np.exp(mu + sigma**2 / 2)
variance = (np.exp(sigma**2) - 1) * np.exp(2 * mu + sigma**2)
stddev = np.sqrt(variance)
low = np.maximum(0, mean - 3 * stddev)
high = mean + 3 * stddev
x = np.linspace(low, high, num_points)
return x, scipy.stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
grid_points, probs = get_log_normal_probabilities(mu, sigma, 2**num_qubits)
probs = probs / np.sum(probs) # normalise to a valid probability distribution
scaling_factor = max(grid_points) - K
grid_step = (max(grid_points) - min(grid_points)) / (len(grid_points) - 1)
def scale(val):
return val * grid_step + min(grid_points)
def descale(val):
return (val - min(grid_points)) / grid_step
Step 2 — Build the State Preparation Circuit in Classiq
The quantum algorithm needs two components: a circuit that loads the price distribution and a payoff function that marks in-the-money outcomes. Classiq synthesises both from high-level descriptions.
@qfunc
def load_distribution(asset: QNum):
inplace_prepare_state(probabilities=probs.tolist(), bound=0, target=asset)
@qfunc
def payoff_linear(asset: Const[QNum], ind: QBit):
# Load payoff amplitude: sqrt((S - K) / scaling_factor) for S > K
assign_amplitude_table(
lookup_table(lambda n: np.sqrt(abs((scale(n) - K) / scaling_factor)), asset),
asset,
ind,
)
@qfunc
def payoff(asset: Const[QNum], ind: QBit):
# Only apply the payoff when asset price exceeds the strike
control(asset >= ceiling(descale(K)), lambda: payoff_linear(asset, ind))
@qfunc
def european_call_state_preparation(asset: QNum, ind: QBit):
load_distribution(asset)
payoff(asset, ind)
Step 3 — Synthesise Both Circuits in One Block
Classiq's synthesize() operates on a function named main. We define and synthesise main twice, once for state preparation, once for the oracle, and export each immediately to a CUDA-Q kernel. Both syntheses and exports must happen in the same cell, back to back: CUDA-Q names internal functions after Pytho nobject memory addresses, and if an address is recycled between two synthesis calls, the two MLIR modules will contain different functions with the same name but different signatures, causing a silent merge failure. Running them inimmediate sequence prevents this.
The stateprep main allocates the asset register and indicator qubit, applies the financial model, then drops the asset register — leaving only the indicator as an output. The oracle main applies the full Grover operator over the combined register.
# --- State preparation: synthesise and export immediately ---
@qfunc
def main(indicator: Output[QBit]):
asset = QNum()
allocate(num_qubits, asset)
allocate(1, indicator)
european_call_state_preparation(asset, indicator)
drop(asset)
qprog_stateprep = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=15))
print(f"State prep width: {qprog_stateprep.data.width} qubits")
stateprep_kernel = qprog_to_cudaq_kernel(qprog_stateprep, is_main_kernel=False)
# --- Oracle: redefine main, synthesise and export immediately ---
@qfunc
def space_transform(est_reg: QArray) -> None:
european_call_state_preparation(est_reg[0 : est_reg.len - 1], est_reg[est_reg.len - 1])
@qperm
def oracle(est_reg: Const[QArray]) -> None:
Z(est_reg[est_reg.len - 1])
@qfunc
def main(indicator: Output[QBit]) -> None:
est_reg: QArray = QArray("est_reg")
problem_vars: QArray = QArray("problem_vars", length=num_qubits)
allocate(problem_vars)
allocate(indicator)
within_apply(
lambda: bind([problem_vars, indicator], est_reg),
lambda: grover_operator(oracle, space_transform, est_reg),
)
drop(problem_vars)
qprog_oracle = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=11))
print(f"Oracle width: {qprog_oracle.data.width} qubits")
oracle_kernel = qprog_to_cudaq_kernel(qprog_oracle, is_main_kernel=False)Step 4 — Validate and Find the Indicator Qubit
Before building the parametric kernel we run a quick sanity check: both circuits must place the indicator qubit at the same index in the register, otherwise the parametric kernel measures the wrong qubit. Classiq's qubit_mapping exposes this directly, so we can assert consistency and extract measure_index without guessing.
def validate_qprogs(qprog_stateprep, qprog_oracle):
sp_idx = qprog_stateprep.data.qubit_mapping.logical_outputs['indicator'][0]
orc_idx = qprog_oracle.data.qubit_mapping.logical_outputs['indicator'][0]
assert sp_idx == orc_idx, (
f"Indicator qubit index mismatch: stateprep={sp_idx}, oracle={orc_idx}. "
"Both circuits must place the indicator at the same physical qubit."
)
total_qubits = max(qprog_stateprep.data.width, qprog_oracle.data.width)
return total_qubits, sp_idx
total_qubits, measure_index = validate_qprogs(qprog_stateprep, qprog_oracle)
print(f"Total qubits: {total_qubits}, measure index: {measure_index}")Step 5 — Build the Parametric CUDA-Q Kernel
This is where the CUDA-Q integration pays off beyond raw GPU speed. A standard QASM-based workflow would need to unroll the full oracle circuit k times per IQAE iteration, thereby producing a new, longer circuit for every value of k. CUDA-Q's for_loop emits a cc.loop in the Quake IR instead: the oracle is described once and the loop bound is a runtime integer. The kernel is O(1) in size regardless of k.
One critical detail: qubit refs for both apply_call invocations must be resolved in the outer scope before for_loop is entered. Inside the loop body, CUDA-Q traces the function once to generate the loop IR — extract_ref ops created during that trace land inside the loop construct where func.call lowering cannot resolve them. Extracting all refs outside the loop avoids this.
def build_parametric_iqae_kernel(
stateprep_kernel: cudaq.Kernel,
oracle_kernel: cudaq.Kernel,
total_qubits: int,
measure_index: int,
) -> cudaq.Kernel:
iqae_kernel, k_var = cudaq.make_kernel(int)
q = iqae_kernel.qalloc(total_qubits)
# Resolve all qubit refs in the outer scope — not inside the loop body
sp_qubits = [q[i] for i in range(qprog_stateprep.data.width)]
oracle_qubits = [q[j] for j in range(qprog_oracle.data.width)]
# Apply state preparation once (the A operator)
iqae_kernel.apply_call(stateprep_kernel, *sp_qubits)
# Apply the Grover oracle k times — k is a runtime integer, not an unrolled copy
def oracle_loop_body(i: cudaq.QuakeValue):
iqae_kernel.apply_call(oracle_kernel, *oracle_qubits)
iqae_kernel.for_loop(start=0, stop=k_var, function=oracle_loop_body)
iqae_kernel.mz(q[measure_index])
return iqae_kernel
iqae_kernel = build_parametric_iqae_kernel(
stateprep_kernel=stateprep_kernel,
oracle_kernel=oracle_kernel,
total_qubits=total_qubits,
measure_index=measure_index,
)Step 6 — Run IQAE on GPU and Compare with Classiq
The IQAEOptionPricer drives the iterative loop: at each iteration it samples the parametric kernel with the current k, computes a Clopper-Pearson confidence interval on the measured probability, inverts back to theta-space, and contracts the interval. The Classiq IQAE class runs the same algorithm internally for the baseline.
from scipy.stats import beta as scipy_beta
class IQAEOptionPricer:
def __init__(self, epsilon: float, alpha: float, scaling_factor: float):
self.epsilon = epsilon
self.alpha = alpha
self.scaling_factor = scaling_factor
self.shots = 1000
def _clopper_pearson(self, count, n, alpha):
lower = 0.0 if count == 0 else scipy_beta.ppf(alpha / 2, count, n - count + 1)
upper = 1.0 if count == n else scipy_beta.ppf(1 - alpha / 2, count + 1, n - count)
return lower, upper
def _find_next_k(self, k_curr, theta_l, theta_u):
K_max = int(np.floor(np.pi / (theta_u - theta_l)))
K = K_max - (K_max - 2) % 4
while K >= 4 * (4 * k_curr + 2):
K = int(K / 4) - (int(K / 4) - 2) % 4
return max((K - 2) // 4, k_curr + 1)
def solve(self, kernel: cudaq.Kernel):
theta_l, theta_u = 0.0, np.pi / 2
k = 0
total_shots = 0
print(f"{'Iter':<5} | {'k':<5} | {'Interval':<26} | {'Error':>8}")
print("-" * 55)
iteration = 0
while (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
iteration += 1
result = cudaq.sample(kernel, k, shots_count=self.shots)
count_1 = result.count('1')
total_shots += self.shots
a_min, a_max = self._clopper_pearson(count_1, self.shots, self.alpha)
theta_meas = np.arcsin(np.sqrt(count_1 / self.shots))
theta_center = (theta_l + theta_u) / 2
scaling = 2 * k + 1
n_ = round((scaling * theta_center - theta_meas) / np.pi)
theta_est = (n_ * np.pi + theta_meas) / scaling
d_a = (a_max - a_min) / 2
deriv = abs(scaling * np.sin(2 * scaling * theta_est))
if deriv < 1e-9:
deriv = 1e-9
delta_theta = d_a / deriv
theta_l = max(theta_l, theta_est - delta_theta)
theta_u = min(theta_u, theta_est + delta_theta)
est_error = (theta_u - theta_l) * self.scaling_factor
print(f"{iteration:<5} | {k:<5} | [{theta_l:.4f}, {theta_u:.4f}] | {est_error:>8.4f}")
if (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
k = self._find_next_k(k, theta_l, theta_u)
theta_final = (theta_l + theta_u) / 2
estimated_price = np.sin(theta_final) ** 2 * self.scaling_factor
return estimated_price, total_shots
# --- Classiq baseline ---
iqae = IQAE(
state_prep_op=european_call_state_preparation,
problem_vars_size=num_qubits,
constraints=Constraints(max_width=20),
preferences=Preferences(optimization_level=1),
)
result_iqae = iqae.run(epsilon=0.05, alpha=0.01)
classiq_payoff = result_iqae.estimation * scaling_factor
# --- CUDA-Q GPU run ---
cudaq.set_target("nvidia") # swap for "qpp-cpu" if no GPU available
pricer = IQAEOptionPricer(epsilon=0.05, alpha=0.01, scaling_factor=scaling_factor)
cudaq_price, cudaq_shots = pricer.solve(iqae_kernel)
# --- Classical reference ---
expected_payoff = sum((grid_points - K) * (grid_points >= K) * probs)
print(f"\n{'Method':<22} {'Payoff':>10}")
print("-" * 34)
print(f"{'CUDA-Q IQAE (GPU)':<22} {cudaq_price:>10.4f} (shots: {cudaq_shots})")
print(f"{'Classiq IQAE':<22} {classiq_payoff:>10.4f}")
print(f"{'Classical reference':<22} {expected_payoff:>10.4f}")All three numbers should land within epsilon = 0.05 of each other. The shot count from the CUDA-Q path makes the quadratic advantage tangible: IQAE reaches 5% precision using far fewer oracle calls than Monte Carlo would need samples for the same confidence.
QAE vs IQAE: Switching Algorithms Without Rewriting Your Model
IQAE is well-suited for near-term hardware because it avoids one of the most expensive primitives in quantum computing: the controlled-unitary. Textbook Quantum Amplitude Estimation (QAE) uses Quantum Phase Estimation as its core subroutine, which requires a register of ancilla qubits and a series of controlled applications of the Grover operator where each controlled application roughly doubles circuit depth. On current hardware, where two-qubit gate fidelity is still the limiting factor, this depth cost is prohibitive.
IQAE sidesteps this entirely. Instead of encoding the amplitude into a phase via QPE, it adaptively queries the Grover oracle at increasing depths and uses classical post-processing (Clopper-Pearsonintervals, phase inversion) to home in on the answer. The circuit for each round is shallow; the intelligence is in the classical loop.
The tradeoff: IQAE requires more total circuit executions than QAE for the same precision, and its confidence interval update involves heuristics that a full QPE-based approach handles exactly. As hardware improves and controlled-unitary costs fall, QAE's single-shot precision becomes more attractive.
What makes Classiq valuable here is that the financial model doesn't change at all. The functions load_distribution, payoff_linear, payoff, and european_call_state_preparation are pure descriptions of the financial problem and they know nothing about IQAE or QAE. They encode the distribution, the strike condition, and the amplitude loading. That logic is identical regardless of which amplitude estimation algorithm wraps around it.
Moving to full QAE would mean replacing the IQAE wrapper with a QPE-based outer circuit: a register of ancilla qubits, controlled applications of the Grover operator at powers 1, 2, 4, ..., 2^(n-1), and an inverse QFT to read out the phase. The number of ancilla qubits directly controls precision as each additional qubit halves the estimation error, at the cost of roughly doubling circuit depth. In Classiq's modeling language, this outer structure is expressed in terms of grover_operator and qpe —the same building blocks already used internally by IQAE — wrapped in a new main function. The european_call_state_preparation function is passed in unchanged as the state preparation operand.
The contrast between the two circuits is worth seeing directly. The IQAE oracle synthesised in Steps 3 and 4 is a shallow single Grover iteration where the depth scaling happens in the classical loop, not in the circuit. A full QAE circuit compiled to the same target precision would be significantly deeper, with every Grover power explicitly present as a controlled sub-circuit connected to the QPE register. On a fault-tolerant machine that penalty disappears; on near-term hardware it's the dominant cost. Classiq lets you synthesise and inspect both on the same financial model, measure the depth difference in qprog.data.depth, and make the hardware-readiness call with real numbers rather than intuition.
The Classiq + NVIDIA Architecture: Why It Matters
It's worth zooming out to appreciate what this combination enables as an end-to-end stack.
Finance Problem
│
▼
Classiq QMOD ──► High-level quantum model
│ (assets, payoffs, distributions)
▼
Classiq Synthesis ──► Optimal gate-level circuit
│ (depth, width, connectivity constraints)
▼
get_cuda_quantum_kernel() ──► CUDA-Q kernel
│
▼
NVIDIA GPU (custatevec / cuTensorNet) ──► Accelerated simulation
│
▼
Classical Optimizer/ IQAE Estimator ──► Result
Classiq handles the quantum modeling layer — ensuring the circuit is correct, efficient, and hardware-aware without requiring the developer to understand quantum circuit synthesis. Change the number of assets, the distribution model, or the hardware backend, and Classiq re-synthesizes the optimal circuit automatically.
NVIDIA CUDA-Q handles the execution layer — providing GPU-accelerated simulation that makes iterative variational and estimation algorithms practical today, and aunified programming model that will extend to quantum hardware as it matures.
The two layers are cleanly separated. Afinancial engineer can iterate on the problem formulation in Classiq without touching the CUDA-Q layer. A systems engineer can swap GPU backends or tune parallelism in CUDA-Q without re-deriving quantum circuits. This is the kind of separation of concerns that makes complex systems maintainable in production.
What This Means for Finance Teams Today
Quantum hardware is not yet at the scale needed to outperform classical computers on real financial data. But GPU-accelerated quantum simulation is production-ready today. Teams that build quantum workflows now, using simulators that behave identically to future hardware, will be positioned to switch seamlessly when quantum advantage arrives.
The immediate practical wins are:
● Faster research iteration — Classiq's synthesis means a financial engineer can prototype a newpayoff structure or constraint formulation in hours, not weeks. No quantumcircuit expertise required.
● GPU-scale Monte Carloreplacement — IQAE on GPU simulators already beats classical Monte Carlo for certain high-precision pricing problems at modest qubit counts (12–20 qubits).
● Hardware-ready algorithms — Classiq-synthesized circuits are already optimized for gate countand depth, meaning they'll run on real quantum hardware with minimal modification when the time comes.
● Unified stack — The Classiq →CUDA-Q bridge means your quantum programs and your GPU programs live in the same development environment, with the same deployment infrastructure.
Getting Started
The full code for both examples is available in the Classiq GitHub repository under applications/finance. To run them:
pip install classiq[cudaq]
import classiq
classiq.authenticate()
You can signup for Classiq at platform.classiq.io. CUDA-Q is open-source and runs on any NVIDIA GPU.
The quantum finance revolution won't be announced with a single breakthrough paper. It will arrive incrementally through better hardware, better algorithms, better tooling, and the teams best positioned to capture it are those building the foundational workflows today. Classiq and NVIDIA are building those foundations together.
Classiq is the quantum software company that enables engineers and researchers to build, optimize, and execute quantum programs at any scale. NVIDIA CUDA-Q is an open-source platform forquantum-classical hybrid computing that runs on CPUs, GPUs, and quantum hardware. Learn more at classiq.io and developer.nvidia.com/cuda-quantum.
By: the Classiq and NVIDIA Teams
The financial industry runs onoptimization and uncertainty. Every day, portfolio managers balance competing assets, risk desks price derivatives under volatile conditions, and trading algorithms make decisions in milliseconds. Classical computers handle these problems well enough, until they don't. As portfolios grow, as derivative structures become more exotic, and as risk models demand higher fidelity, the computational cost explodes. Quantum computing promises a fundamentally different way to attack these hard problems. But the path from quantum theory to practical financial tooling has, until now, been full of friction.
This post shows how Classiq's high-level quantum modeling language and NVIDIA's CUDA-Q platform remove that friction. We'll walk through two of the most promising quantum algorithms for finance, QAOA for portfolio optimization, and IQAE for derivative pricing, and we’ll show how the two platforms work together to accelerate iterative workflows on GPU hardware.
The Finance Problem Space: Where Quantum Has an Edge
Before we write a single line of quantum code, it's worth being precise about where quantum computing actually helps infinance as not every problem benefits. The strongest cases today are:
Portfolio optimization — Finding the optimal allocation across N assets subject to constraints (budget, risk, sector exposure) is a combinatorial problem. With N assets, the search space grows as 2^N. The best classical solvers like Gurobi, CPLEX, branch-and-bound variants are formidably good and have decades of engineering behind them. Quantum approximate optimization (QAOA) has not yet demonstrated a clear advantage over these solvers on real hardware. What it has demonstrated is that the problem maps naturally onto quantum circuits, that those circuits run on today's devices, and that the algorithmic framework scales in the right direction as hardware improves. We are building toward advantage, not claiming it today.
Derivative pricing and risk estimation — Monte Carlo simulation is the industry workhorse for pricing options, computing Value-at-Risk (VaR), and estimating CVA. It converges at O(1/√N) and you need 100× more samples to get 10× more precision. Quantum Amplitude Estimation (QAE) achieves O(1/N) convergence, a quadratic speedup that translates into dramatic runtime reductions for high-precision pricing.
Credit risk modeling — Correlated default probabilities in loan portfolios are computationally expensive to model. Quantum approaches to distribution loading and amplitude estimation apply here as well.
The common thread: quadratic or exponential classical complexity that quantum algorithms address directly.
The Programming Challenge: From Math to Quantum Circuit
Here's where most quantum finance explorations stall. Implementing QAOA or QAE correctly requires deep knowledge of quantum circuit construction such as qubit encoding schemes, parameterized rotation gates, ancilla management, entanglement patterns, and transpilation to hardware-native gate sets. A financial engineer shouldn't need a PhD in quantum information theory to run a portfolio optimization.
Classiq's quantum modeling language flips the paradigm. Instead of specifying how to build a circuit gate by gate, you describe what the circuit should compute. Classiq's synthesis engine handles the translation to an efficient, hardware-ready implementation so quantum programs look like financial models, not physics experiments.
QAOA for Portfolio Optimization
The Problem
We want to select a portfolio of k assets from a universe of N candidates that maximizes expected return while keeping risk below a threshold. This is a constrained combinatorial problem: with N binary variables (invest or don't), it maps naturally onto the space algorithms are built to navigate.
Given an expected return vector μ,a covariance matrix Σ, and a risk aversion parameter q, the objective is:
maximize: μᵀx - q · xᵀΣx
subject to: Σᵢ xᵢ = k, xᵢ ∈ {0, 1}
A word on expectations: QAOA on current hardware does not beat Gurobi or other best-in-class classical solvers on this problem. The honest value of running it now is different — it validates the end-to-end quantum workflow, it stress-tests the circuit synthesis and execution stack, and it puts teams in a position to benefit when hardware improves.
Expressing It in Classiq
What Classiq provides here is meaningfu lregardless of that hardware caveat: the problem is expressed entirely at the financial level. Classiq uses Pyomo, a standard optimization modeling framework that operations researchers already know as its interface for combinatorial problems. The QAOA circuit structure, the cost Hamiltonian encoding, and the mixer layers are all synthesized automatically from the model.
Step 1 — Define the financial problem in Pyomo:
import numpy as np
import pyomo.core as pyo
from classiq import *
from classiq.applications.combinatorial_optimization import (
CombinatorialProblem,
pyo_model_to_pauli_operator,
)
import cudaq
# Problem parameters: 3 assets, integer allocation weights
returns = np.array([3, 4, -1])
covariances = np.array(
[
[ 0.9, 0.5, -0.7],
[ 0.5, 0.9, -0.2],
[-0.7, -0.2, 0.9],
]
)
total_budget = 6
# Pyomo model — pure finance, no quantum yet
portfolio_model = pyo.ConcreteModel("portfolio_optimization")
num_assets = len(returns)
portfolio_model.w = pyo.Var(range(num_assets), domain=pyo.Integers, bounds=(0, 7))
w_array = list(portfolio_model.w.values())
portfolio_model.budget_rule = pyo.Constraint(expr=(sum(w_array) <= total_budget))
portfolio_model.expected_return = returns @ w_array
portfolio_model.risk = w_array @ covariances @ w_array
portfolio_model.cost = pyo.Objective(
expr=portfolio_model.risk - portfolio_model.expected_return,
sense=pyo.minimize,
)
This is a standard intege rprogramming model: minimize risk minus expected return, subject to a total budget constraint. No quantum concepts yet.
Step 2 — Synthesize the QAOA circuit with Classiq:
NUM_LAYERS = 3
PENALTY = 10 # constraint penalty — must match penalty_energy in the CUDA-Q Hamiltonian
combi = CombinatorialProblem(pyo_model=portfolio_model, num_layers=NUM_LAYERS, penalty_factor=PENALTY)
qprog = combi.get_qprog()
show(qprog) # inspect the synthesized circuit in the Classiq IDE
One call to CombinatorialProblem — Classiq reads the Pyomo objective and constraints, constructs the cost Hamiltonian, and synthesizes the full depth-optimized QAOA circuit. show(qprog) opens it in the Classiq platform where you can inspect gate counts, qubit layout, and depth interactively.
The Variational Loop and the CUDA-Q Bridge
QAOA is a variational algorithm: the circuit has trainable parameters (γ, β angles, one pair per layer) that aclassical optimizer tunes iteratively to minimize the cost expectation value. This outer loop can require hundreds to thousands of circuit evaluations which is exactly where GPU acceleration makes the difference between a research prototype and something practical.
The Classiq-to-CUDA-Q bridge is two function calls:
# Build the Hamiltonian using the same penalty as CombinatorialProblem,
# so both paths optimise the identical cost landscape
classiq_hamiltonian = pyo_model_to_pauli_operator(portfolio_model, penalty_energy=PENALTY)
hamiltonian = pauli_operator_to_cudaq_spin_op(classiq_hamiltonian)
# Convert the synthesized Classiq circuit to a native CUDA-Q kernel
kernel = qprog_to_cudaq_kernel(qprog, is_main_kernel=True)The kernel is now a native CUDA-Q object runnable on any CUDA-Q backend by simply changing a single set_target() call. The circuit itself hasn't changed; it's the same gate sequence Classiq synthesized. The PENALTY constant is the critical shared parameter: it controls how heavily constraint violations are penalised in the cost Hamiltonian, and both CombinatorialProblem and pyo_model_to_pauli_operator must use the same value, otherwise the two methods are literally optimising different objective functions and their results will diverge.
Step 3 — Run the Classiq native optimizer:
combi.optimize() must run first as it initialises the execution session (_es) that combi.sample() depends on for decoding. We time it here so the comparison is straightforward.
import time
from classiq.execution import ExecutionPreferences, ClassiqBackendPreferences
execution_preferences = ExecutionPreferences(
backend_preferences=ClassiqBackendPreferences(backend_name="simulator"),
)
t0 = time.perf_counter()
optimized_params = combi.optimize(
execution_preferences,
maxiter=70,
quantile=0.7, # CVaR alpha: focus on the best 30% of samples
)
classiq_time = time.perf_counter() - t0
classiq_result = combi.sample(optimized_params)
best_classiq = classiq_result.solution[classiq_result.cost.idxmin()]
classiq_weights = np.array(best_classiq["w"])
classiq_cost = classiq_result.cost.min()
print(f"Classiq allocation: {classiq_weights.tolist()} cost: {classiq_cost:.4f} time: {classiq_time:.2f}s")
Step 4 — Run the CUDA-Q GPU optimizer:
The CUDA-Q objective now uses the same quantile=0.7 as combi.optimize() — meaning both methods minimise the expected cost over the best 30% of sampled bitstrings rather than the plain expectation value. This is what aligns the two optimisers on the same landscape.
def initial_qaoa_params(NUM_LAYERS) -> np.ndarray:
initial_gammas = math.pi * np.linspace(
1 / (2 * NUM_LAYERS), 1 - 1 / (2 * NUM_LAYERS), NUM_LAYERS
)
initial_betas = math.pi * np.linspace(
1 - 1 / (2 * NUM_LAYERS), 1 / (2 * NUM_LAYERS), NUM_LAYERS
)
initial_params = []
for i in range(NUM_LAYERS):
initial_params.append(initial_gammas[i])
initial_params.append(initial_betas[i])
return np.array(initial_params)
initial_params = initial_qaoa_params(NUM_LAYERS)
cudaq.set_target("nvidia")
QUANTILE = 0.7 # must match combi.optimize(quantile=...)
SHOTS = 1000 # samples per evaluation
parameter_count = 2 * NUM_LAYERS
def cvar_objective(parameters):
"""QAOA objective: mean cost of the best (1-QUANTILE) fraction of shots."""
counts = cudaq.sample(kernel, *parameters, shots_count=SHOTS)
# Evaluate cost for every observed bitstring
costs = []
for bitstring, count in counts.items():
result = combi.sample(list(parameters)) # decode via combi
cost = result.cost.min()
costs.extend([cost] * count)
costs.sort()
cutoff = max(1, int(len(costs) * (1 - QUANTILE)))
return float(np.mean(costs[:cutoff]))
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 70
optimizer.initial_parameters = initial_params
t0 = time.perf_counter()
optimal_cvar, optimal_parameters = optimizer.optimize(
dimensions=parameter_count, function=cvar_objective
)
cudaq_time = time.perf_counter() - t0
# Decode the final allocation using the optimal parameters
cudaq_result = combi.sample(optimal_parameters)
best_cudaq = cudaq_result.solution[cudaq_result.cost.idxmin()]
cudaq_weights = np.array(best_cudaq["w"])
cudaq_cost = cudaq_result.cost.min()
print(f"CUDA-Q allocation: {cudaq_weights.tolist()} cost: {cudaq_cost:.4f} time: {cudaq_time:.2f}s")
Comparing the Two Results
Both methods optimise the same Hamiltonian over the same synthesised circuit and decode through the same combi.sample() function, so their allocations and costs should agree. The timing column is the point of interest where combi.optimize() runs on Classiq's cloud simulator while the CUDA-Q path runs on a local NVIDIA GPU.
print(f"{'Method':<10} {'Allocation':>20} {'Cost':>10} {'Time (s)':>12} {'Speedup':>10}")
print("-" * 66)
print(f"{'Classiq':<10} {str(classiq_weights.tolist()):>20} {classiq_cost:>10.4f} {classiq_time:>12.2f} {classiq_time/cudaq_time:>9.1f}x")
print(f"{'CUDA-Q':<10} {str(cudaq_weights.tolist()):>20} {cudaq_cost:>10.4f} {cudaq_time:>12.2f} {'1.0x':>10}")
Comparing these methods we get these results:
Method Allocation Cost Time (s) Speedup
------------------------------------------------------------------
Classiq [2, 2, 1] -4.5000 160.13 2.5x
CUDA-Q [2, 2, 1] -4.5000 64.35 1.0x
The GPU speedup here is not about beating classical portfolio solvers yet, it's about making the quantum research loop fast enough to be useful by exploring layer depths, tuning penalty factors and benchmarking ansatz structures. GPU simulation is the development environment that makes quantum finance work today while hardware matures.
IQAE for Derivative Pricing
The Problem
Pricing a European option requires estimating an expectation value over a probability distribution of future asset prices:
Price = e^(-rT) · E[max(S_T - K, 0)]
where S_T is the asset price at maturity, K is the strike, and the expectation is taken over a log-normal distribution. Classical Monte Carlo estimates this by averaging over N random samples, converging at O(1/√N). Quantum Amplitude Estimation achieves O(1/N) convergence which is a quadratic speedup in oracle calls.
Iterative QAE (IQAE) is the near-term variant of choice: it avoids the deep controlled-phase circuits of textbook QAE, instead adaptively querying a Grover-like oracle with increasing repetition counts to home in on the true amplitude. Each iteration uses a small number of shots and narrows a confidence interval via Clopper-Pearson statistics which is practical on today's hardware.
Unlike QAOA, the quantum advantage case for IQAE is more concrete. The quadratic speedup in oracle queries is proven, and the question is whether it survives circuit overhead on real hardware. GPU-accelerated simulation lets us run the full IQAE loop at scale today and measure where the crossover point lies.
Step 1 — Define the Distribution and Payoff
The financial parameters map directly onto the Classiq model. We use 3 qubits, giving 2³ = 8 discretisation points over the log-normal distribution. The scaling_factor normalises the payoff into[0, 1] so it can be loaded as a quantum amplitude.
import numpy as np
import scipy
import cudaq
from classiq import *
from classiq.applications.iqae.iqae import IQAE
from classiq.qmod.symbolic import ceiling
# --- Financial parameters ---
num_qubits = 3
mu = 0.7 # log-normal mean
sigma = 0.13 # log-normal std dev
K = 1.9 # strike price
# --- Discretise the log-normal distribution over 2^num_qubits points ---
def get_log_normal_probabilities(mu, sigma, num_points):
mean = np.exp(mu + sigma**2 / 2)
variance = (np.exp(sigma**2) - 1) * np.exp(2 * mu + sigma**2)
stddev = np.sqrt(variance)
low = np.maximum(0, mean - 3 * stddev)
high = mean + 3 * stddev
x = np.linspace(low, high, num_points)
return x, scipy.stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
grid_points, probs = get_log_normal_probabilities(mu, sigma, 2**num_qubits)
probs = probs / np.sum(probs) # normalise to a valid probability distribution
scaling_factor = max(grid_points) - K
grid_step = (max(grid_points) - min(grid_points)) / (len(grid_points) - 1)
def scale(val):
return val * grid_step + min(grid_points)
def descale(val):
return (val - min(grid_points)) / grid_step
Step 2 — Build the State Preparation Circuit in Classiq
The quantum algorithm needs two components: a circuit that loads the price distribution and a payoff function that marks in-the-money outcomes. Classiq synthesises both from high-level descriptions.
@qfunc
def load_distribution(asset: QNum):
inplace_prepare_state(probabilities=probs.tolist(), bound=0, target=asset)
@qfunc
def payoff_linear(asset: Const[QNum], ind: QBit):
# Load payoff amplitude: sqrt((S - K) / scaling_factor) for S > K
assign_amplitude_table(
lookup_table(lambda n: np.sqrt(abs((scale(n) - K) / scaling_factor)), asset),
asset,
ind,
)
@qfunc
def payoff(asset: Const[QNum], ind: QBit):
# Only apply the payoff when asset price exceeds the strike
control(asset >= ceiling(descale(K)), lambda: payoff_linear(asset, ind))
@qfunc
def european_call_state_preparation(asset: QNum, ind: QBit):
load_distribution(asset)
payoff(asset, ind)
Step 3 — Synthesise Both Circuits in One Block
Classiq's synthesize() operates on a function named main. We define and synthesise main twice, once for state preparation, once for the oracle, and export each immediately to a CUDA-Q kernel. Both syntheses and exports must happen in the same cell, back to back: CUDA-Q names internal functions after Pytho nobject memory addresses, and if an address is recycled between two synthesis calls, the two MLIR modules will contain different functions with the same name but different signatures, causing a silent merge failure. Running them inimmediate sequence prevents this.
The stateprep main allocates the asset register and indicator qubit, applies the financial model, then drops the asset register — leaving only the indicator as an output. The oracle main applies the full Grover operator over the combined register.
# --- State preparation: synthesise and export immediately ---
@qfunc
def main(indicator: Output[QBit]):
asset = QNum()
allocate(num_qubits, asset)
allocate(1, indicator)
european_call_state_preparation(asset, indicator)
drop(asset)
qprog_stateprep = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=15))
print(f"State prep width: {qprog_stateprep.data.width} qubits")
stateprep_kernel = qprog_to_cudaq_kernel(qprog_stateprep, is_main_kernel=False)
# --- Oracle: redefine main, synthesise and export immediately ---
@qfunc
def space_transform(est_reg: QArray) -> None:
european_call_state_preparation(est_reg[0 : est_reg.len - 1], est_reg[est_reg.len - 1])
@qperm
def oracle(est_reg: Const[QArray]) -> None:
Z(est_reg[est_reg.len - 1])
@qfunc
def main(indicator: Output[QBit]) -> None:
est_reg: QArray = QArray("est_reg")
problem_vars: QArray = QArray("problem_vars", length=num_qubits)
allocate(problem_vars)
allocate(indicator)
within_apply(
lambda: bind([problem_vars, indicator], est_reg),
lambda: grover_operator(oracle, space_transform, est_reg),
)
drop(problem_vars)
qprog_oracle = synthesize(main, preferences=Preferences(), constraints=Constraints(max_width=11))
print(f"Oracle width: {qprog_oracle.data.width} qubits")
oracle_kernel = qprog_to_cudaq_kernel(qprog_oracle, is_main_kernel=False)Step 4 — Validate and Find the Indicator Qubit
Before building the parametric kernel we run a quick sanity check: both circuits must place the indicator qubit at the same index in the register, otherwise the parametric kernel measures the wrong qubit. Classiq's qubit_mapping exposes this directly, so we can assert consistency and extract measure_index without guessing.
def validate_qprogs(qprog_stateprep, qprog_oracle):
sp_idx = qprog_stateprep.data.qubit_mapping.logical_outputs['indicator'][0]
orc_idx = qprog_oracle.data.qubit_mapping.logical_outputs['indicator'][0]
assert sp_idx == orc_idx, (
f"Indicator qubit index mismatch: stateprep={sp_idx}, oracle={orc_idx}. "
"Both circuits must place the indicator at the same physical qubit."
)
total_qubits = max(qprog_stateprep.data.width, qprog_oracle.data.width)
return total_qubits, sp_idx
total_qubits, measure_index = validate_qprogs(qprog_stateprep, qprog_oracle)
print(f"Total qubits: {total_qubits}, measure index: {measure_index}")Step 5 — Build the Parametric CUDA-Q Kernel
This is where the CUDA-Q integration pays off beyond raw GPU speed. A standard QASM-based workflow would need to unroll the full oracle circuit k times per IQAE iteration, thereby producing a new, longer circuit for every value of k. CUDA-Q's for_loop emits a cc.loop in the Quake IR instead: the oracle is described once and the loop bound is a runtime integer. The kernel is O(1) in size regardless of k.
One critical detail: qubit refs for both apply_call invocations must be resolved in the outer scope before for_loop is entered. Inside the loop body, CUDA-Q traces the function once to generate the loop IR — extract_ref ops created during that trace land inside the loop construct where func.call lowering cannot resolve them. Extracting all refs outside the loop avoids this.
def build_parametric_iqae_kernel(
stateprep_kernel: cudaq.Kernel,
oracle_kernel: cudaq.Kernel,
total_qubits: int,
measure_index: int,
) -> cudaq.Kernel:
iqae_kernel, k_var = cudaq.make_kernel(int)
q = iqae_kernel.qalloc(total_qubits)
# Resolve all qubit refs in the outer scope — not inside the loop body
sp_qubits = [q[i] for i in range(qprog_stateprep.data.width)]
oracle_qubits = [q[j] for j in range(qprog_oracle.data.width)]
# Apply state preparation once (the A operator)
iqae_kernel.apply_call(stateprep_kernel, *sp_qubits)
# Apply the Grover oracle k times — k is a runtime integer, not an unrolled copy
def oracle_loop_body(i: cudaq.QuakeValue):
iqae_kernel.apply_call(oracle_kernel, *oracle_qubits)
iqae_kernel.for_loop(start=0, stop=k_var, function=oracle_loop_body)
iqae_kernel.mz(q[measure_index])
return iqae_kernel
iqae_kernel = build_parametric_iqae_kernel(
stateprep_kernel=stateprep_kernel,
oracle_kernel=oracle_kernel,
total_qubits=total_qubits,
measure_index=measure_index,
)Step 6 — Run IQAE on GPU and Compare with Classiq
The IQAEOptionPricer drives the iterative loop: at each iteration it samples the parametric kernel with the current k, computes a Clopper-Pearson confidence interval on the measured probability, inverts back to theta-space, and contracts the interval. The Classiq IQAE class runs the same algorithm internally for the baseline.
from scipy.stats import beta as scipy_beta
class IQAEOptionPricer:
def __init__(self, epsilon: float, alpha: float, scaling_factor: float):
self.epsilon = epsilon
self.alpha = alpha
self.scaling_factor = scaling_factor
self.shots = 1000
def _clopper_pearson(self, count, n, alpha):
lower = 0.0 if count == 0 else scipy_beta.ppf(alpha / 2, count, n - count + 1)
upper = 1.0 if count == n else scipy_beta.ppf(1 - alpha / 2, count + 1, n - count)
return lower, upper
def _find_next_k(self, k_curr, theta_l, theta_u):
K_max = int(np.floor(np.pi / (theta_u - theta_l)))
K = K_max - (K_max - 2) % 4
while K >= 4 * (4 * k_curr + 2):
K = int(K / 4) - (int(K / 4) - 2) % 4
return max((K - 2) // 4, k_curr + 1)
def solve(self, kernel: cudaq.Kernel):
theta_l, theta_u = 0.0, np.pi / 2
k = 0
total_shots = 0
print(f"{'Iter':<5} | {'k':<5} | {'Interval':<26} | {'Error':>8}")
print("-" * 55)
iteration = 0
while (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
iteration += 1
result = cudaq.sample(kernel, k, shots_count=self.shots)
count_1 = result.count('1')
total_shots += self.shots
a_min, a_max = self._clopper_pearson(count_1, self.shots, self.alpha)
theta_meas = np.arcsin(np.sqrt(count_1 / self.shots))
theta_center = (theta_l + theta_u) / 2
scaling = 2 * k + 1
n_ = round((scaling * theta_center - theta_meas) / np.pi)
theta_est = (n_ * np.pi + theta_meas) / scaling
d_a = (a_max - a_min) / 2
deriv = abs(scaling * np.sin(2 * scaling * theta_est))
if deriv < 1e-9:
deriv = 1e-9
delta_theta = d_a / deriv
theta_l = max(theta_l, theta_est - delta_theta)
theta_u = min(theta_u, theta_est + delta_theta)
est_error = (theta_u - theta_l) * self.scaling_factor
print(f"{iteration:<5} | {k:<5} | [{theta_l:.4f}, {theta_u:.4f}] | {est_error:>8.4f}")
if (theta_u - theta_l) > (2 * self.epsilon / self.scaling_factor):
k = self._find_next_k(k, theta_l, theta_u)
theta_final = (theta_l + theta_u) / 2
estimated_price = np.sin(theta_final) ** 2 * self.scaling_factor
return estimated_price, total_shots
# --- Classiq baseline ---
iqae = IQAE(
state_prep_op=european_call_state_preparation,
problem_vars_size=num_qubits,
constraints=Constraints(max_width=20),
preferences=Preferences(optimization_level=1),
)
result_iqae = iqae.run(epsilon=0.05, alpha=0.01)
classiq_payoff = result_iqae.estimation * scaling_factor
# --- CUDA-Q GPU run ---
cudaq.set_target("nvidia") # swap for "qpp-cpu" if no GPU available
pricer = IQAEOptionPricer(epsilon=0.05, alpha=0.01, scaling_factor=scaling_factor)
cudaq_price, cudaq_shots = pricer.solve(iqae_kernel)
# --- Classical reference ---
expected_payoff = sum((grid_points - K) * (grid_points >= K) * probs)
print(f"\n{'Method':<22} {'Payoff':>10}")
print("-" * 34)
print(f"{'CUDA-Q IQAE (GPU)':<22} {cudaq_price:>10.4f} (shots: {cudaq_shots})")
print(f"{'Classiq IQAE':<22} {classiq_payoff:>10.4f}")
print(f"{'Classical reference':<22} {expected_payoff:>10.4f}")All three numbers should land within epsilon = 0.05 of each other. The shot count from the CUDA-Q path makes the quadratic advantage tangible: IQAE reaches 5% precision using far fewer oracle calls than Monte Carlo would need samples for the same confidence.
QAE vs IQAE: Switching Algorithms Without Rewriting Your Model
IQAE is well-suited for near-term hardware because it avoids one of the most expensive primitives in quantum computing: the controlled-unitary. Textbook Quantum Amplitude Estimation (QAE) uses Quantum Phase Estimation as its core subroutine, which requires a register of ancilla qubits and a series of controlled applications of the Grover operator where each controlled application roughly doubles circuit depth. On current hardware, where two-qubit gate fidelity is still the limiting factor, this depth cost is prohibitive.
IQAE sidesteps this entirely. Instead of encoding the amplitude into a phase via QPE, it adaptively queries the Grover oracle at increasing depths and uses classical post-processing (Clopper-Pearsonintervals, phase inversion) to home in on the answer. The circuit for each round is shallow; the intelligence is in the classical loop.
The tradeoff: IQAE requires more total circuit executions than QAE for the same precision, and its confidence interval update involves heuristics that a full QPE-based approach handles exactly. As hardware improves and controlled-unitary costs fall, QAE's single-shot precision becomes more attractive.
What makes Classiq valuable here is that the financial model doesn't change at all. The functions load_distribution, payoff_linear, payoff, and european_call_state_preparation are pure descriptions of the financial problem and they know nothing about IQAE or QAE. They encode the distribution, the strike condition, and the amplitude loading. That logic is identical regardless of which amplitude estimation algorithm wraps around it.
Moving to full QAE would mean replacing the IQAE wrapper with a QPE-based outer circuit: a register of ancilla qubits, controlled applications of the Grover operator at powers 1, 2, 4, ..., 2^(n-1), and an inverse QFT to read out the phase. The number of ancilla qubits directly controls precision as each additional qubit halves the estimation error, at the cost of roughly doubling circuit depth. In Classiq's modeling language, this outer structure is expressed in terms of grover_operator and qpe —the same building blocks already used internally by IQAE — wrapped in a new main function. The european_call_state_preparation function is passed in unchanged as the state preparation operand.
The contrast between the two circuits is worth seeing directly. The IQAE oracle synthesised in Steps 3 and 4 is a shallow single Grover iteration where the depth scaling happens in the classical loop, not in the circuit. A full QAE circuit compiled to the same target precision would be significantly deeper, with every Grover power explicitly present as a controlled sub-circuit connected to the QPE register. On a fault-tolerant machine that penalty disappears; on near-term hardware it's the dominant cost. Classiq lets you synthesise and inspect both on the same financial model, measure the depth difference in qprog.data.depth, and make the hardware-readiness call with real numbers rather than intuition.
The Classiq + NVIDIA Architecture: Why It Matters
It's worth zooming out to appreciate what this combination enables as an end-to-end stack.
Finance Problem
│
▼
Classiq QMOD ──► High-level quantum model
│ (assets, payoffs, distributions)
▼
Classiq Synthesis ──► Optimal gate-level circuit
│ (depth, width, connectivity constraints)
▼
get_cuda_quantum_kernel() ──► CUDA-Q kernel
│
▼
NVIDIA GPU (custatevec / cuTensorNet) ──► Accelerated simulation
│
▼
Classical Optimizer/ IQAE Estimator ──► Result
Classiq handles the quantum modeling layer — ensuring the circuit is correct, efficient, and hardware-aware without requiring the developer to understand quantum circuit synthesis. Change the number of assets, the distribution model, or the hardware backend, and Classiq re-synthesizes the optimal circuit automatically.
NVIDIA CUDA-Q handles the execution layer — providing GPU-accelerated simulation that makes iterative variational and estimation algorithms practical today, and aunified programming model that will extend to quantum hardware as it matures.
The two layers are cleanly separated. Afinancial engineer can iterate on the problem formulation in Classiq without touching the CUDA-Q layer. A systems engineer can swap GPU backends or tune parallelism in CUDA-Q without re-deriving quantum circuits. This is the kind of separation of concerns that makes complex systems maintainable in production.
What This Means for Finance Teams Today
Quantum hardware is not yet at the scale needed to outperform classical computers on real financial data. But GPU-accelerated quantum simulation is production-ready today. Teams that build quantum workflows now, using simulators that behave identically to future hardware, will be positioned to switch seamlessly when quantum advantage arrives.
The immediate practical wins are:
● Faster research iteration — Classiq's synthesis means a financial engineer can prototype a newpayoff structure or constraint formulation in hours, not weeks. No quantumcircuit expertise required.
● GPU-scale Monte Carloreplacement — IQAE on GPU simulators already beats classical Monte Carlo for certain high-precision pricing problems at modest qubit counts (12–20 qubits).
● Hardware-ready algorithms — Classiq-synthesized circuits are already optimized for gate countand depth, meaning they'll run on real quantum hardware with minimal modification when the time comes.
● Unified stack — The Classiq →CUDA-Q bridge means your quantum programs and your GPU programs live in the same development environment, with the same deployment infrastructure.
Getting Started
The full code for both examples is available in the Classiq GitHub repository under applications/finance. To run them:
pip install classiq[cudaq]
import classiq
classiq.authenticate()
You can signup for Classiq at platform.classiq.io. CUDA-Q is open-source and runs on any NVIDIA GPU.
The quantum finance revolution won't be announced with a single breakthrough paper. It will arrive incrementally through better hardware, better algorithms, better tooling, and the teams best positioned to capture it are those building the foundational workflows today. Classiq and NVIDIA are building those foundations together.
Classiq is the quantum software company that enables engineers and researchers to build, optimize, and execute quantum programs at any scale. NVIDIA CUDA-Q is an open-source platform forquantum-classical hybrid computing that runs on CPUs, GPUs, and quantum hardware. Learn more at classiq.io and developer.nvidia.com/cuda-quantum.
