Classical Implementations
The same mathematical framework producing quantum hardware breakthroughs also delivers measurable performance advantages on classical GPUs. These are working implementations, benchmarked against industry standards.
Long-Context Attention Kernel — Faster Than FlashAttention
Phoenix Attention is a drop-in attention replacement for frontier LLMs. Same math used across our framework, applied to the bottleneck of long-context inference. Benchmarked against FlashAttention on real Qwen 7B weights. Ships as a 3.2 MB stripped .so — no source, no Triton runtime at deploy.
Phoenix Attention (AOT) vs FlashAttention — Qwen 7B, NVIDIA L40S 48 GB
Quality Preservation — End-to-End
Measured on NVIDIA L40S 48 GB. Comparison vs FlashAttention (the memory-efficient industry standard). Vanilla PyTorch SDPA out-of-memory at ≥131K context on a single L40S; both FlashAttention and Phoenix run through 1M, with Phoenix 12×–155× faster. Reproducer in demo container.
CUDA SVD — Faster Than torch.linalg.svd
Phoenix SVD is a CUDA singular-value decomposition kernel for LoRA initialization, quantization, and weight-decomposition pipelines. Benchmarked against torch.linalg.svd on real Qwen 7B q_proj weights. Drop-in replacement; no quality regression.
General Matrices (fp64) — NVIDIA L40S 48 GB
Symmetric Matrices (covariance / Gram / Hermitian, fp64)
Numerical Accuracy — Singular Values vs torch.linalg.svd
Measured on NVIDIA L40S 48 GB, fp64. torch GPU SVD becomes impractical at 8K+ (~100 s at 8192²); Phoenix warm-cache stays sub-second across the full size range. Drop-in replacement for torch.linalg.svd — same call shape, matching outputs, no model code change.
Scaling on H100 — Real Model Weights and Large fp64 Matrices
Measured on NVIDIA H100 80 GB, fp64. Top-5 singular values match torch to ~10−11 on the real Llama weight, ~10−6 on the synthetic PSD. At 24K² torch takes over six minutes; Phoenix finishes in 22 seconds.
Cross-Domain — Real Single-Cell Genomics (PBMC 3K)
Real 10x Genomics PBMC 3K dataset, standard scanpy preprocessing pipeline (Seurat-style), 2,700 cells × 2,000 highly-variable genes, fp64. Singular values match LAPACK to ~10−12 — bit-exact at fp64 precision. Demonstrates the SVD primitive carries outside ML weights: single-cell PCA, signal processing, recommendation systems, any domain where SVD is on the critical path.
Phoenix on NVIDIA's Flagship Hybrid — Measured, Reproducible
Phoenix SVD and Phoenix Attention running together on NVIDIA's Nemotron-3-Super-120B — the hybrid Mamba-2 + Attention + MoE flagship — measured against the unmodified stock configuration. 8× H100 80 GB. Real weights, real data, every number reproducible from a single script.
Phoenix SVD on a Real Nemotron Weight (q_proj 4096²)
Phoenix SVD factorizes Nemotron's own attention projection weight essentially bit-exact, at 48.6× the speed of the PyTorch / cuSOLVER reference. Drop-in primitive anywhere SVD is on the critical path.
Phoenix Windowed KV Cache — Tunable Memory Curve (32K Context)
Two operating points on the same tunable curve. Max-compression for summarization, classification, dialogue, and retrieval-augmented workloads where the model processes long input but does not require verbatim distant recall. Retrieval-safe for workloads that do. Mode-selectable per request. Measured on real wikitext-2 perplexity and real long-context needle-in-haystack.
Phoenix Attention Kernel — Scaling on H100
Standalone kernel performance scales cleanly with context length. End-to-end speedup on a hybrid model like Nemotron is Amdahl-bound — only 8 of 88 layers are attention — so the e2e attention win on Nemotron is memory, not raw speed. Stated honestly; the speed numbers above are the kernel itself, where they apply for attention-dominant workloads and pure-transformer integrations.
Architecture Fit
Phoenix is engineered for NVIDIA's hybrid Mamba-2 + Attention direction. The compression and quality results above are what that fit produces, end-to-end, on the 120B configuration. The same approach generalizes to other hybrids natively, and to pure-transformer architectures via established Mamba-2 conversion methods (distillation; published roughly 1–3% of original pretraining compute, per Mamba in the Llama and related work) — not from-scratch retraining.
Headline table reproducible by a single script. Independent fp64 verification of the kernel, multi-GPU correctness, perplexity, needle, and KV-cache footprint measurements all available under NDA. Architectural methodology and integration specifics are patent-pending and disclosed only under signed evaluation agreement.
Phoenix on a Pure Dense Transformer — 100% Token-Identical Output
Phoenix Attention applied as a drop-in replacement to Meta Llama 3.3 70B on 8× H100 80 GB. A per-layer MSE boundary scan identifies the 9 of 80 attention layers where Phoenix substitutes bit-exactly — layers 0–8. The remaining 71 layers run stock SDPA / FlashAttention. Output is token-by-token identical to the unmodified Llama under greedy decoding at every context tested.
End-to-End Token Throughput — Mode B (9 of 80 layers patched)
At 9 of 80 layers patched, Phoenix delivers a real wall-time win at short context (where the 9 patched layers' attention cost is a meaningful fraction of total time) and reaches parity at long context (where the 71 unpatched attention layers dominate). This is the bit-exact-substitutable ceiling without retraining. Pushing past it to full-network speedup at every context length is the LoRA-distillation path described in the roadmap.
Per-Head Phoenix Profile — Refining the Boundary
Beyond the whole-layer profile, a finer-grained scan examines each of the 64 query heads in each attention layer independently — against the actual Phoenix kernel, not a software stand-in. Layers 0–8: all 64 heads, MSE in [10-9, 10-5] — confirms the whole-layer claim at head granularity. Layer 9 (the boundary): only heads 0–4 break (long-range lookups sliding-window cannot capture); 59 of 64 heads are within MSE < 2×10-2. A head-hybrid dispatcher routing the 5 outlier heads through SDPA unlocks layer 9 without retraining anything — near-term kernel deliverable.
The Boundary Is Structural, Not Numerical
The fp16 and wide-sinks bf16 L9 head-MSE distributions agree to within 1% across all 64 heads. This rules out the "your kernel just needs more headroom or a wider window" objection. The bottleneck is the model's own activation distribution past layer 8, not Phoenix's numerics. LoRA distillation against the distribution shift is the literature-standard fix and is scoped in the roadmap.
Numbers measured 2026-05-18 on 8× H100 80 GB (NVIDIA Innovation Lab / Brev). Reproducible by a single script. JSON manifests and full per-head MSE tables available under NDA.
GPU-Native Vector Search — Faster Than FAISS
VOIS is our patented GPU-native similarity search engine. Built on the same mathematical framework as our quantum algorithms. Benchmarked head-to-head against Meta's FAISS on the same machine, same dataset, same ground truth.
Head-to-Head at Matched Recall — SIFT-1M (1 Million Vectors)
Scaling — VOIS QPS at ~97% Recall
VOIS — Recall vs Throughput Sweep (1M Vectors)
FAISS Comparison Data — 1M Vectors
SIFT-1M dataset: 128 dimensions, 1,000 queries, K=10 nearest neighbors. All methods on the same RTX 4060 (8GB), same data, same ground truth. VOIS uses a single GPU. FAISS HNSW tested at 1 and 16 CPU threads. Best of 5 runs.
PX Compute — Systems-Level Memory Pool for GPU and CPU
A slab-based memory pool allocator for NVIDIA GPUs and Linux CPUs. Per-device size-class free lists, golden-ratio block growth for the long tail, NVML-aware adaptive thresholds, and a hard pool-size cap to prevent runaway VRAM retention. Engineered for workloads that re-cycle buffers in tight loops — the alloc/free pattern LLM KV caches exhibit is exactly what this allocator targets.
PX Compute vs cudaMalloc — NVIDIA H100 80GB (single GPU)
Cold (non-reuse) workloads: 1.7–3.2× consistent. The huge speedups on reuse patterns are the relevant case: LLM inference re-cycles KV cache buffers, attention scratch buffers, and gradient buffers in tight loops — that's where PX Compute lands.
8× H100 in Parallel — Driver-Lock Contention Amplifies the Win
Under multi-worker inference, 8 processes contending on the NVIDIA driver lock serialize through cudaMalloc. PX Compute serves from in-process free lists with zero driver calls — the advantage grows under parallel load. This is the production-realistic regime for cloud LLM serving.
CPU Memory Pool — Large-Block Reuse
Linux CPU pool. Wins concentrate on large-block reuse where libc has to mmap/munmap. Real use cases: CPU-offloaded LLM weight buffers, large embedding lookup tables, distributed-training intermediate buffers.
Every number above is reproducible from a single bench script on stock NVIDIA Brev or AWS hardware. JSON manifests of per-workload timings available under NDA.
60+ Identified Application Domains
Because the framework operates at a fundamental mathematical level, it applies across computational domains. We have identified over 60 specific application areas spanning search, security, optimization, signal processing, defense, energy, medical imaging, telecommunications, and autonomous systems. These are covered by our patent claims.
Detailed application analysis and domain-specific benchmarks available under NDA.
U.S. Patents Filed
Patent Applications Filed 2026
Core mathematical framework, quantum algorithms, classical implementations, hardware designs, and application methods are protected by patent claims. Licensing inquiries welcome.
Research Partners & Funding
Seeking SBIR/STTR funding, research partnerships, and strategic investment. U.S. patents filed. Select results available under NDA.
CONTACT US →