[ TECHNOLOGY ]

Classical Implementations

The same mathematical framework producing quantum hardware breakthroughs also delivers measurable performance advantages on classical GPUs. These are working implementations, benchmarked against industry standards.

Long-Context Attention Kernel — Faster Than FlashAttention

Phoenix Attention is a drop-in attention replacement for frontier LLMs. Same math used across our framework, applied to the bottleneck of long-context inference. Benchmarked against FlashAttention on real Qwen 7B weights. Ships as a 3.2 MB stripped .so — no source, no Triton runtime at deploy.

Phoenix Attention (AOT) vs FlashAttention — Qwen 7B, NVIDIA L40S 48 GB

CONTEXT
FLASHATTENTION
PHOENIX (AOT)
SPEEDUP
16K
10.03 ms
7.51 ms
1.33x
65K
169.98 ms
26.91 ms
6.32x
131K
675.34 ms
54.16 ms
12.47x
524K
16,063 ms
229.19 ms
70.09x
1M
72,614 ms
468.99 ms
154.83x

Quality Preservation — End-to-End

EVALUATION
VANILLA
PHOENIX
GSM8K pass@1 (Qwen 2.5-7B)
75 / 100
75 / 100
Perplexity ratio (Qwen 7B / 1.5B / 0.5B, TinyLlama)
1.000x
1.000x

Measured on NVIDIA L40S 48 GB. Comparison vs FlashAttention (the memory-efficient industry standard). Vanilla PyTorch SDPA out-of-memory at ≥131K context on a single L40S; both FlashAttention and Phoenix run through 1M, with Phoenix 12×–155× faster. Reproducer in demo container.

CUDA SVD — Faster Than torch.linalg.svd

Phoenix SVD is a CUDA singular-value decomposition kernel for LoRA initialization, quantization, and weight-decomposition pipelines. Benchmarked against torch.linalg.svd on real Qwen 7B q_proj weights. Drop-in replacement; no quality regression.

General Matrices (fp64) — NVIDIA L40S 48 GB

SIZE
torch GPU
PHOENIX COLD
PHOENIX WARM
COLD→TORCH
WARM→TORCH
1024²
804 ms
364 ms
4.96 ms
2.2x
162x
2048²
2,785 ms
1,173 ms
19.1 ms
2.4x
146x
4096²
13,060 ms
7,048 ms
90.3 ms
1.9x
145x
8192²
99,223 ms
47,286 ms
355 ms
2.1x
279x

Symmetric Matrices (covariance / Gram / Hermitian, fp64)

SIZE
torch GPU
PHOENIX COLD
PHOENIX WARM
COLD→TORCH
WARM→TORCH
1024²
807 ms
42.6 ms
4.89 ms
19.0x
165x
2048²
2,748 ms
109 ms
18.3 ms
25.2x
150x
4096²
12,991 ms
567 ms
90.3 ms
22.9x
144x
8192²
99,818 ms
3,642 ms
355 ms
27.4x
281x

Numerical Accuracy — Singular Values vs torch.linalg.svd

SIZE
MAX ERROR
512²
5.6 × 10−12
1024²
1.8 × 10−11
2048²
6.3 × 10−11
4096²
1.9 × 10−10
8192²
6.6 × 10−10

Measured on NVIDIA L40S 48 GB, fp64. torch GPU SVD becomes impractical at 8K+ (~100 s at 8192²); Phoenix warm-cache stays sub-second across the full size range. Drop-in replacement for torch.linalg.svd — same call shape, matching outputs, no model code change.

Scaling on H100 — Real Model Weights and Large fp64 Matrices

MATRIX
SIZE
TORCH
PHOENIX
SPEEDUP
Real Nemotron Mamba in_proj A·Aᵀ
18,560²
646.1 s
8.08 s
79.9×
Real Llama 3.3 70B up_proj weight
16,384²
245.7 s
5.86 s
41.9×
Synthetic PSD (A Aᵀ)
16K²
110.6 s
8.02 s
13.8×
Synthetic PSD (A Aᵀ)
24K²
378.6 s
22.3 s
17.0×

Measured on NVIDIA H100 80 GB, fp64. Top-5 singular values match torch to ~10−11 on the real Llama weight, ~10−6 on the synthetic PSD. At 24K² torch takes over six minutes; Phoenix finishes in 22 seconds.

Cross-Domain — Real Single-Cell Genomics (PBMC 3K)

REFERENCE
TIME
ACCURACY
SPEEDUP
NumPy SVD (LAPACK gesdd)
7,741 ms
reference
1.0×
sklearn TruncatedSVD (top-50)
1,500 ms
approximate
5.2×
Phoenix SVD
46.4 ms
1.3 × 10−12
166.9×

Real 10x Genomics PBMC 3K dataset, standard scanpy preprocessing pipeline (Seurat-style), 2,700 cells × 2,000 highly-variable genes, fp64. Singular values match LAPACK to ~10−12 — bit-exact at fp64 precision. Demonstrates the SVD primitive carries outside ML weights: single-cell PCA, signal processing, recommendation systems, any domain where SVD is on the critical path.

Phoenix on NVIDIA's Flagship Hybrid — Measured, Reproducible

Phoenix SVD and Phoenix Attention running together on NVIDIA's Nemotron-3-Super-120B — the hybrid Mamba-2 + Attention + MoE flagship — measured against the unmodified stock configuration. 8× H100 80 GB. Real weights, real data, every number reproducible from a single script.

Phoenix SVD on a Real Nemotron Weight (q_proj 4096²)

REFERENCE
TIME
MAX ΔS
SPEEDUP
torch.linalg.svd
3,788 ms
1.0×
Phoenix SVD
78 ms
5×10−11
48.6×

Phoenix SVD factorizes Nemotron's own attention projection weight essentially bit-exact, at 48.6× the speed of the PyTorch / cuSOLVER reference. Drop-in primitive anywhere SVD is on the critical path.

Phoenix Windowed KV Cache — Tunable Memory Curve (32K Context)

MODE
KV CACHE
PERPLEXITY
NEEDLE
Stock
267.6 MB
baseline
100%
Max-compression
63× smaller
not degraded
tradeoff
Retrieval-safe
3.8× smaller
not degraded
100%

Two operating points on the same tunable curve. Max-compression for summarization, classification, dialogue, and retrieval-augmented workloads where the model processes long input but does not require verbatim distant recall. Retrieval-safe for workloads that do. Mode-selectable per request. Measured on real wikitext-2 perplexity and real long-context needle-in-haystack.

Phoenix Attention Kernel — Scaling on H100

CONTEXT
SDPA
PHOENIX
SPEEDUP
16K
3.49 ms
2.46 ms
1.42×
32K
14.5 ms
4.83 ms
3.01×
65K
59.5 ms
9.58 ms
6.22×

Standalone kernel performance scales cleanly with context length. End-to-end speedup on a hybrid model like Nemotron is Amdahl-bound — only 8 of 88 layers are attention — so the e2e attention win on Nemotron is memory, not raw speed. Stated honestly; the speed numbers above are the kernel itself, where they apply for attention-dominant workloads and pure-transformer integrations.

Architecture Fit

Phoenix is engineered for NVIDIA's hybrid Mamba-2 + Attention direction. The compression and quality results above are what that fit produces, end-to-end, on the 120B configuration. The same approach generalizes to other hybrids natively, and to pure-transformer architectures via established Mamba-2 conversion methods (distillation; published roughly 1–3% of original pretraining compute, per Mamba in the Llama and related work) — not from-scratch retraining.

Headline table reproducible by a single script. Independent fp64 verification of the kernel, multi-GPU correctness, perplexity, needle, and KV-cache footprint measurements all available under NDA. Architectural methodology and integration specifics are patent-pending and disclosed only under signed evaluation agreement.

Phoenix on a Pure Dense Transformer — 100% Token-Identical Output

Phoenix Attention applied as a drop-in replacement to Meta Llama 3.3 70B on 8× H100 80 GB. A per-layer MSE boundary scan identifies the 9 of 80 attention layers where Phoenix substitutes bit-exactly — layers 0–8. The remaining 71 layers run stock SDPA / FlashAttention. Output is token-by-token identical to the unmodified Llama under greedy decoding at every context tested.

End-to-End Token Throughput — Mode B (9 of 80 layers patched)

CONTEXT
SDPA-FLASH
PHOENIX MODE B
SPEEDUP
TOKEN MATCH
4K
3.55 tok/s
9.13 tok/s
2.57×
100%
16K
2.48 tok/s
3.58 tok/s
1.44×
100%
65K
0.73 tok/s
0.70 tok/s
0.96×
100%
131K
0.23 tok/s
0.25 tok/s
1.05×
100%

At 9 of 80 layers patched, Phoenix delivers a real wall-time win at short context (where the 9 patched layers' attention cost is a meaningful fraction of total time) and reaches parity at long context (where the 71 unpatched attention layers dominate). This is the bit-exact-substitutable ceiling without retraining. Pushing past it to full-network speedup at every context length is the LoRA-distillation path described in the roadmap.

Per-Head Phoenix Profile — Refining the Boundary

Beyond the whole-layer profile, a finer-grained scan examines each of the 64 query heads in each attention layer independently — against the actual Phoenix kernel, not a software stand-in. Layers 0–8: all 64 heads, MSE in [10-9, 10-5] — confirms the whole-layer claim at head granularity. Layer 9 (the boundary): only heads 0–4 break (long-range lookups sliding-window cannot capture); 59 of 64 heads are within MSE < 2×10-2. A head-hybrid dispatcher routing the 5 outlier heads through SDPA unlocks layer 9 without retraining anything — near-term kernel deliverable.

The Boundary Is Structural, Not Numerical

PHOENIX VARIANT
MEASURED
L9 BEST-HEAD MSE
L10+
Stock fp16 variant (production-tuned defaults)
2026-05-18
2.607 × 10−3
NaN
bf16 (fp32-equivalent range)
2026-05-13
same profile
same NaN
Wide-sinks bf16 variant (extended-context tuning)
2026-05-18
2.609 × 10−3
NaN / Inf

The fp16 and wide-sinks bf16 L9 head-MSE distributions agree to within 1% across all 64 heads. This rules out the "your kernel just needs more headroom or a wider window" objection. The bottleneck is the model's own activation distribution past layer 8, not Phoenix's numerics. LoRA distillation against the distribution shift is the literature-standard fix and is scoped in the roadmap.

Numbers measured 2026-05-18 on 8× H100 80 GB (NVIDIA Innovation Lab / Brev). Reproducible by a single script. JSON manifests and full per-head MSE tables available under NDA.

GPU-Native Vector Search — Faster Than FAISS

VOIS is our patented GPU-native similarity search engine. Built on the same mathematical framework as our quantum algorithms. Benchmarked head-to-head against Meta's FAISS on the same machine, same dataset, same ground truth.

Head-to-Head at Matched Recall — SIFT-1M (1 Million Vectors)

RECALL
VOIS QPS
FAISS HNSW (16T)
FAISS GPU IVF-FLAT
SPEEDUP
~94%
197,000
79,355
2.5x
~97%
135,000
43,991
28,082
3.1x
~98%
89,000
24,011
14,123
3.7x
~99%
59,000
13,956
6,912
4.2x

Scaling — VOIS QPS at ~97% Recall

DATASET SIZE
VOIS QPS
RECALL@10
100,000 vectors
183,000
96.66%
250,000 vectors
167,000
96.90%
500,000 vectors
153,000
96.80%
1,000,000 vectors
135,000
97.04%

VOIS — Recall vs Throughput Sweep (1M Vectors)

MODE
RECALL@10
QPS
Maximum Speed
94.38%
197,000
Fast
96.01%
161,000
Balanced
97.04%
135,000
High Recall
98.43%
89,000
Maximum Recall
99.13%
59,000
Near-Perfect
99.36%
42,000

FAISS Comparison Data — 1M Vectors

FAISS METHOD
RECALL@10
QPS
HNSW M=32, ef=32 (1 thread)
93.70%
13,471
HNSW M=32, ef=64 (1 thread)
97.91%
7,644
HNSW M=32, ef=128 (1 thread)
99.50%
4,128
HNSW M=32, ef=64 (16 threads)
97.83%
43,991
HNSW M=32, ef=128 (16 threads)
99.51%
24,011
GPU IVF-Flat, nprobe=32
97.79%
14,123
GPU IVF-Flat, nprobe=64
99.47%
6,912
GPU Flat (brute force)
99.94%
18,886

SIFT-1M dataset: 128 dimensions, 1,000 queries, K=10 nearest neighbors. All methods on the same RTX 4060 (8GB), same data, same ground truth. VOIS uses a single GPU. FAISS HNSW tested at 1 and 16 CPU threads. Best of 5 runs.

PX Compute — Systems-Level Memory Pool for GPU and CPU

A slab-based memory pool allocator for NVIDIA GPUs and Linux CPUs. Per-device size-class free lists, golden-ratio block growth for the long tail, NVML-aware adaptive thresholds, and a hard pool-size cap to prevent runaway VRAM retention. Engineered for workloads that re-cycle buffers in tight loops — the alloc/free pattern LLM KV caches exhibit is exactly what this allocator targets.

PX Compute vs cudaMalloc — NVIDIA H100 80GB (single GPU)

WORKLOAD
ITERS
cudaMalloc
PX COMPUTE
SPEEDUP
reuse_4KB
500K
71.5 s
0.44 s
160×
reuse_1MB
200K
28.7 s
0.18 s
159×
reuse_64MB
20K
6.3 s
0.02 s
312×
reuse_1GB
5K
13.5 s
0.005 s
2,706×

Cold (non-reuse) workloads: 1.7–3.2× consistent. The huge speedups on reuse patterns are the relevant case: LLM inference re-cycles KV cache buffers, attention scratch buffers, and gradient buffers in tight loops — that's where PX Compute lands.

8× H100 in Parallel — Driver-Lock Contention Amplifies the Win

WORKLOAD
SINGLE GPU
8× GPU AVG
AMPLIFICATION
reuse_4KB
160×
1,522×
9.5×
reuse_1MB
159×
1,526×
9.6×
reuse_64MB
311×
1,540×
5.0×

Under multi-worker inference, 8 processes contending on the NVIDIA driver lock serialize through cudaMalloc. PX Compute serves from in-process free lists with zero driver calls — the advantage grows under parallel load. This is the production-realistic regime for cloud LLM serving.

CPU Memory Pool — Large-Block Reuse

WORKLOAD
ITERS
libc malloc
PX COMPUTE CPU
SPEEDUP
large_256MB
500
4.0 ms
1.5 ms
2.7×
huge_1GB
100
0.96 ms
0.33 ms
2.8×
reuse_1GB
2K
20.4 ms
0.90 ms
22.6×

Linux CPU pool. Wins concentrate on large-block reuse where libc has to mmap/munmap. Real use cases: CPU-offloaded LLM weight buffers, large embedding lookup tables, distributed-training intermediate buffers.

Every number above is reproducible from a single bench script on stock NVIDIA Brev or AWS hardware. JSON manifests of per-workload timings available under NDA.

60+ Identified Application Domains

Because the framework operates at a fundamental mathematical level, it applies across computational domains. We have identified over 60 specific application areas spanning search, security, optimization, signal processing, defense, energy, medical imaging, telecommunications, and autonomous systems. These are covered by our patent claims.

Detailed application analysis and domain-specific benchmarks available under NDA.

INTELLECTUAL PROPERTY

U.S. Patents Filed

Patent Applications Filed 2026

Core mathematical framework, quantum algorithms, classical implementations, hardware designs, and application methods are protected by patent claims. Licensing inquiries welcome.

Research Partners & Funding

Seeking SBIR/STTR funding, research partnerships, and strategic investment. U.S. patents filed. Select results available under NDA.

CONTACT US