Inference Hardware Accelerators: GPUs, TPUs, and Custom Chips
The selection and deployment of hardware accelerators — graphics processing units (GPUs), tensor processing units (TPUs), and application-specific custom silicon — determines the throughput, latency, energy cost, and economic viability of production inference systems. This page covers the structural taxonomy of accelerator architectures, the engineering mechanics that differentiate chip classes, the market and regulatory forces shaping procurement, and the classification boundaries practitioners use to match hardware to workload requirements. The topic is central to inference system benchmarking, inference cost management, and every layer of modern model serving infrastructure.
- Definition and scope
- Core mechanics or structure
- Causal relationships or drivers
- Classification boundaries
- Tradeoffs and tensions
- Common misconceptions
- Checklist or steps (non-advisory)
- Reference table or matrix
- References
Definition and scope
An inference hardware accelerator is a processor or co-processor architecture optimized for executing trained neural network computations — forward passes through model graphs — at production throughput and latency targets that general-purpose CPUs cannot economically sustain. The accelerator category encompasses three structurally distinct classes: (1) GPU-class devices originally designed for parallel graphics workloads and repurposed for dense matrix arithmetic; (2) TPUs and analogous systolic-array ASICs designed from the ground up for tensor operations at datacenter scale; and (3) custom or semi-custom chips spanning field-programmable gate arrays (FPGAs), inference-only ASICs, and neural processing units (NPUs) embedded in edge hardware.
The scope of the accelerator market extends from hyperscale cloud datacenters running LLM inference services to sub-watt edge chips executing computer vision inference on embedded devices with no network dependency. The MLCommons organization, which administers the MLPerf benchmark suite (MLCommons MLPerf), provides the primary public-domain performance measurement framework across these hardware classes, establishing standardized inference benchmarks across datacenter, edge, and mobile tiers.
The DOE's Argonne National Laboratory and Lawrence Berkeley National Laboratory have both published infrastructure analyses documenting that AI workloads now constitute a measurable and growing fraction of total high-performance computing energy consumption — a regulatory and procurement driver discussed further in the causal relationships section below.
Core mechanics or structure
GPU architecture for inference. A GPU achieves inference throughput through massive single-instruction, multiple-thread (SIMT) parallelism. NVIDIA's Hopper architecture (H100), for example, provides 3,958 TFLOPS of FP8 tensor performance (NVIDIA H100 Datasheet). The key internal structures are tensor cores — mixed-precision arithmetic units that process 4×4 matrix multiply-accumulate operations in a single clock cycle. High-bandwidth memory (HBM2e, HBM3) attached to the GPU die addresses the memory bandwidth bottleneck that limits transformer model inference, where loading billions of parameters per token generated dominates compute time.
TPU architecture. Google's TPU family, described in publicly available technical documents including the 2017 IEEE paper "In-Datacenter Performance Analysis of a Tensor Processing Unit" by Jouppi et al., is built around a 256×256 systolic array that tiles matrix multiplications across a two-dimensional mesh of multiply-accumulate cells. Data flows through the array without returning to an external cache, achieving high arithmetic intensity on regular tensor shapes. The TPU v4, deployed in Google's TPU pods, links 4,096 chips via high-speed interconnect, enabling model-parallel inference across very large language models.
FPGA-based accelerators. FPGAs offer reconfigurable logic fabric that can be bitstream-programmed to implement custom dataflow graphs for specific model architectures. Intel (Altera) and AMD (Xilinx) are the primary FPGA silicon vendors. Latency characteristics for FPGA inference pipelines are covered in depth at inference latency optimization.
NPUs and edge silicon. Neural processing units embedded in system-on-chip (SoC) designs — ARM's Ethos series, Apple's Neural Engine, Qualcomm's AI Engine — dedicate fixed-function logic to quantized INT8 and INT4 matrix operations at sub-watt thermal budgets. These are the primary accelerators for edge inference deployment.
Causal relationships or drivers
Three converging forces have driven accelerator specialization beyond general-purpose silicon.
Model scale. The parameter counts of leading language models grew from 1.5 billion (GPT-2, 2019) to hundreds of billions in production LLMs. A model with 70 billion FP16 parameters requires approximately 140 GB of memory weight alone, exceeding the capacity of any single CPU-class server. This arithmetic necessitates HBM-equipped accelerators or multi-chip configurations. The interaction between model scale and hardware capability is documented in the MLPerf Inference v3.1 results published by MLCommons in 2023.
Latency sensitivity. Real-time inference vs. batch inference workloads impose fundamentally different hardware requirements. Latency-sensitive applications — voice assistants, fraud detection, autonomous vehicle perception — require sub-10-millisecond response, which eliminates CPU-only paths and motivates dedicated on-device accelerators.
Energy economics. The U.S. Department of Energy's "AI and Energy" analysis documents that training and inference workloads in large datacenters consume significant and growing shares of facility power budgets. Cloud providers and hyperscalers migrated to custom inference silicon specifically because GPU energy efficiency, measured in TOPS/W (tera-operations per second per watt), is 3–5× higher for purpose-built inference ASICs than for general-purpose GPU cores running the same quantized workload.
Export control and supply chain policy. The U.S. Bureau of Industry and Security (BIS) under the Department of Commerce has issued export control regulations under the Export Administration Regulations (EAR) restricting the sale of advanced AI accelerator chips — specifically those exceeding defined performance thresholds — to certain jurisdictions (BIS EAR, 15 C.F.R. Parts 730–774). These controls directly affect procurement pathways described at inference system procurement.
Classification boundaries
Accelerator classification follows three independent axes:
By form factor and deployment tier:
- Datacenter / server-grade: PCIe cards, SXM modules, or proprietary board designs requiring active cooling, drawing 300–700 watts per card
- Edge / embedded: SoC NPUs or compact accelerator modules operating at 1–15 watts
- On-premise server-class: rack-mounted GPU servers used in on-premise inference systems
By instruction flexibility:
- Programmable: CPUs, GPUs, FPGAs — can execute arbitrary models but with variable efficiency
- Semi-programmable: DSP-augmented NPUs that support a defined operator library
- Fixed-function: inference-only ASICs that execute a specific model architecture at maximum efficiency but cannot be retargeted without new silicon
By precision support:
- Full-precision (FP32, FP64): training-class GPUs
- Mixed-precision inference (FP16, BF16, INT8): most production inference GPUs and TPUs
- Ultra-low-precision (INT4, binary): edge NPUs and specialized ASICs; see model quantization for inference for the implications of precision reduction on model accuracy
The inference engine architecture page maps these hardware classes to the software runtimes that exploit them.
Tradeoffs and tensions
Throughput vs. latency. Batching requests increases GPU utilization and reduces per-query cost but adds queuing latency. A batch size of 64 on an H100 may achieve 10× higher throughput than a batch size of 1 while imposing 40–80 milliseconds of additional latency. This tradeoff is unresolvable by hardware alone; it requires inference pipeline design choices at the software layer.
Vendor lock-in vs. portability. Google TPU pods are accessible exclusively through Google Cloud Platform. NVIDIA GPUs dominate third-party cloud inference platforms but rely on the proprietary CUDA programming model. The ONNX (Open Neural Network Exchange) interoperability standard, maintained by the Linux Foundation's LF AI & Data Foundation (ONNX GitHub), addresses this by defining a hardware-agnostic model representation — see ONNX and inference interoperability for the scope of that solution.
Performance vs. total cost of ownership. An H100 SXM5 card achieves industry-standard throughput but carries a list price that can exceed $30,000 per unit as of 2023 market reports (NVIDIA published pricing references). Custom ASIC development amortizes over millions of inference queries but requires $50–200 million in NRE (non-recurring engineering) costs, making it economically viable only at hyperscale volumes. The inference system ROI framework addresses how organizations evaluate this decision boundary.
Edge vs. cloud inference architecture. Running inference at the edge on a dedicated NPU eliminates network latency and reduces privacy exposure — a consideration that appears in FTC guidance on AI system data handling — but constrains model size to what fits within the device's memory envelope, typically under 4 GB on commercial edge silicon.
Common misconceptions
Misconception: More GPU memory always means better inference performance.
Memory capacity is one of three binding constraints, alongside memory bandwidth and compute throughput. A model that fits in 24 GB of VRAM may be memory-bandwidth-bound rather than capacity-bound; doubling VRAM without increasing bandwidth (HBM tier, bus width) yields no throughput improvement. The MLPerf benchmark results published by MLCommons quantify where each bottleneck binds for each model class.
Misconception: TPUs are universally faster than GPUs.
TPUs are optimized for regular tensor shapes and large batch sizes. Workloads with irregular sparsity, dynamic shapes, or small batch sizes — common in NLP inference systems with variable sequence lengths — often perform worse on TPU than on GPU. The architectural advantage of systolic arrays is workload-conditional, not absolute.
Misconception: FPGAs are obsolete for inference.
FPGAs remain competitive for latency-critical, low-batch financial and industrial applications where deterministic sub-millisecond response is required and power budgets are constrained. Intel and AMD both publish active FPGA inference reference designs as of their most recent product documentation.
Misconception: CPU-only inference is always inadequate.
For small models (sub-100 million parameters) with low query rates, a well-optimized CPU inference runtime such as Intel OpenVINO or Meta's ONNX Runtime CPU backend delivers acceptable throughput at zero incremental hardware cost. The threshold at which GPU acceleration becomes cost-justified depends on query volume and latency SLAs, not on model architecture alone.
Checklist or steps (non-advisory)
The following sequence describes the hardware qualification process as practiced by production inference engineering teams:
-
Workload profiling. Characterize the model: parameter count, data type (FP16/INT8/INT4), input shape variability, batch size distribution, and target latency SLA (e.g., P99 under 50 ms).
-
Memory capacity check. Confirm that model weights plus activation memory fit within the accelerator's HBM or on-chip SRAM budget, accounting for the KV cache overhead in autoregressive transformer inference.
-
Benchmark execution. Run MLCommons MLPerf Inference benchmarks in the applicable scenario (Server, Offline, SingleStream, MultiStream) to establish normalized throughput and latency baselines.
-
Driver and runtime compatibility verification. Confirm support for the required inference runtime (TensorRT, XLA, ONNX Runtime, OpenVINO) and CUDA/ROCm/oneAPI driver stack version against the target OS kernel and container image.
-
Thermal and power envelope validation. Confirm that rack power delivery (typically 30–80A at 208V per GPU tray) and cooling (inlet temperature, CFM airflow) meet the chip's thermal design power (TDP) specifications under sustained inference load.
-
Precision regression testing. For quantized deployments, validate that INT8 or INT4 quantized model accuracy meets the minimum acceptable degradation threshold relative to FP32 baseline — using the evaluation dataset defined in the model's original publication.
-
Failure mode documentation. Record expected degradation behavior under memory pressure, thermal throttling, and PCIe link errors, cross-referenced against inference system failure modes.
-
Monitoring instrumentation. Deploy hardware telemetry hooks (NVIDIA DCGM, AMD ROCm SMI) before production cutover, as specified in inference monitoring and observability.
Reference table or matrix
| Accelerator Class | Representative Devices | Peak INT8 Throughput | TDP (W) | Batch Optimization | Primary Deployment |
|---|---|---|---|---|---|
| Datacenter GPU | NVIDIA H100 SXM5 | ~3,958 TOPS (FP8) | 700 W | Large batches | Cloud, on-premise HPC |
| Datacenter GPU | NVIDIA A100 80GB | ~2,000 TOPS (INT8) | 400 W | Medium–large batches | Cloud, cloud inference platforms |
| TPU (Google) | TPU v4 | Proprietary (Google internal) | ~170 W per chip | Large batches, regular shapes | Google Cloud only |
| Edge NPU | Apple Neural Engine (M3) | ~18 TOPS | ~5–8 W (SoC share) | Small batch / single sample | On-device iOS/macOS |
| Edge NPU | Qualcomm AI Engine (Snapdragon 8 Gen 3) | ~45 TOPS | ~2–4 W (SoC share) | Single sample | Android edge inference |
| FPGA | Intel Agilex 7 | Configurable | 50–125 W | Single sample / deterministic | Industrial, financial latency |
| Inference ASIC | AWS Inferentia2 | ~190 TOPS | ~75 W | Medium batches | AWS cloud only |
| CPU (optimized) | Intel Xeon w/ AMX | ~10–40 TOPS (INT8) | 250–350 W | Small batch | Low-QPS, cost-sensitive |
Throughput figures drawn from public vendor datasheets and MLCommons MLPerf v3.1 published results. TDP figures are manufacturer-published thermal design power.
The full inference systems landscape — including software runtimes, orchestration layers, and compliance considerations — is indexed at /index. Hardware accelerator selection integrates directly with inference system scalability planning and with mlops for inference pipeline governance.
References
- MLCommons MLPerf Inference Benchmark Suite — primary public benchmark framework for accelerator inference performance comparison
- NVIDIA H100 Tensor Core GPU Datasheet — official TDP, throughput, and memory specifications
- ONNX (Open Neural Network Exchange) — LF AI & Data Foundation — hardware-agnostic model interchange standard
- U.S. Bureau of Industry and Security — Export Administration Regulations (EAR), 15 C.F.R. Parts 730–774 — export control framework for advanced AI accelerator chips
- U.S. Department of Energy — Argonne National Laboratory, AI and HPC Research — infrastructure energy analysis for AI workloads
- [Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," IEEE/ACM ISCA 2017