Homomorphic Encryption Gets a GPU Boost

Author: Denis Avetisyan

Researchers have designed a specialized GPU unit to dramatically accelerate fully homomorphic encryption, paving the way for more practical privacy-preserving computation.

Integration of FHECore demonstrably influences GPU utilization and instruction throughput, as evidenced by variations in occupancy and normalized instructions per cycle across both foundational CKKS primitives and complete computational workloads.

FHECore, a dedicated functional unit, accelerates fully homomorphic encryption by optimizing wide-precision modulo arithmetic and leveraging a systolic array architecture.

While fully homomorphic encryption (FHE) promises computation on encrypted data, its substantial performance and resource demands have hindered practical deployment. This paper, ‘FHECore: Rethinking GPU Microarchitecture for Fully Homomorphic Encryption’, addresses this challenge by introducing a specialized functional unit integrated directly into the GPU’s Streaming Multiprocessor. FHECore accelerates critical FHE primitives-particularly Number Theoretic Transforms and base conversions-by natively supporting wide-precision modulo-multiply-accumulate operations, achieving speedups of up to $2.12\times$ with a modest $2.4\%$ area overhead. Could this microarchitectural rethinking pave the way for widespread adoption of privacy-preserving computation on readily available hardware?

The Promise of Encrypted Computation: A Paradigm Shift

Fully Homomorphic Encryption (FHE) represents a paradigm shift in data security, envisioning a future where computations can be performed directly on encrypted data without requiring decryption. This capability addresses a critical vulnerability in modern data processing – the exposure of sensitive information during analysis – by allowing algorithms to operate on ciphertext, yielding an encrypted result that, when decrypted, is identical to the result of operating on plaintext. The implications are profound, extending to secure cloud computing, confidential machine learning, and privacy-preserving data analytics, all while eliminating the need to trust the computing infrastructure with access to raw, unencrypted data. Essentially, FHE decouples data utility from data exposure, promising a world where insights can be extracted without compromising individual privacy or organizational confidentiality.

The theoretical allure of fully homomorphic encryption (FHE) – performing computations on encrypted data without decryption – has long been tempered by substantial computational costs. Historically, these overheads have proven prohibitive for many real-world applications, especially those demanding significant processing power, such as modern machine learning. Training complex models, or even executing inference on large datasets, requires countless arithmetic operations; when performed on encrypted data using traditional FHE schemes, the required resources balloon exponentially. This practical limitation stems from the intricate mathematical operations inherent in FHE, which transform data in ways that preserve privacy but introduce significant delays and energy consumption. Consequently, while FHE offers unparalleled data security, its adoption remains constrained until these performance bottlenecks can be effectively addressed through algorithmic improvements, specialized hardware, or optimized implementations.

While several Fully Homomorphic Encryption (FHE) schemes have emerged, each presents trade-offs in functionality. The TFHE scheme, for instance, excels at evaluating Boolean circuits and is particularly efficient for simple operations like comparisons and bitwise logic – making it suitable for privacy-preserving identification or basic data matching. However, its architecture is not easily extended to handle the complex, high-dimensional computations prevalent in modern applications. Unlike schemes designed for arithmetic operations, TFHE struggles with tasks requiring floating-point calculations or matrix manipulations, severely limiting its applicability to areas like machine learning or advanced data analytics where versatile computational capabilities are essential. This inherent lack of flexibility necessitates exploring alternative FHE approaches or developing specialized hybrid solutions to address a broader range of computational needs.

The CKKS scheme represents a pivotal advancement in fully homomorphic encryption, specifically tailored for the demands of machine learning applications. Unlike schemes requiring precise computations, CKKS facilitates approximate computation on encrypted data, a critical trade-off that drastically reduces computational complexity. This allows for practical implementation of algorithms like neural networks directly on ciphertexts, preserving data privacy throughout the entire process. However, realizing the full potential of CKKS necessitates substantial acceleration; the inherent complexities of homomorphic operations, even with approximation, still pose a significant performance bottleneck. Current research focuses on hardware acceleration, optimized libraries, and algorithmic improvements to overcome these challenges and unlock the scheme’s promise for privacy-preserving machine learning at scale.

Analysis of CKKS-based workloads on the A100 GPU reveals that the NTT, INTT, and BaseConv steps constitute over 70% of the total runtime across models like ResNet20, BERT-Tiny, and logistic regression, indicating they are the primary computational bottlenecks after FIDESlib mitigates memory limitations.

FHECore: Architecting Accelerated Arithmetic

FHECore is a newly developed functional unit designed to accelerate the computationally intensive modulo arithmetic operations central to Fully Homomorphic Encryption (FHE) schemes. The architecture utilizes a systolic array, a hardware pattern that enables parallel data processing, to improve performance on wide-precision arithmetic. This approach directly addresses a key bottleneck in FHE implementations, where large integer and polynomial manipulations are common. The design aims to increase throughput and reduce latency associated with operations such as multiplication and modular reduction, which are fundamental to both encryption and decryption processes within FHE systems. The systolic array structure facilitates efficient data reuse and minimizes data movement, contributing to lower energy consumption compared to traditional arithmetic units.

FHECore achieves a 2.12x speedup for complete workloads by exploiting the parallel processing capabilities of systolic arrays. Traditional FHE implementations often suffer from sequential bottlenecks during modular arithmetic. Systolic arrays address this limitation by enabling concurrent execution of multiple arithmetic operations, reducing overall latency. This parallel architecture minimizes data movement and maximizes throughput for core FHE operations such as multiplication and modular reduction, resulting in substantial performance gains compared to conventional serial implementations. The speedup was determined through benchmarking FHECore against a standard software implementation performing equivalent computations on a representative full workload.

Barrett Reduction is employed within FHECore as the primary method for performing modular reduction due to its efficiency in hardware implementations. This technique reduces the computational cost of $a \mod q$ by pre-computing a constant, $\mu = \lfloor \frac{q}{2^k} \rfloor$ , where $2^k$ represents the width of the operands. The reduction process involves calculating $t = a \mod {2^k}$ , then computing $a' = a - \mu \cdot t \cdot 2^k$ . If $a'$ is negative, it is adjusted by adding $q$ . This approach minimizes the number of multiplication and addition operations required for modular reduction, thereby increasing throughput and reducing energy consumption within the systolic array dataflow.

FHECore addresses the performance limitations of Fully Homomorphic Encryption (FHE) by implementing a systolic array architecture, enabling a path towards practical FHE deployments. This design prioritizes efficiency without substantial increases in hardware cost; benchmark results demonstrate a minimal area overhead of only 2.4%. The architecture is particularly well-suited for accelerating computations within the CKKS scheme, a prominent FHE scheme used for approximate number computations, due to the systolic array’s ability to efficiently handle the wide-precision arithmetic central to its operation. This balance between performance gain and resource utilization positions FHECore as a viable accelerator for real-world FHE applications.

Each Streaming Multiprocessor integrates a new functional unit, FHECore, alongside CUDA and Tensor Cores, enabling in-register homomorphic encryption via a 22D systolic array of processing elements that perform modulo multiply-and-accumulate operations with Barrett reduction using a precomputed constant μ and modulus <span class="katex-eq" data-katex-display="false">q</span>. — Each Streaming Multiprocessor integrates a new functional unit, FHECore, alongside CUDA and Tensor Cores, enabling in-register homomorphic encryption via a 22D systolic array of processing elements that perform modulo multiply-and-accumulate operations with Barrett reduction using a precomputed constant μ and modulus $q$ .

GPU Integration: Harnessing Parallelism for Accelerated Computation

FHECore is engineered for compatibility with existing GPU architectures to leverage their inherent parallel processing capabilities. This integration allows for the distribution of computationally intensive fully homomorphic encryption (FHE) operations across numerous GPU cores, significantly reducing overall processing time. By adapting to established GPU frameworks, FHECore avoids the need for specialized hardware and facilitates deployment on widely available systems. The design prioritizes minimizing data transfer between the CPU and GPU, further optimizing performance and enabling efficient scaling with increasing dataset sizes and complexity of homomorphic computations.

FHECore leverages the CUDA and WMMA (Warren Machine Learning Matrix Acceleration) APIs to efficiently execute modulo arithmetic operations on the Tensor Cores of compatible NVIDIA GPUs. This implementation maps core cryptographic operations, such as polynomial multiplication and addition, onto the highly parallel Tensor Core architecture. By utilizing WMMA, FHECore optimizes data transfer and computation for matrix-based operations, reducing latency and maximizing throughput. The combination of CUDA for GPU management and WMMA for specialized arithmetic enables significant acceleration of fully homomorphic encryption (FHE) workloads compared to CPU-based implementations.

FHECore leverages both software optimizations within the library and hardware acceleration via GPU Tensor Cores to significantly improve the performance of critical homomorphic encryption operations. Specifically, the Number Theoretic Transform (NTT) and Base Conversion, foundational to many Fully Homomorphic Encryption schemes, benefit from this combined approach. Benchmarking with the CKKS scheme demonstrates a 1.57x speedup in processing time when utilizing FHECore and Tensor Core support, indicating a substantial increase in throughput for these primitives compared to CPU-bound implementations. This acceleration is directly attributable to the efficient mapping of these operations onto the massively parallel architecture of modern GPUs.

Integration of FHECore with GPU architectures yields significant performance gains over conventional software-based Fully Homomorphic Encryption (FHE) implementations. Specifically, for CKKS primitives, this synergistic approach demonstrates a 2.41x reduction in dynamic instruction count. This reduction directly translates to fewer CPU operations required to perform the same cryptographic tasks, resulting in lower latency and increased throughput. The decrease in dynamic instruction count indicates a more efficient utilization of hardware resources and a streamlined execution path for CKKS operations within the FHECore framework.

The ISA extension for FHECore compiles the <span class="katex-eq" data-katex-display="false">fhe\_syncintrinsic</span> into a custom FHEC instruction that performs modulo matrix multiplication, leveraging existing intrinsics and instructions (beige) while introducing targeted changes from TC programming (highlighted). — The ISA extension for FHECore compiles the $fhe\_syncintrinsic$ into a custom FHEC instruction that performs modulo matrix multiplication, leveraging existing intrinsics and instructions (beige) while introducing targeted changes from TC programming (highlighted).

Toward Practical Privacy-Preserving Machine Learning

Recent advancements in Fully Homomorphic Encryption (FHE) have historically been hampered by substantial computational overhead, limiting the feasibility of applying these privacy-enhancing technologies to complex machine learning models. However, the development of FHECore represents a significant leap forward in performance, enabling the deployment of sophisticated algorithms – including deep neural networks – directly on encrypted data. This breakthrough is achieved through a combination of optimized cryptographic primitives and efficient hardware acceleration, dramatically reducing the time and resources required for encrypted computation. Consequently, machine learning models can now operate on sensitive information without requiring decryption, preserving data privacy throughout the entire process and opening doors to previously inaccessible applications in fields where data security is paramount.

The capacity to perform computations on encrypted data unlocks transformative possibilities across several critical sectors. In healthcare, patient data can be analyzed for improved diagnostics and treatment plans without ever being decrypted, safeguarding sensitive medical histories. Financial institutions can detect fraudulent transactions and assess credit risk while maintaining the confidentiality of account details. Moreover, personalized advertising can become more effective and privacy-respecting, allowing users to receive relevant offers without exposing their individual preferences or browsing behavior. This paradigm shift ensures data remains protected throughout the entire machine learning lifecycle, fostering trust and enabling innovation in data-driven applications where privacy is paramount.

The practical implementation of fully homomorphic encryption (FHE) has long been hindered by computational demands, but recent advancements are streamlining integration with existing machine learning infrastructure. FHECore benefits from compatibility with Graphics Processing Units (GPUs), significantly accelerating encrypted computations and enabling the deployment of more complex models. Crucially, the framework is designed to work seamlessly with established software like TensorFlow and PyTorch, minimizing the need for extensive code rewrites or specialized expertise. This ease of integration allows developers to leverage familiar tools and workflows, reducing the barriers to adopting privacy-preserving machine learning techniques within existing pipelines and accelerating the transition towards secure, data-sensitive applications.

The integrity of computations performed on encrypted data by FHECore can be further fortified through the implementation of Zero-Knowledge Proofs. These cryptographic techniques allow verification that a computation was executed correctly, without revealing any information about the data itself or the computation’s internal steps. Essentially, a prover can demonstrate to a verifier that a statement is true – in this case, the accurate execution of a machine learning model on encrypted inputs – without disclosing why it is true. This dual layer of security – encryption protecting data confidentiality and Zero-Knowledge Proofs ensuring computational correctness – is crucial for building trust in privacy-preserving machine learning systems, particularly in high-stakes applications where data breaches or malicious computations could have severe consequences. By independently validating the results of FHECore’s encrypted computations, Zero-Knowledge Proofs mitigate the risk of accepting incorrect or tampered outputs, providing a robust defense against adversarial attacks and enhancing the overall reliability of the system.

The addition of FHECore to the bootstrapping kernel increases both instruction count and latency, as demonstrated by the comparison of baseline (hatched bars) and extended (solid bars) execution across FFT iterations normalized to the <span class="katex-eq" data-katex-display="false">FFTIter=2</span> configuration. — The addition of FHECore to the bootstrapping kernel increases both instruction count and latency, as demonstrated by the comparison of baseline (hatched bars) and extended (solid bars) execution across FFT iterations normalized to the $FFTIter=2$ configuration.

Future Directions: Specialization for Enhanced Performance

Current acceleration strategies for Fully Homomorphic Encryption (FHE) often rely on Graphics Processing Units (GPUs), which deliver substantial speed improvements over traditional CPU-based implementations. However, the inherent architecture of GPUs, designed for parallel graphics rendering, isn’t ideally suited for the bit-serial nature of many FHE operations. Consequently, researchers are increasingly investigating Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) as alternatives. FPGAs offer a reconfigurable hardware platform, allowing for customization and optimization of FHE algorithms, providing a valuable bridge between software flexibility and hardware acceleration. ASICs, designed from the ground up specifically for FHE computations, represent the pinnacle of performance and energy efficiency; while less flexible than FPGAs, their tailored architecture minimizes overhead and maximizes throughput, potentially unlocking orders of magnitude improvement in real-world FHE applications.

Field-programmable gate arrays (FPGAs) present a compelling pathway for accelerating fully homomorphic encryption (FHE) due to their reconfigurable nature. Unlike application-specific integrated circuits (ASICs), FPGAs allow researchers and developers to rapidly prototype and iterate on designs for FHECore, the foundational encryption library. This flexibility is particularly valuable as the field of homomorphic encryption is still evolving; algorithms and optimizations are constantly emerging. An FPGA implementation enables customization of the hardware architecture to precisely match the demands of specific applications, such as privacy-preserving inference or secure data analytics, yielding performance gains over general-purpose processors. By adapting the hardware to the algorithm, rather than vice versa, FPGAs bridge the gap between software innovation and efficient execution, offering a dynamic platform for exploring and deploying advanced FHE techniques.

Application-Specific Integrated Circuits (ASICs) represent a compelling pathway toward maximizing the performance and energy efficiency of Fully Homomorphic Encryption (FHE) systems. Unlike general-purpose processors or even Field-Programmable Gate Arrays (FPGAs), ASICs are designed from the ground up to execute specific computational tasks – in this case, the complex mathematical operations inherent in FHE. This specialization allows for significant reductions in both latency and power consumption, as unnecessary circuitry is eliminated and existing circuits are optimized for FHE primitives. While the initial design and fabrication costs are substantial, the resulting hardware offers a level of performance unattainable with more flexible platforms, potentially unlocking practical applications of privacy-preserving machine learning that are currently computationally prohibitive. By meticulously tailoring the circuit architecture to the unique demands of FHE, ASICs promise to deliver a substantial leap forward in realizing the full potential of this transformative technology.

Realizing the transformative potential of privacy-preserving machine learning hinges not on a singular breakthrough, but on the synergistic optimization of both software and hardware. Current software frameworks, while increasingly sophisticated, are often bottlenecked by the limitations of general-purpose computing architectures. Addressing this requires a co-design approach, where algorithmic advancements in homomorphic encryption and privacy-enhancing technologies are paired with specialized hardware accelerators. This includes exploring field-programmable gate arrays (FPGAs) for flexible prototyping and application-specific integrated circuits (ASICs) for peak performance and energy efficiency. Such a combined strategy promises to overcome computational barriers, reduce latency, and minimize power consumption – ultimately enabling the widespread adoption of machine learning techniques that safeguard data privacy without compromising analytical power.

INT8 operations on Ampere architecture Tensor Cores involve a compilation flow where CUDA C++ code is lowered to PTX and then SASS instructions, with Tensor Cores executing highlighted portions while load/store units manage data movement.

The design presented within FHECore embodies a philosophy of holistic system consideration. It isn’t merely about accelerating individual operations like wide-precision modulo arithmetic-a core component of the CKKS scheme-but about restructuring the GPU microarchitecture to accommodate the unique demands of fully homomorphic encryption. This mirrors John von Neumann’s insight: “It is possible to carry out any operation which can be defined on paper.” The work meticulously crafts a functional unit, a systolic array, that translates abstract cryptographic definitions into tangible hardware, recognizing that every simplification in design-such as optimizing for specific parameters-has a cost, and every clever trick introduces potential risks. The result is not just faster encryption, but a cohesive system where structure dictates behavior, maximizing efficiency across the entire FHE workflow.

What Lies Ahead?

The acceleration of fully homomorphic encryption-a field predicated on the alluring impossibility of computation on encrypted data-has always demanded a peculiar sort of engineering. FHECore’s approach, by focusing on the granular efficiency of wide-precision arithmetic within a systolic array, represents a measured step toward practical realization. Yet, the core challenge remains stubbornly architectural. Speeding individual operations is insufficient; the data dependencies inherent in FHE schemes-particularly the CKKS variant-create bottlenecks that any localized acceleration will eventually encounter. The next iteration must consider holistic dataflow, not merely functional unit optimization.

Furthermore, the minimal area overhead reported by this work is a temporary reprieve. As the demands of real-world encryption-larger datasets, more complex computations-increase, the specialized hardware will inevitably require greater resources. The trade-off between performance and area is not static. It’s a dynamic equation requiring continued innovation in circuit design and algorithmic refinement, pushing the boundaries of both hardware and software co-design.

Ultimately, the true test will not be achieving marginal gains on benchmark datasets, but the seamless integration of FHE into existing computational ecosystems. The promise of privacy-preserving computation hinges on its accessibility, not its esoteric speed. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Original article: https://arxiv.org/pdf/2602.22229.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/