Keeping Machine Learning Secrets Safe

Author: Denis Avetisyan

A new study benchmarks the leading cryptographic techniques for protecting data during machine learning tasks.

Dense models establish performance benchmarks.

This review provides a pragmatic comparison of Fully Homomorphic Encryption and Secure Multi-Party Computation for privacy-preserving model inference.

Despite growing demand for privacy-preserving machine learning, selecting the optimal cryptographic computation technology remains a significant challenge for practitioners. This paper, ‘A Pragmatic Comparison of Cryptographic Computation Technologies for Machine Learning’, addresses this gap by presenting a detailed comparative analysis of secure multi-party computation (SMPC) and fully homomorphic encryption (FHE) through extensive benchmarking of relevant software frameworks. Our results reveal a nuanced performance landscape, with FHE demonstrating advantages for regressions and simpler networks, while SMPC excels with complex models like convolutional neural networks. Will these findings enable a more informed and technology-agnostic approach to deploying secure machine learning solutions?

The Inherent Vulnerabilities of Centralized Machine Learning

Conventional machine learning methodologies frequently necessitate the aggregation of sensitive data within a central repository, thereby creating inherent privacy vulnerabilities and limiting opportunities for collaborative research. This centralized approach exposes individual records to potential breaches and misuse, raising significant concerns under increasingly stringent data protection regulations. Furthermore, the requirement for data consolidation presents logistical and legal hurdles, particularly when dealing with geographically dispersed or competitively sensitive information. Consequently, innovation is often stifled, as organizations and individuals hesitate to share data necessary for training robust and generalizable models. The limitations of this traditional paradigm are driving a search for alternative techniques that enable machine learning without compromising data privacy or hindering collaborative efforts.

The current landscape of data science is undergoing a significant shift, driven by both evolving legal frameworks and a heightened public consciousness regarding personal information. Regulations like GDPR and CCPA are establishing stricter guidelines for data collection and usage, compelling organizations to rethink traditional machine learning pipelines that rely on centralized datasets. Simultaneously, users are increasingly aware of how their data is utilized, demanding greater control and transparency. This dual pressure necessitates innovative approaches to model training and inference – techniques such as federated learning, differential privacy, and homomorphic encryption – which allow algorithms to learn from distributed data sources without directly accessing or revealing sensitive individual records. These developments aren’t merely about compliance; they represent a fundamental reimagining of how data-driven insights are generated, prioritizing user privacy as a core design principle.

Current privacy-preserving machine learning techniques, while promising, frequently encounter limitations in real-world deployment. Approaches like federated learning or differential privacy, designed to safeguard data confidentiality, often necessitate computational compromises. These can manifest as reduced model accuracy – a critical concern in sensitive applications – or decreased efficiency, increasing training times and resource demands. Furthermore, many solutions struggle with scalability, proving difficult to implement with very large datasets or complex models. This trade-off between privacy, accuracy, and practicality remains a significant hurdle, hindering the widespread adoption of these technologies and necessitating continued research into more robust and versatile methods.

Secure Computation: A Foundation for Provable Privacy

Secure Multi-Party Computation (SMPC) is a cryptographic protocol that enables multiple parties to jointly compute a function over their private inputs while keeping those inputs confidential. This is achieved through a distributed computation where each party’s data remains encrypted throughout the entire process; no individual party learns anything about the other parties’ inputs beyond what can be inferred from the final result. SMPC protocols typically utilize techniques like secret sharing, where each input is divided into multiple shares distributed among the participants, or garbled circuits, which allow computation on encrypted data without decryption. The output of the computation is revealed, but the individual inputs remain concealed, providing a robust solution for collaborative data analysis and privacy-preserving machine learning.

Fully-Homomorphic Encryption (FHE) is a form of encryption that allows for arbitrary computations to be performed on encrypted data – ciphertext – without requiring prior decryption. Traditional encryption methods necessitate decryption before any processing can occur, exposing the underlying plaintext. FHE, however, utilizes specialized cryptographic algorithms that preserve data confidentiality even during computation. This is achieved through mathematical operations performed directly on the ciphertext, resulting in an encrypted result that, when decrypted, matches the result of the same computation performed on the original plaintext. The security of FHE relies on the hardness of specific mathematical problems, such as the Ring Learning with Errors (RLE) problem, and is crucial for applications demanding data privacy while enabling data analysis and processing. Current FHE schemes often involve significant computational overhead, limiting their practical applicability but ongoing research focuses on improving efficiency and scalability.

Secure computation techniques, including Secure Multi-Party Computation (SMPC) and Fully-Homomorphic Encryption (FHE), mitigate data privacy risks within machine learning by preventing direct access to sensitive input data. These methods allow for computations – such as model training and inference – to be performed on encrypted data, ensuring that data owners retain control over their information. The output of the computation is also encrypted, and can only be decrypted by authorized parties, effectively isolating the raw data from unauthorized exposure throughout the entire machine learning pipeline. This approach satisfies privacy requirements while still enabling the benefits of data analysis and model building.

Simple CNNs perform significantly worse than those trained with Sampled Motion Planning Control (SMPC) on this benchmark.

Frameworks Enabling Privacy-Preserving Machine Learning

Secretflow-SPU is a prominent Secure Multi-Party Computation (SMPC) framework specifically engineered for privacy-preserving machine learning applications. It utilizes JAX, a high-performance numerical computation library, to facilitate efficient model implementation and automatic differentiation. Further optimization is achieved through integration with the XLA (Accelerated Linear Algebra) compiler, which enables just-in-time compilation and optimization of JAX code for various hardware platforms, including CPUs, GPUs, and TPUs. This combination of JAX and XLA allows Secretflow-SPU to support complex machine learning models while maintaining a focus on computational efficiency and scalability in secure computation settings.

Concrete-ML is a fully homomorphic encryption (FHE) framework specifically designed for machine learning applications, with its core functionality built upon the TFHE (Torus Fully Homomorphic Encryption) library. This framework focuses on optimizing performance for ML workloads through techniques such as quantization, which reduces the precision of numerical representations to decrease computational overhead. The use of TFHE enables computations directly on encrypted data without decryption, preserving data privacy. Concrete-ML’s architecture is tailored to address the unique demands of machine learning algorithms, allowing for the secure execution of models without exposing the underlying data.

Performance comparisons of Concrete-ML and Secretflow-SPU during matrix multiplication reveal a performance trade-off dependent on matrix size. Concrete-ML, utilizing Fully Homomorphic Encryption (FHE), demonstrates superior performance with smaller matrices due to lower computational overhead associated with encryption and decryption. However, as matrix dimensions increase, the computational cost of FHE grows disproportionately. Conversely, Secretflow-SPU, employing Secure Multi-Party Computation (SMPC), exhibits better scalability with larger matrices. This is attributed to SMPC’s ability to distribute computation and reduce the burden on individual parties, offsetting the communication overhead inherent in the protocol. This suggests that the optimal framework selection is contingent on the specific characteristics of the machine learning workload and the size of the matrices involved.

Current secure computation frameworks, such as Secretflow-SPU and Concrete-ML, support the implementation of complex machine learning models beyond simple computations. Specifically, both frameworks have been successfully applied to Dense Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), demonstrating the ability to perform privacy-preserving inference and training on these architectures. This functionality is achieved through techniques like Secure Multi-Party Computation (SMPC) and Fully-Homomorphic Encryption (FHE), which enable computation on encrypted data without requiring decryption during the process. Successful implementation on DNNs and CNNs validates the practical applicability of secure computation for a broad range of real-world machine learning tasks.

Neural Network Building Blocks in Secure Computational Settings

Matrix multiplication is a fundamental operation within both Dense and Convolutional Neural Networks (DNNs and CNNs, respectively), serving as the core computational element in layers such as fully connected layers in DNNs and convolutional filters in CNNs. Its adaptation for secure computation involves techniques like secret sharing or homomorphic encryption, allowing computations to be performed on encrypted data without decryption. This is achieved by distributing the matrix and vector elements among multiple parties or by transforming them into encrypted equivalents, enabling the computation of $y = Wx + b$ without revealing the values of $W$ , $x$ , or $b$ . Secure matrix multiplication is computationally intensive, and optimizations, such as efficient secret sharing schemes and optimized communication protocols, are crucial for practical implementation in privacy-preserving machine learning systems.

Activation functions are fundamental to neural network operation, introducing non-linearity which enables the modeling of complex relationships within data. Common activation functions include Rectified Linear Unit (ReLU), defined as $f(x) = max(0, x)$ , the Gaussian Error Linear Unit (GELU), which utilizes the cumulative distribution function of the standard normal distribution, and the Sigmoid function, expressed as $\sigma(x) = \frac{1}{1 + e^{-x}}$ . Maintaining this non-linearity is critical even within privacy-preserving computation paradigms; without non-linear activations, a neural network would simply be a linear model, severely limiting its expressive power. These functions are applied element-wise to the output of preceding layers, and their properties directly impact the network’s ability to learn and generalize, regardless of whether standard or secure computation techniques are employed.

Performance benchmarks indicate that the Secretflow-SPU framework consistently outperforms Concrete-ML across three common activation functions: ReLU, GELU, and Sigmoid. Specifically, SPU achieves faster computation times for these functions, which are critical for maintaining non-linearity in neural networks used in privacy-preserving machine learning. This improved performance is observed when conducting secure multiparty computation, enabling more efficient execution of complex models without compromising data confidentiality. The speed differential suggests SPU’s architectural design and optimizations are better suited to the computational demands of these activation functions within a secure setting.

Frameworks such as Secretflow-SPU and Concrete-ML facilitate the construction of complex neural network models while addressing data privacy concerns through techniques like Secure Multi-Party Computation (SMPC) and Federated Learning. These frameworks abstract the complexities of cryptographic protocols, enabling developers to utilize familiar machine learning workflows with minimal code modification. Specifically, they allow core neural network operations – including matrix multiplication and activation functions – to be performed on encrypted data, ensuring that individual data points remain confidential throughout the computation process. This is achieved by distributing computations across multiple parties or utilizing privacy-enhancing technologies to mask individual inputs, thereby preserving data privacy while still enabling model training and inference.

Towards Practical Applications of Privacy-Preserving Machine Learning

Predictive modeling often requires access to sensitive data, creating a fundamental conflict between insight generation and individual privacy. However, recent advancements in cryptographic techniques are dissolving this barrier. Specifically, algorithms like Random Forest Regression and Linear Regression can be adapted for privacy-preserving applications through the implementation of Secure Multi-Party Computation (SMPC) and Fully-Homomorphic Encryption (FHE). SMPC allows computations to be performed on encrypted data without revealing the data itself, while FHE enables computations directly on encrypted data, meaning data remains confidential throughout the entire process. This effectively allows models to be trained and inferences to be made without ever exposing the underlying raw data, fostering secure collaboration and unlocking the potential of data-driven discovery in fields where privacy is a critical concern.

A significant advantage of utilizing Secure Multi-Party Computation (SMPC) for linear regression lies in its predictable and remarkably consistent inference time. Studies demonstrate that, unlike many machine learning algorithms where processing time increases with data complexity, SMPC-enabled linear regression maintains a constant inference speed of 0.079 seconds – irrespective of the number of features used in the model. This consistent performance is crucial for real-time applications and scenarios demanding predictable response times, as it eliminates the variability often associated with increasing dataset dimensionality and ensures reliable operation even with high-dimensional data. The ability to achieve this predictable speed without sacrificing privacy represents a substantial step forward in practical, privacy-preserving machine learning.

For linear regression tasks, Fully-Homomorphic Encryption (FHE) demonstrates a significant performance advantage over Secure Multi-Party Computation (SMPC). Studies reveal FHE can achieve inference speeds up to 17 times faster than SMPC, a benefit particularly pronounced when dealing with datasets containing a limited number of features. This acceleration stems from FHE’s ability to perform computations directly on encrypted data without decryption, reducing communication overhead inherent in SMPC protocols. While both techniques safeguard data privacy during analysis, the speed differential suggests FHE as a more viable option for real-time predictive modeling in scenarios where computational efficiency is critical and data complexity remains manageable.

While Fully-Homomorphic Encryption (FHE) demonstrates promise for privacy-preserving machine learning, its computational intensity currently presents significant limitations, especially when applied to complex models. Even with relatively small images – an 8×8 pixel resolution – FHE inference can require hours of processing time. This substantial delay stems from the intricate cryptographic operations needed to perform computations on encrypted data, hindering its practicality for tasks involving larger datasets or more intricate architectures like Convolutional Neural Networks (CNNs). The sheer volume of calculations required for these complex models, combined with the overhead of FHE, currently makes real-time or near real-time inference infeasible, underscoring the need for continued research into optimizing FHE schemes and exploring hybrid approaches that balance privacy with performance.

The confluence of secure multi-party computation and fully-homomorphic encryption represents a paradigm shift in data collaboration, enabling multiple parties to jointly analyze sensitive information without ever revealing their individual datasets. This capability unlocks the potential for valuable insights across numerous fields, from healthcare research-where patient data remains confidential while contributing to larger studies-to financial modeling, where competitive advantages are maintained while collaborative risk assessments are performed. By allowing computations to be performed on encrypted data, these techniques address the critical need for data privacy in an increasingly interconnected world, fostering trust and enabling previously impossible collaborations. The result is a secure environment where knowledge can be extracted from collective data without sacrificing the confidentiality of individual contributions, ultimately driving innovation while upholding ethical data handling practices.

The convergence of privacy-enhancing technologies like Secure Multi-Party Computation and Fully-Homomorphic Encryption is poised to revolutionize data-driven innovation across highly sensitive sectors. Healthcare stands to benefit from collaborative diagnostics and personalized treatment plans without exposing patient records, while the financial industry can leverage fraud detection and risk assessment models without compromising account holder privacy. Beyond these core areas, advancements in privacy-preserving machine learning promise to unlock valuable insights from datasets in fields like insurance, legal services, and even urban planning – all while adhering to increasingly stringent data protection regulations and fostering greater public trust in data-driven systems. This capability to extract knowledge from data without direct access to the underlying information represents a paradigm shift, enabling a future where data utility and individual privacy are no longer mutually exclusive.

Random Forest Regression provides a robust benchmark for evaluating model performance.

The pursuit of practical cryptographic solutions, as detailed in the comparison of SMPC and FHE, echoes a fundamental tenet of computational elegance. The article highlights the trade-offs between theoretical ideals and real-world constraints – a balance inherent in all robust systems. This resonates with Ken Thompson’s observation: “There’s no reason to have complicated code when you can have simple code.” The study’s pragmatic approach to benchmarking, focusing on model inference and protocol comparison, demonstrates a preference for demonstrable function over purely theoretical guarantees. A solution, like an algorithm, must work consistently, even if it isn’t the most mathematically ‘beautiful’-its correctness is paramount.

Future Directions

The observed performance disparities between Secure Multi-Party Computation and Fully Homomorphic Encryption are not, fundamentally, surprising. The elegance of a protocol lies not in its speed – a merely engineering concern – but in its provable security and minimal reliance on heuristic assumptions. While FHE currently suffers from computational overhead, its deterministic nature offers a path toward formal verification of privacy guarantees that eludes many SMPC constructions. The pursuit of algorithmic improvements in FHE, therefore, should prioritize reducing the gap between theoretical complexity and practical performance, even at the expense of short-term gains in speed.

A crucial, often overlooked, limitation remains the model itself. The vast majority of machine learning models are designed without privacy in mind. Adapting existing models, or constructing new ones, that are inherently amenable to cryptographic computation is paramount. Simply applying a privacy-preserving layer to a poorly designed model is akin to adding ornamentation to a flawed structure. A proof of model robustness against adversarial attacks, combined with a formal privacy analysis of the computation protocol, is the ultimate goal – a standard currently absent from the field.

Future work must also address the practical challenges of key management and distribution in distributed settings. The current reliance on trusted authorities or complex key exchange protocols introduces vulnerabilities that negate the benefits of cryptographic computation. A truly decentralized and auditable key management system, perhaps leveraging concepts from zero-knowledge proofs, is essential to realize the full potential of privacy-preserving machine learning. The pursuit of efficiency is secondary; correctness remains the singular imperative.

Original article: https://arxiv.org/pdf/2605.04858.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/