Fuzzy Matching, Securely: A New Approach to Private Set Intersection

Author: Denis Avetisyan

Researchers have developed a highly efficient protocol for fuzzy private set intersection, enabling secure data analysis without revealing exact matches.

This work introduces a linear-complexity protocol leveraging secret-shared oblivious PRFs and prefix optimizations for improved performance in fuzzy PSI.

Existing fuzzy private set intersection (FPSI) protocols often suffer from computational overhead or scalability limitations due to reliance on expensive homomorphic encryption or complexity exponential to data dimension. This work, ‘Efficient Fuzzy Private Set Intersection from Secret-shared OPRF’, introduces efficient FPSI protocols for $L_{p \in [1, \in fty]}$ distance metrics, achieving linear complexity in set sizes, dimension, and distance threshold. Leveraging a novel secret-shared oblivious programmable PRF and a prefix optimization technique, our protocols significantly outperform state-of-the-art constructions, demonstrating speedups of up to 145x and communication cost reductions of up to 8x. Could these advancements unlock broader applications of privacy-preserving data analysis in areas requiring approximate matching?

Data’s Shadow: The Limits of Exact Matching

The increasing demand for data-driven insights often necessitates collaboration while safeguarding sensitive information, making secure set intersection (PSI) a foundational technology for privacy-preserving data analysis. Traditional PSI protocols, however, operate on the principle of exact matches – determining the common elements between two datasets only if those elements are identical. This rigidity poses a significant challenge when dealing with real-world data, where variations due to human error, differing measurement techniques, or simple inconsistencies are commonplace. Consequently, conventional methods frequently fail to identify meaningful overlaps, hindering effective analysis and potentially overlooking valuable insights. The inability to accommodate approximate matches limits the utility of PSI in practical applications, creating a critical need for more flexible and robust solutions capable of handling imperfect data while maintaining stringent privacy guarantees.

Many real-world datasets are rarely pristine; instead, they frequently contain errors, inconsistencies, and approximations-a phenomenon that significantly complicates secure data analysis. Traditional secure set intersection (PSI) protocols rely on exact matches, failing when faced with such ‘fuzzy’ data. Consider, for instance, a medical database where patient names might be misspelled or slightly varied, or a financial dataset where transaction amounts are rounded. A strict, exact-match PSI would miss these near-identical records, yielding incomplete results. Consequently, a ‘fuzzy’ approach to PSI-one that accounts for minor discrepancies and allows for approximate matches-becomes essential for extracting meaningful insights from these imperfect, yet abundant, data sources. This necessitates developing protocols capable of identifying records that are ‘close enough’ to be considered a match, even if not precisely identical, thus broadening the applicability of privacy-preserving data analysis.

Current fuzzy set intersection (PSI) protocols, while enabling privacy-preserving data analysis with imprecise matches, often encounter significant hurdles when applied to large datasets. The computational complexity inherent in handling fuzzy comparisons – assessing the degree of similarity rather than strict equality – introduces substantial overhead. This results in scalability issues, making these protocols impractical for many real-world applications involving millions of records. Researchers are therefore actively investigating techniques to reduce this computational burden, focusing on optimized algorithms, data structures, and cryptographic approaches that minimize communication costs and processing time without compromising the privacy guarantees or accuracy of the fuzzy matching process. Improving efficiency is paramount to unlocking the full potential of fuzzy PSI in domains such as personalized medicine, collaborative filtering, and secure data mining.

Mapping the Imprecise: A Protocol for Fuzzy Sets

FuzzyMapping, the core technique employed within this protocol, addresses the limitations of traditional set comparison by assigning identifiers to set elements that allow for approximate matching. Unlike exact matching which requires complete element-by-element correspondence, FuzzyMapping generates identifiers based on element characteristics, allowing comparisons even when sets do not share identical members. This is achieved by mapping elements to identifiers within a defined range, with proximity in the identifier space reflecting similarity between the corresponding elements. The degree of permissible variation in identifier values dictates the sensitivity of the approximate comparison, providing a tunable parameter for balancing accuracy and recall. This approach is particularly beneficial when dealing with noisy or incomplete data, or when comparing sets with inherent variations, such as those derived from sensor readings or user-generated content.

The FuzzyMapping protocol leverages an Oblivious Key-Value Store (OKVS) and an Oblivious Programmable Pseudo-Random Function, denoted OPPRF_so, to ensure data privacy during the mapping process. The OKVS allows retrieval of values associated with keys without revealing the keys themselves to the server. OPPRF_so, a variant of an Oblivious Programmable Pseudo-Random Function, enables computation of a pseudo-random function value on a key without the server learning the key. This combination prevents the mapping service from learning either the input elements being mapped or the resulting identifiers, preserving confidentiality and supporting privacy-preserving set operations.

PrefixOptimization within the FuzzyMapping protocol leverages the inherent structure of set element identifiers to minimize data transmission and computational overhead. Specifically, the technique exploits common prefixes among identifiers; rather than transmitting or computing on the entire identifier for each comparison, only the differing suffix is processed. This is achieved by pre-computing and storing common prefixes, thereby reducing the amount of data requiring secure computation within the Oblivious Key-Value Store (OKVS) and the Oblivious Programmable Pseudo-Random Function (OPPRF_so). The efficiency gains are proportional to the length of the common prefixes and the number of comparisons performed, resulting in a substantial reduction in both communication and computation costs, particularly for large datasets with significant identifier overlap.

The Language of Proximity: Defining Similarity with LpDistance

The LpDistance metric calculates the distance between two data elements represented as vectors by summing the p-th power of the absolute differences between their corresponding components, and then taking the p-th root. Formally, for vectors $x = (x_1, x_2, ..., x_n)$ and $y = (y_1, y_2, ..., y_n)$ , the LpDistance is defined as $||x - y||_p = (\sum_{i=1}^{n} |x_i - y_i|^p)^{1/p}$ . The value of ‘p’ determines the type of norm used; p=1 yields Manhattan distance, p=2 yields Euclidean distance, and p=∞ yields Chebyshev distance. In the context of fuzzy mapping, LpDistance provides a quantifiable measure of dissimilarity, enabling the algorithm to identify elements that are ‘close enough’ to be considered similar, even if not identical, based on a defined threshold.

The LpDistance metric facilitates a trade-off between privacy and accuracy by adjusting the permissible deviation when comparing set elements. The ‘p’ value in $L_p$ controls the sensitivity of the distance calculation; higher values of ‘p’ emphasize larger differences, leading to a more conservative comparison that prioritizes accuracy but potentially reveals more information. Conversely, lower values of ‘p’ allow for greater tolerance of difference, enhancing privacy by generalizing the comparison, but potentially reducing the accuracy of the approximate set intersection. This tunability allows system administrators to configure the metric to meet specific requirements regarding data sensitivity and acceptable error rates.

The integration of FuzzyMapping with the $L_p$ Distance metric provides a method for performing approximate set intersection that prioritizes both computational efficiency and data security. FuzzyMapping reduces the dimensionality of set elements before distance calculation, minimizing the computational cost associated with comparing large datasets. Simultaneously, the inherent properties of $L_p$ Distance, combined with tunable parameters within FuzzyMapping, introduce controlled imprecision. This imprecision obscures the exact set intersection, preventing precise identification of shared elements while still allowing for a statistically relevant approximation, thus enhancing privacy by limiting the information revealed during the intersection process.

The Architecture of Privacy: Secure Computation and Protocol Efficiency

The foundation of this work rests upon the principles of secure computation, a critical aspect for preserving data privacy during collaborative analysis. This protocol is designed to ensure that no information regarding the individual input sets is revealed to any participating party; only the result of the computation is accessible. This is achieved through a careful construction of cryptographic operations, preventing inference about the underlying data even if an adversary attempts to observe the computational process. By guaranteeing confidentiality of the input sets, the protocol enables secure data sharing and collaborative computation without compromising the privacy of sensitive information, fostering trust and enabling applications in areas where data protection is paramount.

The protocol’s core functionality relies on advanced cryptographic tools designed for privacy-preserving computation. Specifically, it leverages ‘SSPEQT’ – Secure Semi-Private Equality Testing – enabling the comparison of data without revealing the values themselves. Alongside this, ‘B2A’ – Blinded Bit-wise Arithmetic – facilitates secure arithmetic operations on encrypted data, ensuring that intermediate results remain confidential. These tools work in concert, allowing the system to perform necessary calculations – such as determining distances between data points – while upholding strict data privacy standards. The implementation carefully integrates these cryptographic primitives to minimize computational overhead and maintain efficiency, a crucial aspect for practical application in scenarios demanding both security and performance.

The protocol’s efficiency stems from a careful adherence to the DisjointProjectionAssumption, coupled with optimizations tailored for cryptographic operations. This approach yields linear complexity with respect to set size, dimensionality, and the specified distance threshold – a marked improvement over existing methods. Benchmarking reveals a substantial performance gain, with the protocol demonstrating up to an 80-fold acceleration in computation and a 19-fold reduction in communication overhead compared to current state-of-the-art techniques. Moreover, prefix optimizations further refine performance by reducing the complexity associated with the distance threshold from linear to logarithmic, allowing for even faster computations, particularly with large datasets and high-dimensional spaces.

The pursuit of efficient fuzzy PSI, as detailed in this work, isn’t merely about optimizing protocols; it’s about probing the boundaries of what’s computationally feasible. One accepts the established norms of secure computation, then deliberately introduces ‘fuzziness’ – a controlled deviation – to unlock performance gains. This echoes Barbara Liskov’s sentiment: “It’s one thing to program something; it’s another thing to build a system that you can rely on.” The presented protocol, with its secret-shared oblivious PRF and prefix optimizations, aims for that reliability – a system robust enough to handle the complexities of privacy-preserving data analysis while maintaining linear complexity, effectively turning potential weaknesses into strengths.

What Lies Ahead?

The presented protocol, while demonstrating a reduction in computational complexity, operates under the assumption that efficiency directly correlates to security. This is, of course, a convenient fiction. A bug is the system confessing its design sins, and the linear scaling achieved here merely shifts the locus of potential vulnerabilities. Future work must rigorously examine the protocol’s resistance to subtle attacks that exploit the interplay between fuzzy matching and the secret-shared oblivious PRF. The choice of Lp distance, while practical, introduces a parameter space ripe for manipulation; a determined adversary will seek the ‘sweet spot’ where discrimination falters.

The current focus on optimizing for computational cost has largely sidelined considerations of communication bandwidth. Achieving truly scalable fuzzy private set intersection demands a re-evaluation of data transmission strategies. Perhaps the most pressing question is whether the pursuit of ever-faster protocols is, itself, a distraction. The fundamental problem isn’t merely how to compute this intersection, but whether it should be computed at all, given the inherent risks of exposing even partially revealed data.

Ultimately, this work illuminates a familiar truth: optimization is an endless cycle. Each gain in efficiency reveals a new constraint, a new weakness. The next iteration won’t be about faster computation, but about a more honest accounting of the trade-offs between privacy, performance, and the inescapable fragility of any complex system.

Original article: https://arxiv.org/pdf/2604.14909.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Data’s Shadow: The Limits of Exact Matching

Mapping the Imprecise: A Protocol for Fuzzy Sets

The Language of Proximity: Defining Similarity with LpDistance

The Architecture of Privacy: Secure Computation and Protocol Efficiency

What Lies Ahead?

See also: