Siloed Searches: Transforming Queries for Cross-Organization Data Access

Author: Denis Avetisyan

A new approach to Retrieval-Augmented Generation shifts the focus from protecting documents to transforming queries, enabling secure and efficient data sharing between organizations.

The system architecture preserves data sovereignty through a five-phase workflow across organization-specific vector spaces, and achieves computational isolation of queries via a multi-stage transformation-<span class="katex-eq" data-katex-display="false">vector2Trans</span>-employing permutation, cryptographic blinding, a bounded non-linearity <span class="katex-eq" data-katex-display="false">f_{\beta}</span>, orthogonal rotation <span class="katex-eq" data-katex-display="false">W</span>, and L2 normalization while retaining retrieval utility. — The system architecture preserves data sovereignty through a five-phase workflow across organization-specific vector spaces, and achieves computational isolation of queries via a multi-stage transformation- $vector2Trans$ -employing permutation, cryptographic blinding, a bounded non-linearity $f_{\beta}$ , orthogonal rotation $W$ , and L2 normalization while retaining retrieval utility.

Trans-RAG introduces query-centric vector transformation to create isolated vector spaces, preserving data sovereignty and improving the accuracy of cross-organizational retrieval.

Cross-organizational Retrieval Augmented Generation (RAG) systems present a fundamental trade-off between data security, retrieval accuracy, and computational efficiency. This paper introduces ‘Trans-RAG: Query-Centric Vector Transformation for Secure Cross-Organizational Retrieval’, a novel framework that shifts the focus from encrypting documents to transforming queries, enabling secure knowledge access across organizational boundaries. By implementing isolated vector spaces and a multi-stage query transformation technique-vector2Trans-Trans-RAG achieves near-orthogonal vector separation with minimal accuracy degradation and significant efficiency gains over existing methods. Could this query-centric approach redefine secure data collaboration and unlock new possibilities for cross-organizational knowledge sharing?

Balancing Innovation and Sovereignty: The Data Access Dilemma

Contemporary data collaboration increasingly demands access to information dispersed across organizational borders, a practice sharply contrasted by the rise of stringent data sovereignty regulations. These regulations, designed to protect data privacy and national security, often dictate where data can be stored, processed, and transferred, creating significant obstacles for businesses and researchers attempting to leverage distributed knowledge. The inherent tension lies in balancing the benefits of seamless data exchange – fostering innovation and efficiency – with the legal and ethical obligations to respect jurisdictional boundaries and maintain data control. This necessitates a move beyond traditional data-sharing methods towards more sophisticated approaches that can dynamically adapt to varying regulatory landscapes and ensure compliant access without stifling crucial collaborative efforts.

Conventional methods of data sharing frequently fall short when navigating the complexities of modern data sovereignty. Centralized databases, while offering a single point of access, create a vulnerable target and struggle to accommodate varying jurisdictional requirements regarding data storage and processing. Similarly, basic encryption, though enhancing security, doesn’t inherently address where data resides or how access is governed across organizational lines. These limitations mean that simply securing data in transit or at rest isn’t enough; a more nuanced approach is needed to demonstrate compliance with evolving regulations and maintain trust among collaborating entities. Consequently, organizations are seeking innovative solutions that enable secure data retrieval without requiring full data replication or compromising individual data governance policies.

The promise of interconnected data – unlocking insights from previously siloed information – is increasingly hampered by the realities of data governance and sovereignty. While organizations amass valuable knowledge, its true potential remains latent without secure mechanisms for cross-organizational retrieval. Simply accessing data is no longer sufficient; compliance with evolving regulations demands granular control over who can access what, and how, across jurisdictional boundaries. This necessitates innovative approaches that move beyond basic security measures, enabling distributed knowledge to be leveraged without compromising privacy or violating legal frameworks. Ultimately, the fundamental challenge isn’t just about sharing data, but about establishing a trusted and auditable system for compliant data access – a prerequisite for realizing the full benefits of collaborative intelligence and data-driven innovation.

Cross-organizational retrieval without trust mechanisms inherently struggles to balance security, accuracy, and efficiency.

Vector Space Language: Architecting Isolation Through Mathematical Separation

The Vector Space Language (VSL) constructs data isolation by defining separate, mathematically distinct semantic spaces for each participating organization. This is achieved by treating each organization’s knowledge base as a unique vector space, effectively a coordinate system where data points – representing information – are positioned relative to one another. Each vector space operates independently; the same data, when represented in different organizations’ vector spaces, will have different vector representations. This mathematical separation prevents direct data access or interpretation across organizational boundaries, as data points lack inherent meaning outside of their native coordinate system. The dimensionality and transformation rules within each vector space are independently controlled, further solidifying the isolation and preventing cross-contamination of knowledge.

Vector embeddings are the core mechanism for achieving data isolation within the Vector Space Language. These embeddings translate data – text, images, or other formats – into high-dimensional vectors of floating-point numbers. This transformation captures the semantic meaning of the data, allowing for similarity comparisons and analytical operations. Critically, the embedding process intentionally obscures the original content; the vectors themselves do not directly reveal the underlying data. Instead, they represent a compressed, numerical approximation of meaning. The dimensionality of the vector space, and the specific algorithm used for embedding (e.g., word2vec, transformers), determine the level of semantic capture and obfuscation. $\vec{x} = f(data)$ represents this transformation, where $\vec{x}$ is the vector embedding and f is the embedding function.

The Vector Space Language (VSL) facilitates secure data collaboration by enabling organizations to interact with the semantic meaning of data without transmitting the underlying sensitive content. This is achieved by representing knowledge as vector embeddings – numerical vectors capturing semantic relationships – within isolated vector spaces. Organizations share these vector representations, allowing for computations like similarity searches and inferences, but the original data remains private due to the transformation process and the lack of a direct mapping between vectors and source content. This approach allows for cross-organizational knowledge discovery and joint analysis while upholding data sovereignty and minimizing the risk of information leakage, as access is limited to the transformed vector representations rather than the raw data itself.

The transformation successfully increased the angular separation between vector spaces from an average of <span class="katex-eq" data-katex-display="false">58.33^\circ</span> to <span class="katex-eq" data-katex-display="false">89.90^\circ</span>, nearly achieving orthogonality as evidenced by the improved cosine similarity from 0.506 to 0.009. — The transformation successfully increased the angular separation between vector spaces from an average of $58.33^\circ$ to $89.90^\circ$ , nearly achieving orthogonality as evidenced by the improved cosine similarity from 0.506 to 0.009.

Trans-RAG: A System for Secure Cross-Organizational Retrieval

Trans-RAG builds upon the Retrieval-Augmented Generation (RAG) process by enabling secure information retrieval across multiple organizations. Traditional RAG systems typically operate within a single data silo; Trans-RAG extends this capability to scenarios where relevant knowledge is distributed across organizational boundaries. This is achieved through the implementation of a Vectorized Semantic Layer (VSL) framework, which allows each organization to maintain control over its data while still participating in a unified retrieval process. The core innovation lies in adapting queries to be compatible with each organization’s specific VSL, effectively translating requests without requiring direct access to the underlying sensitive data. This approach facilitates cross-organizational knowledge sharing while preserving data privacy and security.

Trans-RAG fundamentally alters data security approaches by moving away from traditional document-level encryption. Instead of protecting documents at rest, Trans-RAG focuses on transforming the query itself before it accesses any data source. This query-level transformation allows for secure cross-organizational retrieval without requiring organizations to directly share their underlying sensitive documents. By modifying the query, Trans-RAG ensures that only relevant and permissible information is retrieved, effectively controlling access at the point of request and minimizing the risk of unauthorized data exposure. This dynamic transformation is central to maintaining privacy and security within a multi-organizational RAG system.

Vector2Trans facilitates cross-organizational query compatibility by converting incoming queries into the Vector Semantic Language (VSL) understood by each participating organization. This multi-stage process prioritizes privacy through the application of cryptographic techniques; Cryptographic Blinding obscures the original query vector, preventing information leakage, while Key-Based Permutation reorders vector elements using a shared key, further obfuscating the data. The resulting transformed query allows retrieval from each organization’s VSL-indexed knowledge base without exposing the original, sensitive query information or requiring direct access to underlying data.

Quantifying Privacy: Ensuring Computational Isolation and Data Protection

Vector2Trans employs a series of transformations on input vectors to enhance data privacy. These transformations include Orthogonal Rotation, which alters the vector’s orientation without changing its magnitude; L2 Normalization, which scales the vector to unit length, preventing magnitude-based inferences; and Bounded Non-linearity, which introduces a controlled, non-linear distortion to the vector components. The combination of these methods serves to mix and obscure the original vector elements, making reconstruction difficult, while simultaneously preserving the underlying similarity relationships between vectors, crucial for maintaining functionality in downstream applications.

Computational isolation within the Vector2Trans framework is quantitatively assessed using Cross-Space Angular Separation. This metric measures the average angular distance between transformed vectors, indicating the degree of dispersion in the output space and, consequently, the difficulty for an adversary to correlate inputs with outputs. Evaluations have demonstrated an average separation of 89.90°, signifying a high degree of isolation and a substantial reduction in the potential for information leakage via vector similarity analysis. This value is calculated across a representative dataset to ensure statistically relevant performance under anticipated operational conditions.

Evaluation of the framework’s data protection capabilities utilized Entropy and Mutual Information metrics to quantify information leakage under a Semi-Honest Adversary Model, which assumes the adversary follows the protocol but attempts to infer private data from observed interactions. Results indicate the implemented techniques effectively protect sensitive data while achieving a 96.5% retrieval effectiveness rate. This performance level demonstrates a functional trade-off between privacy preservation and utility, indicating the framework can successfully balance security requirements with the need for accurate data access and analysis.

Transformation consistently achieves over 99.5% isolation success rates (measured by Cosine Similarity <0.1) across all tested organization pairs, representing an 85.59% improvement from an initial average of 14.22%.

Towards a Future of Secure and Collaborative Intelligence

A significant impediment to broader artificial intelligence implementation lies in the challenge of secure data collaboration across organizational boundaries. Trans-RAG emerges as a solution, offering a robust and scalable framework for cross-organizational retrieval without necessitating direct data sharing. This approach enables organizations to leverage collective knowledge while maintaining complete control and sovereignty over their sensitive information. By decoupling data access from physical location, Trans-RAG facilitates secure AI applications in scenarios where data privacy and regulatory compliance are paramount, paving the way for innovations previously hindered by logistical and security concerns. The system’s architecture is designed to handle large datasets and complex queries, promising a future where collaborative AI is not just possible, but practical and efficient.

Trans-RAG represents a significant advancement in secure data collaboration, allowing organizations to jointly leverage knowledge without relinquishing control over sensitive information. This framework facilitates innovation and accelerates discovery by enabling cross-organizational retrieval of data, a process previously hindered by sovereignty concerns and computational limitations. Notably, Trans-RAG achieves a remarkable 32,216x speedup compared to traditional homomorphic encryption techniques, which previously offered data privacy at the cost of substantial processing overhead. This efficiency unlocks practical applications for collaborative AI, moving beyond theoretical possibilities to enable real-time insights and joint problem-solving across organizational boundaries – a crucial step toward a more interconnected and intelligent future.

Ongoing development of the Trans-RAG framework prioritizes broadening its compatibility beyond text-based data to encompass images, audio, and video formats, thereby unlocking its potential across a wider spectrum of applications. Researchers are actively investigating the system’s performance in complex, real-world scenarios – including federated learning environments and multi-party data analysis – with a focus on maintaining data privacy and security. This iterative process of optimization and expansion aims to establish a robust infrastructure for seamless knowledge sharing and collaborative AI development, ultimately fostering a more interconnected and innovative ecosystem where organizations can jointly leverage data assets without compromising sovereignty or control.

Analysis of cross-organizational probing attacks across 10 organizations and 90 directed pairs reveals potential vulnerabilities in inter-organizational communication.

The pursuit of efficient and secure information retrieval, as explored in this work concerning Trans-RAG, echoes a fundamental principle of system design: structure dictates behavior. Trans-RAG’s innovative approach, moving from document-level encryption to query-level transformation, exemplifies this by fundamentally altering how data interacts within the retrieval process. As Paul Erdős famously stated, “A mathematician knows a lot of things, but a physicist knows the deep underlying principles.” Similarly, Trans-RAG doesn’t simply address the problem of cross-organizational data access; it reframes it by focusing on the underlying structure of the query itself, creating isolated vector spaces that prioritize both data sovereignty and retrieval accuracy. This structural shift is not merely a technical detail, but the core of its effectiveness.

What Lies Ahead?

Trans-RAG rightly identifies a shift in focus – from securing the documents themselves to securing the access patterns revealed by queries. This is a crucial, if subtle, distinction. Yet, the creation of isolated vector spaces, while elegant, introduces a new set of dependencies. Each space becomes a self-contained ecosystem, requiring constant recalibration and potentially exacerbating the ‘cold start’ problem for organizations with limited data. The long-term cost of maintaining these boundaries – the inevitable drift in semantic understanding – remains largely unaddressed.

The framework’s emphasis on query transformation hints at a deeper truth: information retrieval isn’t simply about finding relevant data, it’s about shaping the question itself. Future work should explore how these transformations can be made adaptive, learning not just to preserve privacy, but to actively improve retrieval accuracy by refining the user’s intent. Consider the inherent tension: every new dependency introduced – every layer of transformation – is the hidden cost of freedom from centralized control.

Ultimately, the success of Trans-RAG, and similar approaches, will depend not on technological innovation alone, but on a more holistic understanding of information ecosystems. A truly robust system acknowledges that structure dictates behavior, and that any attempt to isolate parts of that structure will inevitably ripple through the whole. The challenge isn’t just to build secure retrieval systems, but to build systems that are inherently resilient to the complexities of cross-organizational collaboration.

Original article: https://arxiv.org/pdf/2604.09541.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/