Announcing OpaquePrompts: Hide your sensitive data from LLMs

Today, we’re excited to announce the launch of our early access program for OpaquePrompts. OpaquePrompts serves as a privacy layer around LLMs, enabling you to hide personal and sensitive data from large language model (LLM) providers.

Motivation

New generative AI tools are popping up left and right, and for good reason. Some believe it to be the biggest thing since the internet. While new LLM applications, LLMOps tools, and even LLM providers are being built at an incredible rate, there has thus far been a far smaller emphasis on privacy tools for LLMs. This, to us, is almost paradoxical. For LLMs to provide value, they must be able to operate on all data, but today’s policies and regulations dictate what data can and cannot be shared with and retained by third-party providers. While such regulations are critical to protecting our privacy, they create enormous friction for data teams.

For example, the General Data Protection Regulation (GDPR) mandates that personal data must not be retained longer than necessary, while the California Consumer Privacy Act (CCPA) dictates that businesses must direct their service providers to delete “sensitive personal information” upon consumer request. If an LLM application were to pass a user’s prompt containing personal information to a third-party LLM provider, and the third-party LLM provider were to log the request and/or retain the prompt for “service improvement”, the LLM application developer would have to work closely with the LLM provider to ensure that their application adheres to GDPR, CCPA, and other regulations.

Unfortunately, today’s LLM providers’ data usage and retention policies do not offer much privacy. OpenAI’s consumer-facing ChatGPT continually trains its models with user input, and also shares user input with third-party providers. Anthropic’s Claude retains our data for at least 30 days by default. Google’s Bard retains our activity for at least 3 months, and may retain it for up to 3 years. It also uses input data to continually train its model. While providers have recently improved their security posture (for example, OpenAI no longer uses data submitted via its API to train its model), we wouldn’t recommend assuming that any data sent to them will be immediately deleted and will not be used.

With LLM providers insisting on retaining and even using our data, how can we harness the incredible power of the models offered by these providers while ensuring that they’re taking care of any data that we send them? How do we know that what they’re doing won’t bring regulatory blowback? How can we trust them with the data of our colleagues and our customers?

We need to ensure that any data we send to providers is not sensitive. In particular, by sanitizing any user prompt of personal or sensitive data before it reaches the LLM, we no longer have to rely on the LLM provider to follow any specific data retention or processing policy. This is what OpaquePrompts provides. And by leveraging confidential computing and trusted execution environments (TEEs), Opaque, which hosts OpaquePrompts, also doesn’t see the prompt, ensuring that only you, the LLM application developer, see any information included in the prompt.



The OpaquePrompts use case

We’ve built OpaquePrompts specifically for the use case where the user wants to gain insight into some user-provided context. The workflow is as follows:

  1. Given user-provided input, the LLM application constructs a prompt as before, perhaps including retrieved context, memory, and the user query. The LLM application relays this prompt to OpaquePrompts.

  2. In a TEE, OpaquePrompts uses natural language processing (NLP)-based machine learning to identify sensitive tokens in a prompt.

  3. In a TEE, OpaquePrompts sanitizes (in other words, encrypts and redacts) the prompt by encrypting all personal and sensitive tokens before returning the sanitized prompt to the LLM application.

  4. The LLM application submits the prompt to the LLM of their choice (such as ChatGPT or Claude), which returns what we call a sanitized response—a response that contains the sanitized versions of tokens that had personal information—to the LLM application.

  5. In a TEE, OpaquePrompts receives the sanitized response from the LLM application and de-sanitizes it, replacing the sanitized tokens with their plaintext equivalents.

  6. The LLM application returns the de-sanitized response to the user.

Using OpaquePrompts in this workflow minimizes disruption to existing LLM applications, enabling simple integration of OpaquePrompts, and empowers teams to build in privacy-by-design. With OpaquePrompts, LLM applications no longer have to rely on users to manually remove any sensitive information and comply with applicable laws and policies before using a third-party LLM.

Visually, an LLM application’s architecture augmented with OpaquePrompts could look like the following:


A typical retrieval-augmented-generation-based architecture looks like everything outlined in black. With OpaquePrompts, the architecture is further augmented with the blue pieces.

Getting access to OpaquePrompts

Today, you can join the waitlist to access the OpaquePrompts API and OpaquePrompts Chat, a demo chat application built with the OpaquePrompts API. Additionally, to support OpaquePrompts, we’ve built a number of open-source tools that you can view on GitHub:

With OpaquePrompts, we’re excited to support teams in their journey toward building responsible AI. OpaquePrompts is completely free to our early users—all we ask is that you share our commitment and vision toward building a future where all data remains private, regardless of where it is in the data lifecycle.

Join our waitlist to get early access to the OpaquePrompts API and chat interface, take a look at a product tour, or contact us by joining our Discord server or shooting us an email at opaqueprompts@opaque.co.