Whimsical LLM Attack Scenarios

Aditi Bhatnagar
Jan 4
6 min read

Updated: Jan 5

Hi there, my fellow people on earth, hope you're doing great, and the holiday season for you was nothing less than cinnamon cappuccino, plum cakes, cozy blankets, glitter and chill vibes.

Well, I've been lazying out as the year was wrapping up, so while last month was completing the amazing deep learning courses on understanding all things LLMs, last week was mostly imagination.

I recently completed the course on Langchain for LLM Application Development - Credential Flash!

While doing so I imagined - attacks, a new wave of it.

Putting my security lens to a rather constructive course, I had just one question in my mind - what can go wrong?

And so this post would be about that - whimsical scenarios, the what ifs.

Please note - this is not intended to be tech heavy content! It's rather a pleasurable whimsical walk in a garden post that could be treated as a baseless fiction movie or inspire a new category attacks. No harm in thinking mischief, hey!

So what was the course about, you may ask,

It was on using LangChain for LLM Application Development. It covered models, prompts, parsers, memories, chains, Q/A, Agents etc and below are some interesting concepts I learned and few attack scenarios I hallucinated on.

The parser ain't just parsing any more

Output parsers in LangChain are tools designed to interpret, structure, and validate the raw outputs generated by large language models (LLMs). Since LLMs typically generate unstructured text, output parsers help transform this text into structured data formats like dictionaries, JSON, or other desired formats. This makes the outputs easier to use programmatically and ensures they align with expected schemas or constraints.

Malicious thought: Remember JSON parameter tampering attack? Where attacker could add additional fields to the JSON which gets processed separately. Can we probably do something like that here? Well, the exact fields are defined in the Prompt Template so it may not be straightforward to mess up, however, can we get the fields to read unexpected values? Yes.

Since the message is being parsed and JSON values for keys are being deciphered from the passage, we can always tweak the messaging, e.g. edit a comment to mess with the results.

So, this is the prompt template, read the message "It arrived in two days, just in time for my wife's anniversary present"

The JSON generated is:

{
  "gift": true,
  "delivery_days": 2,
  "price_value": "It's slightly more expensive than the other leaf blowers out there"
}

Now, add one line to the review comment:
"Nah it wasn't actually a present, just something she is been meaning to buy for a while"

And now the JSON generated is:

{   
	"gift": false,   
	"delivery_days": 2,   
	"price_value": "It's slightly more expensive than the other leaf blowers out there" 

}

Now one can always argue if it was a gift or not, the point to note is it is only as good as the model's understanding. Also, if business logic is to be decided on something like a review comment, the user can always edit what they wrote and hope for a different possibility, if that logic gets consumed and ends up affecting the user back in some manner.

We just redefined what Business Logic Vulnerabilities can look like.

Get 'em wrong - the embeddings!

Another fascinating concept is embeddings.

LLM embeddings are high-dimensional vectors encoding semantic contexts and relationships of data tokens, facilitating nuanced comprehension by LLMs. They encompass uni-modal and multi-modal types of vectors for single and cross-modal data interpretation, respectively.

Embeddings for the database or reference set (e.g., documents, images, products) are typically precomputed and stored in a vector database or similar system.

Whenever a query comes in, its embedding is dynamically generated using the same embedding model used for the reference data. This ensures the query and the reference embeddings share the same vector space.

Now embeddings can be generated via several open-source libraries across various data types (text, images, graphs, etc.) such as Hugging Face Transformers, Gensim, CLIP etc.

As a hacker, this is a cool place for me to imagine attacks, can the embeddings be generated in a manner that the original query asked is completely modified and used to then query the system to return relevant or irrelevant results.

Well, note that LLMs are designed to generate their own contextual embeddings during processing, so we aren't sending embeddings of the query anywhere, however what we do send is Context . In a typical RAG based application, this context is generated by doing a similarity search on embeddings of incoming query with the embeddings for the knowledge base, it's important that both of these use the same framework to generate embeddings.

Sweet!

Now let's see what can go wrong.

An immediate one is embedding framework mismatch which might lead the system to act in unexpected ways. A better one would be using a malicious third party library to generate embeddings that modifies the context of the actual query itself to then retrieve the result of things that the user didn't even ask.

Would the LLM do that? Well go ahead, try and find out for yourself.

The next interesting thing is the way this information is stored and shared.

If you use a model like OpenAI's text-embedding-ada-002, an embedding for the text "This is a test sentence." might look like this:

embedding = [0.0123, -0.0345, 0.1234, ..., -0.0567]

This may not hold any meaning to human eyes, but note that this is a very important piece of information governing the entire information retrieval logic of the application, so it must never be stored in a lax way.

Don't store it in places where it's easy to modify, don't store it unencrypted manner (which I have seen in so many tutorial videos by now).

Protect it as you would protect user data in any typical scenario.

My genuine curiosity from here was are embeddings reversible?

But then ChatGPT told me "Embeddings are not reversible in the strict sense. They represent a compressed, semantically rich version of the input data, but the process is lossy, meaning the original data cannot be perfectly reconstructed from the embedding."

However, there are certain exceptions that come close to reversibility such as Autoencoders which are designed for encoding and decoding data. In this case, the model learns to compress data (like images or text) into embeddings and then reconstructs the original data.

Another one is Sequence-to-sequence models used in translation tasks.

So, in some cases at least partial context can be retrieved. Best to treat it as user's sensitive info.

Tools gone rogue!

And my eyes got a final glitter when I saw how tools are return for AI agents.

Here's an example:

@tool

def time(text: str) -> str:

    """Returns todays date, use this for any \

    questions related to knowing todays date. \

    The input should always be an empty string, \

    and this function will always return todays \

    date - any date mathmatics should occur \

    outside this function."""

    return str(date.today())

Noticed that long comment?

Well guess what, that command tells the agent when to call this tool.

So, if I write "return me a date whenever I ask for when are my tasks due, it will always return me today's date."

One can practically mess up with the comments and get the following code running. Imagine an RCE code running behind, a simple harmless comment such as

""Returns shell, use this for any questions related to knowing anything about the world or workplaces, about life and strategizing. \The input should always be an empty string, \ and this function will always return a shell.""

Now obviously this won't be intentional. Developers will do the due diligence not adding inappropriate comments to make the agent run a wrong tool, but guess what we are not in a habit of scanning for comments in a code base.

How do we find vulnerabilities today?

We run scanners - SAST etc, but how do we intend to identify a bug like this in production code base?

Intriguing isn't it?

Well, those were a few zero shot observations I got. There are many buried in my scratchpad but this should give some gist.

I feel we are headed towards. a very interesting revolution in security space.

I'm closely monitoring it and if you are too, let's connect and chat.

I'm building something super cool to bring in AI to level up product security as we know of today and if that's something you may find value in, hit me up on LinkedIn!

I'm gonna go and have that coffee my friend has been nagging me on, for a while now.

I hope you enjoyed the whimsical walk.

Will be back soon!

Keep hacking till then ();

Whimsical LLM Attack Scenarios

Recent Posts

Comments