Malice in the Mesh // 05: Intro to Detection Engineering for AI Agent Systems

If modern AI agents were as simple as one outbound API call and a model output, there would not be a need for a comprehensive guide regarding detection engineering of these systems. Agents and orchestration tools have grown to now encompass tool use, file access, local execution, memory, workflow orchestration, network access. Not only have AI agent systems begun to sprawl across the OS, creators are in a race to now achieve the highest number of tool integrations as possible. This has led to an exponential increase in attack surface. The rapid increase of these tools in the marketplace has exploded not just due to market demand, but lower barriers of entry enabled by AI coding assistants. The effort it takes to “vibe code” an orchestration tool is minimal. And we have not reached the point yet where “vibe coding” automatically equates to the same level of security as purely-human derived code.

AI Agent Evolution

Take OpenClaw, the open source agent framework, which emerged in early 2026. As of April 2026, OpenClaw has published 250+ security advisories. Some researchers have estimated the rate of vulnerable ‘skills’ (similar to a plugin) at 41% of 2,800 skills. These skills became the catalyst for the explosion of early abuse of OpenClaw, such as the ClawHavoc campaign. Opening the door for the next major AI agent security risk, what has gone largely under the radar are the vibe-coded tools following in OpenClaw’s footsteps. Whether attempting to one-up OpenClaw with more tool integrations or proclaiming to be a more secure iteration of OpenClaw (but are they really??), agent frameworks like OpenClaw are quickly proliferating. To wrap up this installment, we will look at telemetry from OpenClaw and a cutting edge, newer, “more secure” alternative. Through our analysis, we will be able to make judgments regarding security posture and attack surface, if the alternative is truly more secure, and develop guidance for detection. All of this done in our own custom-built lab for AI analysis. While we won’t delve into the details of the construction of this lab, there will be a separate supplemental 13+ part series to Malice in the Mesh which will document the construction of the AI analysis lab.

This wave of popularity for open sourced agent frameworks operating within the orchestration layer has attracted the attention of the frontier model providers and caused them to rapidly adapt their strategies. Google, OpenAI, and Anthropic have all introduced improvements to their models’ ability to plan, call tools, follow multi-step workflows, inspect files, use code, recover from errors, and operate inside agent frameworks like LangGraph, OpenHands, OpenClaw-like systems, MCP, and workflow tools. Frontier models are moving from “answer engines” to “reasoning-and-action engines.”

Introduction to the Intelligence Process for AI Agents

Our previous four chapters defined our mental models at a conceptual level for detection engineering regarding AI agents. Now we need to get concrete. If you are a SOC analyst watching network traffic, you need to know what LangChain’s HTTP requests look like versus AutoGen’s. If you are a red teamer probing an agent deployment, you need to know where CrewAI stores credentials versus where n8n stores them. If you are doing incident response on a compromised host, you need to know whether the Chroma database in /home/app/.chroma/ belongs to a LangChain agent or a Flowise deployment, because the analysis changes your forensic approach.

The first step is to obtain a technical inventory for the system or tool under analysis. We have developed a five-phase system for our AI Analysis Lab. Those five phases are:

Static Decomposition
Network Fingerprinting
Runtime Behavioral Fingerprinting
Security Surface Instrumentation
Adversary Simulation

We will use these findings for our own detection engineering, but also to further understand the dynamics of upcoming installments:

Behavioral signatures of AI agents
Detection at the network layer
Host and application layer detection

We will use open source tools such as tcpdump, bpftrace, Falco, strace and more to assist in artifact collection. Be sure to check out the accompanying AI Analysis Lab series for a full deep dive into how we use these tools.

From Framework to Evidence: OpenClaw vs. Mercury

On April 25, 2026 the user “@xxxx0101xxxx010” on X published a post in Japanese describing the new release of what is claimed to be, a more secure alternative to OpenClaw:

Upon further research, Mercury is an AI agent framework developed by an organization called “Cosmic Stack” that touts OpenClaw-like interoperability but with “hardened permissions”.

Navigating the Mercury GitHub repo, the permission-first architecture is further described as “Every action that touches the outside world (files, shell, network, git) must go through the permission system. Never assume approval.” Interesting that if you recall, as spoken about in the 4th installment of Malice in the Mesh, this was the solution Anthropic employed for failure handling within Claude Code. This “hub-and-spoke” type of architecture where model executions are routed through a central interface. For Claude Code, it is called the QueryEngine. Our comparative analysis will gives us another opportunity to judge whether this type of architecture as an effective control flow method.

Check out Mercury’s GitHub repo here to source code and project documentation: https://github.com/cosmicstack-labs/mercury-agent

After following OpenClaw’s wave of vulnerabilities closely, Mercury’s release proved to be a great opportunity to not only discover more about Mercury itself but to see what we can infer about OpenClaw’s own security posture via comparative scrutiny. While the version of the analysis presented here is a high-level synopsis, I suggest you dive into the new AI Analysis series, being released in the next couple of weeks, to see the detailed analysis.

The next installments of Malice in the Mesh are organized by detection surface. The AI Analysis Lab is organized by how evidence is collected. Those are not the same framework, but they are intentionally coupled.

Static Decomposition

Static decomposition is analysis at rest, before execution or runtime observation. Our first observation quickly gives us useful insight for determining comparative security features. OpenClaw’s src/ and extensions/ trees contain 22,146 TypeScript files and 10,174 JavaScript files spanning roughly 130 top-level modules. Mercury’s src/ tree contains 83 TypeScript files in twelve directories. The ratio is 267 to 1.

The difference reflects the distinct architectural decisions with a focus on security trade-offs. OpenClaw is a platform with a WebSocket gateway, a browser automation subsystem, over fifteen messaging channel integrations, a plugin SDK, a multi-agent orchestration layer, a model catalog spanning every major provider, media generation pipelines for image, video, and music, a credential custodian, and mode. Every module expands the attack surface.

Mercury’s twelve directories map directly to its security model: capabilities/ (filesystem, git, shell, skills, web, and four others), channels/ (CLI and Telegram), core/, memory/, providers/, skills/, soul/, types/, and utils/. There is no gateway and no browser automation. There are only two channel integrations. The list is short enough to hold in working memory while reading any single file. If complexity breeds insecurity, OpenClaw is a monster to grapple with. It’s quite unfathomable for any enterprise security team to attempt to fully inventory, monitor, and add guardrails to prevent the sprawl of OpenClaw. Mercury takes the same architecture and makes achievement of these goals at least conceivable. With the security control trade-offs of Mercury comes less capabilities. A trade-off every security and risk team should weigh.

Through static decomposition, we can first see evidence of Cosmic Stack’s statement, “Every action that touches the outside world (files, shell, network, git) must go through the permission system. Never assume approval.” Mercury’s permission model runs through a single file: src/capabilities/permissions.ts. This file is 489 lines and has three tiers. The blocked list permanently denies sudo *, rm -rf /, mkfs *, dd if=*, the fork bomb, shutdown *, reboot *, and > /dev/sda. The auto-approved list covers read-only commands: ls *, cat *, pwd, grep *. The needs-approval list limits commands with significant side-effects: npm publish *, pip install *, chmod *, docker *.

The permission models diverge more sharply than anything else in the decomposition. OpenClaw’s permission-model-static.txt file lists over 41,000 lines of file paths containing permission-related logic scattered across the entire extensions tree. Each extension manages its own access controls, or it does not. There is no single place where you can verify that every action goes through the permission system. This is not a “hub-and-spoke” like control flow mechanism in the slightest. It’s a massive ball of yarn.

While static decomposition may not be as “sexy” as gathering runtime telemetry, the insights we can infer at this point are highly significant. Further phases of analysis will bring our insights to life.

Network Fingerprinting

The network fingerprinting phase measures what is emitted onto the network: listening ports, egress destinations, packet-capture signatures. If complexity breeds insecurity and Mercury’s goal is a simple architecture to lower the attack surface, we can see evidence of that through our network fingerprinting. After running ss -tlnp on the host and looking for what is listening, Mercury’s fingerprint returned nothing. Recalling our static decomposition phase, this is by design. The lab’s artifact records “No listening ports — EXPECTED” because this is a designed property we were merely confirming, not a previous gap in awareness. With no web socket, Mercury communicates through the terminal itself using stdin/stdout, not through a network service. For its optional Telegram channel, it uses a outbound-only polling method rather than a gateway, so a port scan would not distinguish a Mercury host from a host without Mercury installed.

In contrast, OpenClaw establishes a web socket gateway on port 18789. This important for a number of reasons but most significant are the fact are it’s persistence and it’s uniqueness. For bad actors conducting port scans, a result of open port 18789 is a major indicator for a host running OpenClaw.

As we march down this path, we start to see signs of a familiar concept, the PPAO loop. While static decomposition focused on an area more closely associated with the former phase of the loop, Perceive, we can begin to see the agent in action with network fingerprinting.

Runtime Behavioral Fingerprinting

Runtime behavioral fingerprinting follows up on network fingerprinting by collecting what the telemetry in the last phase does on the host. This includes File access patterns, process creation sequences, filesystem artifacts left behind after a task completes, strace-level syscall traces, and more.

To compare process creation for each, we used a tool called bpftrace. Bpftrace attaches an eBPF program (called “probes”) to a specific event source so we can monitor that event. Similar to tagging wildlife. And even more similar to EDR conducting hooking in a Windows environment. For our following usecase, which is just a small slice of our runtime behavioral fingerprinting analysis, we attached a probe to the execve syscall entry tracepoint to observe when a process starts another program.

Mercury’s in-process tool execution is invisible to process-creation-based detection. Tools like Falco rules that fire on execve events. Because Mercury’s read_file capability never calls execve at all, those controls see nothing. To observe it you need to hook the actual syscalls Mercury makes from within its own PID, specifically openat, read, and close, using a bpftrace probe attached to the Node.js process directly.

OpenClaw’s container model makes operations visible to process-creation detection precisely because container initialization, health checks, and downloaded binary execution all produce execve events that even a ps poll or a Falco execve rule will catch.

OpenClaw gives you much more noise, giving you more to detect it’s existence, but harder to define what it is actually doing. Mercury gives you a clean signal but a narrow one.

Security Surface Instrumentation

The security surface instrumentation answers the operational question that follows from both of the previous phases given what we now know about each agent’s architecture, where exactly do we place sensors, what do those sensors look for, and how does the instrumentation strategy differ when the architecture differs?

For OpenClaw, the lab produced three detection rules. The first targets network exposure, detecting whether the gateway is bound to a non-loopback interface. The second targets supply-chain events. A skill installation triggers a network fan-out and downloads content from external domains. The third targets credential access. The agent process reads openclaw.json or auth-profiles.json and then makes a non-Anthropic outbound connection.

For Mercury, the lab produced four detection rules. The first targets shell command execution: any run_command produces a node to sh command process chain observable by Falco’s execve rules. The second targets skill loading, a burst of SKILL.md file reads is the indirect injection signal most specific to Mercury’s threat model. The third targets credential file access: reads of ~/.mercury/.env or mercury.yaml outside startup. The fourth targets direct filesystem reads of second-brain.db by a process other than the mercury node process, which would indicate either a debugging session or memory exfiltration by a attacker.

Adversary Simulation

Adversary simulation is where it all comes together, in the very sense of the PPAO loop. Adversary simulation maps to the full PPAO loop. Malicious tests exercise the entire loop: attacker controlled input enters perception, changes planning, triggers unsafe action, and produces observations that may alter later iterations.

If you study AI security, you probably have heard of prompt injection by now. This is where we had much fun with all different types of prompt injection on both Mercury and OpenClaw. We used all of the following techniques: Naive Override, Reformatting Evasion, Single-Shot Reframe, Multi-Turn Crescendo, Persona/Roleplay Hijacking, Tool Output / Skill Injection, Memory Poisoning, and Goal Hijacking.

Mercury was impressive and demonstrated itself true to the label on the can. Mercury not only demonstrated hardened permissions for basic attacks, but a considerable level of defense in depth for more sophisticated ones. Mercury held on every test category that targeted them. The comparative analysis against OpenClaw here was an extreme example, holding on the most extreme prompts (explicit shadow-file exfiltration) and failed on every test that did not look obviously hostile to the model. My favorite was roleplay hijacking where OpenClaw said due to security constraints it cannot assume a different persona, but operating as Claude (the underlying model), here’s the gateway key. Ummm……

While denying the underlying Claude API key, this is still insecure disclosure and should be treated as such.

Mercury on the otherhand:

Conclusion

In our first four installments, we laid the conceptual groundwork for AI Agent systems. In this installment, we were introduced to gathering concrete evidence for how they operate. Going forward we will mesh both to continue down this path of detecting the ‘Malice in the Mesh’.

Have suggestions or want to collaborate on a future project? Shoot me an email at roccofiorecyber@gmail.com or find me on LinkedIn at the icon below.

The content published on this site reflects personal views and research only. It does not represent the views, positions, or policies of any current or former employer, client, or affiliated organization.

Any references to technologies, vulnerabilities, or security practices are for educational and informational purposes only. Nothing on this site should be interpreted as endorsement, disclosure of confidential information, or professional advice.

All examples are generalized or fictionalized unless explicitly stated otherwise.

Latest Posts

roccofiorecyber@gmail.com

Building an AI Analysis Lab: Part 2May 20, 2026
Welcome to the second installment of our AI Analysis Lab. Part 1 gave a thorough ordering of our host-native stack. Part 2 walks through the build of the Kubernetes observability layer that turns nameless syscalls into named flows. A kind cluster running Cilium as the CNI, Hubble for the flow log, Tetragon for in-container process… Read more: Building an AI Analysis Lab: Part 2
Building an AI Analysis Lab: Part 1May 11, 2026
Welcome to the first installment of our AI Analysis Lab, built for running the practical examples in the Malice in the Mesh series, as well as further continuing analysis. The first thing to know as we jump right in, is the examples below are examples only. While much of the same code and process will… Read more: Building an AI Analysis Lab: Part 1
Building an AI Analysis Lab: IntroductionMay 10, 2026
All rigorous AI Analysis needs a sturdy AI Analysis Lab. There is no better way to learn than to get hands on, so let’s get hand on. If you have not seen some of the practical examples in the Malice in the Mesh series, I highly suggest you go over there and check it out.… Read more: Building an AI Analysis Lab: Introduction
Malice in the Mesh // 06: Behavioral Analysis at the Network and Memory LayersMay 9, 2026
Welcome back. For the foreseeable future, we will be tacking detection engineering for AI Agents one detection surface at a time. In this installment, we begin by diving into behavioral analysis at the network layer. Each analysis type may be comprised of multiple installments, as we move through the discipline in a slow, but smooth… Read more: Malice in the Mesh // 06: Behavioral Analysis at the Network and Memory Layers
Malice in the Mesh // 05: Intro to Detection Engineering for AI Agent SystemsMay 2, 2026
If modern AI agents were as simple as one outbound API call and a model output, there would not be a need for a comprehensive guide regarding detection engineering of these systems. Agents and orchestration tools have grown to now encompass tool use, file access, local execution, memory, workflow orchestration, network access. Not only have… Read more: Malice in the Mesh // 05: Intro to Detection Engineering for AI Agent Systems