I've recently been doing a lot of work around AI and developer tools, and thought it'd be fun to build my own take on background coding agents! Some of the basic requirements I wanted in this project:
I also wanted to prioritize freedom of autonomy, from having Shadow surface just the highest-level coding task overview and Git connection actions, to having full insight into the agent's context engine, file and terminal workspace, tool call results, etc.
Shadow has 3 main parts: the frontend interface, the backend server, and the isolated task environments.
It's a Typescript monorepo, using Turborepo as its build system.
The frontend is built with Next.js, and the backend with Express and Socket.io. The isolated task environments are run in Docker containers, using Kubernetes for orchestration. The database is PostgreSQL.
Since Shadow is a background agent for long-running tasks, a serverless backend architecture isn't ideal here. The stateful backend exposes both a REST API and a WebSocket server, built with Express, Typescript, and Socket.io.
Note that there are 2 modes: local and remote. Local mode is for development, where the agent works on local files on your machine. Remote mode is for production, where the agent works on files in remote isolated sandboxes. Initialization steps, tool execution logic, and other bits differ based on the current mode.
Vercel's AI SDK is a super helpful library to work with LLMs, making it easy to support different models and providers. I used AI SDK Core for Shadow's LLM logic.
The tricky part here is that simpler AI apps built with the AI SDK typically use a stateless architecture with Next.js serverless functions. However, Shadow's architecture should decouple clients from the agent so they can freely connect/disconnect without interrupting any workflows. Because of this, we create a stream processor for each active task, each with chunk handlers to process streamed chunks from the LLM and broadcast them to clients.
More on how the frontend parses these WebSocket events in a later section! Here's an overview of the stream processor class in the backend (stream-processor.ts):
1class StreamProcessor {2 private modelProvider = new ModelProvider();3 private chunkHandlers = new ChunkHandlers();45 async *createMessageStream(6 taskId: string,7 systemPrompt: string,8 model: ModelType,9 userApiKeys: ApiKeys,10 workspacePath: string,11 messages: Message[],12 abortSignal: AbortSignal,13 ): AsyncGenerator<StreamChunk> {14 try {15 const modelInstance = this.modelProvider.getModel(model, userApiKeys);16 const tools = await createTools(taskId, workspacePath);17 const result = streamText({18 model: modelInstance,19 messages,20 abortSignal,21 tools,22 experimental_repairToolCall: async (): Promise<LanguageModelV1FunctionToolCall | null> => {...},23 // other stream config options...24 });2526 for await (const chunk of result.fullStream) {27 switch (chunk.type) {28 case "text-delta": {29 const streamChunk = this.chunkHandlers.handleTextDelta(chunk);30 if (streamChunk) yield streamChunk;31 break;32 }33 case "tool-call": {...}34 case "tool-call-streaming-start": {...}35 case "tool-call-delta": {...}36 case "tool-result": {...}37 case "finish": {...}38 case "reasoning": {...}39 // other chunk types...40 }41 }42 } catch (error) {43 yield { type: "error", error: error.message, finishReason: "error" };44 }45 }46}
First, we have an async generator function which uses streamText()
to make LLM calls, then handles each chunk and emits them to connected clients. The tool call repair parameter is used to retry after validation errors from tool call invocations and outputs. A simplified version of the streamConfig
is also included below, although the full file contains more parameters like interleaved thinking, reasoning effort, etc.
The backend also has a ModelContextService
class to help manage task-specific context about API keys and models. This is helpful since Shadow doesn't permanently store user API keys to keep security simpler.
Shadow has a wide range of tools to work like a human developer, reasoning about and working in complex codebases. Each tool is organized in apps/server/src/agent/tools/
with markdown files for the description and usage instructions. Some tools can be executed in parallel by the LLM, particularly for discovery; parallel execution for file editing isn't encouraged due to potential conflicts.
Because of Shadow's dual modes, we have a "tool executor" abstraction with implementations in the classes LocalToolExecutor
and RemoteToolExecutor
.
List of tools:
read_file
edit_file
(entire file)search_replace
(text search & replace)delete_file
list_dir
file_search
semantic_search
(vector-based code search, only available when indexing is complete)grep_search
run_terminal_cmd
(with safety validation logic)todo_write
add_memory
, list_memories
, remove_memory
Shadow also has MCP support! Using experimental_createMCPClient()
from AI SDK, I built-in Context7 for up-to-date library documentation search tools. Additional MCP servers can also be added easily.
The system prompt contains that tool usage info, as well as sections about identity, capabilities, environment context, operation modes, and dynamic injections like repository memories, Shadow Wiki context (more on this later), and more. If you're interested in the code, check out the apps/server/src/agent/
folder.
Shadow has a queueing system for messages and also stacked tasks. Stacked tasks are new messages created with a given prompt, that are based off of where the current task left off. When these messages are received through the WebSocket connection, they're stored in a queue.
On stream completion for that task, we process the next message in the queue according to its type. Messages can also be sent to immediately interrupt the current stream using an AbortController
.
Codebase indexing in Shadow is used by the semantic_search
tool, doing retrieval to understand code based on natural language queries, returning metadata like file paths, line numbers, symbol relationships, etc. It's invoked asynchronously on task initialization (if the setting is enabled), and it powers the semantic_search
tool by creating a semantic graph representation of the codebase.
First, we build a graph with GraphNode
objects:
REPO
: Repository rootFILE
: Individual source filesSYMBOL
: Functions, classes, methodsCOMMENT
: Documentation blocksIMPORT
: Import statementsCHUNK
: Code segments for embeddingGraphEdge
objects define relationships:
CONTAINS
: File contains symbolCALLS
: Function calls another functionDOCS_FOR
: Comment documents codePART_OF
: Symbol is part of larger structureWe do language-aware AST parsing with tree-sitter
, supporting languages like JavaScript, TypeScript, Python, and others. This logic handles symbols, imports, calls, cross-file relationships, and more. Large code blocks are also intelligently broken into chunks, preserving boundaries where possible. These embeddings are stored in Pinecone.
Shadow Wiki is a codebase documentation system inspired by DeepWiki, triggered on task initialization, which generates hierarchical summaries of codebases for initial agent context.
First, we scan with tree-sitter
parsers and build a symbol map for the entire codebase. The next step is identifying critical files and sampling representative files from directories, without deeply analyzing every single file for efficiency.
Then we process files in batches to generate summaries on 3 levels. File-level summaries describe individual files with their purpose, functionality, symbols, dependencies, patterns, etc. Directory-level summaries provide info about folders that aggregates child directory context, previews representative files, and identifies the directory's purpose. The repo-level summary synthesizes this info, provides and architectural understanding, and links to relevant details from directories and files.
Shadow Wiki generation is an initialization step in the task lifecycle, and summaries are injected after the system prompt. Summaries are also cached by repository ID.
One of the most important parts of Shadow is the auto-provisioned task VMs. Each agent needs its own isolated environment to work in, with filesystem and terminal access to work like a developer would.
Each pod runs a Kata Container using QEMU for hardware virtualization.
Inside each pod, we also run a sidecar container that exposes an HTTP API, acting as a bridge between the backend server and the VM. Similar to the backend, the sidecar API is built with Express and Typescript.
Core services:
1// File operations2const fileService = new FileService(workspaceService);3// Command execution4const commandService = new CommandService(workspaceService);5// Git operations6const gitService = new GitService(workspaceService);7// Code search8const searchService = new SearchService(workspaceService);
API endpoints:
/api/files/*
- File read/write/delete/list operations/api/execute/*
- Terminal command execution/api/search/*
- Semantic and grep code search/api/git/*
- Git operations (commit, push, branching)/health
- VM health monitoringThe RemoteToolExecutor
I mentioned earlier contains the tool implementations for remote mode, which communicate with this sidecar API. The sidecar then executes operations within the VM filesystem and terminal processes.
Instead of solely reading from file-related tool execution results to keep directory and file contents up to date, we have a filesystem watcher to handle real-time updates. This way, we also handle directory or file changes from terminal commands or other means. Like the frontend, the sidecar also runs a WebSocket client which connects to the WebSocket server in the backend; the sidecar sends filesystem updates to the backend, then the frontend is kept in sync with these updates.
We also have some basic terminal command security logic to prevent dangerous commands or traversals. It's an internal shared package (packages/command-security
) since in local mode we don't run the sidecar, so this validation logic is run directly in the backend server.
The task lifecycle logic is related to both the agent environment infrastructure and the backend server.
Tasks are initialized on creation or on message reception after a certain period of inactivity, then cleaned up after a certain period of inactivity. The following logic is for remote mode, but local mode is just a simpler version (no VM orchestration logic).
Task initialization follows a state machine pattern. During this, the task status is set to INITIALIZING
, and we also have an InitStatus
enum for granular progress tracking.
This is abstracted into a TaskInitializationEngine
class to easily support both remote and local modes.
The latest sidecar container image is pulled on VM creation. It's hosted on the GitHub Container Registry, triggered by a Github Action (build.yml).
Similarly, we have a TaskCleanupService
to handle task cleanup after a certain period of inactivity. There's a task cleanup queue which polls for expired active tasks, then cleans up in-memory data and the task's VM.
Shadow uses Git as its source of truth for codebase state, and keeping track of changes. When a user logs in with GitHub and installs the Shadow app onto an organization, Shadow gains access to repositories in that organization (through access and refresh tokens) to easily start building on existing projects and deeply integrate into the Git workflow.
On task creation, branches are auto-generated within the selected repository. We already have title generation logic in place, which is then also used to help with branch naming.
Messages are tied to commits. When an assistant message involves code changes, a commit is co-authored by Shadow and the user. This helps with message editing! When an edited message is submitted, we first checkout to that message's commit and todo list state to keep everything in sync.
When a stacked task is initialized, the new branch's base commit is the last commit in the base task.
Pull requests are an important piece of the background agent workflow, to be able to easily review and merge changes. The sidebar on the frontend has buttons to create and view pull requests. Auto PR creation can also be enabled in user settings, which just creates or updates the current task's PR if any changes are present on stream completion.
Pull request cards are also visible in the chat UI, which is persisted by storing a PullRequestSnapshot
linked to the corresponding ChatMessage
. I chose to store pull request metadata in snapshots to maintain the correct history of changes, rather than have every "snapshot" simply show the most recent pull request state from GitHub.
You can also trigger tasks directly from a repo's GitHub issues!
When a repo is selected, we make an API request (authenticated by the Shadow GitHub app) to fetch its issues, displaying them by recency in an expandable list with a fun entry animation :)
Shadow uses PostgreSQL for its database (Supabase in production), with Prisma as its ORM. Here's a simplified diagram of the schema:
The diagram above is simplified, missing some other minor task-related tables. Check out the full schema at packages/db/prisma/schema.prisma
. Note that our database is an internal package since the exported prisma client object is used by multiple apps and other internal packages.
As the backend receives stream chunks, we want to ensure messages are stored in real-time. For efficiency, we debounce database updates at an interval to avoid excessive writes.
Our method for storing chat messages is derived from the patterns of types we see from model providers and the AI SDK. Chat messages have a role (user, assistant, system), content, metadata, and some other fields. A task has an array of messages which form the actual chat history. The AI SDK reference pages are helpful to understand this.
Shadow Wiki and repository index entries are intentionally not related to task entries, since they're tied to Git repositories more than the tasks themselves.
Shadow's frontend is built with Next.js, Typescript, and styled with Tailwind and Shadcn UI.
The logic around Shadow's chat has 2 "modes". While the LLM isn't streaming, we simply fetch the current task's chat history from the database and display it with user messages and assistant messages.
While the LLM is streaming, we need to see the stream in real-time on the interface. Some naive approaches would be polling the database for updates, or using our WebSocket connection to emit the entire message contents on each stream chunk. Forwarding the message content through WebSockets works, but quickly becomes inefficient for long assistant messages. Shadow is specifically made for tasks where the LLM takes many steps, so this isn't ideal. Instead, the backend keeps an in-memory buffer of the currently streaming message to help with this.
When a task page is opened, the server component first fetches the chat history from the database. On WebSocket connection, it then receives a stream-state
event containing the current stream content if it exists. Then the client receives stream-chunk
events representing tokens and parts from the LLM, which gets processed by the accumulation system.
We accumulate stream chunks in a map, where we assign a unique ID to each chunk. This is stored in React state to render them, and also maintained in a ref for immediate access in the processing logic.
1export function useStreamingPartsMap() {2 const mapRef = useRef<Map<string, AssistantMessagePart>>(new Map());3 const [map, setMap] = useState<Map<string, AssistantMessagePart>>(new Map());45 // ...6}
Maintaining unique IDs for chunks helps us elegantly make in-place incremental updates to the LLM response.
Tool calls are even more tricky since we only have access to structured tool call argument data when we receive the tool call result after completion. That isn't enough to surface live tool call statuses like this:
By enabling toolCallStreaming
in the stream config, we get access to tool call deltas. Accumulating these results in a stringified, partial JSON object. By attempting to parse this JSON as it comes in, we can manually extract structured tool call argument data to visualize in the UI.
To display the chat UI, we begin by taking in the merged chat history + streaming parts from the task socket hook. The message types:
Before rendering, we group messages into [user, assistant]
pairs, which will be important later.
To help with chat scroll behavior, I used use-stick-to-bottom
.
Assistant messages are complex, handling multiple part types. Text parts are rendered with memoized markdown components:
1function parseMarkdownIntoBlocks(markdown: string): string[] {2 const tokens = marked.lexer(markdown);3 return tokens.map((token) => token.raw);4}56const MemoizedMarkdown = memo(({ content, id }: { content: string; id: string }) => {7 const blocks = useMemo(() => parseMarkdownIntoBlocks(content), [content]);8 return (9 <div className="space-y-2">10 {blocks.map((block, index) => (11 <MemoizedMarkdownBlock content={block} key={`${id}-block_${index}`} />12 ))}13 </div>14 );15 }16);
Tool call parts are displayed in custom components with configuration options such as being collapsible, file icons, loading states, diff numbers, suffixes, etc. (tool.tsx
).
Reasoning components are simplified collapsible tool components that auto-open while in progress.
A helpful UX hint in Shadow is how user messages stick to the top of the UI when viewing its following assistant message, for some visual context about the latest prompt.
This is where the [user, assistant]
pairings in individual divs are helpful! We need the message to "unstick" before the next user message, so we need parent divs to mark these boundaries. For example:
1<div className="sticky">2 <UserMessage />3 <AssistantMessage />4</div>5<div className="sticky">6 <UserMessage />7 {/* ... */}
The message editing UX is also designed in an intuitive way:
The point of focus for the user doesn't change from the message's original position, so it's inline, frictionless, and also allows model switching (with a keyboard shortcut) if needed.
Shadow's UX around follow-up messages follows a 2-step pattern to support multiple variants of message logic.
This is a bit unconventional and comes with the trade-off of added friction for the benefit of surfacing more complex decision making for follow-up messages. I thought this made sense here since Shadow is specifically for longer horizon tasks meaning follow-up messages are relatively less frequent.
The expanding/collapsing animation uses the prompt input's border as a "secondary container", similar to how message editing is designed.
While not technically "inside" the chat UI, the sidebar is an important piece of the task page, acting as an overview of the task and its progress.
It surfaces important info in real-time, such as Git info, todo completion, diff stats, and modified files in a tree which open the file content when clicked!
The frontend app uses RSCs (React Server Components) where possible, to prefetch data and improve performance. I used TanStack Query for data fetching and server state management.
The simplest approach to support SSR with useQuery
hooks is with the initialData
parameter, but this can lead to prop drilling since we'd need to pass data down to client components that need it.
Instead, we can take a more elegant approach using hydration boundaries and serialization. Here's a simplified code example from the task page layout ([taskId]/layout.tsx
):
1export default async function TaskLayout({ children, params }: { children: React.ReactNode, params: Promise<{ taskId: string }> }) {2 const { taskId } = await params;3 const queryClient = new QueryClient();4 await Promise.allSettled([5 queryClient.prefetchQuery({6 queryKey: ["task", taskId],7 queryFn: () => getTaskWithDetails(taskId),8 }),9 queryClient.prefetchQuery({10 queryKey: ["task-messages", taskId],11 queryFn: () => getTaskMessages(taskId),12 }),13 queryClient.prefetchQuery({14 queryKey: ["api-keys"],15 queryFn: getApiKeys,16 }),17 queryClient.prefetchQuery({18 queryKey: ["models"],19 queryFn: getModels,20 })21 ]);2223 return (24 <HydrationBoundary state={dehydrate(queryClient)}>25 {/* providers, children, etc... */}26 </HydrationBoundary>27 );28}
We also have context providers for agent environment layout states and file-related data (AgentEnvironmentProvider
), task socket data for streaming state (TaskSocketProvider
), modal opened states (ModalProvider
), and more.
The "Shadow Realm" is the agent environment's representation on the frontend. This is an important part of Shadow, since I believe great coding assistant experiences give the user full freedom over how much control they want in the agent's workspace. It has a complete file tree, code editor, terminal, and Shadow Wiki view in a collapsible, resizable layout.
The file explorer displays the project's file tree, which is fetched separately from complete file contents. File content is fetched as needed when files are opened. Markdown files are supported using the same rendering components as the chat UI, and the Shadow Wiki view is just a specialized version of that.
The terminal is a basic terminal emulator, built with xterm.js. It communicates with the backend through WebSocket events, so that clients can see terminal command execution results made by Shadow in real-time.
The code editor is built with Monaco, which is used to power VSCode and many other tools. JSX highlighting support isn't great in Monaco by default, so I used Shiki, a better syntax highlight engine.
The color theme is a slight variation of Vesper from Rauno Freiberg, which I love (image above from their GitHub repo). This theme also inspired color accents throughout the rest of Shadow, including the following animation!
The new task animation is a subtle multi-layered animation involving various gradient effects. It's inspired by Dia's new tab animation.
The first component is a radial gradient that animates upwards and scales down in size, from the near the bottom of the screen to just above the prompt input. Then we have 2 animated conic gradients inside the prompt form, pulsing around either side of the border in sync with the background gradient.
The frontend was the most straightforward part, deployed on Vercel.
The Kubernetes cluster for task VMs was on AWS EKS with bare metal nodes for KVM/nested virtualization. We have a kata-qemu
RuntimeClass for VM-based containers. There's also a shadow-agents
namespace for task execution, storage classes for persistent volumes, and RBAC for service account authentication.
Since each pod needs its own sidecar instance, it pulls from GHCR for the latest image which is built by a GitHub Action.
The backend server is deployed on ECS. We first build the server image with Docker and push to ECR, set up the Application Load Balancer, secrets, security groups, and auto-scaling policies. It's also deployed on the same VPC as the Kubernetes cluster for easy communication between the two.
Check out the full deployment scripts if you're interested!
Pretty soon into building the project, Shadow began contributing to its own codebase! I find this positive feedback cycle that inherently comes with building developer tools to be really cool.
Here's the final demo video:
Also check out the X thread & system design breakdown video.
Working on this project was a grind, but a lot of fun. I'm super proud of how powerful and complete Shadow became after only a few weeks of building.
Thanks to Rajan and Elijah for helping and doing some super interesting work on Shadow! Rajan build the Shadow Wiki codebase understanding system and other bits around the rest of the codebase. Elijah built the background indexing service.
There's still a lot of areas that I think would be super interesting to explore with Shadow. To list a few:
Subagents as Tools
Allowing the agent to spin up subtasks in parallel to allow deep discovery in codebases without much context pollution.
Browser Use
The agent already has access to the terminal; giving it access to the browser with vision capabilities to iterate on UI changes.
Parallel Iterations
Starting multiple iterations of the same task in parallel to test different models and monitor them together, in real time.
VM Improvements
Using dev containers to support more project types, letting the agent environment closely match real local development setups. Also snapshotting VM state on cleanup to use for resumption, avoid long re-initialization times.
Timeline
Similar to Devin, keeping track of task progress in a timeline to trace through every step the agent took in its environment.
Context Improvements
Auto-compaction was a major feature we partially built but haven't fully completed yet, which would be a big step towards supporting even longer-horizon tasks. This is an interesting challenge since it requires cutting down most of the context window, yet maintaining sufficient understanding of the current task to continue without losing quality.
Memory Improvements
The current memory system is fairly primitive. Exposing memories and rules to be customizable in the UI would help improve general codebase understanding across tasks within the same codebase.