Shadow

Project, 2025

Timeline

4 Weeks, July - August 2025

Tools

Next.js

Typescript

AI SDK

Socket.io

Express

Kubernetes

Prisma

PostgreSQL

Overview

Shadow is a powerful open source background coding agent and real-time interface with over 1300 GitHub stars. It gets into agentic interface UX, LLM engineering, and infrastructure for long-running tasks.

Check out the GitHub to run the project locally, self-host it, or contribute!

Team

Rajan Agarwal Elijah Kurien

Background

I've recently been doing a lot of work around AI and developer tools, and thought it'd be fun to build my own take on background coding agents! Some of the basic requirements I wanted in this project:

Coding Agent

A powerful LLM setup with many providers and tools for discovery, editing, and execution.

Background Tasks

A cloud-first architecture for completing long-running coding tasks remotely and in parallel.

GitHub Connection

Work on existing repositories seamlessly, making branches, pull requests, and commits.

Agent Environment

A secure, isolated environment for each task with a filesystem and terminal for the agent.

I also wanted to prioritize freedom of autonomy, from having Shadow surface just the highest-level coding task overview and Git connection actions, to having full insight into the agent's context engine, file and terminal workspace, tool call results, etc.

Architecture

Shadow has 3 main parts: the frontend interface, the backend server, and the isolated task environments.

It's a Typescript monorepo, using Turborepo as its build system.

The frontend is built with Next.js, and the backend with Express and Socket.io. The isolated task environments are run in Docker containers, using Kubernetes for orchestration. The database is PostgreSQL.

Backend Server

Since Shadow is a background agent for long-running tasks, a serverless backend architecture isn't ideal here. The stateful backend exposes both a REST API and a WebSocket server, built with Express, Typescript, and Socket.io.

Note that there are 2 modes: local and remote. Local mode is for development, where the agent works on local files on your machine. Remote mode is for production, where the agent works on files in remote isolated sandboxes. Initialization steps, tool execution logic, and other bits differ based on the current mode.

LLM Setup

Vercel's AI SDK is a super helpful library to work with LLMs, making it easy to support different models and providers. I used AI SDK Core for Shadow's LLM logic.

The tricky part here is that simpler AI apps built with the AI SDK typically use a stateless architecture with Next.js serverless functions. However, Shadow's architecture should decouple clients from the agent so they can freely connect/disconnect without interrupting any workflows. Because of this, we create a stream processor for each active task, each with chunk handlers to process streamed chunks from the LLM and broadcast them to clients.

More on how the frontend parses these WebSocket events in a later section! Here's an overview of the stream processor class in the backend (stream-processor.ts):

1class StreamProcessor {
2  private modelProvider = new ModelProvider();
3  private chunkHandlers = new ChunkHandlers();
4
5  async *createMessageStream(
6    taskId: string,
7    systemPrompt: string,
8    model: ModelType,
9    userApiKeys: ApiKeys,
10    workspacePath: string,
11    messages: Message[],
12    abortSignal: AbortSignal,
13  ): AsyncGenerator<StreamChunk> {
14    try {
15      const modelInstance = this.modelProvider.getModel(model, userApiKeys);
16      const tools = await createTools(taskId, workspacePath);
17      const result = streamText({
18        model: modelInstance,
19        messages,
20        abortSignal,
21        tools, 
22        experimental_repairToolCall: async (): Promise<LanguageModelV1FunctionToolCall | null> => {...},
23        // other stream config options...
24      });
25
26      for await (const chunk of result.fullStream) {
27        switch (chunk.type) {
28          case "text-delta": {
29            const streamChunk = this.chunkHandlers.handleTextDelta(chunk);
30            if (streamChunk) yield streamChunk;
31            break;
32          }
33          case "tool-call": {...}
34          case "tool-call-streaming-start": {...}
35          case "tool-call-delta": {...}
36          case "tool-result": {...}
37          case "finish": {...}
38          case "reasoning": {...}
39          // other chunk types...
40        }
41      }
42    } catch (error) {
43      yield { type: "error", error: error.message, finishReason: "error" };
44    }
45  }
46}

First, we have an async generator function which uses streamText() to make LLM calls, then handles each chunk and emits them to connected clients. The tool call repair parameter is used to retry after validation errors from tool call invocations and outputs. A simplified version of the streamConfig is also included below, although the full file contains more parameters like interleaved thinking, reasoning effort, etc.

The backend also has a ModelContextService class to help manage task-specific context about API keys and models. This is helpful since Shadow doesn't permanently store user API keys to keep security simpler.

Tools & Prompts

Shadow has a wide range of tools to work like a human developer, reasoning about and working in complex codebases. Each tool is organized in apps/server/src/agent/tools/ with markdown files for the description and usage instructions. Some tools can be executed in parallel by the LLM, particularly for discovery; parallel execution for file editing isn't encouraged due to potential conflicts.

Because of Shadow's dual modes, we have a "tool executor" abstraction with implementations in the classes LocalToolExecutor and RemoteToolExecutor.

List of tools:

File operations
- read_file
- edit_file (entire file)
- search_replace (text search & replace)
- delete_file
- list_dir
- file_search
Code Discovery
- semantic_search (vector-based code search, only available when indexing is complete)
- grep_search
Execution
- run_terminal_cmd (with safety validation logic)
Task Management
- todo_write
- add_memory, list_memories, remove_memory

Shadow also has MCP support! Using experimental_createMCPClient() from AI SDK, I built-in Context7 for up-to-date library documentation search tools. Additional MCP servers can also be added easily.

The system prompt contains that tool usage info, as well as sections about identity, capabilities, environment context, operation modes, and dynamic injections like repository memories, Shadow Wiki context (more on this later), and more. If you're interested in the code, check out the apps/server/src/agent/ folder.

Queued Actions

Shadow has a queueing system for messages and also stacked tasks. Stacked tasks are new messages created with a given prompt, that are based off of where the current task left off. When these messages are received through the WebSocket connection, they're stored in a queue.

On stream completion for that task, we process the next message in the queue according to its type. Messages can also be sent to immediately interrupt the current stream using an AbortController.

Indexing

Codebase indexing in Shadow is used by the semantic_search tool, doing retrieval to understand code based on natural language queries, returning metadata like file paths, line numbers, symbol relationships, etc. It's invoked asynchronously on task initialization (if the setting is enabled), and it powers the semantic_search tool by creating a semantic graph representation of the codebase.

First, we build a graph with GraphNode objects:

REPO: Repository root
FILE: Individual source files
SYMBOL: Functions, classes, methods
COMMENT: Documentation blocks
IMPORT: Import statements
CHUNK: Code segments for embedding

GraphEdge objects define relationships:

CONTAINS: File contains symbol
CALLS: Function calls another function
DOCS_FOR: Comment documents code
PART_OF: Symbol is part of larger structure

We do language-aware AST parsing with tree-sitter, supporting languages like JavaScript, TypeScript, Python, and others. This logic handles symbols, imports, calls, cross-file relationships, and more. Large code blocks are also intelligently broken into chunks, preserving boundaries where possible. These embeddings are stored in Pinecone.

Shadow Wiki

Shadow Wiki is a codebase documentation system inspired by DeepWiki, triggered on task initialization, which generates hierarchical summaries of codebases for initial agent context.

First, we scan with tree-sitter parsers and build a symbol map for the entire codebase. The next step is identifying critical files and sampling representative files from directories, without deeply analyzing every single file for efficiency.

Then we process files in batches to generate summaries on 3 levels. File-level summaries describe individual files with their purpose, functionality, symbols, dependencies, patterns, etc. Directory-level summaries provide info about folders that aggregates child directory context, previews representative files, and identifies the directory's purpose. The repo-level summary synthesizes this info, provides and architectural understanding, and links to relevant details from directories and files.

Shadow Wiki generation is an initialization step in the task lifecycle, and summaries are injected after the system prompt. Summaries are also cached by repository ID.

Agent Environment

One of the most important parts of Shadow is the auto-provisioned task VMs. Each agent needs its own isolated environment to work in, with filesystem and terminal access to work like a developer would.

Each pod runs a Kata Container using QEMU for hardware virtualization.

Sidecar

Inside each pod, we also run a sidecar container that exposes an HTTP API, acting as a bridge between the backend server and the VM. Similar to the backend, the sidecar API is built with Express and Typescript.

Core services:

1// File operations
2const fileService = new FileService(workspaceService);
3// Command execution  
4const commandService = new CommandService(workspaceService);
5// Git operations
6const gitService = new GitService(workspaceService);
7// Code search
8const searchService = new SearchService(workspaceService);

API endpoints:

/api/files/* - File read/write/delete/list operations
/api/execute/* - Terminal command execution
/api/search/* - Semantic and grep code search
/api/git/* - Git operations (commit, push, branching)
/health - VM health monitoring

The RemoteToolExecutor I mentioned earlier contains the tool implementations for remote mode, which communicate with this sidecar API. The sidecar then executes operations within the VM filesystem and terminal processes.

Instead of solely reading from file-related tool execution results to keep directory and file contents up to date, we have a filesystem watcher to handle real-time updates. This way, we also handle directory or file changes from terminal commands or other means. Like the frontend, the sidecar also runs a WebSocket client which connects to the WebSocket server in the backend; the sidecar sends filesystem updates to the backend, then the frontend is kept in sync with these updates.

We also have some basic terminal command security logic to prevent dangerous commands or traversals. It's an internal shared package (packages/command-security) since in local mode we don't run the sidecar, so this validation logic is run directly in the backend server.

Lifecycle

The task lifecycle logic is related to both the agent environment infrastructure and the backend server.

Tasks are initialized on creation or on message reception after a certain period of inactivity, then cleaned up after a certain period of inactivity. The following logic is for remote mode, but local mode is just a simpler version (no VM orchestration logic).

Initialization

Task initialization follows a state machine pattern. During this, the task status is set to INITIALIZING, and we also have an InitStatus enum for granular progress tracking.

This is abstracted into a TaskInitializationEngine class to easily support both remote and local modes.

The latest sidecar container image is pulled on VM creation. It's hosted on the GitHub Container Registry, triggered by a Github Action (build.yml).

Cleanup

Similarly, we have a TaskCleanupService to handle task cleanup after a certain period of inactivity. There's a task cleanup queue which polls for expired active tasks, then cleans up in-memory data and the task's VM.

Git Integration

Shadow uses Git as its source of truth for codebase state, and keeping track of changes. When a user logs in with GitHub and installs the Shadow app onto an organization, Shadow gains access to repositories in that organization (through access and refresh tokens) to easily start building on existing projects and deeply integrate into the Git workflow.

Branches

On task creation, branches are auto-generated within the selected repository. We already have title generation logic in place, which is then also used to help with branch naming.

Commits & Checkpointing

Messages are tied to commits. When an assistant message involves code changes, a commit is co-authored by Shadow and the user. This helps with message editing! When an edited message is submitted, we first checkout to that message's commit and todo list state to keep everything in sync.

When a stacked task is initialized, the new branch's base commit is the last commit in the base task.

Pull Requests

Pull requests are an important piece of the background agent workflow, to be able to easily review and merge changes. The sidebar on the frontend has buttons to create and view pull requests. Auto PR creation can also be enabled in user settings, which just creates or updates the current task's PR if any changes are present on stream completion.

Pull request cards are also visible in the chat UI, which is persisted by storing a PullRequestSnapshot linked to the corresponding ChatMessage. I chose to store pull request metadata in snapshots to maintain the correct history of changes, rather than have every "snapshot" simply show the most recent pull request state from GitHub.

Issues

You can also trigger tasks directly from a repo's GitHub issues!

When a repo is selected, we make an API request (authenticated by the Shadow GitHub app) to fetch its issues, displaying them by recency in an expandable list with a fun entry animation :)

Database

Shadow uses PostgreSQL for its database (Supabase in production), with Prisma as its ORM. Here's a simplified diagram of the schema:

The diagram above is simplified, missing some other minor task-related tables. Check out the full schema at packages/db/prisma/schema.prisma. Note that our database is an internal package since the exported prisma client object is used by multiple apps and other internal packages.

As the backend receives stream chunks, we want to ensure messages are stored in real-time. For efficiency, we debounce database updates at an interval to avoid excessive writes.

Our method for storing chat messages is derived from the patterns of types we see from model providers and the AI SDK. Chat messages have a role (user, assistant, system), content, metadata, and some other fields. A task has an array of messages which form the actual chat history. The AI SDK reference pages are helpful to understand this.

Shadow Wiki and repository index entries are intentionally not related to task entries, since they're tied to Git repositories more than the tasks themselves.

Frontend

Shadow's frontend is built with Next.js, Typescript, and styled with Tailwind and Shadcn UI.

Streaming Logic

The logic around Shadow's chat has 2 "modes". While the LLM isn't streaming, we simply fetch the current task's chat history from the database and display it with user messages and assistant messages.

While the LLM is streaming, we need to see the stream in real-time on the interface. Some naive approaches would be polling the database for updates, or using our WebSocket connection to emit the entire message contents on each stream chunk. Forwarding the message content through WebSockets works, but quickly becomes inefficient for long assistant messages. Shadow is specifically made for tasks where the LLM takes many steps, so this isn't ideal. Instead, the backend keeps an in-memory buffer of the currently streaming message to help with this.

When a task page is opened, the server component first fetches the chat history from the database. On WebSocket connection, it then receives a stream-state event containing the current stream content if it exists. Then the client receives stream-chunk events representing tokens and parts from the LLM, which gets processed by the accumulation system.

We accumulate stream chunks in a map, where we assign a unique ID to each chunk. This is stored in React state to render them, and also maintained in a ref for immediate access in the processing logic.

1export function useStreamingPartsMap() {
2  const mapRef = useRef<Map<string, AssistantMessagePart>>(new Map());
3  const [map, setMap] = useState<Map<string, AssistantMessagePart>>(new Map());
4
5  // ...
6}

Maintaining unique IDs for chunks helps us elegantly make in-place incremental updates to the LLM response.

Tool calls are even more tricky since we only have access to structured tool call argument data when we receive the tool call result after completion. That isn't enough to surface live tool call statuses like this:

By enabling toolCallStreaming in the stream config, we get access to tool call deltas. Accumulating these results in a stringified, partial JSON object. By attempting to parse this JSON as it comes in, we can manually extract structured tool call argument data to visualize in the UI.

Chat UI

To display the chat UI, we begin by taking in the merged chat history + streaming parts from the task socket hook. The message types:

User Messages: Displayed in sticky cards with edit functionality
Assistant Messages: Complex rendering with multiple part types
Stacked PR Cards: Special UI for stacked pull requests

Before rendering, we group messages into [user, assistant] pairs, which will be important later.

To help with chat scroll behavior, I used use-stick-to-bottom.

Assistant Message Rendering

Assistant messages are complex, handling multiple part types. Text parts are rendered with memoized markdown components:

1function parseMarkdownIntoBlocks(markdown: string): string[] {
2  const tokens = marked.lexer(markdown);
3  return tokens.map((token) => token.raw);
4}
5
6const MemoizedMarkdown = memo(({ content, id }: { content: string; id: string }) => {
7    const blocks = useMemo(() => parseMarkdownIntoBlocks(content), [content]);
8    return (
9      <div className="space-y-2">
10        {blocks.map((block, index) => (
11          <MemoizedMarkdownBlock content={block} key={`${id}-block_${index}`} />
12        ))}
13      </div>
14    );
15  }
16);

Tool call parts are displayed in custom components with configuration options such as being collapsible, file icons, loading states, diff numbers, suffixes, etc. (tool.tsx).

Reasoning components are simplified collapsible tool components that auto-open while in progress.

User Message Features

A helpful UX hint in Shadow is how user messages stick to the top of the UI when viewing its following assistant message, for some visual context about the latest prompt.

This is where the [user, assistant] pairings in individual divs are helpful! We need the message to "unstick" before the next user message, so we need parent divs to mark these boundaries. For example:

1<div className="sticky">
2  <UserMessage />
3  <AssistantMessage />
4</div>
5<div className="sticky">
6  <UserMessage />
7  {/* ... */}

The message editing UX is also designed in an intuitive way:

The point of focus for the user doesn't change from the message's original position, so it's inline, frictionless, and also allows model switching (with a keyboard shortcut) if needed.

Follow-Up Messages

Shadow's UX around follow-up messages follows a 2-step pattern to support multiple variants of message logic.

This is a bit unconventional and comes with the trade-off of added friction for the benefit of surfacing more complex decision making for follow-up messages. I thought this made sense here since Shadow is specifically for longer horizon tasks meaning follow-up messages are relatively less frequent.

The expanding/collapsing animation uses the prompt input's border as a "secondary container", similar to how message editing is designed.

Sidebar

While not technically "inside" the chat UI, the sidebar is an important piece of the task page, acting as an overview of the task and its progress.

It surfaces important info in real-time, such as Git info, todo completion, diff stats, and modified files in a tree which open the file content when clicked!

Data Fetching

The frontend app uses RSCs (React Server Components) where possible, to prefetch data and improve performance. I used TanStack Query for data fetching and server state management.

The simplest approach to support SSR with useQuery hooks is with the initialData parameter, but this can lead to prop drilling since we'd need to pass data down to client components that need it.

Instead, we can take a more elegant approach using hydration boundaries and serialization. Here's a simplified code example from the task page layout ([taskId]/layout.tsx):

1export default async function TaskLayout({ children, params }: { children: React.ReactNode, params: Promise<{ taskId: string }> }) {
2  const { taskId } = await params;
3  const queryClient = new QueryClient();
4  await Promise.allSettled([
5    queryClient.prefetchQuery({
6      queryKey: ["task", taskId],
7      queryFn: () => getTaskWithDetails(taskId),
8    }),
9    queryClient.prefetchQuery({
10      queryKey: ["task-messages", taskId],
11      queryFn: () => getTaskMessages(taskId),
12    }),
13    queryClient.prefetchQuery({
14      queryKey: ["api-keys"],
15      queryFn: getApiKeys,
16    }),
17    queryClient.prefetchQuery({
18      queryKey: ["models"],
19      queryFn: getModels,
20    })
21  ]);
22
23  return (
24    <HydrationBoundary state={dehydrate(queryClient)}>
25      {/* providers, children, etc... */}
26    </HydrationBoundary>
27  );
28}

We also have context providers for agent environment layout states and file-related data (AgentEnvironmentProvider), task socket data for streaming state (TaskSocketProvider), modal opened states (ModalProvider), and more.

Shadow Realm

The "Shadow Realm" is the agent environment's representation on the frontend. This is an important part of Shadow, since I believe great coding assistant experiences give the user full freedom over how much control they want in the agent's workspace. It has a complete file tree, code editor, terminal, and Shadow Wiki view in a collapsible, resizable layout.

Files

The file explorer displays the project's file tree, which is fetched separately from complete file contents. File content is fetched as needed when files are opened. Markdown files are supported using the same rendering components as the chat UI, and the Shadow Wiki view is just a specialized version of that.

Terminal

The terminal is a basic terminal emulator, built with xterm.js. It communicates with the backend through WebSocket events, so that clients can see terminal command execution results made by Shadow in real-time.

Code Editor

The code editor is built with Monaco, which is used to power VSCode and many other tools. JSX highlighting support isn't great in Monaco by default, so I used Shiki, a better syntax highlight engine.

The color theme is a slight variation of Vesper from Rauno Freiberg, which I love (image above from their GitHub repo). This theme also inspired color accents throughout the rest of Shadow, including the following animation!

New Task Animation

The new task animation is a subtle multi-layered animation involving various gradient effects. It's inspired by Dia's new tab animation.

The first component is a radial gradient that animates upwards and scales down in size, from the near the bottom of the screen to just above the prompt input. Then we have 2 animated conic gradients inside the prompt form, pulsing around either side of the border in sync with the background gradient.

Deployment

The frontend was the most straightforward part, deployed on Vercel.

The Kubernetes cluster for task VMs was on AWS EKS with bare metal nodes for KVM/nested virtualization. We have a kata-qemu RuntimeClass for VM-based containers. There's also a shadow-agents namespace for task execution, storage classes for persistent volumes, and RBAC for service account authentication.

Since each pod needs its own sidecar instance, it pulls from GHCR for the latest image which is built by a GitHub Action.

The backend server is deployed on ECS. We first build the server image with Docker and push to ECR, set up the Application Load Balancer, secrets, security groups, and auto-scaling policies. It's also deployed on the same VPC as the Kubernetes cluster for easy communication between the two.

Check out the full deployment scripts if you're interested!

Results

Pretty soon into building the project, Shadow began contributing to its own codebase! I find this positive feedback cycle that inherently comes with building developer tools to be really cool.

Here's the final demo video:

Also check out the X thread & system design breakdown video.

Takeaways

Coding Agents

Working on tools, prompts, and context engineering for the LLM setup to ultimately make a powerful coding agent was very rewarding.

Infrastructure

Creating and deploying a scalable architecture for remote task execution with VMs took lots of exploration, and I learned a lot.

Design Iterations

Building the correct UX involved research, iterations, and design decisions to optimize for long-running tasks in background agents.

Concurrency

A challenge was how the backend logic needed to center around concurrency and durability for efficiently executing parallel tasks.

Working on this project was a grind, but a lot of fun. I'm super proud of how powerful and complete Shadow became after only a few weeks of building.

Thanks to Rajan and Elijah for helping and doing some super interesting work on Shadow! Rajan build the Shadow Wiki codebase understanding system and other bits around the rest of the codebase. Elijah built the background indexing service.

Next Steps

There's still a lot of areas that I think would be super interesting to explore with Shadow. To list a few:

Subagents as Tools

Allowing the agent to spin up subtasks in parallel to allow deep discovery in codebases without much context pollution.

Browser Use

The agent already has access to the terminal; giving it access to the browser with vision capabilities to iterate on UI changes.

Parallel Iterations

Starting multiple iterations of the same task in parallel to test different models and monitor them together, in real time.

VM Improvements

Using dev containers to support more project types, letting the agent environment closely match real local development setups. Also snapshotting VM state on cleanup to use for resumption, avoid long re-initialization times.

Timeline

Similar to Devin, keeping track of task progress in a timeline to trace through every step the agent took in its environment.

Context Improvements

Auto-compaction was a major feature we partially built but haven't fully completed yet, which would be a big step towards supporting even longer-horizon tasks. This is an interesting challenge since it requires cutting down most of the context window, yet maintaining sufficient understanding of the current task to continue without losing quality.

Memory Improvements

The current memory system is fairly primitive. Exposing memories and rules to be customizable in the UI would help improve general codebase understanding across tasks within the same codebase.