Building a RAG Tool in Ruby: What Actually Happened
I had never personally worked with embeddings, vector databases, or retrieval-augmented generation before this project. I knew the words. I did not know where the sharp edges were. Folks on our team do… but I felt it was time to wrap my own head around it.
What I did have was a real problem, a team that loves Ruby, and enough curiosity to see where things broke.
This is the story of that experiment… what worked, what surprised me, and what I’d tell another Ruby developer who’s considering something similar.
The Problem
At Planet Argon, we manage several client projects. We live in Jira (I know… I know…). We keep decisions in Confluence. We ship code from GitHub. Over years a lot of institutional knowledge piles up across those systems… past bugs, old tradeoffs, and the “we tried that once” stories.
The problem is that nobody remembers all of it. A new ticket comes in: “users can’t export reports to PDF”. Somewhere in Jira there’s a ticket from eight months ago where we debugged a Safari-specific PDF export issue. Of course it was Safari. Somewhere in Confluence there’s a permissions matrix that’s suddenly relevant. If you weren’t assigned to the project back then, you would never know to look.
So we start over. We ask clarifying questions from scratch. We search Slack to ask if anyone has asked something like this before. Tickets go into development with vague acceptance criteria, and the back-and-forth that should have happened before coding shows up during code review and/or when we’re QAing on staging instead.
A vague ticket is a polite way to ask engineers to guess. Guessing can be expensive.
I wanted to build something that could surface that historical context automatically. Point it at a ticket and get suggested clarifying questions grounded in what we actually know “remember” about this project.
Why Ruby, Why Minimal Dependencies
Ruby is what our team loves working in. If we were going to learn embeddings, vector search, and LLM integration, I wanted everything around those ideas to feel familiar.
I also wanted to keep the dependency footprint deliberately small. This is an internal tool for a small team. Every Ruby gem you add is a gem you maintain. I’ve watched too many internal tools rot after someone pulled in thirty dependencies for a weekend project, then nobody wanted to deal with the upgrade treadmill six months later.
Related: Internal Tooling Maturity Ladder is an approach that I’ve been exploring with our internal tools. The idea is to start with the simplest possible implementation (a script that solves the problem for one person), then evolve it through stages of maturity (CLI tool, shared server, versioned gem) as the need becomes clearer and the team is ready to invest more.
Rather than listing the full Gemfile here, I’ll call out the handful of gems that did the heavy lifting… because that’s the part you can steal directly if you’re building something similar.
The gems that do the real work:
- thor: CLI framework. Subcommands, flags, help text out of the box.
- ruby-openai: the workhorse. Handles both embedding generation (
text-embedding-3-small) and LLM completions (gpt-4o-mini). One gem, two critical jobs. - pinecone: Ruby client for Pinecone, our production vector database.
- chroma-db: Ruby client for Chroma, a local vector database you can run in Docker.
- faraday: HTTP client for talking to Jira, Confluence, and GitHub APIs.
- nokogiri: needed to strip HTML from Confluence page bodies before embedding.
- concurrent-ruby: thread pools and futures for parallel data ingestion.
- mcp: Model Context Protocol server for Claude Code integration (this came later, and it changed everything).
- The tty-* family: progress bars, spinners, colored output, prompts. Not necessary… but nicer when you’re watching a 20-minute ingestion run.
Beyond that, I leaned on Ruby’s standard library wherever possible: JSON, URI, Struct, Set, Time, FileUtils. The instinct to reach for a gem is strong, but for most things the stdlib is genuinely sufficient. The goal is not cleverness. The goal is leverage. Also leaned on minitest for testing, but that’s a story for another post.
Why Not a Server
Early on I made a decision that shaped the whole architecture: no running HTTP server with an endpoint. (at least, not yet).
A server is a commitment. Hosting. VPNs. Monitoring. Security reviews. Someone eventually asking, “who owns this?”. For an internal experiment that might not pan out, that felt like a lot of ceremony up front.
So I built it as a CLI tool. Each engineer runs it locally on their own machine. The only shared infrastructure is Pinecone, a cloud-hosted vector database. Everyone gets API keys to the same Pinecone index, but each client’s data lives in its own namespace. Engineers use their own Atlassian and GitHub API tokens when they want to run an ingestion.
Here’s what the environment setup looks like:
# .env: each engineer has their own copy
# OpenAI (for embeddings and analysis)
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
# Atlassian (shared instance, individual tokens)
ATLASSIAN_BASE_URL=https://planetargon.atlassian.net
ATLASSIAN_EMAIL=you@planetargon.com
ATLASSIAN_API_TOKEN=ATATT3x...
# GitHub (individual tokens)
GITHUB_TOKEN=ghp_...
# Pinecone (shared index, namespaces isolate client data)
PINECONE_API_KEY=...
PINECONE_INDEX_NAME=clarion
This kept the experiment low-stakes. No deployment pipeline, no server to maintain, no VPN to configure. If it didn’t work out, there was nothing to decommission. Engineers pull updates from the main branch, run bundle install, and they’re current. It’ll likely become a proper gem we version at some point, but for now the simplicity of “pull main and go” is working fine.
What the CLI Looks Like
The entrypoint is dead simple:
#!/usr/bin/env ruby
require "bundler/setup"
require_relative "../lib/clarion"
Clarion::CLI.start(ARGV)
Here’s the help output:
$ bin/clarion help
Commands:
clarion analyze TICKET_ID # Analyze a Jira ticket and suggest clarifications
clarion help [COMMAND] # Describe available commands or one specific command
clarion ingest SUBCOMMAND # Ingest data from various sources
clarion ingest_all CLIENT # Ingest Jira, Confluence, and GitHub data for a client
clarion mcp # Start MCP server (for Claude Code integration)
$ bin/clarion help ingest
Commands:
clarion ingest confluence # Ingest Confluence pages for a specific space
clarion ingest github # Ingest GitHub repository data
clarion ingest help # Describe subcommands or one specific subcommand
clarion ingest jira # Ingest Jira tickets for a specific project
The CLI: Thor Subcommands
I chose Thor over raw OptionParser because the tool has several distinct commands with different flag sets. Thor gives you subcommands, required options, type validation, and auto-generated help text with minimal boilerplate.
Here’s the skeleton of the CLI:
module Clarion
class CLI < Thor
desc "analyze TICKET_ID", "Analyze a Jira ticket and suggest clarifications"
option :verbose, type: :boolean, desc: "Enable verbose output"
def analyze(ticket_id)
validate_ticket_id!(ticket_id)
analyzer = Clarion::Analyzer.new(ticket_id, verbose: options[:verbose])
puts analyzer.analyze
end
desc "ingest_all CLIENT", "Ingest Jira, Confluence, and GitHub data for a client"
option :limit, type: :numeric, default: 100
option :parallel, type: :boolean, default: true
def ingest_all(client_name)
# Looks up client config, dispatches to parallel ingestion
end
desc "mcp", "Start MCP server (for Claude Code integration)"
option :namespace, type: :string, desc: "Client namespace (auto-detected if omitted)"
def mcp
Clarion::McpServer.new(namespace: options[:namespace]).run
end
# Nested subcommand for individual ingestion
desc "ingest SUBCOMMAND", "Ingest data from various sources"
subcommand "ingest", Ingest
private
def validate_ticket_id!(ticket_id)
return if ticket_id =~ /^[A-Z]+-\d+$/
raise Thor::Error, "Invalid ticket ID format. Expected: PROJECT-123"
end
end
end
The Ingest subcommand is its own Thor class, giving us scoped commands for each data source:
# Analyze a ticket
$ bin/clarion analyze WR-123
# Ingest everything for a client (parallel by default)
$ bin/clarion ingest_all waystar --limit=500
# Or ingest individual sources
$ bin/clarion ingest jira --namespace=waystar --project=WR --limit=500
$ bin/clarion ingest confluence --namespace=waystar --space=WR
$ bin/clarion ingest github --namespace=waystar --repo=planetargon/waystar-web
# Start an MCP server for Claude Code
$ bin/clarion mcp --namespace=waystar
Every ingest command requires explicit --namespace and source-specific scoping flags (--project, --space, --repo). This is deliberate. Operations should never run without explicit client scope.
Client Configuration
Each client maps to a namespace, a Jira project, a Confluence space, and optionally GitHub repos:
# config/clients.yml
clients:
waystar:
namespace: waystar
jira_project: WR
confluence_space: WR
vector_store: pinecone
github_repos:
- planetargon/waystar-web
- planetargon/waystar-api
piedpiper:
namespace: piedpiper
jira_project: PP
confluence_space: PP
vector_store: pinecone
github_repos:
- planetargon/piedpiper-app
pierpoint:
namespace: pierpoint
jira_project: PPC
confluence_space: PPC
vector_store: chroma # Local Chroma for testing
Note the per-client vector_store setting. One client can use Pinecone (shared, cloud-hosted) while another uses Chroma (local Docker instance) for development. The tool doesn’t care. The vector store abstraction handles it.
Embeddings: Simpler Than I Expected, Until They Weren’t
Here’s what took me a while to internalize: you’re just turning text into a point in a very large space. Similar text ends up near similar points. That’s it. That’s the whole idea.
We use OpenAI’s text-embedding-3-small model, which produces 1,536-dimensional vectors. You send it a string, you get back an array of 1,536 floats. Store those floats alongside the original text, and later you can find “nearby” documents by comparing vectors.
The ruby-openai gem makes the embedding call straightforward:
EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIMENSION = 1536
def generate_embedding(text)
return Array.new(EMBEDDING_DIMENSION, 0.0) if text.nil? || text.strip.empty?
response = @openai.embeddings(
parameters: {
model: EMBEDDING_MODEL,
input: text.strip
}
)
response["data"][0]["embedding"]
end
One thing I didn’t appreciate initially is that every embedding call costs money and adds latency. My early version used search, which takes a text string and internally calls OpenAI to generate an embedding before querying Pinecone:
# Before: each search() call generates its own embedding internally
similar = @vector_store.search(query_text, filter: { source: "jira" })
docs = @vector_store.search(query_text, filter: { source: ["confluence", "github"] })
resolved = @vector_store.search(query_text, filter: resolved_filter)
That’s three sequential calls to OpenAI’s embedding API for the exact same text, followed by three sequential calls to Pinecone. Six network round-trips, all in series.
Looking at the search method, you can see why. It generates a fresh embedding every time:
def search(query, filter: nil, top_k: 10)
query_embedding = generate_embedding(query) # Hits OpenAI every call
search_by_vector(query_embedding, filter: filter, top_k: top_k)
end
The fix was two things at once: generate the embedding once, then pass that vector directly to search_by_vector (which skips the embedding step). Then run those three Pinecone queries concurrently:
# After: one embedding, three parallel vector searches
query_vector = @search.embed(query_text)
similar = Thread.new { @search.search_by_vector(query_vector, source: "jira") }
docs = Thread.new { @search.search_by_vector(query_vector, source: ["confluence", "github"]) }
resolved = Thread.new { @search.search_by_vector(query_vector, resolved_filter) }
The OpenAI embedding calls went from 3 to 1. The Pinecone queries stayed at 3 but now run concurrently instead of sequentially. Two wins from a small refactor.
I also learned about truncation the hard way. Some Jira tickets are enormous… long comment threads, embedded images described in markup, and extensive acceptance criteria. The embedding model has a token limit. We now truncate text at 30,000 characters before sending it for embedding.
Would’ve been nice to learn that from documentation rather than from a production error. Oh well.
The Vector Store Abstraction
I didn’t want to be locked into a single vector database, especially early on when I wasn’t sure which one would work best for us. So I built a simple abstraction layer. It’s a factory that returns different backends behind the same interface:
class VectorStore
def self.new(namespace:, backend: nil)
backend ||= ENV.fetch("VECTOR_STORE_BACKEND", "memory")
case backend.downcase
when "pinecone" then VectorStores::Pinecone.new(namespace: namespace)
when "chroma" then VectorStores::Chroma.new(namespace: namespace)
when "memory" then VectorStores::Memory.new(namespace: namespace)
end
end
end
All three backends implement the same base contract:
module VectorStores
class Base
attr_reader :namespace
def initialize(namespace: nil)
@namespace = namespace
end
def upsert(documents)
raise NotImplementedError, "#{self.class}#upsert must be implemented"
end
def search(query, filter: nil, top_k: 10)
raise NotImplementedError, "#{self.class}#search must be implemented"
end
def search_by_vector(vector, filter: nil, top_k: 10)
raise NotImplementedError, "#{self.class}#search_by_vector must be implemented"
end
def embed(text)
raise NotImplementedError, "#{self.class}#embed must be implemented"
end
def delete_all(namespace: nil)
raise NotImplementedError, "#{self.class}#delete_all must be implemented"
end
def stats
raise NotImplementedError, "#{self.class}#stats must be implemented"
end
end
end
Callers just use upsert, search, search_by_vector, stats. They never know or care whether they’re talking to Pinecone, Chroma, or an in-memory hash.
The Pinecone backend stores document text inside the metadata (Pinecone doesn’t have a native text field), then strips it back out on retrieval:
# During upsert: embed text into metadata
metadata = (doc[:metadata] || {}).merge(text: doc[:text])
{ id: doc[:id], values: embedding, metadata: metadata }
# During search: extract text back out, unescape newlines
matches.map do |match|
result = match.dup
if result["metadata"] && result["metadata"]["text"]
text = result["metadata"]["text"]
result["text"] = text.is_a?(String) ? text.gsub('\\n', "\n") : text
result["metadata"] = result["metadata"].except("text")
end
result
end
This paid off quickly. We started with the in-memory backend (pure Ruby cosine similarity, persists to a JSON file) just to prove the concept worked at all. Then Chroma for local development. You can run it in Docker. No cloud account needed. Then Pinecone for the shared production dataset that the whole team can access.
Ingesting Messy Real-World Data
This is where things got messy.
Jira: Flattening the Ticket
Each Jira ticket gets transformed into a document with an ID, a text blob, and structured metadata:
def transform(ticket)
key = ticket["key"]
fields = ticket["fields"] || {}
{
id: "jira_#{@namespace}_#{key}", # e.g., "jira_waystar_WR-123"
text: build_text(key, fields),
metadata: build_metadata(key, fields)
}
end
The text blob concatenates everything meaningful about the ticket: the key, summary, description, comments (with author tags), labels, parent/subtask relationships, and any embedded Confluence links.
Jira’s rich text format is a nested JSON tree. Jira uses something called Atlassian Document Format (ADF) for ticket descriptions and comments. It’s not HTML. It’s not Markdown. It’s a deeply nested JSON structure with node types like paragraph, bulletList, taskItem, mention, inlineCard, and emoji. I had to write a recursive parser to walk that tree and flatten it into plain text:
class AdfParser
def extract_text(adf_doc)
return "" unless adf_doc.is_a?(Hash)
extract_blocks(adf_doc).join(" ").strip
end
private
def extract_blocks(adf_doc)
return [] unless adf_doc["content"].is_a?(Array)
adf_doc["content"].map { |node| format_block(node) }
end
def format_block(node)
return "" unless node.is_a?(Hash)
case node["type"]
when "taskList" then format_task_list(node)
when "bulletList", "orderedList" then format_list(node)
else extract_from_node(node)
end
end
def extract_from_node(node)
case node["type"]
when "text" then node["text"] || ""
when "hardBreak" then "\n"
when "mention" then "@#{node.dig('attrs', 'text') || 'user'}"
when "emoji" then node.dig("attrs", "shortName") || ""
when "inlineCard", "blockCard" then node.dig("attrs", "url") || ""
else inline_text(node)
end
end
end
Not complex, but the kind of thing you don’t anticipate until you see your first embedding full of raw JSON nodes. Thankfully, we can task Claude Code with figuring out some of this chaos.
Comment authors matter. We tag each Jira comment as [Team] or [Client] based on the commenter’s email domain:
def determine_author_type(email)
if email.include?("@planetargon.com")
"[Team]"
elsif email.empty?
""
else
"[Client]"
end
end
This matters more than I thought it would. The LLM can distinguish between internal engineering discussion and client-facing conversation when generating suggested questions.
Confluence: Chunking HTML
Confluence pages come back as raw HTML. Nokogiri strips the markup, then long pages get chunked into roughly 2,000-character segments with 200 characters of overlap, breaking at sentence boundaries where possible. Each chunk becomes its own document in the vector store. A 10-page Confluence spec might produce five or six chunks, each independently searchable.
GitHub: PRs, Issues, Docs, and Code
The GitHub ingester pulls from multiple sources: READMEs and documentation files, pull request descriptions (with merge dates and authors), issues, and source code files. Each becomes a document with source: "github" metadata, so the context builder can query for documentation specifically.
Batch Uploads and Deterministic IDs
Documents get uploaded to the vector store in batches of 20. Errors in one batch don’t abort subsequent batches:
class BatchUploader
BATCH_SIZE = 20
def upload(documents)
documents.each_slice(BATCH_SIZE) do |batch|
@vector_store.upsert(batch)
@processed_count += batch.length
rescue StandardError => e
@error_count += batch.length
end
end
end
Every document gets a deterministic ID based on its source: jira_waystar_WR-123, confluence_waystar_12345_chunk_2, or github_waystar_waystar-web_pr_47. This means re-running ingestion overwrites old documents instead of creating duplicates. Engineers can re-ingest anytime without polluting the dataset.
When a Jira ticket updates, the next ingestion run replaces the old embedding with the new one. Same with Confluence pages and GitHub content. The vector store stays in sync with reality without complex change detection or deletion logic.
The tradeoff: someone needs to remember to run ingestion periodically. But the simplicity is worth it.
Parallel Ingestion with concurrent-ruby
When ingesting all sources for a client, the tool uses concurrent-ruby to run Jira, Confluence, and GitHub ingestions in parallel:
pool = Concurrent::FixedThreadPool.new(3)
futures = []
futures << Concurrent::Future.execute(executor: pool) { ingest_jira }
futures << Concurrent::Future.execute(executor: pool) { ingest_confluence }
github_repos.each do |repo|
futures << Concurrent::Future.execute(executor: pool) { ingest_github(repo) }
end
# Wait for all to complete
futures.each(&:wait)
Thread-safe state tracking uses Concurrent::Hash:
@results = Concurrent::Hash.new
@timings = Concurrent::Hash.new
@status = Concurrent::Hash.new
After completion, the tool calculates time saved versus sequential execution and reports a speedup factor. In practice, parallel ingestion typically finishes in about 60% of the time sequential would take, since the API calls to Jira, Confluence, and GitHub can overlap.
Running an ingestion looks like this:
$ bin/clarion ingest_all waystar --limit=500
════════════════════════════════════════════════════════════
COMBINED DATA INGESTION
════════════════════════════════════════════════════════════
ℹ Client: waystar
ℹ Namespace: waystar
ℹ Vector store: pinecone
ℹ Jira project: WR
ℹ Confluence space: WR
ℹ GitHub repos: planetargon/waystar-web
ℹ Limit: 500 items per source
ℹ Mode: Parallel
✓ Jira (WR) Complete (487/500 processed)
✓ Confluence (WR) Complete (245/500 processed)
✓ GitHub: waystar-web Complete (498/500 processed)
════════════════════════════════════════════════════════════
INGESTION RESULTS
════════════════════════════════════════════════════════════
ℹ ✓ Jira: 487 processed, 0 errors (45.2s)
ℹ ✓ Confluence: 245 processed, 0 errors (38.1s)
ℹ ✓ Github Waystar Web: 498 processed, 0 errors (52.7s)
════════════════════════════════════════════════════════════
PERFORMANCE SUMMARY
════════════════════════════════════════════════════════════
ℹ Total documents processed: 1230
✓ Total time: 58.3s
ℹ Time saved vs sequential: 77.7s (2.3x speedup)
✓ Client 'waystar' is ready for analysis!
Retrieval and Re-Ranking
Raw cosine similarity gets you most of the way there, but not all the way. The vector search returns the 40 most similar Jira tickets, and some of them are similar for the wrong reasons… same boilerplate language, same component name, but not actually useful context.
The context builder generates one embedding, then runs three concurrent searches. Similar tickets. Resolved tickets filtered by component. Documentation from Confluence and GitHub.
def gather_all_context(ticket, ticket_id, current_key, created_time)
query = @search.build_query(ticket)
query_vector = @search.embed(query)
similar_thread = Thread.new do
results = @search.search_by_vector(query_vector, { source: "jira" }, 40)
score_and_limit_results(results, ticket, current_key, created_time, 16)
end
resolved_thread = Thread.new do
results = @search.search_by_vector(query_vector, resolved_filter, 12)
format_resolved_tickets(results)
end
docs_thread = Thread.new do
results = @search.search_by_vector(query_vector, { source: ["confluence", "github"] }, 32)
process_and_limit_docs(results, ticket, ticket_id, created_time, 16)
end
{
similar_tickets: similar_thread.value,
related_resolved: resolved_thread.value,
documentation: docs_thread.value
}
end
After retrieval, I added two simple re-ranking heuristics that made a noticeable difference:
Relationship boost. If a retrieved ticket is a parent or subtask of the ticket being analyzed, its score gets a 1.5x multiplier:
def apply_relationship_boost(ticket_data, relationship_type)
ticket_data[:relationship] = relationship_type
ticket_data[:score] *= 1.5
end
Temporal decay. Tickets older than 7 days get a 0.7x multiplier. Older than 30 days, 0.3x:
def age_adjustment_params(days_before)
return [0.3, "Created #{days_before} days before ticket"] if days_before > 30
return [0.7, nil] if days_before > 7
[nil, nil]
end
These aren’t machine learning models. They’re just multipliers applied after retrieval. I was surprised how much difference they made. A few lines of Ruby math moved the output from “interesting but noisy” to something I’d actually act on.
It’s still early days, I expect that we’ll likely need to tweak this a bunch as we see more real-world queries and get feedback from engineers.
The Prompting Side
Two things surprised me here.
First: structured JSON output. Huge deal. We set response_format: { type: "json_object" } on the LLM call, which means the response is always valid JSON. No regex parsing, no hoping the model follows your format instructions. The response comes back with a defined structure:
{
"ticket_type": "feature",
"clarity_assessment": "needs_clarification",
"clarifying_questions": [
{
"question": "The question to ask the client",
"rationale": "Why this matters for implementation",
"reference": "WR-892: similar issue last quarter"
}
],
"suggested_acceptance_criteria": [
"User can export all report types to PDF",
"Export completes within 30 seconds",
"Error message displays if export fails"
],
"potential_edge_cases": [
"Special characters in report data",
"Very large reports (>10,000 rows)"
],
"implementation_notes": "Brief notes on approach"
}
Once you have reliable structure, everything downstream gets simpler.
Second: the prompt is where your institutional voice lives. This is the part that can’t be replicated by generic tooling. Our system prompt doesn’t just say “generate clarifying questions”. It encodes how Planet Argon communicates with clients:
Instead of asking open-ended technical questions, frame them as confirmations:
“It sounds like this needs to work in Chrome. Should we also make sure it works in Safari and Firefox?”
Rather than:
“What browsers need to be supported?”
The prompt covers dozens of specific communication scenarios. A few examples from the actual prompt file:
When clients apologize for not being technical:
“No need to apologize. You’re describing exactly what we need to know. The ‘what’s broken’ is your expertise; the ‘why it’s broken’ is ours.”
When scope is creeping:
“There’s a lot of good stuff here. To make sure nothing gets lost, would it help to break this into separate tickets? That way we can track the export fix and the new filter feature independently.”
When clients describe workarounds they’re using:
“Good thinking on the CSV workaround. That’ll keep things moving. We’ll fix the PDF export so you don’t have to keep doing that extra step.”
When something is working as designed:
“So it turns out the system is doing what it was originally built to do, but I hear you that it’s not what you need it to do. Want us to write up a feature request to change this behavior?”
This is the part that makes it ours and not just another RAG wrapper. The vector search finds the history. The prompt makes it sound like us.
We also maintain two separate prompt files. prompts/analyzer_default.md is for open tickets (“what’s unclear?”). prompts/analyzer_completed.md is for closed tickets (retrospective analysis). The tool detects the ticket’s status and selects the right prompt automatically. It’s a small touch, but it means the output is always contextually appropriate.
The MCP Surprise
I didn’t expect this part to become the most useful thing in the whole project.
The tool started as a CLI experiment. Run bin/clarion analyze WR-123 in your terminal, get output, copy what’s useful. It worked, but there was friction. You had to switch contexts, looking at a Jira ticket, looking at and jumping away from your editor, and remember the command syntax.
Having spent a bunch of time recently in Claude code, I wondered… could we bring this analysis directly into the editor? I think this took me less than two hours from “I wonder if this could be an MCP server” to “oh wow, it’s actually working”.
I quickly found the mcp gem, which implements Anthropic’s Model Context Protocol. MCP lets you expose a tool as a server that Claude Code can call directly. Here’s what the server looks like:
class McpServer
def initialize(namespace: nil, working_directory: Dir.pwd)
@namespace = namespace
@working_directory = working_directory
@client = resolve_client
end
def run
server = build_server
transport = MCP::Server::Transports::StdioTransport.new(server)
transport.open
end
private
def build_server
MCP::Server.new(
name: "clarion",
version: Clarion::VERSION,
tools: [Mcp::AnalyzeTool.build(@client)]
)
end
end
The MCP tool itself is built dynamically. The tool description is baked in with the client’s namespace and ticket prefix at startup time, so Claude Code knows exactly what it can do:
module AnalyzeTool
def self.build(client)
tool = Class.new(MCP::Tool) do
tool_name "analyze_ticket"
description "Analyze a Jira ticket and suggest clarifying questions " \
"and acceptance criteria. Scoped to client '#{client.namespace}' " \
"(ticket prefix: #{client.ticket_prefix})."
input_schema(
properties: {
ticket_key: {
type: "string",
description: "Jira ticket ID (e.g., #{client.ticket_prefix}-123)"
}
},
required: ["ticket_key"]
)
end
# ... wire up call, validation, and analysis methods
tool
end
end
Each MCP server instance is scoped to a single client namespace. When an engineer is working in a client’s repository, they drop a small JSON config file at the repo root:
{
"mcpServers": {
"clarion": {
"command": "/path/to/clarion/bin/clarion-mcp",
"args": ["--namespace=waystar"]
}
}
}
The bin/clarion-mcp wrapper is a one-liner. It sets the working directory, then delegates:
#!/bin/bash
cd "$(dirname "$0")/.."
exec bundle exec ruby -Ilib bin/clarion mcp "$@"
Now they can ask Claude Code to “analyze WR-123” and get the full analysis inline. Clarifying questions. Suggested acceptance criteria. Edge cases. Implementation notes. All without leaving their editor.
Auto-detection from git remote. If the client’s repo is configured in clients.yml with its github_repos, you can even skip the --namespace flag. The server shells out to git remote get-url origin, parses the owner/repo slug, and looks it up automatically.
One gotcha worth mentioning: TTY output breaks MCP’s stdio transport. All those nice spinners and progress bars and colored output that make the CLI experience polished? They corrupt the MCP response stream. I had to suppress stdout during MCP calls:
def run_analysis(key)
config = AnalyzerConfig.build(key, result_formatter: PlainTextFormatter.new)
original_stdout = $stdout
$stdout = File.open(File::NULL, "w")
begin
Analyzer.new(config).analyze
ensure
$stdout = original_stdout
end
end
Small thing, but it would have been confusing to debug without knowing to look for it. We also have a separate PlainTextFormatter that outputs clean text for MCP, versus the ResultFormatter that uses colored boxes and unicode for the CLI.
Where It Gets Really Interesting: MCP in Combination
Clarion as an MCP server is useful on its own. But the thing that got me excited was running it alongside other MCP servers in the same Claude Code session.
Our engineers can have Clarion (our embedded project history), the Atlassian MCP (live read/write access to Jira and Confluence), and the GitHub MCP all connected at once. That combination opens up workflows none of these tools could do alone:
Analysis to action without context-switching. Ask Clarion to analyze a ticket. It surfaces related historical context and suggests clarifying questions. Review the suggestions, adjust the wording, then use the Atlassian MCP to post a comment directly on the Jira ticket. All within Claude Code. The loop from “what should we ask?” to “we asked it” closes in a single session.
Breaking down epics. This is one we’re actively exploring. Point Clarion at an Epic, and it can pull in context from how similar large efforts were structured in the past. What the subtask breakdown looked like. What got missed. Where scope crept. Use that context to draft a breakdown into smaller tickets with clear acceptance criteria on each one. Then use the Atlassian MCP to create those subtasks in Jira, already populated with suggested AC. That’s different from asking a generic LLM to decompose an epic. It’s referencing how this team on this project has handled similar work before.
Cross-source research. An engineer can ask “what do we know about how authentication works in this project?” and get results from Jira tickets where auth bugs were fixed, Confluence pages documenting the auth flow, and GitHub PRs where the auth code was changed. All from one query, all scoped to that single client. With the GitHub MCP also connected, they can then inspect the actual current code to verify whether those docs are still accurate.
Pre-development discovery. Before an engineer, or an AI coding agent, starts building, the ticket should be clear. Clarion sits at that boundary: after the client describes what they want, before anyone writes code. The suggested questions aren’t generic. They’re informed by the specific history of this project. “Last time we did a PDF export on this project, Safari caused problems” is more useful than “have you considered browser compatibility?”.
Multi-Tenant Scoping: The Hard Constraint
One constraint that shaped everything: Planet Argon uses a single Atlassian account across most of our client projects (some clients own their own Atlassian accounts). Same Jira instance, same Confluence instance, one set of API credentials.
That means data isolation has to be enforced in our code, not by infrastructure boundaries. Every operation requires an explicit client namespace. The vector store uses that namespace to partition data. One Pinecone index. Many isolated namespaces. Ticket IDs are validated against the expected prefix before any analysis runs.
Granted, our engineers do have access to reference different clients at the same time in their Atlassian account, but the tool itself is always scoped to one client per run. That’s the important part.
The validation happens at multiple layers. In the CLI:
def validate_ticket_id!(ticket_id)
return if ticket_id =~ /^[A-Z]+-\d+$/
raise Thor::Error, "Invalid ticket ID format. Expected: PROJECT-123"
end
And again in the MCP tool, where it also checks the prefix matches the scoped client:
def validate_ticket_prefix!(key)
unless key.match?(/^[A-Z]+-\d+$/)
raise ArgumentError, "Invalid ticket ID format: #{key}. Expected: PROJECT-123"
end
prefix = key.split("-").first
return if prefix == scoped_client.ticket_prefix
raise ClientScopeError,
"Ticket #{key} does not belong to client " \
"'#{scoped_client.namespace}' (expected prefix: #{scoped_client.ticket_prefix})"
end
If you’re working in the waystar namespace and try to analyze PP-123, you get a clear error: "Ticket PP-123 does not belong to client 'waystar' (expected prefix: WR)". Not results from the wrong client.
It’s a simple system. Namespaces and prefix checks. Again, engineers technically have access to all clients’ data in Atlassian, but the tool enforces discipline. You have to be intentional about which client’s context you’re working in. We don’t want someone accidentally running an analysis against the wrong client’s project and making assumptions based on irrelevant history.
What’s Next
We’re looking at other LLM models. The ruby-openai gem handles everything we need today, but things are moving fast.
Atlassian is building AI features into Jira and Confluence, and some of that will overlap with what we’ve built. But Atlassian’s tooling only knows about what’s inside Atlassian. It can’t see GitHub repos, PR histories, or how past implementations actually played out in code. Our tool bridges that gap — context across all three systems, shaped by how we work.
Our team is also experimenting more with LLM-assisted code generation. But this tool sits deliberately upstream of that. It’s about the collaboration layer. Making sure what we’re about to build is well-understood before anyone writes code. A perfectly generated pull request against a vague ticket is still a miss.
We’ll probably open source this eventually, but the codebase is full of references to real client projects in tests and config. Scrubbing that is on the list. Not the priority right now.
If you’re thinking about building something like this… just start. Ruby has what you need. The gems are there. It’s more approachable than it looks from the outside.