Ingestion Guide¶

DocBrain needs documents to answer questions. This guide walks you through connecting your document sources — Confluence, GitHub, or local files.

How Ingestion Works¶

When you run ingestion, DocBrain:

Fetches documents from your configured source
Converts them to Markdown (HTML, Confluence storage format, etc.)
Extracts cross-document references — URLs to GitHub PRs, GitLab MRs, Jira tickets, Confluence pages, and other linked resources are identified, classified, and stored as a reference graph in PostgreSQL
Chunks them using heading-aware splitting (preserves semantic coherence)
Embeds each chunk into vectors using your configured embedding provider
Indexes the vectors in OpenSearch for hybrid search (k-NN + BM25), with referenced document IDs attached to each chunk for enrichment at query time

After ingestion, you can immediately start asking questions. DocBrain cites sources in every answer, linking back to the original document.

Quick Reference¶

Configure sources in config/local.yaml (gitignored). Put only infrastructure secrets in .env.

Source	`ingest_sources` value	What You Need
Local files	`local`	A directory of `.md` or `.txt` files
Confluence	`confluence`	Atlassian URL, email, API token, space keys
GitHub	`github`	Repository URL, optional token for private repos
GitHub PRs	`github_pr`	GitHub token, owner/repo
GitLab MRs	`gitlab_mr`	GitLab token, project path
Slack threads	`slack_thread`	Slack bot token, channel IDs
Jira	`jira`	Jira URL, email, API token, project keys

Option 1: Local Files (Default)¶

The simplest option. Point DocBrain at a folder of Markdown or text files.

Setup¶

Add to config/local.yaml:

# config/local.yaml
ingest:
  ingest_sources: local

And set the path in .env (it's a filesystem path, not a secret, but it's deployment-specific):

LOCAL_DOCS_PATH=/data/docs

By default, Docker Compose mounts ./examples/sample-docs to /data/docs — so DocBrain works out of the box with the included sample documents.

Using Your Own Files¶

Option A: Edit the volume mount in docker-compose.yml:

volumes:
  - /absolute/path/to/your/docs:/data/docs:ro

Option B: Copy files into the sample-docs directory:

cp -r ~/my-docs/* examples/sample-docs/

Run Ingestion¶

docker compose exec server docbrain-ingest

Supported File Types¶

.md — Markdown (recommended)
.txt — Plain text

Verify¶

# Ask a question about your docs
docker compose exec server docbrain-cli ask "What is in my documentation?"

Option 2: Confluence¶

Connect DocBrain to your Atlassian Confluence instance. DocBrain fetches pages from the spaces you specify, converts Confluence storage format to Markdown, and indexes everything.

Step 1: Create a Confluence API Token¶

Go to https://id.atlassian.com/manage-profile/security/api-tokens
Click Create API token
Give it a label (e.g. "DocBrain")
Copy the token — you won't see it again

Step 2: Find Your Space Keys¶

Space keys are the short identifiers for your Confluence spaces. You can find them in the URL:

https://yourcompany.atlassian.net/wiki/spaces/ENG/pages/...
                                              ^^^
                                              This is the space key

Common examples: ENG, DOCS, OPS, PLATFORM

Step 3: Configure `config/local.yaml`¶

# config/local.yaml — never committed (gitignored)
ingest:
  ingest_sources: confluence

confluence:
  base_url: https://yourcompany.atlassian.net/wiki
  user_email: you@yourcompany.com
  api_token: your-api-token-here
  space_keys: ENG,DOCS

Multiple spaces: Separate with commas: ENG,DOCS,OPS

Limiting pages: By default, DocBrain ingests all pages in each space. To cap the number of pages per space (useful for testing), add:

confluence:
  page_limit: 100   # 0 = unlimited (default)

Step 4: Run Ingestion¶

# Restart the server to pick up the new config
docker compose restart server

# Run ingestion
docker compose exec server docbrain-ingest

You'll see output like:

Fetching pages from space ENG... 47 pages found
Fetching pages from space DOCS... 123 pages found
Converting 170 pages to Markdown...
Chunking... 892 chunks created
Generating embeddings... done
Indexing in OpenSearch... done
Ingestion complete: 170 pages, 892 chunks

Step 5: Verify¶

docker compose exec server docbrain-cli ask "What are our deployment procedures?"

The answer should cite your Confluence pages with links back to the originals.

Self-Hosted Confluence (Data Center)¶

DocBrain also supports self-hosted Confluence Data Center 7.x+ instances:

# config/local.yaml
confluence:
  api_version: v1
  base_url: https://confluence.yourcompany.com
  api_token: your-personal-access-token
  space_keys: ENG,DOCS

Creating a Personal Access Token (Data Center):

Log in to your Confluence Data Center instance
Go to your profile (top-right) > Settings > Personal Access Tokens
Click Create token, give it a name (e.g. "DocBrain"), and copy the token

	Cloud	Self-Hosted (Data Center)
`CONFLUENCE_API_VERSION`	`v2` (default)	`v1`
`CONFLUENCE_BASE_URL`	`https://yourco.atlassian.net/wiki`	`https://confluence.yourco.com`
`CONFLUENCE_USER_EMAIL`	Atlassian account email	Not required
`CONFLUENCE_API_TOKEN`	API token from Atlassian	Personal Access Token (Bearer auth)

If your instance uses a self-signed certificate or an internal CA that Docker doesn't trust, disable TLS verification:

# config/local.yaml
confluence:
  tls_verify: false

Everything else works identically — same space keys, same page limit, same webhook sync, same image extraction.

Permissions¶

The API token inherits the Confluence permissions of the user account. DocBrain can only access pages that user can read. For broad access, use a service account with read permissions across your target spaces.

Troubleshooting¶

Problem	Cause	Fix
`401 Unauthorized`	Wrong email or token	Double-check email matches the Atlassian account that created the token
`404 Not Found`	Wrong base URL	Use `https://yourco.atlassian.net/wiki` (must include `/wiki`)
0 pages found	Wrong space key	Check the URL of your Confluence space for the correct key
Timeout on large spaces	Too many pages	This is normal for 500+ page spaces — ingestion continues in the background

Option 3: GitHub Repository¶

Ingest documentation from a GitHub repository. DocBrain clones the repo, finds Markdown and text files, and indexes them.

Setup¶

# config/local.yaml
ingest:
  ingest_sources: github

github:
  repo_url: https://github.com/your-org/your-docs-repo
  branch: main

For private repositories, add a personal access token:

github:
  token: ghp_your_token_here

Creating a GitHub Token (for private repos)¶

Go to https://github.com/settings/tokens
Click Generate new token (classic)
Select scope: repo (for private repos) or public_repo (for public repos only)
Copy the token

Run Ingestion¶

docker compose restart server
docker compose exec server docbrain-ingest

What Gets Ingested¶

DocBrain ingests all .md and .txt files in the repository. It respects directory structure and uses file paths as metadata for source citations.

Monorepo?¶

If your docs are in a subdirectory of a larger repo, DocBrain still ingests the whole repo but filters for documentation files. Future versions will support path filtering.

Image Extraction (Confluence)¶

When ingesting from Confluence, DocBrain automatically downloads images (diagrams, screenshots, flowcharts) from each page and uses a vision-capable LLM to generate detailed descriptions. These descriptions are injected into the document content and indexed alongside the text — making image content searchable and available for Q&A.

This is enabled by default. No extra configuration needed if your LLM provider supports vision.

How It Works¶

During page processing, DocBrain extracts image references from the HTML
Downloads each image attachment from the Confluence API
Sends the image to the configured LLM's vision endpoint
Injects the description into the Markdown before chunking

Which Providers Support Vision?¶

Provider	Vision Support	Notes
AWS Bedrock	Yes	Uses Claude's native vision via Messages API
Anthropic	Yes	Uses Claude's native vision via Messages API
OpenAI	Yes	Uses GPT-4o vision via Chat Completions API
Ollama	Depends on model	Vision models (`llava`, `llama3.2-vision`, `moondream`) work. Text-only models (`llama3.1`) are auto-detected on first call — images are skipped with a warning, no failures.

Guardrails¶

Guardrail	Value	Reason
Max images per page	20	Prevent runaway LLM costs on image-heavy pages
Min image size	5KB	Skip icons, avatars, decorative images
Max image size	10MB	Skip huge files
Allowed types	`png`, `jpeg`, `gif`, `webp`	Skip PDFs, ZIPs, videos
Timeout per image	30s	Don't block the pipeline

Disabling Image Extraction¶

IMAGE_EXTRACTION_ENABLED=false

When disabled, images get a [Image: filename.png] placeholder in the text (the pre-existing behavior). You can re-enable later and re-ingest to pick up image descriptions.

Cost¶

Image descriptions use the HAIKU_MODEL_ID model if set (recommended for cost efficiency), otherwise falls back to LLM_MODEL_ID. With Claude Haiku, expect ~$0.001 per image. A full ingestion of 1000 pages with ~3 images each costs roughly $3.

Real-Time Sync: Confluence Webhooks¶

By default, DocBrain ingests documents when you run docbrain-ingest manually or on a cron schedule. But if you want pages to sync automatically the moment they're created, updated, or deleted in Confluence, enable webhook integration.

What It Does¶

Confluence Event	DocBrain Action
`page_created`	Fetches the new page, chunks it, embeds it, indexes it
`page_updated`	Deletes old chunks, re-fetches, re-chunks, re-indexes
`page_restored`	Same as created
`page_removed` / `page_trashed`	Deletes the page's chunks from OpenSearch and marks it deleted in PostgreSQL

All processing happens asynchronously — DocBrain returns 200 OK to Confluence immediately and syncs in the background.

Step 1: Generate a Webhook Secret¶

Pick a strong random string. This secret is shared between Confluence and DocBrain for HMAC-SHA256 signature verification.

# Generate a random secret
openssl rand -hex 32

Step 2: Configure DocBrain¶

Set the webhook secret as an environment variable (it's a runtime secret injected by the environment):

# .env — webhook secret only
CONFLUENCE_WEBHOOK_SECRET=your-generated-secret-here

Confluence credentials must also be set in config/local.yaml (DocBrain needs API access to fetch page content when a webhook fires):

# config/local.yaml
confluence:
  base_url: https://yourcompany.atlassian.net/wiki
  api_token: your-api-token
  user_email: you@yourcompany.com

Restart the server. You should see:

[startup] Confluence webhook integration enabled

If you see CONFLUENCE_WEBHOOK_SECRET set but missing CONFLUENCE_BASE_URL/API_TOKEN — webhook sync disabled, check that both CONFLUENCE_BASE_URL and CONFLUENCE_API_TOKEN are set.

Step 3: Configure the Webhook in Confluence¶

Confluence Cloud¶

Go to your Confluence instance → Settings (gear icon) → Webhooks (under "Atlassian Admin" → find your site)
Or use the Atlassian admin: https://admin.atlassian.com → your site → Settings → Webhooks
Click Create webhook
Configure:

Field	Value
URL	`https://<your-docbrain-domain>/confluence/events`
Secret	The same secret you set in `CONFLUENCE_WEBHOOK_SECRET`
Events	Select: `page_created`, `page_updated`, `page_removed`, `page_trashed`, `page_restored`

Save and activate the webhook.

Important: The URL must be HTTPS and publicly reachable from Atlassian's servers. If DocBrain runs behind a firewall, you'll need an ingress or tunnel (e.g., ngrok for testing, or a proper reverse proxy in production).

Confluence Data Center (Self-Hosted)¶

Go to Administration → Further Configuration → Webhooks (or install the Webhook plugin if not available)
Create a webhook with the same URL and secret as above
Select the page events you want to track

Step 4: Verify¶

Create or edit a page in Confluence. Within a few seconds, check the DocBrain server logs:

[confluence] Processing page_updated for page 'My Test Page' (id=12345)
[confluence] Updated page 'My Test Page' — 8 chunks re-indexed

Then ask a question about the content you just changed:

docbrain-cli ask "What did I just write about?"

The answer should reflect the latest content.

Security¶

Every incoming webhook is verified using HMAC-SHA256 with the shared secret
The signature is checked via the X-Hub-Signature: sha256=<hex> header
Constant-time comparison prevents timing attacks
Request body is limited to 1MB
If verification fails, DocBrain returns 401 Unauthorized and ignores the event

Webhooks vs. Scheduled Ingest¶

	Webhooks	Scheduled Ingest (`docbrain-ingest`)
Latency	Seconds after page edit	Hours (depends on cron interval)
Scope	Single page per event	All pages in configured spaces
Use case	Real-time sync for active teams	Bulk initial load, catch-up, re-indexing
Requirements	Public HTTPS URL, Confluence webhook config	Just a cron schedule

Recommendation: Use both. Run scheduled ingest as a daily safety net (catches anything webhooks might miss — network blips, downtime), and use webhooks for real-time updates.

Real-Time Capture: `@docbrain capture` and `/docbrain capture`¶

DocBrain supports on-demand capture from GitHub PRs/issues, GitLab MRs, and Slack threads. Capture only ingests the thread into the knowledge base — it does not generate a Q&A reply. After capture, the content is immediately searchable via /docbrain ask (Slack) or the API.

What Capture Does¶

Platform	Trigger	What's indexed	Reply
GitHub	Comment `@docbrain capture` on any PR or issue	PR/issue description + all comments	Posts a reply comment confirming capture
GitLab	Comment `@docbrain capture` on any MR	MR title, description, all human discussion notes	Posts a reply note confirming capture
Slack	Run `/docbrain capture` inside a thread	All thread messages, user names resolved	Posts a message in the thread confirming capture

Capture is separate from /docbrain ask (Slack) or @docbrain ask (GitHub/GitLab) — those are Q&A commands that answer questions from the knowledge base.

GitHub PR/Issue Capture¶

Comment @docbrain capture on any GitHub pull request or issue to immediately index the discussion.

Requirements: GitHub webhook configured to send issue_comment and pull_request_review_comment events to DocBrain.

Setup¶

GITHUB_CAPTURE_WEBHOOK_SECRET=your-webhook-secret   # generate with: openssl rand -hex 32
GITHUB_CAPTURE_TOKEN=ghp_...                         # Personal access token with repo:read scope

Optional access control (recommended for shared installations):

GITHUB_CAPTURE_ALLOWED_REPOS=myorg/backend,myorg/frontend  # Only these repos can trigger capture
GITHUB_CAPTURE_ALLOWED_USERS=alice,bob                      # Only these users can trigger capture

Register the Webhook in GitHub¶

Go to your repository: Settings → Webhooks → Add webhook
Fill in:
Payload URL: https://your-docbrain-host/api/v1/github/events
Content type: application/json
Secret: same value as GITHUB_CAPTURE_WEBHOOK_SECRET
Events: select Issue comments and Pull request review comments
Save

What Gets Indexed¶

Issue/PR title, description, and all comments
Cross-document references (URLs to other PRs, Jira tickets, Confluence pages, etc.) are extracted and stored in the reference graph
Threads over 500KB are skipped (DocBrain posts a reply explaining the limit)
Threads under 200 characters are skipped as too short

Reply Behavior¶

On success, DocBrain posts a comment:

✅ Captured by DocBrain — 12 chunks indexed and immediately searchable.
This thread will feed Autopilot's next gap analysis run.

On failure:

⚠️ Capture failed: <error message>

Security and Access Control¶

All incoming webhooks are verified via HMAC-SHA256 (X-Hub-Signature-256 header)
GITHUB_CAPTURE_ALLOWED_REPOS — restrict to specific owner/repo pairs
GITHUB_CAPTURE_ALLOWED_USERS — restrict to specific GitHub usernames
Empty allowlists = all users and repos can trigger capture (acceptable for private org webhooks)

GitLab MR Capture¶

Comment @docbrain capture on any GitLab merge request to immediately index the full MR discussion.

Requirements: GITLAB_CAPTURE_WEBHOOK_SECRET and GITLAB_CAPTURE_TOKEN configured, webhook registered in GitLab.

Step 1: Configure DocBrain¶

GITLAB_CAPTURE_WEBHOOK_SECRET=your-webhook-secret   # generate with: openssl rand -hex 32
GITLAB_CAPTURE_TOKEN=glpat-...                       # Personal access token with api scope
GITLAB_CAPTURE_BASE_URL=https://gitlab.com           # Default; set for self-hosted GitLab

Optional allowlists (recommended for shared instances):

GITLAB_CAPTURE_ALLOWED_USERS=alice,bob          # Only these users can trigger capture
GITLAB_CAPTURE_ALLOWED_PROJECTS=myorg/myrepo    # Only these projects can trigger capture

Step 2: Register the Webhook in GitLab¶

Go to your project: Settings → Webhooks
Fill in:
URL: https://your-docbrain-host/api/v1/gitlab/events
Secret token: same value as GITLAB_CAPTURE_WEBHOOK_SECRET
Trigger: enable Comments
Click Add webhook

Step 3: Test It¶

Open any merge request and add a comment containing @docbrain capture. Within a few seconds, DocBrain replies with a note on the MR:

✅ Captured by DocBrain — 12 chunks indexed and immediately searchable.
This MR will feed Autopilot's next gap analysis run.

What Gets Indexed¶

MR title and description
All human discussion notes (system notes — merge events, label changes, approval events — are excluded)
Cross-document references — URLs and GitLab shorthand references (!123, #123) are extracted, resolved, and stored in the reference graph
Threads over 500KB are skipped silently (too large for the embedding pipeline)

Reply Behavior¶

On success: DocBrain posts a note confirming the chunk count
On failure: DocBrain posts ⚠️ Capture failed: <error>
Replies require GITLAB_CAPTURE_TOKEN to be set (token is also used to fetch MR notes)
The allowlist check is applied to the commenter (the user who wrote @docbrain capture), not the MR author

Security and Access Control¶

All incoming webhooks are verified via the X-Gitlab-Token header (constant-time comparison)
GITLAB_CAPTURE_ALLOWED_USERS — restrict to specific GitLab usernames (the commenter, not the MR author)
GITLAB_CAPTURE_ALLOWED_PROJECTS — restrict to specific project paths (e.g. myorg/myrepo)
If no allowlists are configured, any user in any project can trigger capture — consider setting GITLAB_CAPTURE_ALLOWED_PROJECTS at minimum

Slack Thread Capture¶

Run /docbrain capture inside any Slack thread to immediately index the conversation.

Note: /docbrain capture only ingests the thread. Use /docbrain ask <question> separately to query the knowledge base.

Setup¶

Ensure the Slack bot is installed and SLACK_BOT_TOKEN is configured. The bot needs channels:history and users:read OAuth scopes.

Usage¶

Open a Slack thread with a substantive discussion
Run /docbrain capture inside the thread (not on a top-level message)
DocBrain fetches all messages, resolves user names, and indexes the conversation

Within ~15 seconds, DocBrain posts back in the thread:

✅ Thread from #platform-incidents captured into DocBrain (8 chunks indexed).
It's now searchable and will be used by Autopilot's next gap analysis.

Access Control¶

By default, any user in any channel can run /docbrain capture. Restrict access with:

SLACK_CAPTURE_ALLOWED_CHANNELS=platform-team,infra-review  # channel names (no #) or IDs
SLACK_CAPTURE_ALLOWED_USERS=alice,U01234567                 # usernames or user IDs

Channel check matches against both channel_name and channel_id
User check matches against both user_name and user_id
If rejected, DocBrain responds with an ephemeral message: ⚠️ You don't have permission to use /docbrain capture in this channel.

What Gets Indexed¶

All thread messages with resolved display names and timestamps
Cross-document references (URLs to PRs, tickets, Confluence pages, etc.) are extracted from messages and stored in the reference graph
Threads under 200 characters are skipped as too short
The thread is immediately searchable after capture

Capture Lifecycle and Freshness¶

Space Assignment¶

Captured content is stored under a meaningful space name derived from the source:

Source	Space assigned
GitHub PR/issue	`owner/repo` (e.g., `myorg/backend`)
GitLab MR	`group/project` (e.g., `platform/api`)
Slack thread	Channel name (e.g., `platform-incidents`)

This means allowed_spaces ACL filtering works as expected — setting allowed_spaces: ["platform-incidents"] on an API key will correctly scope answers to Slack captures from that channel, GitHub captures from a matching repo, etc.

Staleness and Time Decay¶

Unlike incident records (Jira, PagerDuty, Zendesk), which are permanent historical events, captured content decays with age. A GitHub PR discussing an architecture from 5 years ago, or a Slack thread about a since-replaced system, should score low in freshness — not be treated as always-current.

The freshness scorer uses the original content creation date (when the PR/MR was opened, when the Slack thread started) as the age baseline — not the time DocBrain captured it.
Captures age through the standard time-decay curve: a 2-year-old architectural discussion will score significantly lower freshness than a recent one, which reduces its weight in RAG retrieval and Autopilot gap analysis.
Re-capturing the same thread (running /docbrain capture again on the same PR or Slack thread) updates the content but preserves the original creation date as the age baseline.

This ensures that outdated design decisions, replaced architectures, or deprecated processes are progressively de-emphasized in answers as they age — without ever being deleted (the historical record is preserved for explicit search).

Re-Ingestion and Updates¶

Updating Documents¶

Run ingestion again to pick up changes:

docker compose exec server docbrain-ingest

DocBrain uses upsert logic — new and changed documents are updated, unchanged documents are skipped. This is safe to run repeatedly.

Scheduled Ingestion (Kubernetes)¶

The Helm chart includes a CronJob that runs ingestion on a schedule:

# In values.yaml
ingest:
  schedule: "0 */6 * * *"  # every 6 hours

Full Re-Index¶

If you change your embedding provider (e.g., from Ollama to OpenAI), you need a full re-index because embedding dimensions differ between providers. Delete the OpenSearch index and re-ingest:

# Delete the existing index
curl -X DELETE http://localhost:9200/docbrain_chunks

# Re-ingest everything
docker compose exec server docbrain-ingest

Multiple Sources¶

DocBrain supports ingesting from multiple sources simultaneously. Set ingest_sources in config/local.yaml to a comma-separated list of sources, and configure credentials for each:

# config/local.yaml
ingest:
  ingest_sources: confluence,github_pr,jira

confluence:
  base_url: https://acme.atlassian.net/wiki
  user_email: you@acme.com
  api_token: ATATT3x...
  space_keys: DOCS,ENG

github_pr:
  token: ghp_...
  repo: acme/platform

jira_ingest:
  base_url: https://acme.atlassian.net
  user_email: you@acme.com
  api_token: your-jira-token
  projects: ENG,OPS

Then run a single ingestion pass to pull from all sources:

docker compose exec server docbrain-ingest

Documents from different sources coexist in the same index and are searched together.

Next Steps¶

Configuration Reference — all ingestion-related environment variables
Provider Setup — configure embedding providers for ingestion
Architecture — how the ingestion pipeline works under the hood

Ingestion Guide¶

How Ingestion Works¶

Quick Reference¶

Option 1: Local Files (Default)¶

Setup¶

Using Your Own Files¶

Run Ingestion¶

Supported File Types¶

Verify¶

Option 2: Confluence¶

Step 1: Create a Confluence API Token¶

Step 2: Find Your Space Keys¶

Step 3: Configure config/local.yaml¶

Step 4: Run Ingestion¶

Step 5: Verify¶

Self-Hosted Confluence (Data Center)¶

Permissions¶

Troubleshooting¶

Option 3: GitHub Repository¶

Setup¶

Creating a GitHub Token (for private repos)¶

Run Ingestion¶

What Gets Ingested¶

Monorepo?¶

Image Extraction (Confluence)¶

How It Works¶

Which Providers Support Vision?¶

Guardrails¶

Disabling Image Extraction¶

Cost¶

Real-Time Sync: Confluence Webhooks¶

What It Does¶

Step 1: Generate a Webhook Secret¶

Step 2: Configure DocBrain¶

Step 3: Configure the Webhook in Confluence¶

Confluence Cloud¶

Confluence Data Center (Self-Hosted)¶

Step 4: Verify¶

Security¶

Webhooks vs. Scheduled Ingest¶

Real-Time Capture: @docbrain capture and /docbrain capture¶

What Capture Does¶

GitHub PR/Issue Capture¶

Setup¶

Register the Webhook in GitHub¶

What Gets Indexed¶

Reply Behavior¶

Security and Access Control¶

GitLab MR Capture¶

Step 1: Configure DocBrain¶

Step 2: Register the Webhook in GitLab¶

Step 3: Test It¶

What Gets Indexed¶

Reply Behavior¶

Security and Access Control¶

Slack Thread Capture¶

Setup¶

Usage¶

Access Control¶

What Gets Indexed¶

Capture Lifecycle and Freshness¶

Space Assignment¶

Staleness and Time Decay¶

Re-Ingestion and Updates¶

Updating Documents¶

Scheduled Ingestion (Kubernetes)¶

Full Re-Index¶

Multiple Sources¶

Next Steps¶

Step 3: Configure `config/local.yaml`¶

Real-Time Capture: `@docbrain capture` and `/docbrain capture`¶