PlanOpticon
Document Ingestion
Document Ingestion
Document ingestion lets you process files -- PDFs, Markdown, and plaintext -- into a knowledge graph. PlanOpticon extracts text from documents, chunks it into manageable pieces, runs LLM-powered entity and relationship extraction, and stores the results in a FalkorDB knowledge graph. This is the same knowledge graph format produced by video analysis, so you can combine video and document insights in a single graph.
Supported formats
| Extension | Processor | Description |
|---|---|---|
| PdfProcessor | Extracts text page by page using pymupdf or pdfplumber | |
| .md, .markdown | MarkdownProcessor | Splits on headings into sections |
| .txt, .text, .log, .csv | PlaintextProcessor | Splits on paragraph boundaries |
Additional formats can be added by implementing the DocumentProcessor base class and registering it (see Extending with custom processors below).
CLI usage
planopticon ingest
planopticon ingest INPUT_PATH [OPTIONS]
Arguments:
| Argument | Description |
|---|---|
| INPUT_PATH | Path to a file or directory to ingest (must exist) |
Options:
| Option | Short | Default | Description |
|---|---|---|---|
| --output | -o | Current directory | Output directory for the knowledge graph |
| --db-path | None | Path to an existing knowledge_graph.db to merge into | |
| --recursive / --no-recursive | -r | --recursive | Recurse into subdirectories (directory ingestion only) |
| --provider | -p | auto | LLM provider for entity extraction (openai, anthropic, gemini, ollama, azure, together, fireworks, cerebras, xai) |
| --chat-model | None | Override the model used for LLM entity extraction |
Single file ingestion
Process a single document and create a new knowledge graph:
planopticon ingest spec.md
This creates knowledge_graph.db and knowledge_graph.json in the current directory.
Specify an output directory:
planopticon ingest report.pdf -o ./results
This creates ./results/knowledge_graph.db and ./results/knowledge_graph.json.
Directory ingestion
Process all supported files in a directory:
planopticon ingest ./docs/
By default, this recurses into subdirectories. To process only the top-level directory:
planopticon ingest ./docs/ --no-recursive
PlanOpticon automatically filters for supported file extensions. Unsupported files are silently skipped.
Merging into an existing knowledge graph
To add document content to an existing knowledge graph (e.g., one created from video analysis), use --db-path:
# First, analyze a video
planopticon analyze meeting.mp4 -o ./results
# Then, ingest supplementary documents into the same graph
planopticon ingest ./meeting-notes/ --db-path ./results/knowledge_graph.db
The ingested entities and relationships are merged with the existing graph. Duplicate entities are consolidated automatically by the knowledge graph engine.
Choosing an LLM provider
Entity and relationship extraction requires an LLM. By default, PlanOpticon auto-detects available providers based on your environment variables. You can override this:
# Use Anthropic for extraction
planopticon ingest docs/ -p anthropic
# Use a specific model
planopticon ingest docs/ -p openai --chat-model gpt-4o
# Use a local Ollama model
planopticon ingest docs/ -p ollama --chat-model llama3
Output
After ingestion, PlanOpticon prints a summary:
Knowledge graph: ./knowledge_graph.db
spec.md: 12 chunks
architecture.md: 8 chunks
requirements.txt: 3 chunks
Ingestion complete:
Files processed: 3
Total chunks: 23
Entities extracted: 47
Relationships: 31
Knowledge graph: ./knowledge_graph.db
Both .db (SQLite/FalkorDB) and .json formats are saved automatically.
How each processor works
PDF processor
The PdfProcessor extracts text from PDF files on a per-page basis. It tries two extraction libraries in order:
- pymupdf (preferred) -- Fast, reliable text extraction. Install with
pip install pymupdf. - pdfplumber (fallback) -- Alternative extractor. Install with
pip install pdfplumber.
If neither library is installed, the processor raises an ImportError with installation instructions.
Each page becomes a separate DocumentChunk with:
text: The extracted text content of the pagepage: The 1-based page numbermetadata.extraction_method: Which library was used (pymupdforpdfplumber)
To install PDF support:
pip install 'planopticon[pdf]'
# or
pip install pymupdf
# or
pip install pdfplumber
Markdown processor
The MarkdownProcessor splits Markdown files on heading boundaries (lines starting with # through ######). Each heading and its content until the next heading becomes a separate chunk.
Splitting behavior:
- If the file contains headings, each heading section becomes a chunk. The
sectionfield records the heading text. - Content before the first heading is captured as a
(preamble)chunk. - If the file contains no headings, it falls back to paragraph-based chunking (same as plaintext).
For example, a file with this structure:
Some intro text.
# Architecture
The system uses a microservices architecture...
## Components
There are three main components...
# Deployment
Deployment is handled via...
Produces four chunks: (preamble), Architecture, Components, and Deployment.
Plaintext processor
The PlaintextProcessor handles .txt, .text, .log, and .csv files. It splits text on paragraph boundaries (double newlines) and groups paragraphs into chunks with a configurable maximum size.
Chunking parameters:
| Parameter | Default | Description |
|---|---|---|
| max_chunk_size | 2000 characters | Maximum size of each chunk |
| overlap | 200 characters | Number of characters from the end of one chunk to repeat at the start of the next |
The overlap ensures that entities or context that spans a paragraph boundary are not lost. Chunks are created by accumulating paragraphs until the next paragraph would exceed max_chunk_size, at which point the current chunk is flushed and a new one begins.
The ingestion pipeline
Document ingestion follows this pipeline:
File on disk
|
v
Processor selection (by file extension)
|
v
Text extraction (PDF pages / Markdown sections / plaintext paragraphs)
|
v
DocumentChunk objects (text + metadata)
|
v
Source registration (provenance tracking in the KG)
|
v
KG content addition (LLM entity/relationship extraction per chunk)
|
v
Knowledge graph storage (.db + .json)
Step 1: Processor selection
PlanOpticon maintains a registry of processors keyed by file extension. When you call ingest_file(), it looks up the appropriate processor using get_processor(path). If no processor is registered for the file extension, a ValueError is raised.
Step 2: Text extraction
The selected processor reads the file and produces a list of DocumentChunk objects. Each chunk contains:
| Field | Type | Description |
|---|---|---|
| text | str | The extracted text content |
| source_file | str | Path to the source file |
| chunk_index | int | Sequential index of this chunk within the file |
| page | Optional[int] | Page number (PDF only, 1-based) |
| section | Optional[str] | Section heading (Markdown only) |
| metadata | Dict[str, Any] | Additional metadata (e.g., extraction method) |
Step 3: Source registration
Each ingested file is registered as a source in the knowledge graph with provenance metadata:
source_id: A SHA-256 hash of the absolute file path (first 12 characters), unless you provide a custom IDsource_type: Always"document"title: The file stem (filename without extension)path: The file pathmime_type: Detected MIME typeingested_at: ISO-8601 timestampmetadata: Chunk count and file extension
Step 4: Entity and relationship extraction
Each chunk's text is passed to knowledge_graph.add_content(), which uses the configured LLM provider to extract entities and relationships. The content source is tagged with the document name and either the page number or section name:
document:report.pdf:page:3document:spec.md:section:Architecturedocument:notes.txt(no page or section)
Step 5: Storage
The knowledge graph is saved in both .db (SQLite-backed FalkorDB) and .json formats.
Combining with video analysis
A common workflow is to analyze a video recording and then ingest related documents into the same knowledge graph:
# Step 1: Analyze the meeting recording
planopticon analyze meeting-recording.mp4 -o ./project-kg
# Step 2: Ingest the meeting agenda
planopticon ingest agenda.md --db-path ./project-kg/knowledge_graph.db
# Step 3: Ingest the project spec
planopticon ingest project-spec.pdf --db-path ./project-kg/knowledge_graph.db
# Step 4: Ingest a whole docs folder
planopticon ingest ./reference-docs/ --db-path ./project-kg/knowledge_graph.db
# Step 5: Query the combined graph
planopticon query --db-path ./project-kg/knowledge_graph.db
The resulting knowledge graph contains entities and relationships from all sources -- video transcripts, meeting agendas, specs, and reference documents -- with full provenance tracking so you can trace any entity back to its source.
Python API
Ingesting a single file
from pathlib import Path
from video_processor.integrators.knowledge_graph import KnowledgeGraph
from video_processor.processors.ingest import ingest_file
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
chunk_count = ingest_file(Path("document.pdf"), kg)
print(f"Processed {chunk_count} chunks")
kg.save(Path("knowledge_graph.db"))
Ingesting a directory
from pathlib import Path
from video_processor.integrators.knowledge_graph import KnowledgeGraph
from video_processor.processors.ingest import ingest_directory
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
results = ingest_directory(
Path("./docs"),
kg,
recursive=True,
extensions=[".md", ".pdf"], # Optional: filter by extension
)
for filepath, chunks in results.items():
print(f" {filepath}: {chunks} chunks")
kg.save(Path("knowledge_graph.db"))
Listing supported extensions
from video_processor.processors.base import list_supported_extensions
extensions = list_supported_extensions()
print(extensions)
# ['.csv', '.log', '.markdown', '.md', '.pdf', '.text', '.txt']
Working with processors directly
from pathlib import Path
from video_processor.processors.base import get_processor
processor = get_processor(Path("report.pdf"))
if processor:
chunks = processor.process(Path("report.pdf"))
for chunk in chunks:
print(f"Page {chunk.page}: {chunk.text[:100]}...")
Extending with custom processors
To add support for a new file format, implement the DocumentProcessor abstract class and register it:
from pathlib import Path
from typing import List
from video_processor.processors.base import (
DocumentChunk,
DocumentProcessor,
register_processor,
)
class HtmlProcessor(DocumentProcessor):
supported_extensions = [".html", ".htm"]
def can_process(self, path: Path) -> bool:
return path.suffix.lower() in self.supported_extensions
def process(self, path: Path) -> List[DocumentChunk]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(path.read_text(), "html.parser")
text = soup.get_text(separator="\n")
return [
DocumentChunk(
text=text,
source_file=str(path),
chunk_index=0,
)
]
register_processor(HtmlProcessor.supported_extensions, HtmlProcessor)
After registration, planopticon ingest will automatically handle .html and .htm files.
Companion REPL
Inside the interactive companion REPL, you can ingest files using the /ingest command:
> /ingest ./meeting-notes.md
Ingested meeting-notes.md: 5 chunks
This adds content to the currently loaded knowledge graph.
Common workflows
Build a project knowledge base from scratch
# Ingest all project docs
planopticon ingest ./project-docs/ -o ./knowledge-base
# Query what was captured
planopticon query --db-path ./knowledge-base/knowledge_graph.db
# Export as an Obsidian vault
planopticon export obsidian ./knowledge-base/knowledge_graph.db -o ./vault
Incrementally build a knowledge graph
# Start with initial docs
planopticon ingest ./sprint-1-docs/ -o ./kg
# Add more docs over time
planopticon ingest ./sprint-2-docs/ --db-path ./kg/knowledge_graph.db
planopticon ingest ./sprint-3-docs/ --db-path ./kg/knowledge_graph.db
# The graph grows with each ingestion
planopticon query --db-path ./kg/knowledge_graph.db stats
Ingest from Google Workspace or Microsoft 365
PlanOpticon provides integrated commands that fetch cloud documents and ingest them in one step:
# Google Workspace
planopticon gws ingest --folder-id FOLDER_ID -o ./results
# Microsoft 365 / SharePoint
planopticon m365 ingest --web-url https://contoso.sharepoint.com/sites/proj \
--folder-url /sites/proj/Shared\ Documents
These commands handle authentication, document download, text extraction, and knowledge graph creation automatically.