PlanOpticon

planopticon / docs / guide / document-ingestion.md
Source Blame History 434 lines
3551b80… noreply 1 # Document Ingestion
3551b80… noreply 2
3551b80… noreply 3 Document ingestion lets you process files -- PDFs, Markdown, and plaintext -- into a knowledge graph. PlanOpticon extracts text from documents, chunks it into manageable pieces, runs LLM-powered entity and relationship extraction, and stores the results in a FalkorDB knowledge graph. This is the same knowledge graph format produced by video analysis, so you can combine video and document insights in a single graph.
3551b80… noreply 4
3551b80… noreply 5 ## Supported formats
3551b80… noreply 6
3551b80… noreply 7 | Extension | Processor | Description |
3551b80… noreply 8 |-----------|-----------|-------------|
3551b80… noreply 9 | `.pdf` | `PdfProcessor` | Extracts text page by page using pymupdf or pdfplumber |
3551b80… noreply 10 | `.md`, `.markdown` | `MarkdownProcessor` | Splits on headings into sections |
3551b80… noreply 11 | `.txt`, `.text`, `.log`, `.csv` | `PlaintextProcessor` | Splits on paragraph boundaries |
3551b80… noreply 12
3551b80… noreply 13 Additional formats can be added by implementing the `DocumentProcessor` base class and registering it (see [Extending with custom processors](#extending-with-custom-processors) below).
3551b80… noreply 14
3551b80… noreply 15 ## CLI usage
3551b80… noreply 16
3551b80… noreply 17 ### `planopticon ingest`
3551b80… noreply 18
3551b80… noreply 19 ```
3551b80… noreply 20 planopticon ingest INPUT_PATH [OPTIONS]
3551b80… noreply 21 ```
3551b80… noreply 22
3551b80… noreply 23 **Arguments:**
3551b80… noreply 24
3551b80… noreply 25 | Argument | Description |
3551b80… noreply 26 |----------|-------------|
3551b80… noreply 27 | `INPUT_PATH` | Path to a file or directory to ingest (must exist) |
3551b80… noreply 28
3551b80… noreply 29 **Options:**
3551b80… noreply 30
3551b80… noreply 31 | Option | Short | Default | Description |
3551b80… noreply 32 |--------|-------|---------|-------------|
3551b80… noreply 33 | `--output` | `-o` | Current directory | Output directory for the knowledge graph |
3551b80… noreply 34 | `--db-path` | | None | Path to an existing `knowledge_graph.db` to merge into |
3551b80… noreply 35 | `--recursive / --no-recursive` | `-r` | `--recursive` | Recurse into subdirectories (directory ingestion only) |
3551b80… noreply 36 | `--provider` | `-p` | `auto` | LLM provider for entity extraction (`openai`, `anthropic`, `gemini`, `ollama`, `azure`, `together`, `fireworks`, `cerebras`, `xai`) |
3551b80… noreply 37 | `--chat-model` | | None | Override the model used for LLM entity extraction |
3551b80… noreply 38
3551b80… noreply 39 ### Single file ingestion
3551b80… noreply 40
3551b80… noreply 41 Process a single document and create a new knowledge graph:
3551b80… noreply 42
3551b80… noreply 43 ```bash
3551b80… noreply 44 planopticon ingest spec.md
3551b80… noreply 45 ```
3551b80… noreply 46
3551b80… noreply 47 This creates `knowledge_graph.db` and `knowledge_graph.json` in the current directory.
3551b80… noreply 48
3551b80… noreply 49 Specify an output directory:
3551b80… noreply 50
3551b80… noreply 51 ```bash
3551b80… noreply 52 planopticon ingest report.pdf -o ./results
3551b80… noreply 53 ```
3551b80… noreply 54
3551b80… noreply 55 This creates `./results/knowledge_graph.db` and `./results/knowledge_graph.json`.
3551b80… noreply 56
3551b80… noreply 57 ### Directory ingestion
3551b80… noreply 58
3551b80… noreply 59 Process all supported files in a directory:
3551b80… noreply 60
3551b80… noreply 61 ```bash
3551b80… noreply 62 planopticon ingest ./docs/
3551b80… noreply 63 ```
3551b80… noreply 64
3551b80… noreply 65 By default, this recurses into subdirectories. To process only the top-level directory:
3551b80… noreply 66
3551b80… noreply 67 ```bash
3551b80… noreply 68 planopticon ingest ./docs/ --no-recursive
3551b80… noreply 69 ```
3551b80… noreply 70
3551b80… noreply 71 PlanOpticon automatically filters for supported file extensions. Unsupported files are silently skipped.
3551b80… noreply 72
3551b80… noreply 73 ### Merging into an existing knowledge graph
3551b80… noreply 74
3551b80… noreply 75 To add document content to an existing knowledge graph (e.g., one created from video analysis), use `--db-path`:
3551b80… noreply 76
3551b80… noreply 77 ```bash
3551b80… noreply 78 # First, analyze a video
3551b80… noreply 79 planopticon analyze meeting.mp4 -o ./results
3551b80… noreply 80
3551b80… noreply 81 # Then, ingest supplementary documents into the same graph
3551b80… noreply 82 planopticon ingest ./meeting-notes/ --db-path ./results/knowledge_graph.db
3551b80… noreply 83 ```
3551b80… noreply 84
3551b80… noreply 85 The ingested entities and relationships are merged with the existing graph. Duplicate entities are consolidated automatically by the knowledge graph engine.
3551b80… noreply 86
3551b80… noreply 87 ### Choosing an LLM provider
3551b80… noreply 88
3551b80… noreply 89 Entity and relationship extraction requires an LLM. By default, PlanOpticon auto-detects available providers based on your environment variables. You can override this:
3551b80… noreply 90
3551b80… noreply 91 ```bash
3551b80… noreply 92 # Use Anthropic for extraction
3551b80… noreply 93 planopticon ingest docs/ -p anthropic
3551b80… noreply 94
3551b80… noreply 95 # Use a specific model
3551b80… noreply 96 planopticon ingest docs/ -p openai --chat-model gpt-4o
3551b80… noreply 97
3551b80… noreply 98 # Use a local Ollama model
3551b80… noreply 99 planopticon ingest docs/ -p ollama --chat-model llama3
3551b80… noreply 100 ```
3551b80… noreply 101
3551b80… noreply 102 ### Output
3551b80… noreply 103
3551b80… noreply 104 After ingestion, PlanOpticon prints a summary:
3551b80… noreply 105
3551b80… noreply 106 ```
3551b80… noreply 107 Knowledge graph: ./knowledge_graph.db
3551b80… noreply 108 spec.md: 12 chunks
3551b80… noreply 109 architecture.md: 8 chunks
3551b80… noreply 110 requirements.txt: 3 chunks
3551b80… noreply 111
3551b80… noreply 112 Ingestion complete:
3551b80… noreply 113 Files processed: 3
3551b80… noreply 114 Total chunks: 23
3551b80… noreply 115 Entities extracted: 47
3551b80… noreply 116 Relationships: 31
3551b80… noreply 117 Knowledge graph: ./knowledge_graph.db
3551b80… noreply 118 ```
3551b80… noreply 119
3551b80… noreply 120 Both `.db` (SQLite/FalkorDB) and `.json` formats are saved automatically.
3551b80… noreply 121
3551b80… noreply 122 ## How each processor works
3551b80… noreply 123
3551b80… noreply 124 ### PDF processor
3551b80… noreply 125
3551b80… noreply 126 The `PdfProcessor` extracts text from PDF files on a per-page basis. It tries two extraction libraries in order:
3551b80… noreply 127
3551b80… noreply 128 1. **pymupdf** (preferred) -- Fast, reliable text extraction. Install with `pip install pymupdf`.
3551b80… noreply 129 2. **pdfplumber** (fallback) -- Alternative extractor. Install with `pip install pdfplumber`.
3551b80… noreply 130
3551b80… noreply 131 If neither library is installed, the processor raises an `ImportError` with installation instructions.
3551b80… noreply 132
3551b80… noreply 133 Each page becomes a separate `DocumentChunk` with:
3551b80… noreply 134
3551b80… noreply 135 - `text`: The extracted text content of the page
3551b80… noreply 136 - `page`: The 1-based page number
3551b80… noreply 137 - `metadata.extraction_method`: Which library was used (`pymupdf` or `pdfplumber`)
3551b80… noreply 138
3551b80… noreply 139 To install PDF support:
3551b80… noreply 140
3551b80… noreply 141 ```bash
3551b80… noreply 142 pip install 'planopticon[pdf]'
3551b80… noreply 143 # or
3551b80… noreply 144 pip install pymupdf
3551b80… noreply 145 # or
3551b80… noreply 146 pip install pdfplumber
3551b80… noreply 147 ```
3551b80… noreply 148
3551b80… noreply 149 ### Markdown processor
3551b80… noreply 150
3551b80… noreply 151 The `MarkdownProcessor` splits Markdown files on heading boundaries (lines starting with `#` through `######`). Each heading and its content until the next heading becomes a separate chunk.
3551b80… noreply 152
3551b80… noreply 153 **Splitting behavior:**
3551b80… noreply 154
3551b80… noreply 155 - If the file contains headings, each heading section becomes a chunk. The `section` field records the heading text.
3551b80… noreply 156 - Content before the first heading is captured as a `(preamble)` chunk.
3551b80… noreply 157 - If the file contains no headings, it falls back to paragraph-based chunking (same as plaintext).
3551b80… noreply 158
3551b80… noreply 159 For example, a file with this structure:
3551b80… noreply 160
3551b80… noreply 161 ```markdown
3551b80… noreply 162 Some intro text.
3551b80… noreply 163
3551b80… noreply 164 # Architecture
3551b80… noreply 165
3551b80… noreply 166 The system uses a microservices architecture...
3551b80… noreply 167
3551b80… noreply 168 ## Components
3551b80… noreply 169
3551b80… noreply 170 There are three main components...
3551b80… noreply 171
3551b80… noreply 172 # Deployment
3551b80… noreply 173
3551b80… noreply 174 Deployment is handled via...
3551b80… noreply 175 ```
3551b80… noreply 176
3551b80… noreply 177 Produces four chunks: `(preamble)`, `Architecture`, `Components`, and `Deployment`.
3551b80… noreply 178
3551b80… noreply 179 ### Plaintext processor
3551b80… noreply 180
3551b80… noreply 181 The `PlaintextProcessor` handles `.txt`, `.text`, `.log`, and `.csv` files. It splits text on paragraph boundaries (double newlines) and groups paragraphs into chunks with a configurable maximum size.
3551b80… noreply 182
3551b80… noreply 183 **Chunking parameters:**
3551b80… noreply 184
3551b80… noreply 185 | Parameter | Default | Description |
3551b80… noreply 186 |-----------|---------|-------------|
3551b80… noreply 187 | `max_chunk_size` | 2000 characters | Maximum size of each chunk |
3551b80… noreply 188 | `overlap` | 200 characters | Number of characters from the end of one chunk to repeat at the start of the next |
3551b80… noreply 189
3551b80… noreply 190 The overlap ensures that entities or context that spans a paragraph boundary are not lost. Chunks are created by accumulating paragraphs until the next paragraph would exceed `max_chunk_size`, at which point the current chunk is flushed and a new one begins.
3551b80… noreply 191
3551b80… noreply 192 ## The ingestion pipeline
3551b80… noreply 193
3551b80… noreply 194 Document ingestion follows this pipeline:
3551b80… noreply 195
3551b80… noreply 196 ```
3551b80… noreply 197 File on disk
3551b80… noreply 198 |
3551b80… noreply 199 v
3551b80… noreply 200 Processor selection (by file extension)
3551b80… noreply 201 |
3551b80… noreply 202 v
3551b80… noreply 203 Text extraction (PDF pages / Markdown sections / plaintext paragraphs)
3551b80… noreply 204 |
3551b80… noreply 205 v
3551b80… noreply 206 DocumentChunk objects (text + metadata)
3551b80… noreply 207 |
3551b80… noreply 208 v
3551b80… noreply 209 Source registration (provenance tracking in the KG)
3551b80… noreply 210 |
3551b80… noreply 211 v
3551b80… noreply 212 KG content addition (LLM entity/relationship extraction per chunk)
3551b80… noreply 213 |
3551b80… noreply 214 v
3551b80… noreply 215 Knowledge graph storage (.db + .json)
3551b80… noreply 216 ```
3551b80… noreply 217
3551b80… noreply 218 ### Step 1: Processor selection
3551b80… noreply 219
3551b80… noreply 220 PlanOpticon maintains a registry of processors keyed by file extension. When you call `ingest_file()`, it looks up the appropriate processor using `get_processor(path)`. If no processor is registered for the file extension, a `ValueError` is raised.
3551b80… noreply 221
3551b80… noreply 222 ### Step 2: Text extraction
3551b80… noreply 223
3551b80… noreply 224 The selected processor reads the file and produces a list of `DocumentChunk` objects. Each chunk contains:
3551b80… noreply 225
3551b80… noreply 226 | Field | Type | Description |
3551b80… noreply 227 |-------|------|-------------|
3551b80… noreply 228 | `text` | `str` | The extracted text content |
3551b80… noreply 229 | `source_file` | `str` | Path to the source file |
3551b80… noreply 230 | `chunk_index` | `int` | Sequential index of this chunk within the file |
3551b80… noreply 231 | `page` | `Optional[int]` | Page number (PDF only, 1-based) |
3551b80… noreply 232 | `section` | `Optional[str]` | Section heading (Markdown only) |
3551b80… noreply 233 | `metadata` | `Dict[str, Any]` | Additional metadata (e.g., extraction method) |
3551b80… noreply 234
3551b80… noreply 235 ### Step 3: Source registration
3551b80… noreply 236
3551b80… noreply 237 Each ingested file is registered as a source in the knowledge graph with provenance metadata:
3551b80… noreply 238
3551b80… noreply 239 - `source_id`: A SHA-256 hash of the absolute file path (first 12 characters), unless you provide a custom ID
3551b80… noreply 240 - `source_type`: Always `"document"`
3551b80… noreply 241 - `title`: The file stem (filename without extension)
3551b80… noreply 242 - `path`: The file path
3551b80… noreply 243 - `mime_type`: Detected MIME type
3551b80… noreply 244 - `ingested_at`: ISO-8601 timestamp
3551b80… noreply 245 - `metadata`: Chunk count and file extension
3551b80… noreply 246
3551b80… noreply 247 ### Step 4: Entity and relationship extraction
3551b80… noreply 248
3551b80… noreply 249 Each chunk's text is passed to `knowledge_graph.add_content()`, which uses the configured LLM provider to extract entities and relationships. The content source is tagged with the document name and either the page number or section name:
3551b80… noreply 250
3551b80… noreply 251 - `document:report.pdf:page:3`
3551b80… noreply 252 - `document:spec.md:section:Architecture`
3551b80… noreply 253 - `document:notes.txt` (no page or section)
3551b80… noreply 254
3551b80… noreply 255 ### Step 5: Storage
3551b80… noreply 256
3551b80… noreply 257 The knowledge graph is saved in both `.db` (SQLite-backed FalkorDB) and `.json` formats.
3551b80… noreply 258
3551b80… noreply 259 ## Combining with video analysis
3551b80… noreply 260
3551b80… noreply 261 A common workflow is to analyze a video recording and then ingest related documents into the same knowledge graph:
3551b80… noreply 262
3551b80… noreply 263 ```bash
3551b80… noreply 264 # Step 1: Analyze the meeting recording
3551b80… noreply 265 planopticon analyze meeting-recording.mp4 -o ./project-kg
3551b80… noreply 266
3551b80… noreply 267 # Step 2: Ingest the meeting agenda
3551b80… noreply 268 planopticon ingest agenda.md --db-path ./project-kg/knowledge_graph.db
3551b80… noreply 269
3551b80… noreply 270 # Step 3: Ingest the project spec
3551b80… noreply 271 planopticon ingest project-spec.pdf --db-path ./project-kg/knowledge_graph.db
3551b80… noreply 272
3551b80… noreply 273 # Step 4: Ingest a whole docs folder
3551b80… noreply 274 planopticon ingest ./reference-docs/ --db-path ./project-kg/knowledge_graph.db
3551b80… noreply 275
3551b80… noreply 276 # Step 5: Query the combined graph
3551b80… noreply 277 planopticon query --db-path ./project-kg/knowledge_graph.db
3551b80… noreply 278 ```
3551b80… noreply 279
3551b80… noreply 280 The resulting knowledge graph contains entities and relationships from all sources -- video transcripts, meeting agendas, specs, and reference documents -- with full provenance tracking so you can trace any entity back to its source.
3551b80… noreply 281
3551b80… noreply 282 ## Python API
3551b80… noreply 283
3551b80… noreply 284 ### Ingesting a single file
3551b80… noreply 285
3551b80… noreply 286 ```python
3551b80… noreply 287 from pathlib import Path
3551b80… noreply 288 from video_processor.integrators.knowledge_graph import KnowledgeGraph
3551b80… noreply 289 from video_processor.processors.ingest import ingest_file
3551b80… noreply 290
3551b80… noreply 291 kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
3551b80… noreply 292 chunk_count = ingest_file(Path("document.pdf"), kg)
3551b80… noreply 293 print(f"Processed {chunk_count} chunks")
3551b80… noreply 294
3551b80… noreply 295 kg.save(Path("knowledge_graph.db"))
3551b80… noreply 296 ```
3551b80… noreply 297
3551b80… noreply 298 ### Ingesting a directory
3551b80… noreply 299
3551b80… noreply 300 ```python
3551b80… noreply 301 from pathlib import Path
3551b80… noreply 302 from video_processor.integrators.knowledge_graph import KnowledgeGraph
3551b80… noreply 303 from video_processor.processors.ingest import ingest_directory
3551b80… noreply 304
3551b80… noreply 305 kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
3551b80… noreply 306 results = ingest_directory(
3551b80… noreply 307 Path("./docs"),
3551b80… noreply 308 kg,
3551b80… noreply 309 recursive=True,
3551b80… noreply 310 extensions=[".md", ".pdf"], # Optional: filter by extension
3551b80… noreply 311 )
3551b80… noreply 312
3551b80… noreply 313 for filepath, chunks in results.items():
3551b80… noreply 314 print(f" {filepath}: {chunks} chunks")
3551b80… noreply 315
3551b80… noreply 316 kg.save(Path("knowledge_graph.db"))
3551b80… noreply 317 ```
3551b80… noreply 318
3551b80… noreply 319 ### Listing supported extensions
3551b80… noreply 320
3551b80… noreply 321 ```python
3551b80… noreply 322 from video_processor.processors.base import list_supported_extensions
3551b80… noreply 323
3551b80… noreply 324 extensions = list_supported_extensions()
3551b80… noreply 325 print(extensions)
3551b80… noreply 326 # ['.csv', '.log', '.markdown', '.md', '.pdf', '.text', '.txt']
3551b80… noreply 327 ```
3551b80… noreply 328
3551b80… noreply 329 ### Working with processors directly
3551b80… noreply 330
3551b80… noreply 331 ```python
3551b80… noreply 332 from pathlib import Path
3551b80… noreply 333 from video_processor.processors.base import get_processor
3551b80… noreply 334
3551b80… noreply 335 processor = get_processor(Path("report.pdf"))
3551b80… noreply 336 if processor:
3551b80… noreply 337 chunks = processor.process(Path("report.pdf"))
3551b80… noreply 338 for chunk in chunks:
3551b80… noreply 339 print(f"Page {chunk.page}: {chunk.text[:100]}...")
3551b80… noreply 340 ```
3551b80… noreply 341
3551b80… noreply 342 ## Extending with custom processors
3551b80… noreply 343
3551b80… noreply 344 To add support for a new file format, implement the `DocumentProcessor` abstract class and register it:
3551b80… noreply 345
3551b80… noreply 346 ```python
3551b80… noreply 347 from pathlib import Path
3551b80… noreply 348 from typing import List
3551b80… noreply 349 from video_processor.processors.base import (
3551b80… noreply 350 DocumentChunk,
3551b80… noreply 351 DocumentProcessor,
3551b80… noreply 352 register_processor,
3551b80… noreply 353 )
3551b80… noreply 354
3551b80… noreply 355
3551b80… noreply 356 class HtmlProcessor(DocumentProcessor):
3551b80… noreply 357 supported_extensions = [".html", ".htm"]
3551b80… noreply 358
3551b80… noreply 359 def can_process(self, path: Path) -> bool:
3551b80… noreply 360 return path.suffix.lower() in self.supported_extensions
3551b80… noreply 361
3551b80… noreply 362 def process(self, path: Path) -> List[DocumentChunk]:
3551b80… noreply 363 from bs4 import BeautifulSoup
3551b80… noreply 364
3551b80… noreply 365 soup = BeautifulSoup(path.read_text(), "html.parser")
3551b80… noreply 366 text = soup.get_text(separator="\n")
3551b80… noreply 367 return [
3551b80… noreply 368 DocumentChunk(
3551b80… noreply 369 text=text,
3551b80… noreply 370 source_file=str(path),
3551b80… noreply 371 chunk_index=0,
3551b80… noreply 372 )
3551b80… noreply 373 ]
3551b80… noreply 374
3551b80… noreply 375
3551b80… noreply 376 register_processor(HtmlProcessor.supported_extensions, HtmlProcessor)
3551b80… noreply 377 ```
3551b80… noreply 378
3551b80… noreply 379 After registration, `planopticon ingest` will automatically handle `.html` and `.htm` files.
3551b80… noreply 380
3551b80… noreply 381 ## Companion REPL
3551b80… noreply 382
3551b80… noreply 383 Inside the interactive companion REPL, you can ingest files using the `/ingest` command:
3551b80… noreply 384
3551b80… noreply 385 ```
3551b80… noreply 386 > /ingest ./meeting-notes.md
3551b80… noreply 387 Ingested meeting-notes.md: 5 chunks
3551b80… noreply 388 ```
3551b80… noreply 389
3551b80… noreply 390 This adds content to the currently loaded knowledge graph.
3551b80… noreply 391
3551b80… noreply 392 ## Common workflows
3551b80… noreply 393
3551b80… noreply 394 ### Build a project knowledge base from scratch
3551b80… noreply 395
3551b80… noreply 396 ```bash
3551b80… noreply 397 # Ingest all project docs
3551b80… noreply 398 planopticon ingest ./project-docs/ -o ./knowledge-base
3551b80… noreply 399
3551b80… noreply 400 # Query what was captured
3551b80… noreply 401 planopticon query --db-path ./knowledge-base/knowledge_graph.db
3551b80… noreply 402
3551b80… noreply 403 # Export as an Obsidian vault
3551b80… noreply 404 planopticon export obsidian ./knowledge-base/knowledge_graph.db -o ./vault
3551b80… noreply 405 ```
3551b80… noreply 406
3551b80… noreply 407 ### Incrementally build a knowledge graph
3551b80… noreply 408
3551b80… noreply 409 ```bash
3551b80… noreply 410 # Start with initial docs
3551b80… noreply 411 planopticon ingest ./sprint-1-docs/ -o ./kg
3551b80… noreply 412
3551b80… noreply 413 # Add more docs over time
3551b80… noreply 414 planopticon ingest ./sprint-2-docs/ --db-path ./kg/knowledge_graph.db
3551b80… noreply 415 planopticon ingest ./sprint-3-docs/ --db-path ./kg/knowledge_graph.db
3551b80… noreply 416
3551b80… noreply 417 # The graph grows with each ingestion
3551b80… noreply 418 planopticon query --db-path ./kg/knowledge_graph.db stats
3551b80… noreply 419 ```
3551b80… noreply 420
3551b80… noreply 421 ### Ingest from Google Workspace or Microsoft 365
3551b80… noreply 422
3551b80… noreply 423 PlanOpticon provides integrated commands that fetch cloud documents and ingest them in one step:
3551b80… noreply 424
3551b80… noreply 425 ```bash
3551b80… noreply 426 # Google Workspace
3551b80… noreply 427 planopticon gws ingest --folder-id FOLDER_ID -o ./results
3551b80… noreply 428
3551b80… noreply 429 # Microsoft 365 / SharePoint
3551b80… noreply 430 planopticon m365 ingest --web-url https://contoso.sharepoint.com/sites/proj \
3551b80… noreply 431 --folder-url /sites/proj/Shared\ Documents
3551b80… noreply 432 ```
3551b80… noreply 433
3551b80… noreply 434 These commands handle authentication, document download, text extraction, and knowledge graph creation automatically.

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button