PlanOpticon

planopticon / docs / guide / document-ingestion.md
1
# Document Ingestion
2
3
Document ingestion lets you process files -- PDFs, Markdown, and plaintext -- into a knowledge graph. PlanOpticon extracts text from documents, chunks it into manageable pieces, runs LLM-powered entity and relationship extraction, and stores the results in a FalkorDB knowledge graph. This is the same knowledge graph format produced by video analysis, so you can combine video and document insights in a single graph.
4
5
## Supported formats
6
7
| Extension | Processor | Description |
8
|-----------|-----------|-------------|
9
| `.pdf` | `PdfProcessor` | Extracts text page by page using pymupdf or pdfplumber |
10
| `.md`, `.markdown` | `MarkdownProcessor` | Splits on headings into sections |
11
| `.txt`, `.text`, `.log`, `.csv` | `PlaintextProcessor` | Splits on paragraph boundaries |
12
13
Additional formats can be added by implementing the `DocumentProcessor` base class and registering it (see [Extending with custom processors](#extending-with-custom-processors) below).
14
15
## CLI usage
16
17
### `planopticon ingest`
18
19
```
20
planopticon ingest INPUT_PATH [OPTIONS]
21
```
22
23
**Arguments:**
24
25
| Argument | Description |
26
|----------|-------------|
27
| `INPUT_PATH` | Path to a file or directory to ingest (must exist) |
28
29
**Options:**
30
31
| Option | Short | Default | Description |
32
|--------|-------|---------|-------------|
33
| `--output` | `-o` | Current directory | Output directory for the knowledge graph |
34
| `--db-path` | | None | Path to an existing `knowledge_graph.db` to merge into |
35
| `--recursive / --no-recursive` | `-r` | `--recursive` | Recurse into subdirectories (directory ingestion only) |
36
| `--provider` | `-p` | `auto` | LLM provider for entity extraction (`openai`, `anthropic`, `gemini`, `ollama`, `azure`, `together`, `fireworks`, `cerebras`, `xai`) |
37
| `--chat-model` | | None | Override the model used for LLM entity extraction |
38
39
### Single file ingestion
40
41
Process a single document and create a new knowledge graph:
42
43
```bash
44
planopticon ingest spec.md
45
```
46
47
This creates `knowledge_graph.db` and `knowledge_graph.json` in the current directory.
48
49
Specify an output directory:
50
51
```bash
52
planopticon ingest report.pdf -o ./results
53
```
54
55
This creates `./results/knowledge_graph.db` and `./results/knowledge_graph.json`.
56
57
### Directory ingestion
58
59
Process all supported files in a directory:
60
61
```bash
62
planopticon ingest ./docs/
63
```
64
65
By default, this recurses into subdirectories. To process only the top-level directory:
66
67
```bash
68
planopticon ingest ./docs/ --no-recursive
69
```
70
71
PlanOpticon automatically filters for supported file extensions. Unsupported files are silently skipped.
72
73
### Merging into an existing knowledge graph
74
75
To add document content to an existing knowledge graph (e.g., one created from video analysis), use `--db-path`:
76
77
```bash
78
# First, analyze a video
79
planopticon analyze meeting.mp4 -o ./results
80
81
# Then, ingest supplementary documents into the same graph
82
planopticon ingest ./meeting-notes/ --db-path ./results/knowledge_graph.db
83
```
84
85
The ingested entities and relationships are merged with the existing graph. Duplicate entities are consolidated automatically by the knowledge graph engine.
86
87
### Choosing an LLM provider
88
89
Entity and relationship extraction requires an LLM. By default, PlanOpticon auto-detects available providers based on your environment variables. You can override this:
90
91
```bash
92
# Use Anthropic for extraction
93
planopticon ingest docs/ -p anthropic
94
95
# Use a specific model
96
planopticon ingest docs/ -p openai --chat-model gpt-4o
97
98
# Use a local Ollama model
99
planopticon ingest docs/ -p ollama --chat-model llama3
100
```
101
102
### Output
103
104
After ingestion, PlanOpticon prints a summary:
105
106
```
107
Knowledge graph: ./knowledge_graph.db
108
spec.md: 12 chunks
109
architecture.md: 8 chunks
110
requirements.txt: 3 chunks
111
112
Ingestion complete:
113
Files processed: 3
114
Total chunks: 23
115
Entities extracted: 47
116
Relationships: 31
117
Knowledge graph: ./knowledge_graph.db
118
```
119
120
Both `.db` (SQLite/FalkorDB) and `.json` formats are saved automatically.
121
122
## How each processor works
123
124
### PDF processor
125
126
The `PdfProcessor` extracts text from PDF files on a per-page basis. It tries two extraction libraries in order:
127
128
1. **pymupdf** (preferred) -- Fast, reliable text extraction. Install with `pip install pymupdf`.
129
2. **pdfplumber** (fallback) -- Alternative extractor. Install with `pip install pdfplumber`.
130
131
If neither library is installed, the processor raises an `ImportError` with installation instructions.
132
133
Each page becomes a separate `DocumentChunk` with:
134
135
- `text`: The extracted text content of the page
136
- `page`: The 1-based page number
137
- `metadata.extraction_method`: Which library was used (`pymupdf` or `pdfplumber`)
138
139
To install PDF support:
140
141
```bash
142
pip install 'planopticon[pdf]'
143
# or
144
pip install pymupdf
145
# or
146
pip install pdfplumber
147
```
148
149
### Markdown processor
150
151
The `MarkdownProcessor` splits Markdown files on heading boundaries (lines starting with `#` through `######`). Each heading and its content until the next heading becomes a separate chunk.
152
153
**Splitting behavior:**
154
155
- If the file contains headings, each heading section becomes a chunk. The `section` field records the heading text.
156
- Content before the first heading is captured as a `(preamble)` chunk.
157
- If the file contains no headings, it falls back to paragraph-based chunking (same as plaintext).
158
159
For example, a file with this structure:
160
161
```markdown
162
Some intro text.
163
164
# Architecture
165
166
The system uses a microservices architecture...
167
168
## Components
169
170
There are three main components...
171
172
# Deployment
173
174
Deployment is handled via...
175
```
176
177
Produces four chunks: `(preamble)`, `Architecture`, `Components`, and `Deployment`.
178
179
### Plaintext processor
180
181
The `PlaintextProcessor` handles `.txt`, `.text`, `.log`, and `.csv` files. It splits text on paragraph boundaries (double newlines) and groups paragraphs into chunks with a configurable maximum size.
182
183
**Chunking parameters:**
184
185
| Parameter | Default | Description |
186
|-----------|---------|-------------|
187
| `max_chunk_size` | 2000 characters | Maximum size of each chunk |
188
| `overlap` | 200 characters | Number of characters from the end of one chunk to repeat at the start of the next |
189
190
The overlap ensures that entities or context that spans a paragraph boundary are not lost. Chunks are created by accumulating paragraphs until the next paragraph would exceed `max_chunk_size`, at which point the current chunk is flushed and a new one begins.
191
192
## The ingestion pipeline
193
194
Document ingestion follows this pipeline:
195
196
```
197
File on disk
198
|
199
v
200
Processor selection (by file extension)
201
|
202
v
203
Text extraction (PDF pages / Markdown sections / plaintext paragraphs)
204
|
205
v
206
DocumentChunk objects (text + metadata)
207
|
208
v
209
Source registration (provenance tracking in the KG)
210
|
211
v
212
KG content addition (LLM entity/relationship extraction per chunk)
213
|
214
v
215
Knowledge graph storage (.db + .json)
216
```
217
218
### Step 1: Processor selection
219
220
PlanOpticon maintains a registry of processors keyed by file extension. When you call `ingest_file()`, it looks up the appropriate processor using `get_processor(path)`. If no processor is registered for the file extension, a `ValueError` is raised.
221
222
### Step 2: Text extraction
223
224
The selected processor reads the file and produces a list of `DocumentChunk` objects. Each chunk contains:
225
226
| Field | Type | Description |
227
|-------|------|-------------|
228
| `text` | `str` | The extracted text content |
229
| `source_file` | `str` | Path to the source file |
230
| `chunk_index` | `int` | Sequential index of this chunk within the file |
231
| `page` | `Optional[int]` | Page number (PDF only, 1-based) |
232
| `section` | `Optional[str]` | Section heading (Markdown only) |
233
| `metadata` | `Dict[str, Any]` | Additional metadata (e.g., extraction method) |
234
235
### Step 3: Source registration
236
237
Each ingested file is registered as a source in the knowledge graph with provenance metadata:
238
239
- `source_id`: A SHA-256 hash of the absolute file path (first 12 characters), unless you provide a custom ID
240
- `source_type`: Always `"document"`
241
- `title`: The file stem (filename without extension)
242
- `path`: The file path
243
- `mime_type`: Detected MIME type
244
- `ingested_at`: ISO-8601 timestamp
245
- `metadata`: Chunk count and file extension
246
247
### Step 4: Entity and relationship extraction
248
249
Each chunk's text is passed to `knowledge_graph.add_content()`, which uses the configured LLM provider to extract entities and relationships. The content source is tagged with the document name and either the page number or section name:
250
251
- `document:report.pdf:page:3`
252
- `document:spec.md:section:Architecture`
253
- `document:notes.txt` (no page or section)
254
255
### Step 5: Storage
256
257
The knowledge graph is saved in both `.db` (SQLite-backed FalkorDB) and `.json` formats.
258
259
## Combining with video analysis
260
261
A common workflow is to analyze a video recording and then ingest related documents into the same knowledge graph:
262
263
```bash
264
# Step 1: Analyze the meeting recording
265
planopticon analyze meeting-recording.mp4 -o ./project-kg
266
267
# Step 2: Ingest the meeting agenda
268
planopticon ingest agenda.md --db-path ./project-kg/knowledge_graph.db
269
270
# Step 3: Ingest the project spec
271
planopticon ingest project-spec.pdf --db-path ./project-kg/knowledge_graph.db
272
273
# Step 4: Ingest a whole docs folder
274
planopticon ingest ./reference-docs/ --db-path ./project-kg/knowledge_graph.db
275
276
# Step 5: Query the combined graph
277
planopticon query --db-path ./project-kg/knowledge_graph.db
278
```
279
280
The resulting knowledge graph contains entities and relationships from all sources -- video transcripts, meeting agendas, specs, and reference documents -- with full provenance tracking so you can trace any entity back to its source.
281
282
## Python API
283
284
### Ingesting a single file
285
286
```python
287
from pathlib import Path
288
from video_processor.integrators.knowledge_graph import KnowledgeGraph
289
from video_processor.processors.ingest import ingest_file
290
291
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
292
chunk_count = ingest_file(Path("document.pdf"), kg)
293
print(f"Processed {chunk_count} chunks")
294
295
kg.save(Path("knowledge_graph.db"))
296
```
297
298
### Ingesting a directory
299
300
```python
301
from pathlib import Path
302
from video_processor.integrators.knowledge_graph import KnowledgeGraph
303
from video_processor.processors.ingest import ingest_directory
304
305
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))
306
results = ingest_directory(
307
Path("./docs"),
308
kg,
309
recursive=True,
310
extensions=[".md", ".pdf"], # Optional: filter by extension
311
)
312
313
for filepath, chunks in results.items():
314
print(f" {filepath}: {chunks} chunks")
315
316
kg.save(Path("knowledge_graph.db"))
317
```
318
319
### Listing supported extensions
320
321
```python
322
from video_processor.processors.base import list_supported_extensions
323
324
extensions = list_supported_extensions()
325
print(extensions)
326
# ['.csv', '.log', '.markdown', '.md', '.pdf', '.text', '.txt']
327
```
328
329
### Working with processors directly
330
331
```python
332
from pathlib import Path
333
from video_processor.processors.base import get_processor
334
335
processor = get_processor(Path("report.pdf"))
336
if processor:
337
chunks = processor.process(Path("report.pdf"))
338
for chunk in chunks:
339
print(f"Page {chunk.page}: {chunk.text[:100]}...")
340
```
341
342
## Extending with custom processors
343
344
To add support for a new file format, implement the `DocumentProcessor` abstract class and register it:
345
346
```python
347
from pathlib import Path
348
from typing import List
349
from video_processor.processors.base import (
350
DocumentChunk,
351
DocumentProcessor,
352
register_processor,
353
)
354
355
356
class HtmlProcessor(DocumentProcessor):
357
supported_extensions = [".html", ".htm"]
358
359
def can_process(self, path: Path) -> bool:
360
return path.suffix.lower() in self.supported_extensions
361
362
def process(self, path: Path) -> List[DocumentChunk]:
363
from bs4 import BeautifulSoup
364
365
soup = BeautifulSoup(path.read_text(), "html.parser")
366
text = soup.get_text(separator="\n")
367
return [
368
DocumentChunk(
369
text=text,
370
source_file=str(path),
371
chunk_index=0,
372
)
373
]
374
375
376
register_processor(HtmlProcessor.supported_extensions, HtmlProcessor)
377
```
378
379
After registration, `planopticon ingest` will automatically handle `.html` and `.htm` files.
380
381
## Companion REPL
382
383
Inside the interactive companion REPL, you can ingest files using the `/ingest` command:
384
385
```
386
> /ingest ./meeting-notes.md
387
Ingested meeting-notes.md: 5 chunks
388
```
389
390
This adds content to the currently loaded knowledge graph.
391
392
## Common workflows
393
394
### Build a project knowledge base from scratch
395
396
```bash
397
# Ingest all project docs
398
planopticon ingest ./project-docs/ -o ./knowledge-base
399
400
# Query what was captured
401
planopticon query --db-path ./knowledge-base/knowledge_graph.db
402
403
# Export as an Obsidian vault
404
planopticon export obsidian ./knowledge-base/knowledge_graph.db -o ./vault
405
```
406
407
### Incrementally build a knowledge graph
408
409
```bash
410
# Start with initial docs
411
planopticon ingest ./sprint-1-docs/ -o ./kg
412
413
# Add more docs over time
414
planopticon ingest ./sprint-2-docs/ --db-path ./kg/knowledge_graph.db
415
planopticon ingest ./sprint-3-docs/ --db-path ./kg/knowledge_graph.db
416
417
# The graph grows with each ingestion
418
planopticon query --db-path ./kg/knowledge_graph.db stats
419
```
420
421
### Ingest from Google Workspace or Microsoft 365
422
423
PlanOpticon provides integrated commands that fetch cloud documents and ingest them in one step:
424
425
```bash
426
# Google Workspace
427
planopticon gws ingest --folder-id FOLDER_ID -o ./results
428
429
# Microsoft 365 / SharePoint
430
planopticon m365 ingest --web-url https://contoso.sharepoint.com/sites/proj \
431
--folder-url /sites/proj/Shared\ Documents
432
```
433
434
These commands handle authentication, document download, text extraction, and knowledge graph creation automatically.
435

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button