PlanOpticon

planopticon / docs / guide / document-ingestion.md

Source Rendered

Blame History Raw 435 lines

1	`# Document Ingestion`
2
3	`Document ingestion lets you process files -- PDFs, Markdown, and plaintext -- into a knowledge graph. PlanOpticon extracts text from documents, chunks it into manageable pieces, runs LLM-powered entity and relationship extraction, and stores the results in a FalkorDB knowledge graph. This is the same knowledge graph format produced by video analysis, so you can combine video and document insights in a single graph.`
4
5	`## Supported formats`
6
7	`\| Extension \| Processor \| Description \|`
8	`\|-----------\|-----------\|-------------\|`
9	\| `.pdf` \| `PdfProcessor` \| Extracts text page by page using pymupdf or pdfplumber \|
10	\| `.md`, `.markdown` \| `MarkdownProcessor` \| Splits on headings into sections \|
11	\| `.txt`, `.text`, `.log`, `.csv` \| `PlaintextProcessor` \| Splits on paragraph boundaries \|
12
13	Additional formats can be added by implementing the `DocumentProcessor` base class and registering it (see [Extending with custom processors](#extending-with-custom-processors) below).
14
15	`## CLI usage`
16
17	### `planopticon ingest`
18
19	```
20	`planopticon ingest INPUT_PATH [OPTIONS]`
21	```
22
23	`Arguments:`
24
25	`\| Argument \| Description \|`
26	`\|----------\|-------------\|`
27	\| `INPUT_PATH` \| Path to a file or directory to ingest (must exist) \|
28
29	`Options:`
30
31	`\| Option \| Short \| Default \| Description \|`
32	`\|--------\|-------\|---------\|-------------\|`
33	\| `--output` \| `-o` \| Current directory \| Output directory for the knowledge graph \|
34	\| `--db-path` \| \| None \| Path to an existing `knowledge_graph.db` to merge into \|
35	\| `--recursive / --no-recursive` \| `-r` \| `--recursive` \| Recurse into subdirectories (directory ingestion only) \|
36	\| `--provider` \| `-p` \| `auto` \| LLM provider for entity extraction (`openai`, `anthropic`, `gemini`, `ollama`, `azure`, `together`, `fireworks`, `cerebras`, `xai`) \|
37	\| `--chat-model` \| \| None \| Override the model used for LLM entity extraction \|
38
39	`### Single file ingestion`
40
41	`Process a single document and create a new knowledge graph:`
42
43	```bash
44	`planopticon ingest spec.md`
45	```
46
47	This creates `knowledge_graph.db` and `knowledge_graph.json` in the current directory.
48
49	`Specify an output directory:`
50
51	```bash
52	`planopticon ingest report.pdf -o ./results`
53	```
54
55	This creates `./results/knowledge_graph.db` and `./results/knowledge_graph.json`.
56
57	`### Directory ingestion`
58
59	`Process all supported files in a directory:`
60
61	```bash
62	`planopticon ingest ./docs/`
63	```
64
65	`By default, this recurses into subdirectories. To process only the top-level directory:`
66
67	```bash
68	`planopticon ingest ./docs/ --no-recursive`
69	```
70
71	`PlanOpticon automatically filters for supported file extensions. Unsupported files are silently skipped.`
72
73	`### Merging into an existing knowledge graph`
74
75	To add document content to an existing knowledge graph (e.g., one created from video analysis), use `--db-path`:
76
77	```bash
78	`# First, analyze a video`
79	`planopticon analyze meeting.mp4 -o ./results`
80
81	`# Then, ingest supplementary documents into the same graph`
82	`planopticon ingest ./meeting-notes/ --db-path ./results/knowledge_graph.db`
83	```
84
85	`The ingested entities and relationships are merged with the existing graph. Duplicate entities are consolidated automatically by the knowledge graph engine.`
86
87	`### Choosing an LLM provider`
88
89	`Entity and relationship extraction requires an LLM. By default, PlanOpticon auto-detects available providers based on your environment variables. You can override this:`
90
91	```bash
92	`# Use Anthropic for extraction`
93	`planopticon ingest docs/ -p anthropic`
94
95	`# Use a specific model`
96	`planopticon ingest docs/ -p openai --chat-model gpt-4o`
97
98	`# Use a local Ollama model`
99	`planopticon ingest docs/ -p ollama --chat-model llama3`
100	```
101
102	`### Output`
103
104	`After ingestion, PlanOpticon prints a summary:`
105
106	```
107	`Knowledge graph: ./knowledge_graph.db`
108	`spec.md: 12 chunks`
109	`architecture.md: 8 chunks`
110	`requirements.txt: 3 chunks`
111
112	`Ingestion complete:`
113	`Files processed: 3`
114	`Total chunks: 23`
115	`Entities extracted: 47`
116	`Relationships: 31`
117	`Knowledge graph: ./knowledge_graph.db`
118	```
119
120	Both `.db` (SQLite/FalkorDB) and `.json` formats are saved automatically.
121
122	`## How each processor works`
123
124	`### PDF processor`
125
126	The `PdfProcessor` extracts text from PDF files on a per-page basis. It tries two extraction libraries in order:
127
128	1. pymupdf (preferred) -- Fast, reliable text extraction. Install with `pip install pymupdf`.
129	2. pdfplumber (fallback) -- Alternative extractor. Install with `pip install pdfplumber`.
130
131	If neither library is installed, the processor raises an `ImportError` with installation instructions.
132
133	Each page becomes a separate `DocumentChunk` with:
134
135	- `text`: The extracted text content of the page
136	- `page`: The 1-based page number
137	- `metadata.extraction_method`: Which library was used (`pymupdf` or `pdfplumber`)
138
139	`To install PDF support:`
140
141	```bash
142	`pip install 'planopticon[pdf]'`
143	`# or`
144	`pip install pymupdf`
145	`# or`
146	`pip install pdfplumber`
147	```
148
149	`### Markdown processor`
150
151	The `MarkdownProcessor` splits Markdown files on heading boundaries (lines starting with `#` through `######`). Each heading and its content until the next heading becomes a separate chunk.
152
153	`Splitting behavior:`
154
155	- If the file contains headings, each heading section becomes a chunk. The `section` field records the heading text.
156	- Content before the first heading is captured as a `(preamble)` chunk.
157	`- If the file contains no headings, it falls back to paragraph-based chunking (same as plaintext).`
158
159	`For example, a file with this structure:`
160
161	```markdown
162	`Some intro text.`
163
164	`# Architecture`
165
166	`The system uses a microservices architecture...`
167
168	`## Components`
169
170	`There are three main components...`
171
172	`# Deployment`
173
174	`Deployment is handled via...`
175	```
176
177	Produces four chunks: `(preamble)`, `Architecture`, `Components`, and `Deployment`.
178
179	`### Plaintext processor`
180
181	The `PlaintextProcessor` handles `.txt`, `.text`, `.log`, and `.csv` files. It splits text on paragraph boundaries (double newlines) and groups paragraphs into chunks with a configurable maximum size.
182
183	`Chunking parameters:`
184
185	`\| Parameter \| Default \| Description \|`
186	`\|-----------\|---------\|-------------\|`
187	\| `max_chunk_size` \| 2000 characters \| Maximum size of each chunk \|
188	\| `overlap` \| 200 characters \| Number of characters from the end of one chunk to repeat at the start of the next \|
189
190	The overlap ensures that entities or context that spans a paragraph boundary are not lost. Chunks are created by accumulating paragraphs until the next paragraph would exceed `max_chunk_size`, at which point the current chunk is flushed and a new one begins.
191
192	`## The ingestion pipeline`
193
194	`Document ingestion follows this pipeline:`
195
196	```
197	`File on disk`
198	`\|`
199	`v`
200	`Processor selection (by file extension)`
201	`\|`
202	`v`
203	`Text extraction (PDF pages / Markdown sections / plaintext paragraphs)`
204	`\|`
205	`v`
206	`DocumentChunk objects (text + metadata)`
207	`\|`
208	`v`
209	`Source registration (provenance tracking in the KG)`
210	`\|`
211	`v`
212	`KG content addition (LLM entity/relationship extraction per chunk)`
213	`\|`
214	`v`
215	`Knowledge graph storage (.db + .json)`
216	```
217
218	`### Step 1: Processor selection`
219
220	PlanOpticon maintains a registry of processors keyed by file extension. When you call `ingest_file()`, it looks up the appropriate processor using `get_processor(path)`. If no processor is registered for the file extension, a `ValueError` is raised.
221
222	`### Step 2: Text extraction`
223
224	The selected processor reads the file and produces a list of `DocumentChunk` objects. Each chunk contains:
225
226	`\| Field \| Type \| Description \|`
227	`\|-------\|------\|-------------\|`
228	\| `text` \| `str` \| The extracted text content \|
229	\| `source_file` \| `str` \| Path to the source file \|
230	\| `chunk_index` \| `int` \| Sequential index of this chunk within the file \|
231	\| `page` \| `Optional[int]` \| Page number (PDF only, 1-based) \|
232	\| `section` \| `Optional[str]` \| Section heading (Markdown only) \|
233	\| `metadata` \| `Dict[str, Any]` \| Additional metadata (e.g., extraction method) \|
234
235	`### Step 3: Source registration`
236
237	`Each ingested file is registered as a source in the knowledge graph with provenance metadata:`
238
239	- `source_id`: A SHA-256 hash of the absolute file path (first 12 characters), unless you provide a custom ID
240	- `source_type`: Always `"document"`
241	- `title`: The file stem (filename without extension)
242	- `path`: The file path
243	- `mime_type`: Detected MIME type
244	- `ingested_at`: ISO-8601 timestamp
245	- `metadata`: Chunk count and file extension
246
247	`### Step 4: Entity and relationship extraction`
248
249	Each chunk's text is passed to `knowledge_graph.add_content()`, which uses the configured LLM provider to extract entities and relationships. The content source is tagged with the document name and either the page number or section name:
250
251	- `document:report.pdf:page:3`
252	- `document:spec.md:section:Architecture`
253	- `document:notes.txt` (no page or section)
254
255	`### Step 5: Storage`
256
257	The knowledge graph is saved in both `.db` (SQLite-backed FalkorDB) and `.json` formats.
258
259	`## Combining with video analysis`
260
261	`A common workflow is to analyze a video recording and then ingest related documents into the same knowledge graph:`
262
263	```bash
264	`# Step 1: Analyze the meeting recording`
265	`planopticon analyze meeting-recording.mp4 -o ./project-kg`
266
267	`# Step 2: Ingest the meeting agenda`
268	`planopticon ingest agenda.md --db-path ./project-kg/knowledge_graph.db`
269
270	`# Step 3: Ingest the project spec`
271	`planopticon ingest project-spec.pdf --db-path ./project-kg/knowledge_graph.db`
272
273	`# Step 4: Ingest a whole docs folder`
274	`planopticon ingest ./reference-docs/ --db-path ./project-kg/knowledge_graph.db`
275
276	`# Step 5: Query the combined graph`
277	`planopticon query --db-path ./project-kg/knowledge_graph.db`
278	```
279
280	`The resulting knowledge graph contains entities and relationships from all sources -- video transcripts, meeting agendas, specs, and reference documents -- with full provenance tracking so you can trace any entity back to its source.`
281
282	`## Python API`
283
284	`### Ingesting a single file`
285
286	```python
287	`from pathlib import Path`
288	`from video_processor.integrators.knowledge_graph import KnowledgeGraph`
289	`from video_processor.processors.ingest import ingest_file`
290
291	`kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))`
292	`chunk_count = ingest_file(Path("document.pdf"), kg)`
293	`print(f"Processed {chunk_count} chunks")`
294
295	`kg.save(Path("knowledge_graph.db"))`
296	```
297
298	`### Ingesting a directory`
299
300	```python
301	`from pathlib import Path`
302	`from video_processor.integrators.knowledge_graph import KnowledgeGraph`
303	`from video_processor.processors.ingest import ingest_directory`
304
305	`kg = KnowledgeGraph(db_path=Path("knowledge_graph.db"))`
306	`results = ingest_directory(`
307	`Path("./docs"),`
308	`kg,`
309	`recursive=True,`
310	`extensions=[".md", ".pdf"], # Optional: filter by extension`
311	`)`
312
313	`for filepath, chunks in results.items():`
314	`print(f" {filepath}: {chunks} chunks")`
315
316	`kg.save(Path("knowledge_graph.db"))`
317	```
318
319	`### Listing supported extensions`
320
321	```python
322	`from video_processor.processors.base import list_supported_extensions`
323
324	`extensions = list_supported_extensions()`
325	`print(extensions)`
326	`# ['.csv', '.log', '.markdown', '.md', '.pdf', '.text', '.txt']`
327	```
328
329	`### Working with processors directly`
330
331	```python
332	`from pathlib import Path`
333	`from video_processor.processors.base import get_processor`
334
335	`processor = get_processor(Path("report.pdf"))`
336	`if processor:`
337	`chunks = processor.process(Path("report.pdf"))`
338	`for chunk in chunks:`
339	`print(f"Page {chunk.page}: {chunk.text[:100]}...")`
340	```
341
342	`## Extending with custom processors`
343
344	To add support for a new file format, implement the `DocumentProcessor` abstract class and register it:
345
346	```python
347	`from pathlib import Path`
348	`from typing import List`
349	`from video_processor.processors.base import (`
350	`DocumentChunk,`
351	`DocumentProcessor,`
352	`register_processor,`
353	`)`
354
355
356	`class HtmlProcessor(DocumentProcessor):`
357	`supported_extensions = [".html", ".htm"]`
358
359	`def can_process(self, path: Path) -> bool:`
360	`return path.suffix.lower() in self.supported_extensions`
361
362	`def process(self, path: Path) -> List[DocumentChunk]:`
363	`from bs4 import BeautifulSoup`
364
365	`soup = BeautifulSoup(path.read_text(), "html.parser")`
366	`text = soup.get_text(separator="\n")`
367	`return [`
368	`DocumentChunk(`
369	`text=text,`
370	`source_file=str(path),`
371	`chunk_index=0,`
372	`)`
373	`]`
374
375
376	`register_processor(HtmlProcessor.supported_extensions, HtmlProcessor)`
377	```
378
379	After registration, `planopticon ingest` will automatically handle `.html` and `.htm` files.
380
381	`## Companion REPL`
382
383	Inside the interactive companion REPL, you can ingest files using the `/ingest` command:
384
385	```
386	`> /ingest ./meeting-notes.md`
387	`Ingested meeting-notes.md: 5 chunks`
388	```
389
390	`This adds content to the currently loaded knowledge graph.`
391
392	`## Common workflows`
393
394	`### Build a project knowledge base from scratch`
395
396	```bash
397	`# Ingest all project docs`
398	`planopticon ingest ./project-docs/ -o ./knowledge-base`
399
400	`# Query what was captured`
401	`planopticon query --db-path ./knowledge-base/knowledge_graph.db`
402
403	`# Export as an Obsidian vault`
404	`planopticon export obsidian ./knowledge-base/knowledge_graph.db -o ./vault`
405	```
406
407	`### Incrementally build a knowledge graph`
408
409	```bash
410	`# Start with initial docs`
411	`planopticon ingest ./sprint-1-docs/ -o ./kg`
412
413	`# Add more docs over time`
414	`planopticon ingest ./sprint-2-docs/ --db-path ./kg/knowledge_graph.db`
415	`planopticon ingest ./sprint-3-docs/ --db-path ./kg/knowledge_graph.db`
416
417	`# The graph grows with each ingestion`
418	`planopticon query --db-path ./kg/knowledge_graph.db stats`
419	```
420
421	`### Ingest from Google Workspace or Microsoft 365`
422
423	`PlanOpticon provides integrated commands that fetch cloud documents and ingest them in one step:`
424
425	```bash
426	`# Google Workspace`
427	`planopticon gws ingest --folder-id FOLDER_ID -o ./results`
428
429	`# Microsoft 365 / SharePoint`
430	`planopticon m365 ingest --web-url https://contoso.sharepoint.com/sites/proj \`
431	`--folder-url /sites/proj/Shared\ Documents`
432	```
433
434	`These commands handle authentication, document download, text extraction, and knowledge graph creation automatically.`
435

PlanOpticon

Keyboard Shortcuts