|
3551b80…
|
noreply
|
1 |
# Document Ingestion |
|
3551b80…
|
noreply
|
2 |
|
|
3551b80…
|
noreply
|
3 |
Document ingestion lets you process files -- PDFs, Markdown, and plaintext -- into a knowledge graph. PlanOpticon extracts text from documents, chunks it into manageable pieces, runs LLM-powered entity and relationship extraction, and stores the results in a FalkorDB knowledge graph. This is the same knowledge graph format produced by video analysis, so you can combine video and document insights in a single graph. |
|
3551b80…
|
noreply
|
4 |
|
|
3551b80…
|
noreply
|
5 |
## Supported formats |
|
3551b80…
|
noreply
|
6 |
|
|
3551b80…
|
noreply
|
7 |
| Extension | Processor | Description | |
|
3551b80…
|
noreply
|
8 |
|-----------|-----------|-------------| |
|
3551b80…
|
noreply
|
9 |
| `.pdf` | `PdfProcessor` | Extracts text page by page using pymupdf or pdfplumber | |
|
3551b80…
|
noreply
|
10 |
| `.md`, `.markdown` | `MarkdownProcessor` | Splits on headings into sections | |
|
3551b80…
|
noreply
|
11 |
| `.txt`, `.text`, `.log`, `.csv` | `PlaintextProcessor` | Splits on paragraph boundaries | |
|
3551b80…
|
noreply
|
12 |
|
|
3551b80…
|
noreply
|
13 |
Additional formats can be added by implementing the `DocumentProcessor` base class and registering it (see [Extending with custom processors](#extending-with-custom-processors) below). |
|
3551b80…
|
noreply
|
14 |
|
|
3551b80…
|
noreply
|
15 |
## CLI usage |
|
3551b80…
|
noreply
|
16 |
|
|
3551b80…
|
noreply
|
17 |
### `planopticon ingest` |
|
3551b80…
|
noreply
|
18 |
|
|
3551b80…
|
noreply
|
19 |
``` |
|
3551b80…
|
noreply
|
20 |
planopticon ingest INPUT_PATH [OPTIONS] |
|
3551b80…
|
noreply
|
21 |
``` |
|
3551b80…
|
noreply
|
22 |
|
|
3551b80…
|
noreply
|
23 |
**Arguments:** |
|
3551b80…
|
noreply
|
24 |
|
|
3551b80…
|
noreply
|
25 |
| Argument | Description | |
|
3551b80…
|
noreply
|
26 |
|----------|-------------| |
|
3551b80…
|
noreply
|
27 |
| `INPUT_PATH` | Path to a file or directory to ingest (must exist) | |
|
3551b80…
|
noreply
|
28 |
|
|
3551b80…
|
noreply
|
29 |
**Options:** |
|
3551b80…
|
noreply
|
30 |
|
|
3551b80…
|
noreply
|
31 |
| Option | Short | Default | Description | |
|
3551b80…
|
noreply
|
32 |
|--------|-------|---------|-------------| |
|
3551b80…
|
noreply
|
33 |
| `--output` | `-o` | Current directory | Output directory for the knowledge graph | |
|
3551b80…
|
noreply
|
34 |
| `--db-path` | | None | Path to an existing `knowledge_graph.db` to merge into | |
|
3551b80…
|
noreply
|
35 |
| `--recursive / --no-recursive` | `-r` | `--recursive` | Recurse into subdirectories (directory ingestion only) | |
|
3551b80…
|
noreply
|
36 |
| `--provider` | `-p` | `auto` | LLM provider for entity extraction (`openai`, `anthropic`, `gemini`, `ollama`, `azure`, `together`, `fireworks`, `cerebras`, `xai`) | |
|
3551b80…
|
noreply
|
37 |
| `--chat-model` | | None | Override the model used for LLM entity extraction | |
|
3551b80…
|
noreply
|
38 |
|
|
3551b80…
|
noreply
|
39 |
### Single file ingestion |
|
3551b80…
|
noreply
|
40 |
|
|
3551b80…
|
noreply
|
41 |
Process a single document and create a new knowledge graph: |
|
3551b80…
|
noreply
|
42 |
|
|
3551b80…
|
noreply
|
43 |
```bash |
|
3551b80…
|
noreply
|
44 |
planopticon ingest spec.md |
|
3551b80…
|
noreply
|
45 |
``` |
|
3551b80…
|
noreply
|
46 |
|
|
3551b80…
|
noreply
|
47 |
This creates `knowledge_graph.db` and `knowledge_graph.json` in the current directory. |
|
3551b80…
|
noreply
|
48 |
|
|
3551b80…
|
noreply
|
49 |
Specify an output directory: |
|
3551b80…
|
noreply
|
50 |
|
|
3551b80…
|
noreply
|
51 |
```bash |
|
3551b80…
|
noreply
|
52 |
planopticon ingest report.pdf -o ./results |
|
3551b80…
|
noreply
|
53 |
``` |
|
3551b80…
|
noreply
|
54 |
|
|
3551b80…
|
noreply
|
55 |
This creates `./results/knowledge_graph.db` and `./results/knowledge_graph.json`. |
|
3551b80…
|
noreply
|
56 |
|
|
3551b80…
|
noreply
|
57 |
### Directory ingestion |
|
3551b80…
|
noreply
|
58 |
|
|
3551b80…
|
noreply
|
59 |
Process all supported files in a directory: |
|
3551b80…
|
noreply
|
60 |
|
|
3551b80…
|
noreply
|
61 |
```bash |
|
3551b80…
|
noreply
|
62 |
planopticon ingest ./docs/ |
|
3551b80…
|
noreply
|
63 |
``` |
|
3551b80…
|
noreply
|
64 |
|
|
3551b80…
|
noreply
|
65 |
By default, this recurses into subdirectories. To process only the top-level directory: |
|
3551b80…
|
noreply
|
66 |
|
|
3551b80…
|
noreply
|
67 |
```bash |
|
3551b80…
|
noreply
|
68 |
planopticon ingest ./docs/ --no-recursive |
|
3551b80…
|
noreply
|
69 |
``` |
|
3551b80…
|
noreply
|
70 |
|
|
3551b80…
|
noreply
|
71 |
PlanOpticon automatically filters for supported file extensions. Unsupported files are silently skipped. |
|
3551b80…
|
noreply
|
72 |
|
|
3551b80…
|
noreply
|
73 |
### Merging into an existing knowledge graph |
|
3551b80…
|
noreply
|
74 |
|
|
3551b80…
|
noreply
|
75 |
To add document content to an existing knowledge graph (e.g., one created from video analysis), use `--db-path`: |
|
3551b80…
|
noreply
|
76 |
|
|
3551b80…
|
noreply
|
77 |
```bash |
|
3551b80…
|
noreply
|
78 |
# First, analyze a video |
|
3551b80…
|
noreply
|
79 |
planopticon analyze meeting.mp4 -o ./results |
|
3551b80…
|
noreply
|
80 |
|
|
3551b80…
|
noreply
|
81 |
# Then, ingest supplementary documents into the same graph |
|
3551b80…
|
noreply
|
82 |
planopticon ingest ./meeting-notes/ --db-path ./results/knowledge_graph.db |
|
3551b80…
|
noreply
|
83 |
``` |
|
3551b80…
|
noreply
|
84 |
|
|
3551b80…
|
noreply
|
85 |
The ingested entities and relationships are merged with the existing graph. Duplicate entities are consolidated automatically by the knowledge graph engine. |
|
3551b80…
|
noreply
|
86 |
|
|
3551b80…
|
noreply
|
87 |
### Choosing an LLM provider |
|
3551b80…
|
noreply
|
88 |
|
|
3551b80…
|
noreply
|
89 |
Entity and relationship extraction requires an LLM. By default, PlanOpticon auto-detects available providers based on your environment variables. You can override this: |
|
3551b80…
|
noreply
|
90 |
|
|
3551b80…
|
noreply
|
91 |
```bash |
|
3551b80…
|
noreply
|
92 |
# Use Anthropic for extraction |
|
3551b80…
|
noreply
|
93 |
planopticon ingest docs/ -p anthropic |
|
3551b80…
|
noreply
|
94 |
|
|
3551b80…
|
noreply
|
95 |
# Use a specific model |
|
3551b80…
|
noreply
|
96 |
planopticon ingest docs/ -p openai --chat-model gpt-4o |
|
3551b80…
|
noreply
|
97 |
|
|
3551b80…
|
noreply
|
98 |
# Use a local Ollama model |
|
3551b80…
|
noreply
|
99 |
planopticon ingest docs/ -p ollama --chat-model llama3 |
|
3551b80…
|
noreply
|
100 |
``` |
|
3551b80…
|
noreply
|
101 |
|
|
3551b80…
|
noreply
|
102 |
### Output |
|
3551b80…
|
noreply
|
103 |
|
|
3551b80…
|
noreply
|
104 |
After ingestion, PlanOpticon prints a summary: |
|
3551b80…
|
noreply
|
105 |
|
|
3551b80…
|
noreply
|
106 |
``` |
|
3551b80…
|
noreply
|
107 |
Knowledge graph: ./knowledge_graph.db |
|
3551b80…
|
noreply
|
108 |
spec.md: 12 chunks |
|
3551b80…
|
noreply
|
109 |
architecture.md: 8 chunks |
|
3551b80…
|
noreply
|
110 |
requirements.txt: 3 chunks |
|
3551b80…
|
noreply
|
111 |
|
|
3551b80…
|
noreply
|
112 |
Ingestion complete: |
|
3551b80…
|
noreply
|
113 |
Files processed: 3 |
|
3551b80…
|
noreply
|
114 |
Total chunks: 23 |
|
3551b80…
|
noreply
|
115 |
Entities extracted: 47 |
|
3551b80…
|
noreply
|
116 |
Relationships: 31 |
|
3551b80…
|
noreply
|
117 |
Knowledge graph: ./knowledge_graph.db |
|
3551b80…
|
noreply
|
118 |
``` |
|
3551b80…
|
noreply
|
119 |
|
|
3551b80…
|
noreply
|
120 |
Both `.db` (SQLite/FalkorDB) and `.json` formats are saved automatically. |
|
3551b80…
|
noreply
|
121 |
|
|
3551b80…
|
noreply
|
122 |
## How each processor works |
|
3551b80…
|
noreply
|
123 |
|
|
3551b80…
|
noreply
|
124 |
### PDF processor |
|
3551b80…
|
noreply
|
125 |
|
|
3551b80…
|
noreply
|
126 |
The `PdfProcessor` extracts text from PDF files on a per-page basis. It tries two extraction libraries in order: |
|
3551b80…
|
noreply
|
127 |
|
|
3551b80…
|
noreply
|
128 |
1. **pymupdf** (preferred) -- Fast, reliable text extraction. Install with `pip install pymupdf`. |
|
3551b80…
|
noreply
|
129 |
2. **pdfplumber** (fallback) -- Alternative extractor. Install with `pip install pdfplumber`. |
|
3551b80…
|
noreply
|
130 |
|
|
3551b80…
|
noreply
|
131 |
If neither library is installed, the processor raises an `ImportError` with installation instructions. |
|
3551b80…
|
noreply
|
132 |
|
|
3551b80…
|
noreply
|
133 |
Each page becomes a separate `DocumentChunk` with: |
|
3551b80…
|
noreply
|
134 |
|
|
3551b80…
|
noreply
|
135 |
- `text`: The extracted text content of the page |
|
3551b80…
|
noreply
|
136 |
- `page`: The 1-based page number |
|
3551b80…
|
noreply
|
137 |
- `metadata.extraction_method`: Which library was used (`pymupdf` or `pdfplumber`) |
|
3551b80…
|
noreply
|
138 |
|
|
3551b80…
|
noreply
|
139 |
To install PDF support: |
|
3551b80…
|
noreply
|
140 |
|
|
3551b80…
|
noreply
|
141 |
```bash |
|
3551b80…
|
noreply
|
142 |
pip install 'planopticon[pdf]' |
|
3551b80…
|
noreply
|
143 |
# or |
|
3551b80…
|
noreply
|
144 |
pip install pymupdf |
|
3551b80…
|
noreply
|
145 |
# or |
|
3551b80…
|
noreply
|
146 |
pip install pdfplumber |
|
3551b80…
|
noreply
|
147 |
``` |
|
3551b80…
|
noreply
|
148 |
|
|
3551b80…
|
noreply
|
149 |
### Markdown processor |
|
3551b80…
|
noreply
|
150 |
|
|
3551b80…
|
noreply
|
151 |
The `MarkdownProcessor` splits Markdown files on heading boundaries (lines starting with `#` through `######`). Each heading and its content until the next heading becomes a separate chunk. |
|
3551b80…
|
noreply
|
152 |
|
|
3551b80…
|
noreply
|
153 |
**Splitting behavior:** |
|
3551b80…
|
noreply
|
154 |
|
|
3551b80…
|
noreply
|
155 |
- If the file contains headings, each heading section becomes a chunk. The `section` field records the heading text. |
|
3551b80…
|
noreply
|
156 |
- Content before the first heading is captured as a `(preamble)` chunk. |
|
3551b80…
|
noreply
|
157 |
- If the file contains no headings, it falls back to paragraph-based chunking (same as plaintext). |
|
3551b80…
|
noreply
|
158 |
|
|
3551b80…
|
noreply
|
159 |
For example, a file with this structure: |
|
3551b80…
|
noreply
|
160 |
|
|
3551b80…
|
noreply
|
161 |
```markdown |
|
3551b80…
|
noreply
|
162 |
Some intro text. |
|
3551b80…
|
noreply
|
163 |
|
|
3551b80…
|
noreply
|
164 |
# Architecture |
|
3551b80…
|
noreply
|
165 |
|
|
3551b80…
|
noreply
|
166 |
The system uses a microservices architecture... |
|
3551b80…
|
noreply
|
167 |
|
|
3551b80…
|
noreply
|
168 |
## Components |
|
3551b80…
|
noreply
|
169 |
|
|
3551b80…
|
noreply
|
170 |
There are three main components... |
|
3551b80…
|
noreply
|
171 |
|
|
3551b80…
|
noreply
|
172 |
# Deployment |
|
3551b80…
|
noreply
|
173 |
|
|
3551b80…
|
noreply
|
174 |
Deployment is handled via... |
|
3551b80…
|
noreply
|
175 |
``` |
|
3551b80…
|
noreply
|
176 |
|
|
3551b80…
|
noreply
|
177 |
Produces four chunks: `(preamble)`, `Architecture`, `Components`, and `Deployment`. |
|
3551b80…
|
noreply
|
178 |
|
|
3551b80…
|
noreply
|
179 |
### Plaintext processor |
|
3551b80…
|
noreply
|
180 |
|
|
3551b80…
|
noreply
|
181 |
The `PlaintextProcessor` handles `.txt`, `.text`, `.log`, and `.csv` files. It splits text on paragraph boundaries (double newlines) and groups paragraphs into chunks with a configurable maximum size. |
|
3551b80…
|
noreply
|
182 |
|
|
3551b80…
|
noreply
|
183 |
**Chunking parameters:** |
|
3551b80…
|
noreply
|
184 |
|
|
3551b80…
|
noreply
|
185 |
| Parameter | Default | Description | |
|
3551b80…
|
noreply
|
186 |
|-----------|---------|-------------| |
|
3551b80…
|
noreply
|
187 |
| `max_chunk_size` | 2000 characters | Maximum size of each chunk | |
|
3551b80…
|
noreply
|
188 |
| `overlap` | 200 characters | Number of characters from the end of one chunk to repeat at the start of the next | |
|
3551b80…
|
noreply
|
189 |
|
|
3551b80…
|
noreply
|
190 |
The overlap ensures that entities or context that spans a paragraph boundary are not lost. Chunks are created by accumulating paragraphs until the next paragraph would exceed `max_chunk_size`, at which point the current chunk is flushed and a new one begins. |
|
3551b80…
|
noreply
|
191 |
|
|
3551b80…
|
noreply
|
192 |
## The ingestion pipeline |
|
3551b80…
|
noreply
|
193 |
|
|
3551b80…
|
noreply
|
194 |
Document ingestion follows this pipeline: |
|
3551b80…
|
noreply
|
195 |
|
|
3551b80…
|
noreply
|
196 |
``` |
|
3551b80…
|
noreply
|
197 |
File on disk |
|
3551b80…
|
noreply
|
198 |
| |
|
3551b80…
|
noreply
|
199 |
v |
|
3551b80…
|
noreply
|
200 |
Processor selection (by file extension) |
|
3551b80…
|
noreply
|
201 |
| |
|
3551b80…
|
noreply
|
202 |
v |
|
3551b80…
|
noreply
|
203 |
Text extraction (PDF pages / Markdown sections / plaintext paragraphs) |
|
3551b80…
|
noreply
|
204 |
| |
|
3551b80…
|
noreply
|
205 |
v |
|
3551b80…
|
noreply
|
206 |
DocumentChunk objects (text + metadata) |
|
3551b80…
|
noreply
|
207 |
| |
|
3551b80…
|
noreply
|
208 |
v |
|
3551b80…
|
noreply
|
209 |
Source registration (provenance tracking in the KG) |
|
3551b80…
|
noreply
|
210 |
| |
|
3551b80…
|
noreply
|
211 |
v |
|
3551b80…
|
noreply
|
212 |
KG content addition (LLM entity/relationship extraction per chunk) |
|
3551b80…
|
noreply
|
213 |
| |
|
3551b80…
|
noreply
|
214 |
v |
|
3551b80…
|
noreply
|
215 |
Knowledge graph storage (.db + .json) |
|
3551b80…
|
noreply
|
216 |
``` |
|
3551b80…
|
noreply
|
217 |
|
|
3551b80…
|
noreply
|
218 |
### Step 1: Processor selection |
|
3551b80…
|
noreply
|
219 |
|
|
3551b80…
|
noreply
|
220 |
PlanOpticon maintains a registry of processors keyed by file extension. When you call `ingest_file()`, it looks up the appropriate processor using `get_processor(path)`. If no processor is registered for the file extension, a `ValueError` is raised. |
|
3551b80…
|
noreply
|
221 |
|
|
3551b80…
|
noreply
|
222 |
### Step 2: Text extraction |
|
3551b80…
|
noreply
|
223 |
|
|
3551b80…
|
noreply
|
224 |
The selected processor reads the file and produces a list of `DocumentChunk` objects. Each chunk contains: |
|
3551b80…
|
noreply
|
225 |
|
|
3551b80…
|
noreply
|
226 |
| Field | Type | Description | |
|
3551b80…
|
noreply
|
227 |
|-------|------|-------------| |
|
3551b80…
|
noreply
|
228 |
| `text` | `str` | The extracted text content | |
|
3551b80…
|
noreply
|
229 |
| `source_file` | `str` | Path to the source file | |
|
3551b80…
|
noreply
|
230 |
| `chunk_index` | `int` | Sequential index of this chunk within the file | |
|
3551b80…
|
noreply
|
231 |
| `page` | `Optional[int]` | Page number (PDF only, 1-based) | |
|
3551b80…
|
noreply
|
232 |
| `section` | `Optional[str]` | Section heading (Markdown only) | |
|
3551b80…
|
noreply
|
233 |
| `metadata` | `Dict[str, Any]` | Additional metadata (e.g., extraction method) | |
|
3551b80…
|
noreply
|
234 |
|
|
3551b80…
|
noreply
|
235 |
### Step 3: Source registration |
|
3551b80…
|
noreply
|
236 |
|
|
3551b80…
|
noreply
|
237 |
Each ingested file is registered as a source in the knowledge graph with provenance metadata: |
|
3551b80…
|
noreply
|
238 |
|
|
3551b80…
|
noreply
|
239 |
- `source_id`: A SHA-256 hash of the absolute file path (first 12 characters), unless you provide a custom ID |
|
3551b80…
|
noreply
|
240 |
- `source_type`: Always `"document"` |
|
3551b80…
|
noreply
|
241 |
- `title`: The file stem (filename without extension) |
|
3551b80…
|
noreply
|
242 |
- `path`: The file path |
|
3551b80…
|
noreply
|
243 |
- `mime_type`: Detected MIME type |
|
3551b80…
|
noreply
|
244 |
- `ingested_at`: ISO-8601 timestamp |
|
3551b80…
|
noreply
|
245 |
- `metadata`: Chunk count and file extension |
|
3551b80…
|
noreply
|
246 |
|
|
3551b80…
|
noreply
|
247 |
### Step 4: Entity and relationship extraction |
|
3551b80…
|
noreply
|
248 |
|
|
3551b80…
|
noreply
|
249 |
Each chunk's text is passed to `knowledge_graph.add_content()`, which uses the configured LLM provider to extract entities and relationships. The content source is tagged with the document name and either the page number or section name: |
|
3551b80…
|
noreply
|
250 |
|
|
3551b80…
|
noreply
|
251 |
- `document:report.pdf:page:3` |
|
3551b80…
|
noreply
|
252 |
- `document:spec.md:section:Architecture` |
|
3551b80…
|
noreply
|
253 |
- `document:notes.txt` (no page or section) |
|
3551b80…
|
noreply
|
254 |
|
|
3551b80…
|
noreply
|
255 |
### Step 5: Storage |
|
3551b80…
|
noreply
|
256 |
|
|
3551b80…
|
noreply
|
257 |
The knowledge graph is saved in both `.db` (SQLite-backed FalkorDB) and `.json` formats. |
|
3551b80…
|
noreply
|
258 |
|
|
3551b80…
|
noreply
|
259 |
## Combining with video analysis |
|
3551b80…
|
noreply
|
260 |
|
|
3551b80…
|
noreply
|
261 |
A common workflow is to analyze a video recording and then ingest related documents into the same knowledge graph: |
|
3551b80…
|
noreply
|
262 |
|
|
3551b80…
|
noreply
|
263 |
```bash |
|
3551b80…
|
noreply
|
264 |
# Step 1: Analyze the meeting recording |
|
3551b80…
|
noreply
|
265 |
planopticon analyze meeting-recording.mp4 -o ./project-kg |
|
3551b80…
|
noreply
|
266 |
|
|
3551b80…
|
noreply
|
267 |
# Step 2: Ingest the meeting agenda |
|
3551b80…
|
noreply
|
268 |
planopticon ingest agenda.md --db-path ./project-kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
269 |
|
|
3551b80…
|
noreply
|
270 |
# Step 3: Ingest the project spec |
|
3551b80…
|
noreply
|
271 |
planopticon ingest project-spec.pdf --db-path ./project-kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
272 |
|
|
3551b80…
|
noreply
|
273 |
# Step 4: Ingest a whole docs folder |
|
3551b80…
|
noreply
|
274 |
planopticon ingest ./reference-docs/ --db-path ./project-kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
275 |
|
|
3551b80…
|
noreply
|
276 |
# Step 5: Query the combined graph |
|
3551b80…
|
noreply
|
277 |
planopticon query --db-path ./project-kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
278 |
``` |
|
3551b80…
|
noreply
|
279 |
|
|
3551b80…
|
noreply
|
280 |
The resulting knowledge graph contains entities and relationships from all sources -- video transcripts, meeting agendas, specs, and reference documents -- with full provenance tracking so you can trace any entity back to its source. |
|
3551b80…
|
noreply
|
281 |
|
|
3551b80…
|
noreply
|
282 |
## Python API |
|
3551b80…
|
noreply
|
283 |
|
|
3551b80…
|
noreply
|
284 |
### Ingesting a single file |
|
3551b80…
|
noreply
|
285 |
|
|
3551b80…
|
noreply
|
286 |
```python |
|
3551b80…
|
noreply
|
287 |
from pathlib import Path |
|
3551b80…
|
noreply
|
288 |
from video_processor.integrators.knowledge_graph import KnowledgeGraph |
|
3551b80…
|
noreply
|
289 |
from video_processor.processors.ingest import ingest_file |
|
3551b80…
|
noreply
|
290 |
|
|
3551b80…
|
noreply
|
291 |
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db")) |
|
3551b80…
|
noreply
|
292 |
chunk_count = ingest_file(Path("document.pdf"), kg) |
|
3551b80…
|
noreply
|
293 |
print(f"Processed {chunk_count} chunks") |
|
3551b80…
|
noreply
|
294 |
|
|
3551b80…
|
noreply
|
295 |
kg.save(Path("knowledge_graph.db")) |
|
3551b80…
|
noreply
|
296 |
``` |
|
3551b80…
|
noreply
|
297 |
|
|
3551b80…
|
noreply
|
298 |
### Ingesting a directory |
|
3551b80…
|
noreply
|
299 |
|
|
3551b80…
|
noreply
|
300 |
```python |
|
3551b80…
|
noreply
|
301 |
from pathlib import Path |
|
3551b80…
|
noreply
|
302 |
from video_processor.integrators.knowledge_graph import KnowledgeGraph |
|
3551b80…
|
noreply
|
303 |
from video_processor.processors.ingest import ingest_directory |
|
3551b80…
|
noreply
|
304 |
|
|
3551b80…
|
noreply
|
305 |
kg = KnowledgeGraph(db_path=Path("knowledge_graph.db")) |
|
3551b80…
|
noreply
|
306 |
results = ingest_directory( |
|
3551b80…
|
noreply
|
307 |
Path("./docs"), |
|
3551b80…
|
noreply
|
308 |
kg, |
|
3551b80…
|
noreply
|
309 |
recursive=True, |
|
3551b80…
|
noreply
|
310 |
extensions=[".md", ".pdf"], # Optional: filter by extension |
|
3551b80…
|
noreply
|
311 |
) |
|
3551b80…
|
noreply
|
312 |
|
|
3551b80…
|
noreply
|
313 |
for filepath, chunks in results.items(): |
|
3551b80…
|
noreply
|
314 |
print(f" {filepath}: {chunks} chunks") |
|
3551b80…
|
noreply
|
315 |
|
|
3551b80…
|
noreply
|
316 |
kg.save(Path("knowledge_graph.db")) |
|
3551b80…
|
noreply
|
317 |
``` |
|
3551b80…
|
noreply
|
318 |
|
|
3551b80…
|
noreply
|
319 |
### Listing supported extensions |
|
3551b80…
|
noreply
|
320 |
|
|
3551b80…
|
noreply
|
321 |
```python |
|
3551b80…
|
noreply
|
322 |
from video_processor.processors.base import list_supported_extensions |
|
3551b80…
|
noreply
|
323 |
|
|
3551b80…
|
noreply
|
324 |
extensions = list_supported_extensions() |
|
3551b80…
|
noreply
|
325 |
print(extensions) |
|
3551b80…
|
noreply
|
326 |
# ['.csv', '.log', '.markdown', '.md', '.pdf', '.text', '.txt'] |
|
3551b80…
|
noreply
|
327 |
``` |
|
3551b80…
|
noreply
|
328 |
|
|
3551b80…
|
noreply
|
329 |
### Working with processors directly |
|
3551b80…
|
noreply
|
330 |
|
|
3551b80…
|
noreply
|
331 |
```python |
|
3551b80…
|
noreply
|
332 |
from pathlib import Path |
|
3551b80…
|
noreply
|
333 |
from video_processor.processors.base import get_processor |
|
3551b80…
|
noreply
|
334 |
|
|
3551b80…
|
noreply
|
335 |
processor = get_processor(Path("report.pdf")) |
|
3551b80…
|
noreply
|
336 |
if processor: |
|
3551b80…
|
noreply
|
337 |
chunks = processor.process(Path("report.pdf")) |
|
3551b80…
|
noreply
|
338 |
for chunk in chunks: |
|
3551b80…
|
noreply
|
339 |
print(f"Page {chunk.page}: {chunk.text[:100]}...") |
|
3551b80…
|
noreply
|
340 |
``` |
|
3551b80…
|
noreply
|
341 |
|
|
3551b80…
|
noreply
|
342 |
## Extending with custom processors |
|
3551b80…
|
noreply
|
343 |
|
|
3551b80…
|
noreply
|
344 |
To add support for a new file format, implement the `DocumentProcessor` abstract class and register it: |
|
3551b80…
|
noreply
|
345 |
|
|
3551b80…
|
noreply
|
346 |
```python |
|
3551b80…
|
noreply
|
347 |
from pathlib import Path |
|
3551b80…
|
noreply
|
348 |
from typing import List |
|
3551b80…
|
noreply
|
349 |
from video_processor.processors.base import ( |
|
3551b80…
|
noreply
|
350 |
DocumentChunk, |
|
3551b80…
|
noreply
|
351 |
DocumentProcessor, |
|
3551b80…
|
noreply
|
352 |
register_processor, |
|
3551b80…
|
noreply
|
353 |
) |
|
3551b80…
|
noreply
|
354 |
|
|
3551b80…
|
noreply
|
355 |
|
|
3551b80…
|
noreply
|
356 |
class HtmlProcessor(DocumentProcessor): |
|
3551b80…
|
noreply
|
357 |
supported_extensions = [".html", ".htm"] |
|
3551b80…
|
noreply
|
358 |
|
|
3551b80…
|
noreply
|
359 |
def can_process(self, path: Path) -> bool: |
|
3551b80…
|
noreply
|
360 |
return path.suffix.lower() in self.supported_extensions |
|
3551b80…
|
noreply
|
361 |
|
|
3551b80…
|
noreply
|
362 |
def process(self, path: Path) -> List[DocumentChunk]: |
|
3551b80…
|
noreply
|
363 |
from bs4 import BeautifulSoup |
|
3551b80…
|
noreply
|
364 |
|
|
3551b80…
|
noreply
|
365 |
soup = BeautifulSoup(path.read_text(), "html.parser") |
|
3551b80…
|
noreply
|
366 |
text = soup.get_text(separator="\n") |
|
3551b80…
|
noreply
|
367 |
return [ |
|
3551b80…
|
noreply
|
368 |
DocumentChunk( |
|
3551b80…
|
noreply
|
369 |
text=text, |
|
3551b80…
|
noreply
|
370 |
source_file=str(path), |
|
3551b80…
|
noreply
|
371 |
chunk_index=0, |
|
3551b80…
|
noreply
|
372 |
) |
|
3551b80…
|
noreply
|
373 |
] |
|
3551b80…
|
noreply
|
374 |
|
|
3551b80…
|
noreply
|
375 |
|
|
3551b80…
|
noreply
|
376 |
register_processor(HtmlProcessor.supported_extensions, HtmlProcessor) |
|
3551b80…
|
noreply
|
377 |
``` |
|
3551b80…
|
noreply
|
378 |
|
|
3551b80…
|
noreply
|
379 |
After registration, `planopticon ingest` will automatically handle `.html` and `.htm` files. |
|
3551b80…
|
noreply
|
380 |
|
|
3551b80…
|
noreply
|
381 |
## Companion REPL |
|
3551b80…
|
noreply
|
382 |
|
|
3551b80…
|
noreply
|
383 |
Inside the interactive companion REPL, you can ingest files using the `/ingest` command: |
|
3551b80…
|
noreply
|
384 |
|
|
3551b80…
|
noreply
|
385 |
``` |
|
3551b80…
|
noreply
|
386 |
> /ingest ./meeting-notes.md |
|
3551b80…
|
noreply
|
387 |
Ingested meeting-notes.md: 5 chunks |
|
3551b80…
|
noreply
|
388 |
``` |
|
3551b80…
|
noreply
|
389 |
|
|
3551b80…
|
noreply
|
390 |
This adds content to the currently loaded knowledge graph. |
|
3551b80…
|
noreply
|
391 |
|
|
3551b80…
|
noreply
|
392 |
## Common workflows |
|
3551b80…
|
noreply
|
393 |
|
|
3551b80…
|
noreply
|
394 |
### Build a project knowledge base from scratch |
|
3551b80…
|
noreply
|
395 |
|
|
3551b80…
|
noreply
|
396 |
```bash |
|
3551b80…
|
noreply
|
397 |
# Ingest all project docs |
|
3551b80…
|
noreply
|
398 |
planopticon ingest ./project-docs/ -o ./knowledge-base |
|
3551b80…
|
noreply
|
399 |
|
|
3551b80…
|
noreply
|
400 |
# Query what was captured |
|
3551b80…
|
noreply
|
401 |
planopticon query --db-path ./knowledge-base/knowledge_graph.db |
|
3551b80…
|
noreply
|
402 |
|
|
3551b80…
|
noreply
|
403 |
# Export as an Obsidian vault |
|
3551b80…
|
noreply
|
404 |
planopticon export obsidian ./knowledge-base/knowledge_graph.db -o ./vault |
|
3551b80…
|
noreply
|
405 |
``` |
|
3551b80…
|
noreply
|
406 |
|
|
3551b80…
|
noreply
|
407 |
### Incrementally build a knowledge graph |
|
3551b80…
|
noreply
|
408 |
|
|
3551b80…
|
noreply
|
409 |
```bash |
|
3551b80…
|
noreply
|
410 |
# Start with initial docs |
|
3551b80…
|
noreply
|
411 |
planopticon ingest ./sprint-1-docs/ -o ./kg |
|
3551b80…
|
noreply
|
412 |
|
|
3551b80…
|
noreply
|
413 |
# Add more docs over time |
|
3551b80…
|
noreply
|
414 |
planopticon ingest ./sprint-2-docs/ --db-path ./kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
415 |
planopticon ingest ./sprint-3-docs/ --db-path ./kg/knowledge_graph.db |
|
3551b80…
|
noreply
|
416 |
|
|
3551b80…
|
noreply
|
417 |
# The graph grows with each ingestion |
|
3551b80…
|
noreply
|
418 |
planopticon query --db-path ./kg/knowledge_graph.db stats |
|
3551b80…
|
noreply
|
419 |
``` |
|
3551b80…
|
noreply
|
420 |
|
|
3551b80…
|
noreply
|
421 |
### Ingest from Google Workspace or Microsoft 365 |
|
3551b80…
|
noreply
|
422 |
|
|
3551b80…
|
noreply
|
423 |
PlanOpticon provides integrated commands that fetch cloud documents and ingest them in one step: |
|
3551b80…
|
noreply
|
424 |
|
|
3551b80…
|
noreply
|
425 |
```bash |
|
3551b80…
|
noreply
|
426 |
# Google Workspace |
|
3551b80…
|
noreply
|
427 |
planopticon gws ingest --folder-id FOLDER_ID -o ./results |
|
3551b80…
|
noreply
|
428 |
|
|
3551b80…
|
noreply
|
429 |
# Microsoft 365 / SharePoint |
|
3551b80…
|
noreply
|
430 |
planopticon m365 ingest --web-url https://contoso.sharepoint.com/sites/proj \ |
|
3551b80…
|
noreply
|
431 |
--folder-url /sites/proj/Shared\ Documents |
|
3551b80…
|
noreply
|
432 |
``` |
|
3551b80…
|
noreply
|
433 |
|
|
3551b80…
|
noreply
|
434 |
These commands handle authentication, document download, text extraction, and knowledge graph creation automatically. |