fix(security): 添加VITE_PAYMENT_URL环境变量配置
This commit is contained in:
132
.qoder/skills/understand-knowledge/SKILL.md
Normal file
132
.qoder/skills/understand-knowledge/SKILL.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
name: understand-knowledge
|
||||
description: Analyze a Karpathy-pattern LLM wiki knowledge base and generate an interactive knowledge graph with entity extraction, implicit relationships, and topic clustering.
|
||||
argument-hint: [wiki-directory]
|
||||
---
|
||||
|
||||
# /understand-knowledge
|
||||
|
||||
Analyzes a Karpathy-pattern LLM wiki — a three-layer knowledge base with raw sources, wiki markdown, and a schema file — and produces an interactive knowledge graph dashboard.
|
||||
|
||||
## What It Detects
|
||||
|
||||
The **Karpathy LLM wiki pattern** (see https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f):
|
||||
- **Raw sources** — immutable source documents (articles, papers, data files)
|
||||
- **Wiki** — LLM-generated markdown files with wikilinks (`[[target]]` syntax)
|
||||
- **Schema** — CLAUDE.md, AGENTS.md, or similar configuration file
|
||||
- **index.md** — content catalog organized by categories
|
||||
- **log.md** — chronological operation log
|
||||
|
||||
Detection signals: has `index.md` + multiple `.md` files with wikilinks. May have `raw/` directory and schema file.
|
||||
|
||||
## Instructions
|
||||
|
||||
### Phase 1: DETECT
|
||||
|
||||
1. Determine the target directory:
|
||||
- If the user provided a path argument, use that
|
||||
- Otherwise, use the current working directory
|
||||
|
||||
2. Run the format detection script bundled with this skill:
|
||||
```
|
||||
python3 <SKILL_DIR>/parse-knowledge-base.py <TARGET_DIR>
|
||||
```
|
||||
- If the script exits with an error, tell the user this doesn't appear to be a Karpathy-pattern wiki and explain what was expected
|
||||
- If successful, proceed. The script writes `scan-manifest.json` to `<TARGET_DIR>/.understand-anything/intermediate/`
|
||||
|
||||
3. Read the scan-manifest.json and announce the results:
|
||||
- "Detected Karpathy wiki: N articles, N sources, N topics, N wikilinks (N unresolved)"
|
||||
- List the categories found from index.md
|
||||
|
||||
### Phase 2: SCAN (already done)
|
||||
|
||||
The parse script in Phase 1 already performed the deterministic scan. The scan-manifest.json contains:
|
||||
- Article nodes (one per wiki .md file) with extracted wikilinks, headings, frontmatter
|
||||
- Source nodes (one per raw/ file)
|
||||
- Topic nodes (from index.md section headings)
|
||||
- `related` edges (from wikilinks)
|
||||
- `categorized_under` edges (from index.md sections)
|
||||
|
||||
No additional scanning is needed. Proceed to Phase 3.
|
||||
|
||||
### Phase 3: ANALYZE
|
||||
|
||||
Dispatch `article-analyzer` subagents to extract implicit knowledge:
|
||||
|
||||
1. Read the scan-manifest.json to get the article list
|
||||
|
||||
2. Prepare batches of 10-15 articles each, grouped by category when possible (articles in the same category are more likely to have implicit cross-references)
|
||||
|
||||
3. For each batch, dispatch an `article-analyzer` subagent with:
|
||||
- The batch of articles (id, name, summary, wikilinks, category, content from knowledgeMeta)
|
||||
- The full list of existing node IDs (so the agent can reference them)
|
||||
- The batch number for output file naming
|
||||
- The intermediate directory path: `$INTERMEDIATE_DIR = <TARGET_DIR>/.understand-anything/intermediate`
|
||||
|
||||
The agent will write `analysis-batch-{N}.json` to the intermediate directory.
|
||||
|
||||
4. Run up to 3 batches concurrently. Wait for all batches to complete.
|
||||
|
||||
5. If any batch fails, log a warning but continue — the scan-manifest provides a solid base graph even without LLM analysis.
|
||||
|
||||
### Phase 4: MERGE
|
||||
|
||||
1. Run the merge script bundled with this skill:
|
||||
```
|
||||
python3 <SKILL_DIR>/merge-knowledge-graph.py <TARGET_DIR>
|
||||
```
|
||||
|
||||
2. The script:
|
||||
- Combines scan-manifest.json + all analysis-batch-*.json files
|
||||
- Deduplicates entities (case-insensitive name matching)
|
||||
- Normalizes node/edge types via alias maps
|
||||
- Builds layers from index.md categories
|
||||
- Builds a tour from index.md section ordering
|
||||
- Writes `assembled-graph.json` to the intermediate directory
|
||||
|
||||
3. Read the merge report from stderr and announce:
|
||||
- Total nodes, edges, layers, tour steps
|
||||
- How many entities/claims the LLM analysis added
|
||||
|
||||
### Phase 5: SAVE
|
||||
|
||||
1. Read the assembled-graph.json
|
||||
|
||||
2. Run basic validation:
|
||||
- Every edge source/target must reference an existing node
|
||||
- Every node must have: id, type, name, summary, tags, complexity
|
||||
- Remove any edges with dangling references
|
||||
|
||||
3. Copy the validated graph to `<TARGET_DIR>/.understand-anything/knowledge-graph.json`
|
||||
|
||||
4. Write metadata to `<TARGET_DIR>/.understand-anything/meta.json`:
|
||||
```json
|
||||
{
|
||||
"lastAnalyzedAt": "<ISO timestamp>",
|
||||
"gitCommitHash": "<from git rev-parse HEAD or empty>",
|
||||
"version": "1.0.0",
|
||||
"analyzedFiles": <number of wiki articles>
|
||||
}
|
||||
```
|
||||
|
||||
5. Clean up intermediate files:
|
||||
```
|
||||
rm -rf <TARGET_DIR>/.understand-anything/intermediate
|
||||
```
|
||||
|
||||
6. Report summary to the user:
|
||||
- "Knowledge graph saved: N articles, N entities, N topics, N claims, N sources"
|
||||
- "N edges (N wikilink, N categorized, N implicit)"
|
||||
- "N layers, N tour steps"
|
||||
|
||||
7. Auto-trigger the dashboard:
|
||||
```
|
||||
/understand-dashboard <TARGET_DIR>
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The parse script handles ALL deterministic extraction (wikilinks, headings, frontmatter, categories from index.md). The LLM agents only add implicit knowledge that requires inference.
|
||||
- Categories and taxonomy come from index.md section headings, NOT from filename prefixes. The Karpathy spec is intentionally abstract about naming conventions.
|
||||
- The graph uses `kind: "knowledge"` to signal the dashboard to use force-directed layout instead of hierarchical dagre.
|
||||
- Source nodes from raw/ are lightweight (filename + size only) — we don't parse PDFs or binary files.
|
||||
397
.qoder/skills/understand-knowledge/merge-knowledge-graph.py
Normal file
397
.qoder/skills/understand-knowledge/merge-knowledge-graph.py
Normal file
@@ -0,0 +1,397 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Merge script for Karpathy-pattern knowledge graphs.
|
||||
|
||||
Combines the deterministic scan-manifest.json with LLM analysis batches
|
||||
(analysis-batch-*.json) into a final assembled knowledge graph.
|
||||
|
||||
Handles: entity deduplication, edge normalization, layer building from
|
||||
index.md categories, tour generation from index.md section ordering.
|
||||
|
||||
Usage:
|
||||
python merge-knowledge-graph.py <wiki-directory>
|
||||
|
||||
Output:
|
||||
Writes assembled-graph.json to <wiki-directory>/.understand-anything/intermediate/
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Canonical type sets (must match core/src/types.ts)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
VALID_NODE_TYPES = {
|
||||
"article", "entity", "topic", "claim", "source",
|
||||
# Codebase types (for cross-compatibility)
|
||||
"file", "function", "class", "module", "concept",
|
||||
"config", "document", "service", "table", "endpoint",
|
||||
"pipeline", "schema", "resource", "domain", "flow", "step",
|
||||
}
|
||||
|
||||
VALID_EDGE_TYPES = {
|
||||
"cites", "contradicts", "builds_on", "exemplifies",
|
||||
"categorized_under", "authored_by", "related", "similar_to",
|
||||
# Codebase types
|
||||
"imports", "exports", "contains", "inherits", "implements",
|
||||
"calls", "subscribes", "publishes", "middleware",
|
||||
"reads_from", "writes_to", "transforms", "validates",
|
||||
"depends_on", "tested_by", "configures",
|
||||
"deploys", "serves", "provisions", "triggers",
|
||||
"migrates", "documents", "routes", "defines_schema",
|
||||
"contains_flow", "flow_step", "cross_domain",
|
||||
}
|
||||
|
||||
NODE_TYPE_ALIASES = {
|
||||
"note": "article", "page": "article", "wiki_page": "article",
|
||||
"person": "entity", "actor": "entity", "organization": "entity",
|
||||
"tag": "topic", "category": "topic", "theme": "topic",
|
||||
"assertion": "claim", "decision": "claim", "thesis": "claim",
|
||||
"reference": "source", "raw": "source", "paper": "source",
|
||||
}
|
||||
|
||||
EDGE_TYPE_ALIASES = {
|
||||
"references": "cites", "cites_source": "cites",
|
||||
"conflicts_with": "contradicts", "disagrees_with": "contradicts",
|
||||
"refines": "builds_on", "elaborates": "builds_on",
|
||||
"illustrates": "exemplifies", "instance_of": "exemplifies", "example_of": "exemplifies",
|
||||
"belongs_to": "categorized_under", "tagged_with": "categorized_under",
|
||||
"written_by": "authored_by", "created_by": "authored_by",
|
||||
"relates_to": "related", "related_to": "related",
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Normalization
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def normalize_node_type(t: str) -> str:
|
||||
t = t.lower().strip()
|
||||
return NODE_TYPE_ALIASES.get(t, t)
|
||||
|
||||
|
||||
def normalize_edge_type(t: str) -> str:
|
||||
t = t.lower().strip()
|
||||
return EDGE_TYPE_ALIASES.get(t, t)
|
||||
|
||||
|
||||
def normalize_entity_name(name: str) -> str:
|
||||
"""Normalize entity names for deduplication."""
|
||||
return re.sub(r'\s+', ' ', name.strip().lower())
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Merge pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def merge(root: Path) -> dict:
|
||||
intermediate = root / ".understand-anything" / "intermediate"
|
||||
manifest_path = intermediate / "scan-manifest.json"
|
||||
|
||||
if not manifest_path.is_file():
|
||||
print(f"Error: {manifest_path} not found. Run parse-knowledge-base.py first.",
|
||||
file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Load scan manifest (deterministic base)
|
||||
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
|
||||
nodes = {n["id"]: n for n in manifest["nodes"]}
|
||||
edges = list(manifest["edges"])
|
||||
|
||||
report = {"base_nodes": len(nodes), "base_edges": len(edges),
|
||||
"batches": 0, "new_entities": 0, "new_claims": 0,
|
||||
"new_edges": 0, "deduped_entities": 0, "dropped_edges": 0}
|
||||
|
||||
# Load analysis batches
|
||||
batch_files = sorted(intermediate.glob("analysis-batch-*.json"))
|
||||
entity_name_map: dict[str, str] = {} # normalized_name → entity_id
|
||||
dedup_remap: dict[str, str] = {} # duplicate_id → canonical_id
|
||||
|
||||
for bf in batch_files:
|
||||
report["batches"] += 1
|
||||
try:
|
||||
batch = json.loads(bf.read_text(encoding="utf-8"))
|
||||
except (json.JSONDecodeError, OSError) as e:
|
||||
print(f"[merge] Warning: Failed to load {bf.name}: {e}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
# Process new nodes from LLM analysis
|
||||
for node in batch.get("nodes", []):
|
||||
node_type = normalize_node_type(node.get("type", ""))
|
||||
if node_type not in VALID_NODE_TYPES:
|
||||
print(f"[merge] Warning: Unknown node type '{node.get('type')}' — skipping",
|
||||
file=sys.stderr)
|
||||
continue
|
||||
|
||||
node["type"] = node_type
|
||||
node_id = node.get("id", "")
|
||||
|
||||
# Entity deduplication — track remapping for edge fixup
|
||||
if node_type == "entity":
|
||||
norm_name = normalize_entity_name(node.get("name", ""))
|
||||
if norm_name in entity_name_map:
|
||||
# Map duplicate ID → canonical ID for edge remapping
|
||||
dedup_remap[node_id] = entity_name_map[norm_name]
|
||||
report["deduped_entities"] += 1
|
||||
continue
|
||||
entity_name_map[norm_name] = node_id
|
||||
report["new_entities"] += 1
|
||||
elif node_type == "claim":
|
||||
report["new_claims"] += 1
|
||||
|
||||
# Ensure required fields
|
||||
node.setdefault("summary", node.get("name", ""))
|
||||
node.setdefault("tags", [])
|
||||
node.setdefault("complexity", "simple")
|
||||
|
||||
nodes[node_id] = node
|
||||
|
||||
# Process new edges from LLM analysis
|
||||
for edge in batch.get("edges", []):
|
||||
edge_type = normalize_edge_type(edge.get("type", ""))
|
||||
if edge_type not in VALID_EDGE_TYPES:
|
||||
print(f"[merge] Warning: Unknown edge type '{edge.get('type')}' — "
|
||||
f"mapped to 'related'", file=sys.stderr)
|
||||
edge_type = "related"
|
||||
|
||||
edge["type"] = edge_type
|
||||
edge.setdefault("direction", "forward")
|
||||
edge.setdefault("weight", 0.5)
|
||||
|
||||
# Remap deduped entity IDs, then validate source/target exist
|
||||
src = dedup_remap.get(edge.get("source", ""), edge.get("source", ""))
|
||||
tgt = dedup_remap.get(edge.get("target", ""), edge.get("target", ""))
|
||||
edge["source"] = src
|
||||
edge["target"] = tgt
|
||||
if src in nodes and tgt in nodes:
|
||||
edges.append(edge)
|
||||
report["new_edges"] += 1
|
||||
else:
|
||||
report["dropped_edges"] += 1
|
||||
|
||||
# --- Deduplicate edges ---
|
||||
seen: set[tuple[str, str, str]] = set()
|
||||
final_edges = []
|
||||
for edge in edges:
|
||||
key = (edge["source"], edge["target"], edge["type"])
|
||||
if key not in seen:
|
||||
seen.add(key)
|
||||
final_edges.append(edge)
|
||||
|
||||
# --- Build article→layer map from categories ---
|
||||
categories = manifest.get("categories", [])
|
||||
article_layer_map: dict[str, str] = {} # article_id → layer_id
|
||||
layer_members: dict[str, list[str]] = {} # layer_id → [node_ids]
|
||||
|
||||
for cat in categories:
|
||||
cat_name = cat["name"]
|
||||
cat_slug = cat_name.lower().replace(" ", "-")
|
||||
layer_id = f"layer:{cat_slug}"
|
||||
topic_id = f"topic:{cat_slug}"
|
||||
members = [e["source"] for e in final_edges
|
||||
if e["type"] == "categorized_under" and e["target"] == topic_id]
|
||||
if topic_id in nodes:
|
||||
members.append(topic_id)
|
||||
layer_members[layer_id] = members
|
||||
for mid in members:
|
||||
article_layer_map[mid] = layer_id
|
||||
|
||||
# --- Assign entity/claim nodes to their parent article's layer ---
|
||||
# Step 1: Build entity/claim → article mapping from edges
|
||||
child_to_article: dict[str, str] = {}
|
||||
for edge in final_edges:
|
||||
src_type = nodes.get(edge["source"], {}).get("type", "")
|
||||
tgt_type = nodes.get(edge["target"], {}).get("type", "")
|
||||
# If an article connects to an entity/claim, map the child to the article
|
||||
if src_type == "article" and tgt_type in ("entity", "claim"):
|
||||
child_to_article.setdefault(edge["target"], edge["source"])
|
||||
elif tgt_type == "article" and src_type in ("entity", "claim"):
|
||||
child_to_article.setdefault(edge["source"], edge["target"])
|
||||
|
||||
# Step 2: For orphan entities/claims, try to match by ID prefix
|
||||
# Build a reverse lookup: bare article name → full article ID
|
||||
# e.g., "concept-aaak-compression" → "article:concepts/concept-aaak-compression"
|
||||
bare_to_article: dict[str, str] = {}
|
||||
for nid in nodes:
|
||||
if nid.startswith("article:"):
|
||||
# Extract the bare filename from paths like "article:concepts/concept-foo"
|
||||
bare = nid.split("/")[-1] if "/" in nid else nid.replace("article:", "")
|
||||
bare_to_article[bare] = nid
|
||||
|
||||
for nid, node in nodes.items():
|
||||
if node["type"] in ("entity", "claim") and nid not in child_to_article:
|
||||
# e.g., "claim:concept-aaak-compression:not-zero-loss" → stem "concept-aaak-compression"
|
||||
# e.g., "entity:brain" → stem "brain"
|
||||
raw = nid.split(":", 1)[1] if ":" in nid else nid # "concept-aaak-compression:not-zero-loss"
|
||||
stem = raw.split(":")[0] # "concept-aaak-compression"
|
||||
|
||||
# Try exact bare name match first
|
||||
if stem in bare_to_article:
|
||||
child_to_article[nid] = bare_to_article[stem]
|
||||
else:
|
||||
# Try suffix/substring match against bare names
|
||||
# e.g., entity:brain → segment-brain, entity:mempalace → tool-mempalace
|
||||
matched = False
|
||||
for bare, aid in bare_to_article.items():
|
||||
if stem in bare or bare in stem:
|
||||
child_to_article[nid] = aid
|
||||
matched = True
|
||||
break
|
||||
# Also try: bare ends with -stem (e.g., "segment-brain" ends with "-brain")
|
||||
if bare.endswith(f"-{stem}") or bare.endswith(f"/{stem}"):
|
||||
child_to_article[nid] = aid
|
||||
matched = True
|
||||
break
|
||||
# Last resort: check if the node's name appears in any article's
|
||||
# name OR content (knowledgeMeta.content)
|
||||
if not matched and node.get("name"):
|
||||
node_name_lower = node["name"].lower()
|
||||
for aid, anode in nodes.items():
|
||||
if not aid.startswith("article:"):
|
||||
continue
|
||||
# Match against article name
|
||||
if node_name_lower in anode.get("name", "").lower():
|
||||
child_to_article[nid] = aid
|
||||
matched = True
|
||||
break
|
||||
# Match against article content (wikilinks or text)
|
||||
meta = anode.get("knowledgeMeta", {})
|
||||
content = (meta.get("content") or "").lower()
|
||||
if len(node_name_lower) >= 3 and node_name_lower in content:
|
||||
child_to_article[nid] = aid
|
||||
matched = True
|
||||
break
|
||||
|
||||
# Step 3: Place children into their parent article's layer
|
||||
for child_id, article_id in child_to_article.items():
|
||||
layer_id = article_layer_map.get(article_id)
|
||||
if layer_id and layer_id in layer_members:
|
||||
layer_members[layer_id].append(child_id)
|
||||
article_layer_map[child_id] = layer_id
|
||||
|
||||
# --- Build layers ---
|
||||
layers = []
|
||||
for cat in categories:
|
||||
cat_name = cat["name"]
|
||||
cat_slug = cat_name.lower().replace(" ", "-")
|
||||
layer_id = f"layer:{cat_slug}"
|
||||
members = list(dict.fromkeys(layer_members.get(layer_id, []))) # Deduplicate preserving order
|
||||
layers.append({
|
||||
"id": layer_id,
|
||||
"name": cat_name,
|
||||
"description": f"{cat_name} ({len(members)} nodes)",
|
||||
"nodeIds": members,
|
||||
})
|
||||
|
||||
# Assign uncategorized nodes to an "Other" layer
|
||||
categorized_ids = set()
|
||||
for layer in layers:
|
||||
categorized_ids.update(layer["nodeIds"])
|
||||
uncategorized = [nid for nid in nodes if nid not in categorized_ids]
|
||||
if uncategorized:
|
||||
layers.append({
|
||||
"id": "layer:other",
|
||||
"name": "Other",
|
||||
"description": f"Uncategorized nodes ({len(uncategorized)})",
|
||||
"nodeIds": uncategorized,
|
||||
})
|
||||
|
||||
# --- Build tour from index.md category ordering ---
|
||||
tour = []
|
||||
for i, cat in enumerate(categories):
|
||||
cat_slug = cat["name"].lower().replace(" ", "-")
|
||||
topic_id = f"topic:{cat_slug}"
|
||||
# Pick representative articles (up to 3 per category)
|
||||
members = [e["source"] for e in final_edges
|
||||
if e["type"] == "categorized_under" and e["target"] == topic_id][:3]
|
||||
if not members and topic_id in nodes:
|
||||
members = [topic_id]
|
||||
if members:
|
||||
tour.append({
|
||||
"order": i + 1,
|
||||
"title": cat["name"],
|
||||
"description": f"Explore the {cat['name']} section ({cat['count']} articles)",
|
||||
"nodeIds": members,
|
||||
})
|
||||
|
||||
# --- Detect project name ---
|
||||
project_name = root.name
|
||||
# Try to find a better name from index.md H1
|
||||
index_path = root / "wiki" / "index.md"
|
||||
if not index_path.is_file():
|
||||
index_path = root / "index.md"
|
||||
if index_path.is_file():
|
||||
text = index_path.read_text(encoding="utf-8", errors="replace")
|
||||
h1_match = re.search(r"^#\s+(.+)$", text, re.MULTILINE)
|
||||
if h1_match:
|
||||
project_name = h1_match.group(1).strip()
|
||||
|
||||
# --- Assemble final graph ---
|
||||
graph = {
|
||||
"version": "1.0.0",
|
||||
"kind": "knowledge",
|
||||
"project": {
|
||||
"name": project_name,
|
||||
"languages": ["markdown"],
|
||||
"frameworks": ["karpathy-wiki"],
|
||||
"description": f"Knowledge graph for {project_name}",
|
||||
"analyzedAt": datetime.now(timezone.utc).isoformat(),
|
||||
"gitCommitHash": "",
|
||||
},
|
||||
"nodes": list(nodes.values()),
|
||||
"edges": final_edges,
|
||||
"layers": layers,
|
||||
"tour": tour,
|
||||
}
|
||||
|
||||
# Try to get git commit hash
|
||||
try:
|
||||
import subprocess
|
||||
result = subprocess.run(
|
||||
["git", "rev-parse", "HEAD"],
|
||||
capture_output=True, text=True, cwd=str(root), timeout=5
|
||||
)
|
||||
if result.returncode == 0:
|
||||
graph["project"]["gitCommitHash"] = result.stdout.strip()
|
||||
except (OSError, subprocess.TimeoutExpired):
|
||||
pass
|
||||
|
||||
# Write output
|
||||
out_path = intermediate / "assembled-graph.json"
|
||||
out_path.write_text(json.dumps(graph, indent=2), encoding="utf-8")
|
||||
|
||||
# Report
|
||||
print(f"[merge] Input: {report['base_nodes']} scan nodes, "
|
||||
f"{report['base_edges']} scan edges, {report['batches']} analysis batches",
|
||||
file=sys.stderr)
|
||||
print(f"[merge] Added: {report['new_entities']} entities, "
|
||||
f"{report['new_claims']} claims, {report['new_edges']} edges "
|
||||
f"({report['deduped_entities']} deduped entities, "
|
||||
f"{report['dropped_edges']} dropped dangling edges)", file=sys.stderr)
|
||||
print(f"[merge] Output: {len(graph['nodes'])} nodes, {len(final_edges)} edges, "
|
||||
f"{len(layers)} layers, {len(tour)} tour steps", file=sys.stderr)
|
||||
print(f"[merge] Written: {out_path}", file=sys.stderr)
|
||||
|
||||
return graph
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: merge-knowledge-graph.py <wiki-directory>", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
root = Path(sys.argv[1]).resolve()
|
||||
if not root.is_dir():
|
||||
print(f"Error: {root} is not a directory", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
merge(root)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
509
.qoder/skills/understand-knowledge/parse-knowledge-base.py
Normal file
509
.qoder/skills/understand-knowledge/parse-knowledge-base.py
Normal file
@@ -0,0 +1,509 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Deterministic parser for Karpathy-pattern LLM wikis.
|
||||
|
||||
Detects the three-layer pattern (raw sources + wiki markdown + schema),
|
||||
extracts structure from markdown files, resolves wikilinks, and derives
|
||||
categories from index.md section headings.
|
||||
|
||||
Usage:
|
||||
python parse-knowledge-base.py <wiki-directory>
|
||||
|
||||
Output:
|
||||
Writes scan-manifest.json to <wiki-directory>/.understand-anything/intermediate/
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Regex patterns
|
||||
# ---------------------------------------------------------------------------
|
||||
WIKILINK_RE = re.compile(r"\[\[([^\]|]+)(?:\|([^\]]+))?\]\]")
|
||||
FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---\s*\n", re.DOTALL)
|
||||
CODE_BLOCK_RE = re.compile(r"```(\w*)")
|
||||
HEADING_RE = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
|
||||
INDEX_SECTION_RE = re.compile(r"^##\s+(.+)$", re.MULTILINE)
|
||||
|
||||
# Files that are part of wiki infrastructure, not content articles
|
||||
INFRA_FILES = {"index.md", "log.md", "claude.md", "agents.md", "soul.md"}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Detection: is this a Karpathy-pattern wiki?
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def detect_format(root: Path) -> dict:
|
||||
"""Detect if directory follows the Karpathy LLM wiki three-layer pattern."""
|
||||
signals = {
|
||||
"has_index": (root / "index.md").is_file() or (root / "wiki" / "index.md").is_file(),
|
||||
"has_log": (root / "log.md").is_file() or (root / "wiki" / "log.md").is_file(),
|
||||
"has_raw": (root / "raw").is_dir(),
|
||||
"has_schema": any(
|
||||
(root / f).is_file() or (root / "wiki" / f).is_file()
|
||||
for f in ["CLAUDE.md", "AGENTS.md"]
|
||||
),
|
||||
}
|
||||
|
||||
# Find the wiki root — could be the directory itself or a wiki/ subdirectory
|
||||
if (root / "wiki").is_dir():
|
||||
wiki_root = root / "wiki"
|
||||
else:
|
||||
wiki_root = root
|
||||
|
||||
# Count markdown files in the wiki root
|
||||
md_files = list(wiki_root.rglob("*.md"))
|
||||
signals["md_count"] = len(md_files)
|
||||
signals["wiki_root"] = str(wiki_root)
|
||||
|
||||
# Primary signal: has index.md + meaningful number of markdown files
|
||||
if signals["has_index"] and signals["md_count"] >= 3:
|
||||
signals["detected"] = True
|
||||
signals["format"] = "karpathy"
|
||||
else:
|
||||
signals["detected"] = False
|
||||
signals["format"] = "unknown"
|
||||
|
||||
return signals
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Markdown extraction helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def extract_frontmatter(text: str) -> dict:
|
||||
"""Extract YAML frontmatter as a simple key-value dict."""
|
||||
m = FRONTMATTER_RE.match(text)
|
||||
if not m:
|
||||
return {}
|
||||
fm = {}
|
||||
for line in m.group(1).split("\n"):
|
||||
if ":" in line:
|
||||
key, _, val = line.partition(":")
|
||||
fm[key.strip()] = val.strip().strip('"').strip("'")
|
||||
return fm
|
||||
|
||||
|
||||
def extract_wikilinks(text: str) -> list[dict]:
|
||||
"""Extract all [[target]] and [[target|display]] wikilinks."""
|
||||
links = []
|
||||
for m in WIKILINK_RE.finditer(text):
|
||||
links.append({
|
||||
"target": m.group(1).strip(),
|
||||
"display": m.group(2).strip() if m.group(2) else None,
|
||||
})
|
||||
return links
|
||||
|
||||
|
||||
def extract_headings(text: str) -> list[dict]:
|
||||
"""Extract all markdown headings with level and text."""
|
||||
return [
|
||||
{"level": len(m.group(1)), "text": m.group(2).strip()}
|
||||
for m in HEADING_RE.finditer(text)
|
||||
]
|
||||
|
||||
|
||||
def extract_code_blocks(text: str) -> list[str]:
|
||||
"""Extract languages from fenced code blocks."""
|
||||
return [m.group(1) for m in CODE_BLOCK_RE.finditer(text) if m.group(1)]
|
||||
|
||||
|
||||
def extract_first_paragraph(text: str) -> str:
|
||||
"""Extract the first non-empty paragraph after frontmatter and H1."""
|
||||
# Strip frontmatter
|
||||
stripped = FRONTMATTER_RE.sub("", text).strip()
|
||||
if not stripped:
|
||||
return ""
|
||||
lines = stripped.split("\n")
|
||||
|
||||
def _collect_paragraph(start_lines: list[str]) -> str:
|
||||
"""Collect the first paragraph from the given lines."""
|
||||
para: list[str] = []
|
||||
for s_raw in start_lines:
|
||||
s = s_raw.strip()
|
||||
if not s and not para:
|
||||
continue # Skip leading blank lines
|
||||
if not s and para:
|
||||
break # End of paragraph
|
||||
if s.startswith(">"):
|
||||
continue # Skip blockquotes
|
||||
if re.match(r"^[-*_]{3,}\s*$", s):
|
||||
continue # Skip horizontal rules
|
||||
if s.startswith("#"):
|
||||
if para:
|
||||
break # End paragraph at next heading
|
||||
continue # Skip headings before paragraph
|
||||
para.append(s)
|
||||
return " ".join(para)
|
||||
|
||||
# Try: find first paragraph after H1
|
||||
for i, line in enumerate(lines):
|
||||
if line.strip().startswith("# "):
|
||||
result = _collect_paragraph(lines[i + 1:])
|
||||
if result:
|
||||
if len(result) > 200:
|
||||
return result[:197] + "..."
|
||||
return result
|
||||
|
||||
# Fallback: no H1 found, take first paragraph from start
|
||||
result = _collect_paragraph(lines)
|
||||
if len(result) > 200:
|
||||
result = result[:197] + "..."
|
||||
return result or ""
|
||||
|
||||
|
||||
def extract_h1(text: str) -> str:
|
||||
"""Extract the first H1 heading."""
|
||||
for m in HEADING_RE.finditer(text):
|
||||
if len(m.group(1)) == 1:
|
||||
# Strip trailing wiki-style decorations like " — subtitle"
|
||||
return m.group(2).strip()
|
||||
return ""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Index.md parsing — categories come from section headings
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_index(index_path: Path) -> list[dict]:
|
||||
"""Parse index.md to extract categories from ## headings and their wikilinks."""
|
||||
if not index_path.is_file():
|
||||
return []
|
||||
text = index_path.read_text(encoding="utf-8", errors="replace")
|
||||
categories = []
|
||||
current_category = None
|
||||
|
||||
for line in text.split("\n"):
|
||||
# Detect ## section heading
|
||||
sec_match = re.match(r"^##\s+(.+)$", line)
|
||||
if sec_match:
|
||||
current_category = {
|
||||
"name": sec_match.group(1).strip(),
|
||||
"articles": [],
|
||||
}
|
||||
categories.append(current_category)
|
||||
continue
|
||||
|
||||
# Collect wikilinks under current section
|
||||
if current_category:
|
||||
for wl in WIKILINK_RE.finditer(line):
|
||||
current_category["articles"].append(wl.group(1).strip())
|
||||
|
||||
return categories
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Log.md parsing — extract operation timeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_log(log_path: Path) -> list[dict]:
|
||||
"""Parse log.md to extract chronological entries."""
|
||||
if not log_path.is_file():
|
||||
return []
|
||||
text = log_path.read_text(encoding="utf-8", errors="replace")
|
||||
entries = []
|
||||
log_entry_re = re.compile(
|
||||
r"^##\s+\[(\d{4}-\d{2}-\d{2})\]\s+(\w+)\s*\|\s*(.+)$", re.MULTILINE
|
||||
)
|
||||
for m in log_entry_re.finditer(text):
|
||||
entries.append({
|
||||
"date": m.group(1),
|
||||
"operation": m.group(2),
|
||||
"title": m.group(3).strip(),
|
||||
})
|
||||
return entries
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def build_name_to_stem_map(wiki_root: Path) -> dict[str, str]:
|
||||
"""Build a case-insensitive map from filename stem to relative stem path.
|
||||
|
||||
Full relative paths always map uniquely. Bare basenames map only when
|
||||
unambiguous — duplicate basenames are removed so they don't silently
|
||||
resolve to the wrong page.
|
||||
"""
|
||||
name_map: dict[str, str] = {}
|
||||
# Track which bare basenames appear more than once
|
||||
basename_counts: dict[str, int] = {}
|
||||
for md_file in wiki_root.rglob("*.md"):
|
||||
rel = md_file.relative_to(wiki_root)
|
||||
stem = rel.with_suffix("").as_posix() # e.g., "decisions/decision-foo"
|
||||
basename = md_file.stem # e.g., "decision-foo"
|
||||
# Full relative path always maps uniquely
|
||||
name_map[stem.lower()] = stem
|
||||
# Track basename for ambiguity detection
|
||||
key = basename.lower()
|
||||
basename_counts[key] = basename_counts.get(key, 0) + 1
|
||||
name_map[key] = stem
|
||||
|
||||
# Remove ambiguous basename entries (appear more than once)
|
||||
for key, count in basename_counts.items():
|
||||
if count > 1 and key in name_map:
|
||||
del name_map[key]
|
||||
|
||||
return name_map
|
||||
|
||||
|
||||
def resolve_wikilink(target: str, name_map: dict[str, str], node_ids: set[str] | None = None) -> str | None:
|
||||
"""Resolve a wikilink target to an article node ID.
|
||||
|
||||
If node_ids is provided, only resolve to IDs that exist in the set.
|
||||
"""
|
||||
key = target.lower().strip()
|
||||
# Skip targets that are clearly not page names (shell flags, etc.)
|
||||
if key.startswith("-"):
|
||||
return None
|
||||
stem = name_map.get(key)
|
||||
if stem:
|
||||
candidate = f"article:{stem}"
|
||||
# If we have a node set, verify the target exists
|
||||
if node_ids is not None and candidate not in node_ids:
|
||||
return None
|
||||
return candidate
|
||||
# Try without directory prefix
|
||||
for stored_key, stored_stem in name_map.items():
|
||||
if stored_key.endswith("/" + key) or stored_key == key:
|
||||
candidate = f"article:{stored_stem}"
|
||||
if node_ids is not None and candidate not in node_ids:
|
||||
return None
|
||||
return candidate
|
||||
return None
|
||||
|
||||
|
||||
def parse_wiki(root: Path) -> dict:
|
||||
"""Parse a Karpathy-pattern wiki and produce the scan manifest."""
|
||||
detection = detect_format(root)
|
||||
if not detection["detected"]:
|
||||
print(json.dumps({"error": "Not a Karpathy-pattern wiki", "detection": detection}),
|
||||
file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
wiki_root = Path(detection["wiki_root"])
|
||||
raw_root = root / "raw"
|
||||
|
||||
# Build name resolution map
|
||||
name_map = build_name_to_stem_map(wiki_root)
|
||||
|
||||
# Find index.md and log.md
|
||||
index_path = wiki_root / "index.md"
|
||||
if not index_path.is_file():
|
||||
index_path = root / "index.md"
|
||||
log_path = wiki_root / "log.md"
|
||||
if not log_path.is_file():
|
||||
log_path = root / "log.md"
|
||||
|
||||
# Parse index for categories
|
||||
categories = parse_index(index_path)
|
||||
log_entries = parse_log(log_path)
|
||||
|
||||
# Build category lookup: wikilink target → category name
|
||||
category_lookup: dict[str, str] = {}
|
||||
for cat in categories:
|
||||
for article_target in cat["articles"]:
|
||||
category_lookup[article_target.lower()] = cat["name"]
|
||||
|
||||
# --- Pre-compute article IDs (for edge resolution validation) ---
|
||||
# Only skip infra files at the wiki root level, not in subdirectories
|
||||
# (e.g., wiki/index.md is infra, but wiki/concepts/index.md is content)
|
||||
article_ids: set[str] = set()
|
||||
for md_file in sorted(wiki_root.rglob("*.md")):
|
||||
rel = md_file.relative_to(wiki_root)
|
||||
stem = rel.with_suffix("").as_posix()
|
||||
# Only filter infra files at root level (no parent directory)
|
||||
if rel.parent == Path(".") and rel.name.lower() in INFRA_FILES:
|
||||
continue
|
||||
article_ids.add(f"article:{stem}")
|
||||
|
||||
# --- Build article nodes ---
|
||||
nodes = []
|
||||
edges = []
|
||||
warnings = []
|
||||
stats = {"articles": 0, "sources": 0, "topics": 0, "wikilinks": 0, "unresolved": 0}
|
||||
|
||||
for md_file in sorted(wiki_root.rglob("*.md")):
|
||||
rel = md_file.relative_to(wiki_root)
|
||||
stem = rel.with_suffix("").as_posix()
|
||||
basename = md_file.stem
|
||||
|
||||
# Skip infrastructure files only at wiki root level
|
||||
if rel.parent == Path(".") and rel.name.lower() in INFRA_FILES:
|
||||
continue
|
||||
|
||||
text = md_file.read_text(encoding="utf-8", errors="replace")
|
||||
h1 = extract_h1(text)
|
||||
frontmatter = extract_frontmatter(text)
|
||||
wikilinks = extract_wikilinks(text)
|
||||
headings = extract_headings(text)
|
||||
code_langs = extract_code_blocks(text)
|
||||
summary = extract_first_paragraph(text)
|
||||
line_count = text.count("\n") + 1
|
||||
word_count = len(text.split())
|
||||
|
||||
# Derive category from index.md lookup
|
||||
category = category_lookup.get(basename.lower(), "")
|
||||
if not category:
|
||||
# Try stem match
|
||||
category = category_lookup.get(stem.lower(), "")
|
||||
|
||||
# Derive tags (deduplicated)
|
||||
tag_set: set[str] = set()
|
||||
if category:
|
||||
tag_set.add(category.lower())
|
||||
if rel.parent != Path("."):
|
||||
tag_set.add(str(rel.parent))
|
||||
fm_tags = frontmatter.get("tags", "")
|
||||
if fm_tags:
|
||||
tag_set.update(t.strip() for t in fm_tags.split(",") if t.strip())
|
||||
tags = sorted(tag_set)
|
||||
|
||||
# Complexity from wikilink density
|
||||
wl_count = len(wikilinks)
|
||||
if wl_count > 15:
|
||||
complexity = "complex"
|
||||
elif wl_count > 5:
|
||||
complexity = "moderate"
|
||||
else:
|
||||
complexity = "simple"
|
||||
|
||||
node_id = f"article:{stem}"
|
||||
nodes.append({
|
||||
"id": node_id,
|
||||
"type": "article",
|
||||
"name": h1 or basename,
|
||||
"filePath": str(rel),
|
||||
"summary": summary or f"Wiki article: {h1 or basename}",
|
||||
"tags": tags,
|
||||
"complexity": complexity,
|
||||
"knowledgeMeta": {
|
||||
"wikilinks": [wl["target"] for wl in wikilinks],
|
||||
**({"category": category} if category else {}),
|
||||
"content": text[:3000], # First 3000 chars for LLM analysis
|
||||
},
|
||||
})
|
||||
stats["articles"] += 1
|
||||
stats["wikilinks"] += wl_count
|
||||
|
||||
# Build edges from wikilinks (resolve against known article IDs)
|
||||
for wl in wikilinks:
|
||||
target_id = resolve_wikilink(wl["target"], name_map, article_ids)
|
||||
if target_id and target_id != node_id:
|
||||
edges.append({
|
||||
"source": node_id,
|
||||
"target": target_id,
|
||||
"type": "related",
|
||||
"direction": "forward",
|
||||
"weight": 0.7,
|
||||
})
|
||||
elif not target_id:
|
||||
warnings.append(f"Unresolved wikilink: [[{wl['target']}]] in {rel}")
|
||||
stats["unresolved"] += 1
|
||||
|
||||
# --- Build topic nodes from index.md categories ---
|
||||
for cat in categories:
|
||||
topic_id = f"topic:{cat['name'].lower().replace(' ', '-')}"
|
||||
nodes.append({
|
||||
"id": topic_id,
|
||||
"type": "topic",
|
||||
"name": cat["name"],
|
||||
"summary": f"Category from index: {cat['name']} ({len(cat['articles'])} articles)",
|
||||
"tags": ["category"],
|
||||
"complexity": "simple",
|
||||
})
|
||||
stats["topics"] += 1
|
||||
|
||||
# categorized_under edges (only resolve to known article nodes)
|
||||
for article_target in cat["articles"]:
|
||||
article_id = resolve_wikilink(article_target, name_map, article_ids)
|
||||
if article_id:
|
||||
edges.append({
|
||||
"source": article_id,
|
||||
"target": topic_id,
|
||||
"type": "categorized_under",
|
||||
"direction": "forward",
|
||||
"weight": 0.6,
|
||||
})
|
||||
|
||||
# --- Build source nodes from raw/ ---
|
||||
if raw_root.is_dir():
|
||||
for raw_file in sorted(raw_root.rglob("*")):
|
||||
if raw_file.is_file() and not raw_file.name.startswith("."):
|
||||
rel_raw = raw_file.relative_to(root)
|
||||
ext = raw_file.suffix.lower()
|
||||
size_kb = raw_file.stat().st_size / 1024
|
||||
source_id = f"source:{raw_file.relative_to(raw_root).with_suffix('')}"
|
||||
nodes.append({
|
||||
"id": source_id,
|
||||
"type": "source",
|
||||
"name": raw_file.name,
|
||||
"filePath": str(rel_raw),
|
||||
"summary": f"Raw source ({ext or 'unknown'}, {size_kb:.0f} KB)",
|
||||
"tags": ["raw", ext.lstrip(".") or "unknown"],
|
||||
"complexity": "simple",
|
||||
})
|
||||
stats["sources"] += 1
|
||||
|
||||
# --- Compute backlinks ---
|
||||
backlink_map: dict[str, list[str]] = {}
|
||||
for edge in edges:
|
||||
if edge["type"] == "related":
|
||||
target = edge["target"]
|
||||
source = edge["source"]
|
||||
backlink_map.setdefault(target, []).append(source)
|
||||
for node in nodes:
|
||||
if node["type"] == "article" and "knowledgeMeta" in node:
|
||||
bl = backlink_map.get(node["id"], [])
|
||||
node["knowledgeMeta"]["backlinks"] = bl
|
||||
|
||||
# --- Deduplicate edges ---
|
||||
seen_edges: set[tuple[str, str, str]] = set()
|
||||
deduped_edges = []
|
||||
for edge in edges:
|
||||
key = (edge["source"], edge["target"], edge["type"])
|
||||
if key not in seen_edges:
|
||||
seen_edges.add(key)
|
||||
deduped_edges.append(edge)
|
||||
|
||||
return {
|
||||
"format": "karpathy",
|
||||
"stats": stats,
|
||||
"categories": [{"name": c["name"], "count": len(c["articles"])} for c in categories],
|
||||
"logEntries": len(log_entries),
|
||||
"nodes": nodes,
|
||||
"edges": deduped_edges,
|
||||
"warnings": warnings[:50], # Cap warnings
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: parse-knowledge-base.py <wiki-directory>", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
root = Path(sys.argv[1]).resolve()
|
||||
if not root.is_dir():
|
||||
print(f"Error: {root} is not a directory", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
manifest = parse_wiki(root)
|
||||
|
||||
# Write output
|
||||
out_dir = root / ".understand-anything" / "intermediate"
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = out_dir / "scan-manifest.json"
|
||||
out_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
|
||||
|
||||
# Report to stderr
|
||||
s = manifest["stats"]
|
||||
print(f"[parse] Karpathy wiki: {s['articles']} articles, {s['sources']} sources, "
|
||||
f"{s['topics']} topics, {s['wikilinks']} wikilinks "
|
||||
f"({s['unresolved']} unresolved)", file=sys.stderr)
|
||||
print(f"[parse] Output: {out_path}", file=sys.stderr)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user