fix(security): 添加VITE_PAYMENT_URL环境变量配置

This commit is contained in:
2026-06-18 21:29:41 +08:00
parent 3d977d0a2d
commit 8afeb2e4d9
160 changed files with 21893 additions and 0 deletions

View File

@@ -0,0 +1,132 @@
---
name: understand-knowledge
description: Analyze a Karpathy-pattern LLM wiki knowledge base and generate an interactive knowledge graph with entity extraction, implicit relationships, and topic clustering.
argument-hint: [wiki-directory]
---
# /understand-knowledge
Analyzes a Karpathy-pattern LLM wiki — a three-layer knowledge base with raw sources, wiki markdown, and a schema file — and produces an interactive knowledge graph dashboard.
## What It Detects
The **Karpathy LLM wiki pattern** (see https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f):
- **Raw sources** — immutable source documents (articles, papers, data files)
- **Wiki** — LLM-generated markdown files with wikilinks (`[[target]]` syntax)
- **Schema** — CLAUDE.md, AGENTS.md, or similar configuration file
- **index.md** — content catalog organized by categories
- **log.md** — chronological operation log
Detection signals: has `index.md` + multiple `.md` files with wikilinks. May have `raw/` directory and schema file.
## Instructions
### Phase 1: DETECT
1. Determine the target directory:
- If the user provided a path argument, use that
- Otherwise, use the current working directory
2. Run the format detection script bundled with this skill:
```
python3 <SKILL_DIR>/parse-knowledge-base.py <TARGET_DIR>
```
- If the script exits with an error, tell the user this doesn't appear to be a Karpathy-pattern wiki and explain what was expected
- If successful, proceed. The script writes `scan-manifest.json` to `<TARGET_DIR>/.understand-anything/intermediate/`
3. Read the scan-manifest.json and announce the results:
- "Detected Karpathy wiki: N articles, N sources, N topics, N wikilinks (N unresolved)"
- List the categories found from index.md
### Phase 2: SCAN (already done)
The parse script in Phase 1 already performed the deterministic scan. The scan-manifest.json contains:
- Article nodes (one per wiki .md file) with extracted wikilinks, headings, frontmatter
- Source nodes (one per raw/ file)
- Topic nodes (from index.md section headings)
- `related` edges (from wikilinks)
- `categorized_under` edges (from index.md sections)
No additional scanning is needed. Proceed to Phase 3.
### Phase 3: ANALYZE
Dispatch `article-analyzer` subagents to extract implicit knowledge:
1. Read the scan-manifest.json to get the article list
2. Prepare batches of 10-15 articles each, grouped by category when possible (articles in the same category are more likely to have implicit cross-references)
3. For each batch, dispatch an `article-analyzer` subagent with:
- The batch of articles (id, name, summary, wikilinks, category, content from knowledgeMeta)
- The full list of existing node IDs (so the agent can reference them)
- The batch number for output file naming
- The intermediate directory path: `$INTERMEDIATE_DIR = <TARGET_DIR>/.understand-anything/intermediate`
The agent will write `analysis-batch-{N}.json` to the intermediate directory.
4. Run up to 3 batches concurrently. Wait for all batches to complete.
5. If any batch fails, log a warning but continue — the scan-manifest provides a solid base graph even without LLM analysis.
### Phase 4: MERGE
1. Run the merge script bundled with this skill:
```
python3 <SKILL_DIR>/merge-knowledge-graph.py <TARGET_DIR>
```
2. The script:
- Combines scan-manifest.json + all analysis-batch-*.json files
- Deduplicates entities (case-insensitive name matching)
- Normalizes node/edge types via alias maps
- Builds layers from index.md categories
- Builds a tour from index.md section ordering
- Writes `assembled-graph.json` to the intermediate directory
3. Read the merge report from stderr and announce:
- Total nodes, edges, layers, tour steps
- How many entities/claims the LLM analysis added
### Phase 5: SAVE
1. Read the assembled-graph.json
2. Run basic validation:
- Every edge source/target must reference an existing node
- Every node must have: id, type, name, summary, tags, complexity
- Remove any edges with dangling references
3. Copy the validated graph to `<TARGET_DIR>/.understand-anything/knowledge-graph.json`
4. Write metadata to `<TARGET_DIR>/.understand-anything/meta.json`:
```json
{
"lastAnalyzedAt": "<ISO timestamp>",
"gitCommitHash": "<from git rev-parse HEAD or empty>",
"version": "1.0.0",
"analyzedFiles": <number of wiki articles>
}
```
5. Clean up intermediate files:
```
rm -rf <TARGET_DIR>/.understand-anything/intermediate
```
6. Report summary to the user:
- "Knowledge graph saved: N articles, N entities, N topics, N claims, N sources"
- "N edges (N wikilink, N categorized, N implicit)"
- "N layers, N tour steps"
7. Auto-trigger the dashboard:
```
/understand-dashboard <TARGET_DIR>
```
## Notes
- The parse script handles ALL deterministic extraction (wikilinks, headings, frontmatter, categories from index.md). The LLM agents only add implicit knowledge that requires inference.
- Categories and taxonomy come from index.md section headings, NOT from filename prefixes. The Karpathy spec is intentionally abstract about naming conventions.
- The graph uses `kind: "knowledge"` to signal the dashboard to use force-directed layout instead of hierarchical dagre.
- Source nodes from raw/ are lightweight (filename + size only) — we don't parse PDFs or binary files.

View File

@@ -0,0 +1,397 @@
#!/usr/bin/env python3
"""
Merge script for Karpathy-pattern knowledge graphs.
Combines the deterministic scan-manifest.json with LLM analysis batches
(analysis-batch-*.json) into a final assembled knowledge graph.
Handles: entity deduplication, edge normalization, layer building from
index.md categories, tour generation from index.md section ordering.
Usage:
python merge-knowledge-graph.py <wiki-directory>
Output:
Writes assembled-graph.json to <wiki-directory>/.understand-anything/intermediate/
"""
import json
import os
import re
import sys
from datetime import datetime, timezone
from pathlib import Path
# ---------------------------------------------------------------------------
# Canonical type sets (must match core/src/types.ts)
# ---------------------------------------------------------------------------
VALID_NODE_TYPES = {
"article", "entity", "topic", "claim", "source",
# Codebase types (for cross-compatibility)
"file", "function", "class", "module", "concept",
"config", "document", "service", "table", "endpoint",
"pipeline", "schema", "resource", "domain", "flow", "step",
}
VALID_EDGE_TYPES = {
"cites", "contradicts", "builds_on", "exemplifies",
"categorized_under", "authored_by", "related", "similar_to",
# Codebase types
"imports", "exports", "contains", "inherits", "implements",
"calls", "subscribes", "publishes", "middleware",
"reads_from", "writes_to", "transforms", "validates",
"depends_on", "tested_by", "configures",
"deploys", "serves", "provisions", "triggers",
"migrates", "documents", "routes", "defines_schema",
"contains_flow", "flow_step", "cross_domain",
}
NODE_TYPE_ALIASES = {
"note": "article", "page": "article", "wiki_page": "article",
"person": "entity", "actor": "entity", "organization": "entity",
"tag": "topic", "category": "topic", "theme": "topic",
"assertion": "claim", "decision": "claim", "thesis": "claim",
"reference": "source", "raw": "source", "paper": "source",
}
EDGE_TYPE_ALIASES = {
"references": "cites", "cites_source": "cites",
"conflicts_with": "contradicts", "disagrees_with": "contradicts",
"refines": "builds_on", "elaborates": "builds_on",
"illustrates": "exemplifies", "instance_of": "exemplifies", "example_of": "exemplifies",
"belongs_to": "categorized_under", "tagged_with": "categorized_under",
"written_by": "authored_by", "created_by": "authored_by",
"relates_to": "related", "related_to": "related",
}
# ---------------------------------------------------------------------------
# Normalization
# ---------------------------------------------------------------------------
def normalize_node_type(t: str) -> str:
t = t.lower().strip()
return NODE_TYPE_ALIASES.get(t, t)
def normalize_edge_type(t: str) -> str:
t = t.lower().strip()
return EDGE_TYPE_ALIASES.get(t, t)
def normalize_entity_name(name: str) -> str:
"""Normalize entity names for deduplication."""
return re.sub(r'\s+', ' ', name.strip().lower())
# ---------------------------------------------------------------------------
# Merge pipeline
# ---------------------------------------------------------------------------
def merge(root: Path) -> dict:
intermediate = root / ".understand-anything" / "intermediate"
manifest_path = intermediate / "scan-manifest.json"
if not manifest_path.is_file():
print(f"Error: {manifest_path} not found. Run parse-knowledge-base.py first.",
file=sys.stderr)
sys.exit(1)
# Load scan manifest (deterministic base)
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
nodes = {n["id"]: n for n in manifest["nodes"]}
edges = list(manifest["edges"])
report = {"base_nodes": len(nodes), "base_edges": len(edges),
"batches": 0, "new_entities": 0, "new_claims": 0,
"new_edges": 0, "deduped_entities": 0, "dropped_edges": 0}
# Load analysis batches
batch_files = sorted(intermediate.glob("analysis-batch-*.json"))
entity_name_map: dict[str, str] = {} # normalized_name → entity_id
dedup_remap: dict[str, str] = {} # duplicate_id → canonical_id
for bf in batch_files:
report["batches"] += 1
try:
batch = json.loads(bf.read_text(encoding="utf-8"))
except (json.JSONDecodeError, OSError) as e:
print(f"[merge] Warning: Failed to load {bf.name}: {e}", file=sys.stderr)
continue
# Process new nodes from LLM analysis
for node in batch.get("nodes", []):
node_type = normalize_node_type(node.get("type", ""))
if node_type not in VALID_NODE_TYPES:
print(f"[merge] Warning: Unknown node type '{node.get('type')}' — skipping",
file=sys.stderr)
continue
node["type"] = node_type
node_id = node.get("id", "")
# Entity deduplication — track remapping for edge fixup
if node_type == "entity":
norm_name = normalize_entity_name(node.get("name", ""))
if norm_name in entity_name_map:
# Map duplicate ID → canonical ID for edge remapping
dedup_remap[node_id] = entity_name_map[norm_name]
report["deduped_entities"] += 1
continue
entity_name_map[norm_name] = node_id
report["new_entities"] += 1
elif node_type == "claim":
report["new_claims"] += 1
# Ensure required fields
node.setdefault("summary", node.get("name", ""))
node.setdefault("tags", [])
node.setdefault("complexity", "simple")
nodes[node_id] = node
# Process new edges from LLM analysis
for edge in batch.get("edges", []):
edge_type = normalize_edge_type(edge.get("type", ""))
if edge_type not in VALID_EDGE_TYPES:
print(f"[merge] Warning: Unknown edge type '{edge.get('type')}'"
f"mapped to 'related'", file=sys.stderr)
edge_type = "related"
edge["type"] = edge_type
edge.setdefault("direction", "forward")
edge.setdefault("weight", 0.5)
# Remap deduped entity IDs, then validate source/target exist
src = dedup_remap.get(edge.get("source", ""), edge.get("source", ""))
tgt = dedup_remap.get(edge.get("target", ""), edge.get("target", ""))
edge["source"] = src
edge["target"] = tgt
if src in nodes and tgt in nodes:
edges.append(edge)
report["new_edges"] += 1
else:
report["dropped_edges"] += 1
# --- Deduplicate edges ---
seen: set[tuple[str, str, str]] = set()
final_edges = []
for edge in edges:
key = (edge["source"], edge["target"], edge["type"])
if key not in seen:
seen.add(key)
final_edges.append(edge)
# --- Build article→layer map from categories ---
categories = manifest.get("categories", [])
article_layer_map: dict[str, str] = {} # article_id → layer_id
layer_members: dict[str, list[str]] = {} # layer_id → [node_ids]
for cat in categories:
cat_name = cat["name"]
cat_slug = cat_name.lower().replace(" ", "-")
layer_id = f"layer:{cat_slug}"
topic_id = f"topic:{cat_slug}"
members = [e["source"] for e in final_edges
if e["type"] == "categorized_under" and e["target"] == topic_id]
if topic_id in nodes:
members.append(topic_id)
layer_members[layer_id] = members
for mid in members:
article_layer_map[mid] = layer_id
# --- Assign entity/claim nodes to their parent article's layer ---
# Step 1: Build entity/claim → article mapping from edges
child_to_article: dict[str, str] = {}
for edge in final_edges:
src_type = nodes.get(edge["source"], {}).get("type", "")
tgt_type = nodes.get(edge["target"], {}).get("type", "")
# If an article connects to an entity/claim, map the child to the article
if src_type == "article" and tgt_type in ("entity", "claim"):
child_to_article.setdefault(edge["target"], edge["source"])
elif tgt_type == "article" and src_type in ("entity", "claim"):
child_to_article.setdefault(edge["source"], edge["target"])
# Step 2: For orphan entities/claims, try to match by ID prefix
# Build a reverse lookup: bare article name → full article ID
# e.g., "concept-aaak-compression" → "article:concepts/concept-aaak-compression"
bare_to_article: dict[str, str] = {}
for nid in nodes:
if nid.startswith("article:"):
# Extract the bare filename from paths like "article:concepts/concept-foo"
bare = nid.split("/")[-1] if "/" in nid else nid.replace("article:", "")
bare_to_article[bare] = nid
for nid, node in nodes.items():
if node["type"] in ("entity", "claim") and nid not in child_to_article:
# e.g., "claim:concept-aaak-compression:not-zero-loss" → stem "concept-aaak-compression"
# e.g., "entity:brain" → stem "brain"
raw = nid.split(":", 1)[1] if ":" in nid else nid # "concept-aaak-compression:not-zero-loss"
stem = raw.split(":")[0] # "concept-aaak-compression"
# Try exact bare name match first
if stem in bare_to_article:
child_to_article[nid] = bare_to_article[stem]
else:
# Try suffix/substring match against bare names
# e.g., entity:brain → segment-brain, entity:mempalace → tool-mempalace
matched = False
for bare, aid in bare_to_article.items():
if stem in bare or bare in stem:
child_to_article[nid] = aid
matched = True
break
# Also try: bare ends with -stem (e.g., "segment-brain" ends with "-brain")
if bare.endswith(f"-{stem}") or bare.endswith(f"/{stem}"):
child_to_article[nid] = aid
matched = True
break
# Last resort: check if the node's name appears in any article's
# name OR content (knowledgeMeta.content)
if not matched and node.get("name"):
node_name_lower = node["name"].lower()
for aid, anode in nodes.items():
if not aid.startswith("article:"):
continue
# Match against article name
if node_name_lower in anode.get("name", "").lower():
child_to_article[nid] = aid
matched = True
break
# Match against article content (wikilinks or text)
meta = anode.get("knowledgeMeta", {})
content = (meta.get("content") or "").lower()
if len(node_name_lower) >= 3 and node_name_lower in content:
child_to_article[nid] = aid
matched = True
break
# Step 3: Place children into their parent article's layer
for child_id, article_id in child_to_article.items():
layer_id = article_layer_map.get(article_id)
if layer_id and layer_id in layer_members:
layer_members[layer_id].append(child_id)
article_layer_map[child_id] = layer_id
# --- Build layers ---
layers = []
for cat in categories:
cat_name = cat["name"]
cat_slug = cat_name.lower().replace(" ", "-")
layer_id = f"layer:{cat_slug}"
members = list(dict.fromkeys(layer_members.get(layer_id, []))) # Deduplicate preserving order
layers.append({
"id": layer_id,
"name": cat_name,
"description": f"{cat_name} ({len(members)} nodes)",
"nodeIds": members,
})
# Assign uncategorized nodes to an "Other" layer
categorized_ids = set()
for layer in layers:
categorized_ids.update(layer["nodeIds"])
uncategorized = [nid for nid in nodes if nid not in categorized_ids]
if uncategorized:
layers.append({
"id": "layer:other",
"name": "Other",
"description": f"Uncategorized nodes ({len(uncategorized)})",
"nodeIds": uncategorized,
})
# --- Build tour from index.md category ordering ---
tour = []
for i, cat in enumerate(categories):
cat_slug = cat["name"].lower().replace(" ", "-")
topic_id = f"topic:{cat_slug}"
# Pick representative articles (up to 3 per category)
members = [e["source"] for e in final_edges
if e["type"] == "categorized_under" and e["target"] == topic_id][:3]
if not members and topic_id in nodes:
members = [topic_id]
if members:
tour.append({
"order": i + 1,
"title": cat["name"],
"description": f"Explore the {cat['name']} section ({cat['count']} articles)",
"nodeIds": members,
})
# --- Detect project name ---
project_name = root.name
# Try to find a better name from index.md H1
index_path = root / "wiki" / "index.md"
if not index_path.is_file():
index_path = root / "index.md"
if index_path.is_file():
text = index_path.read_text(encoding="utf-8", errors="replace")
h1_match = re.search(r"^#\s+(.+)$", text, re.MULTILINE)
if h1_match:
project_name = h1_match.group(1).strip()
# --- Assemble final graph ---
graph = {
"version": "1.0.0",
"kind": "knowledge",
"project": {
"name": project_name,
"languages": ["markdown"],
"frameworks": ["karpathy-wiki"],
"description": f"Knowledge graph for {project_name}",
"analyzedAt": datetime.now(timezone.utc).isoformat(),
"gitCommitHash": "",
},
"nodes": list(nodes.values()),
"edges": final_edges,
"layers": layers,
"tour": tour,
}
# Try to get git commit hash
try:
import subprocess
result = subprocess.run(
["git", "rev-parse", "HEAD"],
capture_output=True, text=True, cwd=str(root), timeout=5
)
if result.returncode == 0:
graph["project"]["gitCommitHash"] = result.stdout.strip()
except (OSError, subprocess.TimeoutExpired):
pass
# Write output
out_path = intermediate / "assembled-graph.json"
out_path.write_text(json.dumps(graph, indent=2), encoding="utf-8")
# Report
print(f"[merge] Input: {report['base_nodes']} scan nodes, "
f"{report['base_edges']} scan edges, {report['batches']} analysis batches",
file=sys.stderr)
print(f"[merge] Added: {report['new_entities']} entities, "
f"{report['new_claims']} claims, {report['new_edges']} edges "
f"({report['deduped_entities']} deduped entities, "
f"{report['dropped_edges']} dropped dangling edges)", file=sys.stderr)
print(f"[merge] Output: {len(graph['nodes'])} nodes, {len(final_edges)} edges, "
f"{len(layers)} layers, {len(tour)} tour steps", file=sys.stderr)
print(f"[merge] Written: {out_path}", file=sys.stderr)
return graph
def main():
if len(sys.argv) < 2:
print("Usage: merge-knowledge-graph.py <wiki-directory>", file=sys.stderr)
sys.exit(1)
root = Path(sys.argv[1]).resolve()
if not root.is_dir():
print(f"Error: {root} is not a directory", file=sys.stderr)
sys.exit(1)
merge(root)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,509 @@
#!/usr/bin/env python3
"""
Deterministic parser for Karpathy-pattern LLM wikis.
Detects the three-layer pattern (raw sources + wiki markdown + schema),
extracts structure from markdown files, resolves wikilinks, and derives
categories from index.md section headings.
Usage:
python parse-knowledge-base.py <wiki-directory>
Output:
Writes scan-manifest.json to <wiki-directory>/.understand-anything/intermediate/
"""
import json
import os
import re
import sys
from pathlib import Path
# ---------------------------------------------------------------------------
# Regex patterns
# ---------------------------------------------------------------------------
WIKILINK_RE = re.compile(r"\[\[([^\]|]+)(?:\|([^\]]+))?\]\]")
FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---\s*\n", re.DOTALL)
CODE_BLOCK_RE = re.compile(r"```(\w*)")
HEADING_RE = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)
INDEX_SECTION_RE = re.compile(r"^##\s+(.+)$", re.MULTILINE)
# Files that are part of wiki infrastructure, not content articles
INFRA_FILES = {"index.md", "log.md", "claude.md", "agents.md", "soul.md"}
# ---------------------------------------------------------------------------
# Detection: is this a Karpathy-pattern wiki?
# ---------------------------------------------------------------------------
def detect_format(root: Path) -> dict:
"""Detect if directory follows the Karpathy LLM wiki three-layer pattern."""
signals = {
"has_index": (root / "index.md").is_file() or (root / "wiki" / "index.md").is_file(),
"has_log": (root / "log.md").is_file() or (root / "wiki" / "log.md").is_file(),
"has_raw": (root / "raw").is_dir(),
"has_schema": any(
(root / f).is_file() or (root / "wiki" / f).is_file()
for f in ["CLAUDE.md", "AGENTS.md"]
),
}
# Find the wiki root — could be the directory itself or a wiki/ subdirectory
if (root / "wiki").is_dir():
wiki_root = root / "wiki"
else:
wiki_root = root
# Count markdown files in the wiki root
md_files = list(wiki_root.rglob("*.md"))
signals["md_count"] = len(md_files)
signals["wiki_root"] = str(wiki_root)
# Primary signal: has index.md + meaningful number of markdown files
if signals["has_index"] and signals["md_count"] >= 3:
signals["detected"] = True
signals["format"] = "karpathy"
else:
signals["detected"] = False
signals["format"] = "unknown"
return signals
# ---------------------------------------------------------------------------
# Markdown extraction helpers
# ---------------------------------------------------------------------------
def extract_frontmatter(text: str) -> dict:
"""Extract YAML frontmatter as a simple key-value dict."""
m = FRONTMATTER_RE.match(text)
if not m:
return {}
fm = {}
for line in m.group(1).split("\n"):
if ":" in line:
key, _, val = line.partition(":")
fm[key.strip()] = val.strip().strip('"').strip("'")
return fm
def extract_wikilinks(text: str) -> list[dict]:
"""Extract all [[target]] and [[target|display]] wikilinks."""
links = []
for m in WIKILINK_RE.finditer(text):
links.append({
"target": m.group(1).strip(),
"display": m.group(2).strip() if m.group(2) else None,
})
return links
def extract_headings(text: str) -> list[dict]:
"""Extract all markdown headings with level and text."""
return [
{"level": len(m.group(1)), "text": m.group(2).strip()}
for m in HEADING_RE.finditer(text)
]
def extract_code_blocks(text: str) -> list[str]:
"""Extract languages from fenced code blocks."""
return [m.group(1) for m in CODE_BLOCK_RE.finditer(text) if m.group(1)]
def extract_first_paragraph(text: str) -> str:
"""Extract the first non-empty paragraph after frontmatter and H1."""
# Strip frontmatter
stripped = FRONTMATTER_RE.sub("", text).strip()
if not stripped:
return ""
lines = stripped.split("\n")
def _collect_paragraph(start_lines: list[str]) -> str:
"""Collect the first paragraph from the given lines."""
para: list[str] = []
for s_raw in start_lines:
s = s_raw.strip()
if not s and not para:
continue # Skip leading blank lines
if not s and para:
break # End of paragraph
if s.startswith(">"):
continue # Skip blockquotes
if re.match(r"^[-*_]{3,}\s*$", s):
continue # Skip horizontal rules
if s.startswith("#"):
if para:
break # End paragraph at next heading
continue # Skip headings before paragraph
para.append(s)
return " ".join(para)
# Try: find first paragraph after H1
for i, line in enumerate(lines):
if line.strip().startswith("# "):
result = _collect_paragraph(lines[i + 1:])
if result:
if len(result) > 200:
return result[:197] + "..."
return result
# Fallback: no H1 found, take first paragraph from start
result = _collect_paragraph(lines)
if len(result) > 200:
result = result[:197] + "..."
return result or ""
def extract_h1(text: str) -> str:
"""Extract the first H1 heading."""
for m in HEADING_RE.finditer(text):
if len(m.group(1)) == 1:
# Strip trailing wiki-style decorations like " — subtitle"
return m.group(2).strip()
return ""
# ---------------------------------------------------------------------------
# Index.md parsing — categories come from section headings
# ---------------------------------------------------------------------------
def parse_index(index_path: Path) -> list[dict]:
"""Parse index.md to extract categories from ## headings and their wikilinks."""
if not index_path.is_file():
return []
text = index_path.read_text(encoding="utf-8", errors="replace")
categories = []
current_category = None
for line in text.split("\n"):
# Detect ## section heading
sec_match = re.match(r"^##\s+(.+)$", line)
if sec_match:
current_category = {
"name": sec_match.group(1).strip(),
"articles": [],
}
categories.append(current_category)
continue
# Collect wikilinks under current section
if current_category:
for wl in WIKILINK_RE.finditer(line):
current_category["articles"].append(wl.group(1).strip())
return categories
# ---------------------------------------------------------------------------
# Log.md parsing — extract operation timeline
# ---------------------------------------------------------------------------
def parse_log(log_path: Path) -> list[dict]:
"""Parse log.md to extract chronological entries."""
if not log_path.is_file():
return []
text = log_path.read_text(encoding="utf-8", errors="replace")
entries = []
log_entry_re = re.compile(
r"^##\s+\[(\d{4}-\d{2}-\d{2})\]\s+(\w+)\s*\|\s*(.+)$", re.MULTILINE
)
for m in log_entry_re.finditer(text):
entries.append({
"date": m.group(1),
"operation": m.group(2),
"title": m.group(3).strip(),
})
return entries
# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------
def build_name_to_stem_map(wiki_root: Path) -> dict[str, str]:
"""Build a case-insensitive map from filename stem to relative stem path.
Full relative paths always map uniquely. Bare basenames map only when
unambiguous — duplicate basenames are removed so they don't silently
resolve to the wrong page.
"""
name_map: dict[str, str] = {}
# Track which bare basenames appear more than once
basename_counts: dict[str, int] = {}
for md_file in wiki_root.rglob("*.md"):
rel = md_file.relative_to(wiki_root)
stem = rel.with_suffix("").as_posix() # e.g., "decisions/decision-foo"
basename = md_file.stem # e.g., "decision-foo"
# Full relative path always maps uniquely
name_map[stem.lower()] = stem
# Track basename for ambiguity detection
key = basename.lower()
basename_counts[key] = basename_counts.get(key, 0) + 1
name_map[key] = stem
# Remove ambiguous basename entries (appear more than once)
for key, count in basename_counts.items():
if count > 1 and key in name_map:
del name_map[key]
return name_map
def resolve_wikilink(target: str, name_map: dict[str, str], node_ids: set[str] | None = None) -> str | None:
"""Resolve a wikilink target to an article node ID.
If node_ids is provided, only resolve to IDs that exist in the set.
"""
key = target.lower().strip()
# Skip targets that are clearly not page names (shell flags, etc.)
if key.startswith("-"):
return None
stem = name_map.get(key)
if stem:
candidate = f"article:{stem}"
# If we have a node set, verify the target exists
if node_ids is not None and candidate not in node_ids:
return None
return candidate
# Try without directory prefix
for stored_key, stored_stem in name_map.items():
if stored_key.endswith("/" + key) or stored_key == key:
candidate = f"article:{stored_stem}"
if node_ids is not None and candidate not in node_ids:
return None
return candidate
return None
def parse_wiki(root: Path) -> dict:
"""Parse a Karpathy-pattern wiki and produce the scan manifest."""
detection = detect_format(root)
if not detection["detected"]:
print(json.dumps({"error": "Not a Karpathy-pattern wiki", "detection": detection}),
file=sys.stderr)
sys.exit(1)
wiki_root = Path(detection["wiki_root"])
raw_root = root / "raw"
# Build name resolution map
name_map = build_name_to_stem_map(wiki_root)
# Find index.md and log.md
index_path = wiki_root / "index.md"
if not index_path.is_file():
index_path = root / "index.md"
log_path = wiki_root / "log.md"
if not log_path.is_file():
log_path = root / "log.md"
# Parse index for categories
categories = parse_index(index_path)
log_entries = parse_log(log_path)
# Build category lookup: wikilink target → category name
category_lookup: dict[str, str] = {}
for cat in categories:
for article_target in cat["articles"]:
category_lookup[article_target.lower()] = cat["name"]
# --- Pre-compute article IDs (for edge resolution validation) ---
# Only skip infra files at the wiki root level, not in subdirectories
# (e.g., wiki/index.md is infra, but wiki/concepts/index.md is content)
article_ids: set[str] = set()
for md_file in sorted(wiki_root.rglob("*.md")):
rel = md_file.relative_to(wiki_root)
stem = rel.with_suffix("").as_posix()
# Only filter infra files at root level (no parent directory)
if rel.parent == Path(".") and rel.name.lower() in INFRA_FILES:
continue
article_ids.add(f"article:{stem}")
# --- Build article nodes ---
nodes = []
edges = []
warnings = []
stats = {"articles": 0, "sources": 0, "topics": 0, "wikilinks": 0, "unresolved": 0}
for md_file in sorted(wiki_root.rglob("*.md")):
rel = md_file.relative_to(wiki_root)
stem = rel.with_suffix("").as_posix()
basename = md_file.stem
# Skip infrastructure files only at wiki root level
if rel.parent == Path(".") and rel.name.lower() in INFRA_FILES:
continue
text = md_file.read_text(encoding="utf-8", errors="replace")
h1 = extract_h1(text)
frontmatter = extract_frontmatter(text)
wikilinks = extract_wikilinks(text)
headings = extract_headings(text)
code_langs = extract_code_blocks(text)
summary = extract_first_paragraph(text)
line_count = text.count("\n") + 1
word_count = len(text.split())
# Derive category from index.md lookup
category = category_lookup.get(basename.lower(), "")
if not category:
# Try stem match
category = category_lookup.get(stem.lower(), "")
# Derive tags (deduplicated)
tag_set: set[str] = set()
if category:
tag_set.add(category.lower())
if rel.parent != Path("."):
tag_set.add(str(rel.parent))
fm_tags = frontmatter.get("tags", "")
if fm_tags:
tag_set.update(t.strip() for t in fm_tags.split(",") if t.strip())
tags = sorted(tag_set)
# Complexity from wikilink density
wl_count = len(wikilinks)
if wl_count > 15:
complexity = "complex"
elif wl_count > 5:
complexity = "moderate"
else:
complexity = "simple"
node_id = f"article:{stem}"
nodes.append({
"id": node_id,
"type": "article",
"name": h1 or basename,
"filePath": str(rel),
"summary": summary or f"Wiki article: {h1 or basename}",
"tags": tags,
"complexity": complexity,
"knowledgeMeta": {
"wikilinks": [wl["target"] for wl in wikilinks],
**({"category": category} if category else {}),
"content": text[:3000], # First 3000 chars for LLM analysis
},
})
stats["articles"] += 1
stats["wikilinks"] += wl_count
# Build edges from wikilinks (resolve against known article IDs)
for wl in wikilinks:
target_id = resolve_wikilink(wl["target"], name_map, article_ids)
if target_id and target_id != node_id:
edges.append({
"source": node_id,
"target": target_id,
"type": "related",
"direction": "forward",
"weight": 0.7,
})
elif not target_id:
warnings.append(f"Unresolved wikilink: [[{wl['target']}]] in {rel}")
stats["unresolved"] += 1
# --- Build topic nodes from index.md categories ---
for cat in categories:
topic_id = f"topic:{cat['name'].lower().replace(' ', '-')}"
nodes.append({
"id": topic_id,
"type": "topic",
"name": cat["name"],
"summary": f"Category from index: {cat['name']} ({len(cat['articles'])} articles)",
"tags": ["category"],
"complexity": "simple",
})
stats["topics"] += 1
# categorized_under edges (only resolve to known article nodes)
for article_target in cat["articles"]:
article_id = resolve_wikilink(article_target, name_map, article_ids)
if article_id:
edges.append({
"source": article_id,
"target": topic_id,
"type": "categorized_under",
"direction": "forward",
"weight": 0.6,
})
# --- Build source nodes from raw/ ---
if raw_root.is_dir():
for raw_file in sorted(raw_root.rglob("*")):
if raw_file.is_file() and not raw_file.name.startswith("."):
rel_raw = raw_file.relative_to(root)
ext = raw_file.suffix.lower()
size_kb = raw_file.stat().st_size / 1024
source_id = f"source:{raw_file.relative_to(raw_root).with_suffix('')}"
nodes.append({
"id": source_id,
"type": "source",
"name": raw_file.name,
"filePath": str(rel_raw),
"summary": f"Raw source ({ext or 'unknown'}, {size_kb:.0f} KB)",
"tags": ["raw", ext.lstrip(".") or "unknown"],
"complexity": "simple",
})
stats["sources"] += 1
# --- Compute backlinks ---
backlink_map: dict[str, list[str]] = {}
for edge in edges:
if edge["type"] == "related":
target = edge["target"]
source = edge["source"]
backlink_map.setdefault(target, []).append(source)
for node in nodes:
if node["type"] == "article" and "knowledgeMeta" in node:
bl = backlink_map.get(node["id"], [])
node["knowledgeMeta"]["backlinks"] = bl
# --- Deduplicate edges ---
seen_edges: set[tuple[str, str, str]] = set()
deduped_edges = []
for edge in edges:
key = (edge["source"], edge["target"], edge["type"])
if key not in seen_edges:
seen_edges.add(key)
deduped_edges.append(edge)
return {
"format": "karpathy",
"stats": stats,
"categories": [{"name": c["name"], "count": len(c["articles"])} for c in categories],
"logEntries": len(log_entries),
"nodes": nodes,
"edges": deduped_edges,
"warnings": warnings[:50], # Cap warnings
}
def main():
if len(sys.argv) < 2:
print("Usage: parse-knowledge-base.py <wiki-directory>", file=sys.stderr)
sys.exit(1)
root = Path(sys.argv[1]).resolve()
if not root.is_dir():
print(f"Error: {root} is not a directory", file=sys.stderr)
sys.exit(1)
manifest = parse_wiki(root)
# Write output
out_dir = root / ".understand-anything" / "intermediate"
out_dir.mkdir(parents=True, exist_ok=True)
out_path = out_dir / "scan-manifest.json"
out_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
# Report to stderr
s = manifest["stats"]
print(f"[parse] Karpathy wiki: {s['articles']} articles, {s['sources']} sources, "
f"{s['topics']} topics, {s['wikilinks']} wikilinks "
f"({s['unresolved']} unresolved)", file=sys.stderr)
print(f"[parse] Output: {out_path}", file=sys.stderr)
if __name__ == "__main__":
main()