Business View
v1.2.0-stable

ENGINEERING
DATA SOVEREIGNTY.

?��?工�? Notion Internal API,實?��??��??�遷移�?br> ?�迴?��??�斷點�??��??�件?��?以�??�自?��???Confluence ?��???

zsh ??node-1
??/span> ~ python crawl_notion_api.py
[INFO] Initializing reverse API connection...
[INFO] Authenticated as user_8a2d...
[INFO] Found 517 pages in workspace.
> Processing Page: 'API Specification' (ID: a1b2...)
> Recursively fetching children... [OK]
_

// README.md: THE_WHY

"In the era of AI & Vibe Coding, building tools is faster than ever. But Token costs and Time are real."

PERFORMANCE_METRICS

benchmark_result.json
MANUAL_MIGRATION (est.)
Time: 129.0 hrs
NOTION_CRAWLER (v1.2)
Time: 0.5 hrs

> Speedup Factor: 258x
> Data Integrity: 100%
> Memory Usage: <150MB (Streaming)

Exponential Efficiency

?�於 500+ ?��??��??�實?�移?��?測試?�人工搬?��??�耗�? (?�估 9 ?��?/??,�?容�??��?層�?結�??? Notion Crawler ?��?並�??��??�自?��?轉�?,�??�週�?工�?縮短??1 小�??��??��?

517
PAGES PROCESSED
0
DATA LOSS
258x
FASTER

SYSTEM_ARCHITECTURE

Notion Crawler System Architecture Diagram showing Recursive Crawler, Markdown Transpiler, and Confluence Integrator data flow

Recursive Crawler Engine

  • Reverse Internal API:直?�串?? `loadPageChunk` ?��?結�???Block 資�?,速度�?Playwright �?10 ?��?/li>
  • ?�迴?�歷 (Recursive):自?�解?��??�面 (Sub-pages) ??Database Rows,精確�??�無?�層級�?構�?/li>
  • ?�偵測�???/b>:實�?Exponential Backoff ??Jitter ?��?延遲,�??��???429 Rate Limit??/li>

Markdown Transpiler

  • AST �??:�??��? JSON �???�抽象�?法樹,確�? Table, Callout, Code Block 等�??��?件精確渲?��?/li>
  • Knowledge Stitcher:自?��??�並縫�??�散??API ?�數?�面 (Input/Output/Schema),�?組為?��? Truth??/li>

Resilience & Failover

  • Dual-Domain Failover:優?��?�? `notion.so`,�??�自?��??�至 `notion.site` ?��??��?,確�?99.9% ?�用?��?/li>
  • Connection Pooling:使?? `requests.Session` 維�? TCP ???池�?減�? TLS ?��??�銷,�??�大?�爬?��??��?/li>
  • Granular Checkpoints:SQLite/JSON 記�? Page ID ?�?��?實現 100% ?��?續傳??/li>

Confluence Integrator

  • BFS Traversal:採?�廣度優?��??��??��?確�??��??��?對優?�於子�??�建立�??��? Orphan Pages??/li>
  • Smart Transform:自?��? Mermaid ?�塊�??�為 Confluence Macro,並修復?�面?��??��???? (Internal Links)??/li>
  • Auto-Root Management:自?�在 Space ?�目?�建�?`Notion_KB`,支??`--clean` ?�迴?�除以進�?乾淨?�部署�?/li>

Legacy Mode (Fallback)

  • Playwright Renderer:�??�瀏覽?�模?��??��? DOM �?? Breadcrumb 決�?路�?,解�?API ?��??��??�特�?Edge Case??
  • Interactive Crawling:支?? Auto-Scroll 觸發 Lazy Loading?�自?��???Toggle?��???Database??/li>
  • Stealth Mode:使??Headed 模�??�隨機延??(3-10s) 繞�? Cloudflare 驗�???/li>

Test Suite (Quality Gate)

  • 130 Unit Tests:使??pytest 覆�?三大?��?模�??��??��??�渲?��?輯�??�併策略,確保�?次�??��??��??��?行為??/li>
  • Zero-IO Pure Testing:RichText 轉�??�Block 渲�??��?題�?歧�??��??�輯?�為純函式測試�??��? Mock 外部?��???/li>
  • Filesystem Isolation:�?併�?樹建構測試使??pytest tmp_path fixture,�??��??��?污�??�實檔�?系統??

DEV_EXPERIENCE (DX)

?�發?��??��?機�?快速迭�?��??/span>
// DEBUG MODE: OFFLINE
enable --offline-replay
> Loading snapshot `dump_20250131.json`
> Mocking API responses...
> Ready. (0ms latency)
Offline Replay

?�發�???�輯?�直?��??�本??Snapshot�?b>完全?��??�網,�?迭代?�度?��? 100 ?��?/p>

// PRE-FLIGHT CHECK
run --dry-run
> Simulating write operations...
> [SKIP] POST /wiki/rest/api/content
> No changes applied.
Dry Run Mode

模擬轉譯?��??��?程�??�輸?�日誌而�??��?寫入,確保�??��? Confluence 資�??��?突�?/p>

// ERROR RECOVERY
status --checkpoints
> Pending: 42 pages
> Failed: 3 pages (Rate Limited)
> Resuming from last success...
Smart Resume

程�??��??�自?��???Checkpoint,跳?�已?��? (Success) ?��??��??��?試失?��??��?/p>

CLI_COMMANDS

使用?��? API 快速爬?��??��??�覽?��??�。適?�於大批?��??��??��?/p>

INPUT
python crawl_notion_api.py --token $NOTION_TOKEN_V2 --page $ROOT_PAGE_ID

將零???案�?併為 API ?�件,並?��? MkDocs ?�地伺�??��?覽�?/p>

INPUT
python build_knowledge_base.py && mkdocs serve

?��?讀??output ?��?並�??�至?��? Space?�`--source all` �?��?��?上傳??/p>

INPUT
python upload_to_confluence.py --source all --space ENGINEERING

FALLBACK_STRATEGY

PLAN B ??WHEN API FAILS

Why We Need a Fallback

Notion ?�部 API (loadPageChunk) 屬於?�公?�端點�??��??�能變更?��??��??��?證�??��??? ?�爬?�被識別?�自?��?工具,API 請�?將直?��???403 ??429??br>
?�此?��?案內�?Playwright ?�覽?�模�?/b>作為完整?�援?��?�? 以�?實瀏覽?�渲?��??��?完全繞�? API 層�?確�??�任何�?境�??�能完�?資�??�移??

Headed Browser Mode
使用??Headless Chromium 渲�?,�??�執�?JavaScript,�??? Cloudflare Bot Detection??/div>
Randomized Delay (3-10s)
每次請�??��? 3~10 秒隨機延?��?模擬人�??�覽行為,避?�固定�?奏被?�測??/div>
Auto-Scroll & Expand
?��?觸發 Lazy Loading?��???Toggle Block?��???Database 完整?�表,�??��?任�??�容??/div>
Breadcrumb Path Resolution
�???�面 DOM 中�? Breadcrumb 導覽?��?精確?��??�面層�?結�??�儲存路徑�?/div>
crawl_notion.py ??Playwright
$ python crawl_notion.py
[INFO] Launching Chromium (headed mode)...
[INFO] User-Agent: Chrome/131.0 (Windows)
[INFO] Navigating to notion.site/...
[INFO] Waiting 6.2s (random delay)...
[INFO] Auto-scrolling page content...
[INFO] Expanding 3 toggle blocks...
[INFO] Breadcrumb: Project > Auth > Login
[INFO] Saved: output/Project/Auth/Login.md
[INFO] Waiting 4.8s (random delay)...
[INFO] Processing next page...
_
API vs Playwright 比�?
API Mode Playwright
?�度 ~10 min ~3 hrs
?�偵�?/td> Header ?��? ?�實?�覽??/span>
API 依賴 ?�公??API ??API 依賴
Cloudflare ?�能被�???/td> 完全繞�?
?��?續傳
記憶�?/td> <50MB ~500MB

TEST_SUITE

pytest ??130 tests across 3 core modules
130
TEST CASES
3
MODULES COVERED
0.3s
EXECUTION TIME
100%
PASS RATE
pytest -v --tb=short
$ pytest -v --tb=short
======================== test session starts ========================
collected 130 items
 
test_crawl_notion_api.py::TestSanitizeFilename::test_basic PASSED
test_crawl_notion_api.py::TestRichTextConvert::test_bold PASSED
test_crawl_notion_api.py::TestApplyDecorations::test_equation PASSED
test_crawl_notion_api.py::TestBlockToMarkdown::test_table_2x2 PASSED
test_build_knowledge_base.py::TestStripH1::test_removes_h1 PASSED
test_build_knowledge_base.py::TestMergeStandardSplit::test_basic PASSED
test_upload_to_confluence.py::TestEscapeXml::test_ampersand PASSED
test_upload_to_confluence.py::TestConvertMd::test_mermaid PASSED
... 122 more passed ...
 
======================== 130 passed in 0.32s ========================
MODULE COVERAGE
crawl_notion_api.py 44 tests
RichTextConverter • BlockToMarkdown • sanitize_filename • page ID helpers
upload_to_confluence.py 34 tests
_escape_xml • PageNode tree • title conflicts • MD?�Confluence conversion
build_knowledge_base.py 32 tests
strip_h1 • extract_api_meta • Mermaid generators • merge functions (tmp_path)
TEST CATEGORIES
Pure Functions ~50
Rich Text Parsing ~27
Block Rendering ~16
File Merge (I/O) ~9
Tree Building ~12
MD?�Confluence ~5
RUN TESTS
pip install pytest && pytest -v --tb=short