Chrome Extension
Page Capture and Parsing
Full-page capture, selection capture, content-root scoring, multi-article/main handling, and image extraction logic
Page Capture and Parsing
Two Capture Paths
1. Extension Popup full-page capture
Entry: src/services/pageMarkdown.ts
Responsibilities:
- inject into the current tab
- identify candidate content roots
- filter ads, nav, and hidden elements
- convert structured DOM into Markdown
2. Content-script lightweight capture
Entry: src/inline-composer/pageContext.ts
Responsibilities:
- read the current selection
- fall back to a full-page text snapshot when nothing is selected
- extract image references related to the content root
How Multiple article/main Roots Are Handled
The implementation does not just take the first one.
Instead it:
- collects multiple candidate content roots
- scores them by text length, paragraph count, headings, images, and link density
- keeps the strongest content root
- merges additional strong roots when they look like part of the same main body
This reduces the chance of treating feed/list containers as the final article body.
How Image Resolution Works
The extractor now tries:
currentSrcsrcsrcsetdata-srcdata-originaldata-lazy-src- other
data-*image attributes
In full-page mode, images are appended as Markdown references rather than uploaded as binary attachments.