Initial commit with translated description

2026-03-29 10:19:02 +08:00
commit c027104b0b
6 changed files with 455 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,83 @@
 # Markdown.new Skill
 Single-skill repository for `markdown-new` - official Cloudflare URL-to-Markdown service ([markdown.new](https://markdown.new/)) converted into a skill.
 Skill entrypoint:
 - `markdown-new/SKILL.md`
 ## What It Does
 `markdown-new` converts public web pages into LLM-ready Markdown using [markdown.new](https://markdown.new), with:
 - URL-to-Markdown conversion for summarization, extraction, RAG, and archiving
 - conversion fallback control (`auto`, `ai`, `browser`)
 - optional image retention
 - optional wrapped delivery mode for downstream parsing
 ## Path Resolution (Important)
 - Relative paths such as `scripts/markdown_new_fetch.py` are relative to the skill directory.
 - Do not run `python3 scripts/markdown_new_fetch.py ...` from workspace root unless `scripts/` exists there.
 - Safe command from any current directory:
 ```bash
 python3 ~/.codex/skills/markdown-new/scripts/markdown_new_fetch.py 'https://example.com'
 ```
 ## Modes
 ### Conversion Modes (`--method`)
 - `auto`: default pipeline, fastest successful path
 - `ai`: force Workers AI conversion path
 - `browser`: force Browser Rendering for JS-heavy pages
 ### Output Modes
 - default: print Markdown to stdout
 - `--output <file>`: write Markdown to file
 - `--deliver-md`: write `.md` output with wrapped content; useful for reasoning LLMs on long reads because it reduces format confusion:
 ```text
 <url>
 ...markdown...
 </url>
 ```
 If `--deliver-md` is used without `--output`, filename is auto-generated from the URL.
 ## How It Works
 1. Validate the input URL (`http/https`).
 2. Call `POST https://markdown.new/` with `url`, `method`, and `retain_images`.
 3. Accept response as either raw markdown or JSON with markdown in `content`.
 4. Normalize metadata and choose output behavior.
 5. Return stdout by default, `--output` for files, and `--deliver-md` for wrapped `.md` packets.
 ## Install Paths
 - Codex (macOS/Linux): `~/.codex/skills/markdown-new`
 - Claude Code (macOS/Linux): `~/.claude/skills/markdown-new`
 ## Install on macOS/Linux (single command)
 ### Codex
 ```bash
 mkdir -p ~/.codex/skills && rm -rf ~/.codex/skills/markdown-new && cp -R /Users/pro16/Dropbox/experiments/skills-i-use/markdown-new ~/.codex/skills/
 ```
 ### Claude Code
 ```bash
 mkdir -p ~/.claude/skills && rm -rf ~/.claude/skills/markdown-new && cp -R /Users/pro16/Dropbox/experiments/skills-i-use/markdown-new ~/.claude/skills/
 ```
 ## Quick Usage
 ```bash
 python3 scripts/markdown_new_fetch.py 'https://example.com'
 python3 scripts/markdown_new_fetch.py 'https://example.com' --method browser --retain-images --output page.md
 python3 scripts/markdown_new_fetch.py 'https://example.com' --deliver-md
 ```
 ## Credits
 - `webservervis` for the markdown conversion service powering this skill.
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,90 @@
 ---
 name: markdown-new
 description: "使用markdown.new将公共网页转换为干净的Markdown，用于AI工作流。"
 ---
 # Markdown.new
 Use this skill to convert public URLs into LLM-ready Markdown via [markdown.new](https://markdown.new).
 ## Path Resolution (Critical)
 - Resolve relative paths like `scripts/...` and `references/...` from the skill directory, not workspace root.
 - If current directory is unknown, use an absolute script path.
 ```bash
 python3 ~/.codex/skills/markdown-new/scripts/markdown_new_fetch.py 'https://example.com'
 ```
 ```bash
 cd ~/.codex/skills/markdown-new
 python3 scripts/markdown_new_fetch.py 'https://example.com'
 ```
 Avoid this pattern from an arbitrary workspace root:
 ```bash
 python3 scripts/markdown_new_fetch.py 'https://example.com'
 ```
 ## Workflow
 1. Validate the input URL is public `http` or `https`.
 2. Run `scripts/markdown_new_fetch.py` with `--method auto` first.
 3. Re-run with `--method browser` if output misses JS-rendered content.
 4. Enable `--retain-images` only when image links are required.
 5. Capture response metadata (`x-markdown-tokens`, `x-rate-limit-remaining`, and JSON metadata when present) for downstream planning.
 ## Quick Start
 Commands below assume current directory is the skill root (`~/.codex/skills/markdown-new`).
 ```bash
 python3 scripts/markdown_new_fetch.py 'https://example.com' > page.md
 ```
 ```bash
 python3 scripts/markdown_new_fetch.py 'https://example.com' --method browser --retain-images --output page.md
 ```
 ```bash
 python3 scripts/markdown_new_fetch.py 'https://example.com' --deliver-md
 ```
 ## Method Selection
 - `auto`: default. Let markdown.new use its fastest successful pipeline.
 - `ai`: force Workers AI HTML-to-Markdown conversion.
 - `browser`: force headless browser rendering for JS-heavy pages.
 Use `auto` first, then retry with `browser` only when needed.
 ## Delivery Mode
 - Use `--deliver-md` to force file output in `.md` format.
 - In delivery mode, content is wrapped as:
  - `<url>`
  - `...markdown...`
  - `</url>`
 - If `--output` is omitted, the script auto-generates a filename from the URL.
 ## API Modes
 - Prefix mode:
  - `https://markdown.new/https://example.com?method=browser&retain_images=true`
 - POST mode:
  - `POST https://markdown.new/`
  - JSON body: `{"url":"https://example.com","method":"auto","retain_images":false}`
 Prefer POST mode for automation and explicit parameters.
 ## Limits And Safety
 - Treat `429` as rate limiting (documented limit: 500 requests/day/IP).
 - Convert only publicly accessible pages.
 - Respect `robots.txt`, terms of service, and copyright constraints.
 - Do not treat markdown.new output as guaranteed complete for every page; verify critical extractions.
 ## References
 - `references/markdown-new-api.md`
--- a/_meta.json
+++ b/_meta.json
@@ -0,0 +1,6 @@
 {
  "ownerId": "kn7a3xe57dnqxgg5kfp78twfj98022a5",
  "slug": "markdown-convert",
  "version": "1.0.0",
  "publishedAt": 1771528764013
 }
--- a/agents/openai.yaml
+++ b/agents/openai.yaml
@@ -0,0 +1,4 @@
 interface:
  display_name: "Markdown.new"
  short_description: "Convert web URLs to LLM-ready Markdown"
  default_prompt: "Use $markdown-new to fetch clean Markdown from public URLs for summarization, RAG, or extraction tasks."
--- a/references/markdown-new-api.md
+++ b/references/markdown-new-api.md
@@ -0,0 +1,56 @@
 # markdown.new API Reference Notes
 ## Endpoints
 - Prefix conversion: `GET https://markdown.new/<absolute-url>`
 - API conversion: `POST https://markdown.new/`
 ## POST Request
 - Content type: `application/json`
 - Body fields:
  - `url` (string, required): public URL to convert
  - `method` (string, optional): `auto` (default), `ai`, or `browser`
  - `retain_images` (boolean, optional): `false` (default)
 Example:
 ```json
 {
  "url": "https://example.com",
  "method": "auto",
  "retain_images": false
 }
 ```
 ## Response
 - Status: `200` on success
 - Prefix mode typically returns Markdown (`text/markdown`)
 - POST mode may return JSON (`application/json`) with Markdown in `content`
 - Common headers:
  - `x-markdown-tokens`: estimated token count for the returned Markdown
  - `x-rate-limit-remaining`: remaining requests for current daily quota
 ## Conversion Pipeline (as documented)
 1. Request native Markdown via `Accept: text/markdown`
 2. Fall back to Workers AI `toMarkdown()` when HTML is returned
 3. Fall back to Browser Rendering for JS-heavy pages
 ## Operational Notes
 - Documented rate limit: 500 requests/day per IP
 - `429` indicates rate-limit exhaustion
 - Public URLs only; authenticated/paywalled pages may fail
 - Browser rendering usually adds latency compared with `auto`/`ai`
 ## Skill Script Notes
 - `scripts/markdown_new_fetch.py --deliver-md` writes a `.md` file and wraps the markdown body with pseudo-XML tags:
 ```text
 <url>
 ...markdown...
 </url>
 ```
--- a/scripts/markdown_new_fetch.py
+++ b/scripts/markdown_new_fetch.py
@@ -0,0 +1,216 @@
 #!/usr/bin/env python3
 """Convert public URLs to Markdown through markdown.new."""
 import argparse
 import json
 import re
 import sys
 import urllib.error
 import urllib.parse
 import urllib.request
 from pathlib import Path
 from typing import Dict, Optional, Tuple
 DEFAULT_API_URL = "https://markdown.new/"
 def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        description="Fetch Markdown from a public URL using markdown.new"
    )
    parser.add_argument("url", help="Public URL to convert (http/https)")
    parser.add_argument(
        "--method",
        choices=["auto", "ai", "browser"],
        default="auto",
        help="Conversion method to request (default: auto)",
    )
    parser.add_argument(
        "--retain-images",
        action="store_true",
        help="Request image retention in output markdown",
    )
    parser.add_argument(
        "--output",
        help="Write markdown to this file instead of stdout",
    )
    parser.add_argument(
        "--timeout",
        type=float,
        default=45.0,
        help="Request timeout in seconds (default: 45)",
    )
    parser.add_argument(
        "--api-url",
        default=DEFAULT_API_URL,
        help="markdown.new API endpoint (default: https://markdown.new/)",
    )
    parser.add_argument(
        "--show-headers",
        action="store_true",
        help="Print response headers to stderr",
    )
    parser.add_argument(
        "--deliver-md",
        action="store_true",
        help=(
            "Write output into a .md file and wrap content in pseudo-XML "
            "<url>...</url> tags"
        ),
    )
    return parser
 def validate_url(url: str) -> None:
    parsed = urllib.parse.urlparse(url)
    if parsed.scheme not in {"http", "https"} or not parsed.netloc:
        raise ValueError(f"Invalid URL: {url!r}. Use absolute http/https URL.")
 def build_request(api_url: str, payload: Dict[str, object]) -> urllib.request.Request:
    data = json.dumps(payload).encode("utf-8")
    return urllib.request.Request(
        api_url,
        data=data,
        method="POST",
        headers={
            "Content-Type": "application/json",
            "Accept": "text/markdown",
            "User-Agent": "codex-markdown-new-skill/1.0",
        },
    )
 def normalize_body(body: str, headers: Dict[str, str]) -> Tuple[str, Dict[str, str]]:
    content_type = headers.get("content-type", "")
    if "application/json" not in content_type:
        return body, {}
    try:
        payload = json.loads(body)
    except json.JSONDecodeError:
        return body, {}
    if not isinstance(payload, dict):
        return body, {}
    markdown = payload.get("content")
    if not isinstance(markdown, str):
        return body, {}
    metadata: Dict[str, str] = {}
    for key in ("title", "url", "method", "duration_ms", "timestamp"):
        value = payload.get(key)
        if value is not None:
            metadata[f"response_{key}"] = str(value)
    return markdown, metadata
 def print_metadata(
    headers: Dict[str, str], show_headers: bool, response_meta: Dict[str, str]
 ) -> None:
    tokens = headers.get("x-markdown-tokens")
    remaining = headers.get("x-rate-limit-remaining")
    if tokens:
        print(f"x-markdown-tokens: {tokens}", file=sys.stderr)
    if remaining:
        print(f"x-rate-limit-remaining: {remaining}", file=sys.stderr)
    for key, value in sorted(response_meta.items()):
        print(f"{key}: {value}", file=sys.stderr)
    if show_headers:
        for key, value in sorted(headers.items()):
            print(f"{key}: {value}", file=sys.stderr)
 def slugify_url(url: str) -> str:
    parsed = urllib.parse.urlparse(url)
    raw = f"{parsed.netloc}{parsed.path}".strip("/")
    if not raw:
        raw = parsed.netloc or "page"
    slug = re.sub(r"[^a-zA-Z0-9._-]+", "_", raw).strip("_")
    return slug or "page"
 def resolve_output_path(url: str, output_path: Optional[str], deliver_md: bool) -> Optional[str]:
    if output_path:
        if deliver_md and not output_path.lower().endswith(".md"):
            output_path = f"{output_path}.md"
        return output_path
    if deliver_md:
        return f"{slugify_url(url)}.md"
    return None
 def wrap_in_url_tag(markdown: str) -> str:
    body = markdown.rstrip("\n")
    return f"<url>\n{body}\n</url>\n"
 def write_output(markdown: str, output_path: Optional[str]) -> None:
    if output_path:
        path = Path(output_path)
        if path.parent != Path("."):
            path.parent.mkdir(parents=True, exist_ok=True)
        with open(path, "w", encoding="utf-8") as f:
            f.write(markdown)
        return
    sys.stdout.write(markdown)
 def main() -> int:
    parser = build_parser()
    args = parser.parse_args()
    try:
        validate_url(args.url)
    except ValueError as exc:
        print(str(exc), file=sys.stderr)
        return 2
    payload = {
        "url": args.url,
        "method": args.method,
        "retain_images": bool(args.retain_images),
    }
    req = build_request(args.api_url, payload)
    try:
        with urllib.request.urlopen(req, timeout=args.timeout) as resp:
            headers = {k.lower(): v for k, v in resp.headers.items()}
            body = resp.read().decode("utf-8", errors="replace")
    except urllib.error.HTTPError as err:
        error_body = err.read().decode("utf-8", errors="replace")
        print(f"HTTP {err.code} from markdown.new", file=sys.stderr)
        if err.code == 429:
            print(
                "Rate limit reached (documented: 500 requests/day/IP). Retry later.",
                file=sys.stderr,
            )
        if error_body:
            print(error_body.strip(), file=sys.stderr)
        return 1
    except urllib.error.URLError as err:
        print(f"Request failed: {err.reason}", file=sys.stderr)
        return 1
    markdown, response_meta = normalize_body(body, headers)
    print_metadata(headers, args.show_headers, response_meta)
    content = wrap_in_url_tag(markdown) if args.deliver_md else markdown
    output_path = resolve_output_path(args.url, args.output, args.deliver_md)
    write_output(content, output_path)
    if args.deliver_md and output_path:
        print(f"output_file: {Path(output_path).resolve()}", file=sys.stderr)
    return 0
 if __name__ == "__main__":
    raise SystemExit(main())