Initial commit with translated description
This commit is contained in:
83
README.md
Normal file
83
README.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# Markdown.new Skill
|
||||
|
||||
Single-skill repository for `markdown-new` - official Cloudflare URL-to-Markdown service ([markdown.new](https://markdown.new/)) converted into a skill.
|
||||
|
||||
Skill entrypoint:
|
||||
- `markdown-new/SKILL.md`
|
||||
|
||||
## What It Does
|
||||
|
||||
`markdown-new` converts public web pages into LLM-ready Markdown using [markdown.new](https://markdown.new), with:
|
||||
- URL-to-Markdown conversion for summarization, extraction, RAG, and archiving
|
||||
- conversion fallback control (`auto`, `ai`, `browser`)
|
||||
- optional image retention
|
||||
- optional wrapped delivery mode for downstream parsing
|
||||
|
||||
## Path Resolution (Important)
|
||||
|
||||
- Relative paths such as `scripts/markdown_new_fetch.py` are relative to the skill directory.
|
||||
- Do not run `python3 scripts/markdown_new_fetch.py ...` from workspace root unless `scripts/` exists there.
|
||||
- Safe command from any current directory:
|
||||
|
||||
```bash
|
||||
python3 ~/.codex/skills/markdown-new/scripts/markdown_new_fetch.py 'https://example.com'
|
||||
```
|
||||
|
||||
## Modes
|
||||
|
||||
### Conversion Modes (`--method`)
|
||||
- `auto`: default pipeline, fastest successful path
|
||||
- `ai`: force Workers AI conversion path
|
||||
- `browser`: force Browser Rendering for JS-heavy pages
|
||||
|
||||
### Output Modes
|
||||
- default: print Markdown to stdout
|
||||
- `--output <file>`: write Markdown to file
|
||||
- `--deliver-md`: write `.md` output with wrapped content; useful for reasoning LLMs on long reads because it reduces format confusion:
|
||||
|
||||
```text
|
||||
<url>
|
||||
...markdown...
|
||||
</url>
|
||||
```
|
||||
|
||||
If `--deliver-md` is used without `--output`, filename is auto-generated from the URL.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. Validate the input URL (`http/https`).
|
||||
2. Call `POST https://markdown.new/` with `url`, `method`, and `retain_images`.
|
||||
3. Accept response as either raw markdown or JSON with markdown in `content`.
|
||||
4. Normalize metadata and choose output behavior.
|
||||
5. Return stdout by default, `--output` for files, and `--deliver-md` for wrapped `.md` packets.
|
||||
|
||||
## Install Paths
|
||||
|
||||
- Codex (macOS/Linux): `~/.codex/skills/markdown-new`
|
||||
- Claude Code (macOS/Linux): `~/.claude/skills/markdown-new`
|
||||
|
||||
## Install on macOS/Linux (single command)
|
||||
|
||||
### Codex
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.codex/skills && rm -rf ~/.codex/skills/markdown-new && cp -R /Users/pro16/Dropbox/experiments/skills-i-use/markdown-new ~/.codex/skills/
|
||||
```
|
||||
|
||||
### Claude Code
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.claude/skills && rm -rf ~/.claude/skills/markdown-new && cp -R /Users/pro16/Dropbox/experiments/skills-i-use/markdown-new ~/.claude/skills/
|
||||
```
|
||||
|
||||
## Quick Usage
|
||||
|
||||
```bash
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com'
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com' --method browser --retain-images --output page.md
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com' --deliver-md
|
||||
```
|
||||
|
||||
## Credits
|
||||
|
||||
- `webservervis` for the markdown conversion service powering this skill.
|
||||
90
SKILL.md
Normal file
90
SKILL.md
Normal file
@@ -0,0 +1,90 @@
|
||||
---
|
||||
name: markdown-new
|
||||
description: "使用markdown.new将公共网页转换为干净的Markdown,用于AI工作流。"
|
||||
---
|
||||
|
||||
# Markdown.new
|
||||
|
||||
Use this skill to convert public URLs into LLM-ready Markdown via [markdown.new](https://markdown.new).
|
||||
|
||||
## Path Resolution (Critical)
|
||||
|
||||
- Resolve relative paths like `scripts/...` and `references/...` from the skill directory, not workspace root.
|
||||
- If current directory is unknown, use an absolute script path.
|
||||
|
||||
```bash
|
||||
python3 ~/.codex/skills/markdown-new/scripts/markdown_new_fetch.py 'https://example.com'
|
||||
```
|
||||
|
||||
```bash
|
||||
cd ~/.codex/skills/markdown-new
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com'
|
||||
```
|
||||
|
||||
Avoid this pattern from an arbitrary workspace root:
|
||||
|
||||
```bash
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com'
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Validate the input URL is public `http` or `https`.
|
||||
2. Run `scripts/markdown_new_fetch.py` with `--method auto` first.
|
||||
3. Re-run with `--method browser` if output misses JS-rendered content.
|
||||
4. Enable `--retain-images` only when image links are required.
|
||||
5. Capture response metadata (`x-markdown-tokens`, `x-rate-limit-remaining`, and JSON metadata when present) for downstream planning.
|
||||
|
||||
## Quick Start
|
||||
|
||||
Commands below assume current directory is the skill root (`~/.codex/skills/markdown-new`).
|
||||
|
||||
```bash
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com' > page.md
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com' --method browser --retain-images --output page.md
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 scripts/markdown_new_fetch.py 'https://example.com' --deliver-md
|
||||
```
|
||||
|
||||
## Method Selection
|
||||
|
||||
- `auto`: default. Let markdown.new use its fastest successful pipeline.
|
||||
- `ai`: force Workers AI HTML-to-Markdown conversion.
|
||||
- `browser`: force headless browser rendering for JS-heavy pages.
|
||||
|
||||
Use `auto` first, then retry with `browser` only when needed.
|
||||
|
||||
## Delivery Mode
|
||||
|
||||
- Use `--deliver-md` to force file output in `.md` format.
|
||||
- In delivery mode, content is wrapped as:
|
||||
- `<url>`
|
||||
- `...markdown...`
|
||||
- `</url>`
|
||||
- If `--output` is omitted, the script auto-generates a filename from the URL.
|
||||
|
||||
## API Modes
|
||||
|
||||
- Prefix mode:
|
||||
- `https://markdown.new/https://example.com?method=browser&retain_images=true`
|
||||
- POST mode:
|
||||
- `POST https://markdown.new/`
|
||||
- JSON body: `{"url":"https://example.com","method":"auto","retain_images":false}`
|
||||
|
||||
Prefer POST mode for automation and explicit parameters.
|
||||
|
||||
## Limits And Safety
|
||||
|
||||
- Treat `429` as rate limiting (documented limit: 500 requests/day/IP).
|
||||
- Convert only publicly accessible pages.
|
||||
- Respect `robots.txt`, terms of service, and copyright constraints.
|
||||
- Do not treat markdown.new output as guaranteed complete for every page; verify critical extractions.
|
||||
|
||||
## References
|
||||
|
||||
- `references/markdown-new-api.md`
|
||||
6
_meta.json
Normal file
6
_meta.json
Normal file
@@ -0,0 +1,6 @@
|
||||
{
|
||||
"ownerId": "kn7a3xe57dnqxgg5kfp78twfj98022a5",
|
||||
"slug": "markdown-convert",
|
||||
"version": "1.0.0",
|
||||
"publishedAt": 1771528764013
|
||||
}
|
||||
4
agents/openai.yaml
Normal file
4
agents/openai.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
interface:
|
||||
display_name: "Markdown.new"
|
||||
short_description: "Convert web URLs to LLM-ready Markdown"
|
||||
default_prompt: "Use $markdown-new to fetch clean Markdown from public URLs for summarization, RAG, or extraction tasks."
|
||||
56
references/markdown-new-api.md
Normal file
56
references/markdown-new-api.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# markdown.new API Reference Notes
|
||||
|
||||
## Endpoints
|
||||
|
||||
- Prefix conversion: `GET https://markdown.new/<absolute-url>`
|
||||
- API conversion: `POST https://markdown.new/`
|
||||
|
||||
## POST Request
|
||||
|
||||
- Content type: `application/json`
|
||||
- Body fields:
|
||||
- `url` (string, required): public URL to convert
|
||||
- `method` (string, optional): `auto` (default), `ai`, or `browser`
|
||||
- `retain_images` (boolean, optional): `false` (default)
|
||||
|
||||
Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"method": "auto",
|
||||
"retain_images": false
|
||||
}
|
||||
```
|
||||
|
||||
## Response
|
||||
|
||||
- Status: `200` on success
|
||||
- Prefix mode typically returns Markdown (`text/markdown`)
|
||||
- POST mode may return JSON (`application/json`) with Markdown in `content`
|
||||
- Common headers:
|
||||
- `x-markdown-tokens`: estimated token count for the returned Markdown
|
||||
- `x-rate-limit-remaining`: remaining requests for current daily quota
|
||||
|
||||
## Conversion Pipeline (as documented)
|
||||
|
||||
1. Request native Markdown via `Accept: text/markdown`
|
||||
2. Fall back to Workers AI `toMarkdown()` when HTML is returned
|
||||
3. Fall back to Browser Rendering for JS-heavy pages
|
||||
|
||||
## Operational Notes
|
||||
|
||||
- Documented rate limit: 500 requests/day per IP
|
||||
- `429` indicates rate-limit exhaustion
|
||||
- Public URLs only; authenticated/paywalled pages may fail
|
||||
- Browser rendering usually adds latency compared with `auto`/`ai`
|
||||
|
||||
## Skill Script Notes
|
||||
|
||||
- `scripts/markdown_new_fetch.py --deliver-md` writes a `.md` file and wraps the markdown body with pseudo-XML tags:
|
||||
|
||||
```text
|
||||
<url>
|
||||
...markdown...
|
||||
</url>
|
||||
```
|
||||
216
scripts/markdown_new_fetch.py
Normal file
216
scripts/markdown_new_fetch.py
Normal file
@@ -0,0 +1,216 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Convert public URLs to Markdown through markdown.new."""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
from typing import Dict, Optional, Tuple
|
||||
|
||||
|
||||
DEFAULT_API_URL = "https://markdown.new/"
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Fetch Markdown from a public URL using markdown.new"
|
||||
)
|
||||
parser.add_argument("url", help="Public URL to convert (http/https)")
|
||||
parser.add_argument(
|
||||
"--method",
|
||||
choices=["auto", "ai", "browser"],
|
||||
default="auto",
|
||||
help="Conversion method to request (default: auto)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--retain-images",
|
||||
action="store_true",
|
||||
help="Request image retention in output markdown",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
help="Write markdown to this file instead of stdout",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=float,
|
||||
default=45.0,
|
||||
help="Request timeout in seconds (default: 45)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--api-url",
|
||||
default=DEFAULT_API_URL,
|
||||
help="markdown.new API endpoint (default: https://markdown.new/)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--show-headers",
|
||||
action="store_true",
|
||||
help="Print response headers to stderr",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--deliver-md",
|
||||
action="store_true",
|
||||
help=(
|
||||
"Write output into a .md file and wrap content in pseudo-XML "
|
||||
"<url>...</url> tags"
|
||||
),
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def validate_url(url: str) -> None:
|
||||
parsed = urllib.parse.urlparse(url)
|
||||
if parsed.scheme not in {"http", "https"} or not parsed.netloc:
|
||||
raise ValueError(f"Invalid URL: {url!r}. Use absolute http/https URL.")
|
||||
|
||||
|
||||
def build_request(api_url: str, payload: Dict[str, object]) -> urllib.request.Request:
|
||||
data = json.dumps(payload).encode("utf-8")
|
||||
return urllib.request.Request(
|
||||
api_url,
|
||||
data=data,
|
||||
method="POST",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Accept": "text/markdown",
|
||||
"User-Agent": "codex-markdown-new-skill/1.0",
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def normalize_body(body: str, headers: Dict[str, str]) -> Tuple[str, Dict[str, str]]:
|
||||
content_type = headers.get("content-type", "")
|
||||
if "application/json" not in content_type:
|
||||
return body, {}
|
||||
|
||||
try:
|
||||
payload = json.loads(body)
|
||||
except json.JSONDecodeError:
|
||||
return body, {}
|
||||
|
||||
if not isinstance(payload, dict):
|
||||
return body, {}
|
||||
|
||||
markdown = payload.get("content")
|
||||
if not isinstance(markdown, str):
|
||||
return body, {}
|
||||
|
||||
metadata: Dict[str, str] = {}
|
||||
for key in ("title", "url", "method", "duration_ms", "timestamp"):
|
||||
value = payload.get(key)
|
||||
if value is not None:
|
||||
metadata[f"response_{key}"] = str(value)
|
||||
|
||||
return markdown, metadata
|
||||
|
||||
|
||||
def print_metadata(
|
||||
headers: Dict[str, str], show_headers: bool, response_meta: Dict[str, str]
|
||||
) -> None:
|
||||
tokens = headers.get("x-markdown-tokens")
|
||||
remaining = headers.get("x-rate-limit-remaining")
|
||||
|
||||
if tokens:
|
||||
print(f"x-markdown-tokens: {tokens}", file=sys.stderr)
|
||||
if remaining:
|
||||
print(f"x-rate-limit-remaining: {remaining}", file=sys.stderr)
|
||||
for key, value in sorted(response_meta.items()):
|
||||
print(f"{key}: {value}", file=sys.stderr)
|
||||
|
||||
if show_headers:
|
||||
for key, value in sorted(headers.items()):
|
||||
print(f"{key}: {value}", file=sys.stderr)
|
||||
|
||||
|
||||
def slugify_url(url: str) -> str:
|
||||
parsed = urllib.parse.urlparse(url)
|
||||
raw = f"{parsed.netloc}{parsed.path}".strip("/")
|
||||
if not raw:
|
||||
raw = parsed.netloc or "page"
|
||||
slug = re.sub(r"[^a-zA-Z0-9._-]+", "_", raw).strip("_")
|
||||
return slug or "page"
|
||||
|
||||
|
||||
def resolve_output_path(url: str, output_path: Optional[str], deliver_md: bool) -> Optional[str]:
|
||||
if output_path:
|
||||
if deliver_md and not output_path.lower().endswith(".md"):
|
||||
output_path = f"{output_path}.md"
|
||||
return output_path
|
||||
|
||||
if deliver_md:
|
||||
return f"{slugify_url(url)}.md"
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def wrap_in_url_tag(markdown: str) -> str:
|
||||
body = markdown.rstrip("\n")
|
||||
return f"<url>\n{body}\n</url>\n"
|
||||
|
||||
|
||||
def write_output(markdown: str, output_path: Optional[str]) -> None:
|
||||
if output_path:
|
||||
path = Path(output_path)
|
||||
if path.parent != Path("."):
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
f.write(markdown)
|
||||
return
|
||||
sys.stdout.write(markdown)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
validate_url(args.url)
|
||||
except ValueError as exc:
|
||||
print(str(exc), file=sys.stderr)
|
||||
return 2
|
||||
|
||||
payload = {
|
||||
"url": args.url,
|
||||
"method": args.method,
|
||||
"retain_images": bool(args.retain_images),
|
||||
}
|
||||
|
||||
req = build_request(args.api_url, payload)
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=args.timeout) as resp:
|
||||
headers = {k.lower(): v for k, v in resp.headers.items()}
|
||||
body = resp.read().decode("utf-8", errors="replace")
|
||||
except urllib.error.HTTPError as err:
|
||||
error_body = err.read().decode("utf-8", errors="replace")
|
||||
print(f"HTTP {err.code} from markdown.new", file=sys.stderr)
|
||||
if err.code == 429:
|
||||
print(
|
||||
"Rate limit reached (documented: 500 requests/day/IP). Retry later.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
if error_body:
|
||||
print(error_body.strip(), file=sys.stderr)
|
||||
return 1
|
||||
except urllib.error.URLError as err:
|
||||
print(f"Request failed: {err.reason}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
markdown, response_meta = normalize_body(body, headers)
|
||||
print_metadata(headers, args.show_headers, response_meta)
|
||||
|
||||
content = wrap_in_url_tag(markdown) if args.deliver_md else markdown
|
||||
output_path = resolve_output_path(args.url, args.output, args.deliver_md)
|
||||
write_output(content, output_path)
|
||||
|
||||
if args.deliver_md and output_path:
|
||||
print(f"output_file: {Path(output_path).resolve()}", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Reference in New Issue
Block a user