From f727ce26b6145bbffe4025c18d99976df8f3d0a8 Mon Sep 17 00:00:00 2001 From: zlei9 Date: Sun, 29 Mar 2026 14:34:36 +0800 Subject: [PATCH] Initial commit with translated description --- SKILL.md | 206 ++++++++++++++++++++++++++++++++ _meta.json | 6 + references/failed_tests.md | 236 +++++++++++++++++++++++++++++++++++++ references/methodology.md | 227 +++++++++++++++++++++++++++++++++++ 4 files changed, 675 insertions(+) create mode 100644 SKILL.md create mode 100644 _meta.json create mode 100644 references/failed_tests.md create mode 100644 references/methodology.md diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000..54254bd --- /dev/null +++ b/SKILL.md @@ -0,0 +1,206 @@ +--- +name: backtest-expert +description: "交易策略系统回测的专家指导。" +--- + +# Backtest Expert + +Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results. + +## Core Philosophy + +**Goal**: Find strategies that "break the least", not strategies that "profit the most" on paper. + +**Principle**: Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading. + +## When to Use This Skill + +Use this skill when: +- Developing or validating systematic trading strategies +- Evaluating whether a trading idea is robust enough for live implementation +- Troubleshooting why a backtest might be misleading +- Learning proper backtesting methodology +- Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias) +- Assessing parameter sensitivity and regime dependence +- Setting realistic expectations for slippage and execution costs + +## Backtesting Workflow + +### 1. State the Hypothesis + +Define the edge in one sentence. + +**Example**: "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity." + +If you can't articulate the edge clearly, don't proceed to testing. + +### 2. Codify Rules with Zero Discretion + +Define with complete specificity: +- **Entry**: Exact conditions, timing, price type +- **Exit**: Stop loss, profit target, time-based exit +- **Position sizing**: Fixed $$, % of portfolio, volatility-adjusted +- **Filters**: Market cap, volume, sector, volatility conditions +- **Universe**: What instruments are eligible + +**Critical**: No subjective judgment allowed. Every decision must be rule-based and unambiguous. + +### 3. Run Initial Backtest + +Test over: +- **Minimum 5 years** (preferably 10+) +- **Multiple market regimes** (bull, bear, high/low volatility) +- **Realistic costs**: Commissions + conservative slippage + +Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis. + +### 4. Stress Test the Strategy + +This is where 80% of testing time should be spent. + +**Parameter sensitivity**: +- Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline +- Test profit target at 80%, 90%, 100%, 110%, 120% of baseline +- Vary entry/exit timing by ±15-30 minutes +- Look for "plateaus" of stable performance, not narrow spikes + +**Execution friction**: +- Increase slippage to 1.5-2x typical estimates +- Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick) +- Add realistic order rejection scenarios +- Test with pessimistic commission structures + +**Time robustness**: +- Analyze year-by-year performance +- Require positive expectancy in majority of years +- Ensure strategy doesn't rely on 1-2 exceptional periods +- Test in different market regimes separately + +**Sample size**: +- Absolute minimum: 30 trades +- Preferred: 100+ trades +- High confidence: 200+ trades + +### 5. Out-of-Sample Validation + +**Walk-forward analysis**: +1. Optimize on training period (e.g., Year 1-3) +2. Test on validation period (Year 4) +3. Roll forward and repeat +4. Compare in-sample vs out-of-sample performance + +**Warning signs**: +- Out-of-sample <50% of in-sample performance +- Need frequent parameter re-optimization +- Parameters change dramatically between periods + +### 6. Evaluate Results + +**Questions to answer**: +- Does edge survive pessimistic assumptions? +- Is performance stable across parameter variations? +- Does strategy work in multiple market regimes? +- Is sample size sufficient for statistical confidence? +- Are results realistic, not "too good to be true"? + +**Decision criteria**: +- ✅ **Deploy**: Survives all stress tests with acceptable performance +- 🔄 **Refine**: Core logic sound but needs parameter adjustment +- ❌ **Abandon**: Fails stress tests or relies on fragile assumptions + +## Key Testing Principles + +### Punish the Strategy + +Add friction everywhere: +- Commissions higher than reality +- Slippage 1.5-2x typical +- Worst-case fills +- Order rejections +- Partial fills + +**Rationale**: Strategies that survive pessimistic assumptions often outperform in live trading. + +### Seek Plateaus, Not Peaks + +Look for parameter ranges where performance is stable, not optimal values that create performance spikes. + +**Good**: Strategy profitable with stop loss anywhere from 1.5% to 3.0% +**Bad**: Strategy only works with stop loss at exactly 2.13% + +Stable performance indicates genuine edge; narrow optima suggest curve-fitting. + +### Test All Cases, Not Cherry-Picked Examples + +**Wrong approach**: Study hand-picked "market leaders" that worked +**Right approach**: Test every stock that met criteria, including those that failed + +Selective examples create survivorship bias and overestimate strategy quality. + +### Separate Idea Generation from Validation + +**Intuition**: Useful for generating hypotheses +**Validation**: Must be purely data-driven + +Never let attachment to an idea influence interpretation of test results. + +## Common Failure Patterns + +Recognize these patterns early to save time: + +1. **Parameter sensitivity**: Only works with exact parameter values +2. **Regime-specific**: Great in some years, terrible in others +3. **Slippage sensitivity**: Unprofitable when realistic costs added +4. **Small sample**: Too few trades for statistical confidence +5. **Look-ahead bias**: "Too good to be true" results +6. **Over-optimization**: Many parameters, poor out-of-sample results + +See `references/failed_tests.md` for detailed examples and diagnostic framework. + +## Available Reference Documentation + +### Methodology Reference +**File**: `references/methodology.md` + +**When to read**: For detailed guidance on specific testing techniques. + +**Contents**: +- Stress testing methods +- Parameter sensitivity analysis +- Slippage and friction modeling +- Sample size requirements +- Market regime classification +- Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.) + +### Failed Tests Reference +**File**: `references/failed_tests.md` + +**When to read**: When strategy fails tests, or learning from past mistakes. + +**Contents**: +- Why failures are valuable +- Common failure patterns with examples +- Case study documentation framework +- Red flags checklist for evaluating backtests + +## Critical Reminders + +**Time allocation**: Spend 20% generating ideas, 80% trying to break them. + +**Context-free requirement**: If strategy requires "perfect context" to work, it's not robust enough for systematic trading. + +**Red flag**: If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues. + +**Tool limitations**: Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues). + +**Statistical significance**: Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck. + +## Discretionary vs Systematic Differences + +This skill focuses on **systematic/quantitative** backtesting where: +- All rules are codified in advance +- No discretion or "feel" in execution +- Testing happens on all historical examples, not cherry-picked cases +- Context (news, macro) is deliberately stripped out + +Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment. diff --git a/_meta.json b/_meta.json new file mode 100644 index 0000000..a88645d --- /dev/null +++ b/_meta.json @@ -0,0 +1,6 @@ +{ + "ownerId": "kn7agf701n3afzzbq8ge0wa8k1809wm4", + "slug": "backtest-expert", + "version": "0.1.0", + "publishedAt": 1769870095738 +} \ No newline at end of file diff --git a/references/failed_tests.md b/references/failed_tests.md new file mode 100644 index 0000000..f7ea7e5 --- /dev/null +++ b/references/failed_tests.md @@ -0,0 +1,236 @@ +# Learning from Failed Backtests + +## Table of Contents + +1. Why Failed Ideas Are Valuable +2. Common Failure Patterns +3. Case Study Framework +4. Red Flags Checklist + +## 1. Why Failed Ideas Are Valuable + +### The Value of Failures + +**Key insights**: +- Failed tests save capital by preventing live implementation +- Failure patterns reveal which assumptions don't hold +- Understanding what doesn't work narrows the search space +- Failed tests build experience in recognizing fragile strategies + +### Documentation Discipline + +**Record for each failed idea**: +- The hypothesis being tested +- Why you thought it would work +- What the data showed +- Specific breaking points +- Lessons learned + +**Purpose**: Build a library of "anti-patterns" to avoid repeating mistakes. + +## 2. Common Failure Patterns + +### Pattern 1: Parameter Sensitivity + +**Symptom**: Strategy only works with very specific parameter values. + +**Example scenario**: +- Strategy profitable with stop loss at exactly 2.5% +- Increasing to 3% or decreasing to 2% causes significant performance drop +- No "plateau" of stable performance + +**Why it fails**: Real markets have noise; if small changes break the strategy, it likely captured noise, not signal. + +**Lesson**: Seek strategies with stable performance across parameter ranges. + +### Pattern 2: Regime-Specific Performance + +**Symptom**: Strategy works brilliantly in some years, terribly in others. + +**Example scenario**: +- Great performance in 2017-2019 (low volatility bull market) +- Catastrophic losses in 2020 (high volatility) +- Poor performance in 2022 (downtrend) + +**Why it fails**: Strategy dependent on specific market conditions, not robust enough for diverse environments. + +**Lesson**: Require acceptable (not necessarily best) performance across all regimes. + +### Pattern 3: Slippage Sensitivity + +**Symptom**: Strategy becomes unprofitable when realistic trading costs added. + +**Example scenario**: +- Backtest shows 0.5% average gain per trade +- Adding 0.1% slippage per side (0.2% round-trip) eliminates profits +- Strategy requires unrealistic fills to be profitable + +**Why it fails**: Edge too small to survive real-world friction. + +**Lesson**: Edge must be large enough to survive pessimistic assumptions about costs. + +### Pattern 4: Sample Size Issues + +**Symptom**: Strong results based on small number of trades. + +**Example scenario**: +- Backtest shows 80% win rate +- Only 15 total trades in 5 years +- A few different outcomes would dramatically change results + +**Why it fails**: Insufficient data to distinguish edge from luck. + +**Lesson**: Require minimum 100 trades for meaningful conclusions, preferably 200+. + +### Pattern 5: Look-Ahead Bias + +**Symptom**: Perfect or near-perfect backtest results. + +**Example scenario**: +- Strategy shows 95%+ win rate +- Unrealistically good entry/exit timing +- Performance too good to be realistic + +**Why it fails**: Likely using information not available at time of trade. + +**Lesson**: Be suspicious of "too good to be true" results; audit data alignment carefully. + +### Pattern 6: Over-Optimization (Curve Fitting) + +**Symptom**: Complex strategy with many parameters shows excellent in-sample results but poor out-of-sample. + +**Example scenario**: +- Strategy uses 8-10 different indicators with specific thresholds +- In-sample performance: 40% annual return +- Out-of-sample performance: -5% annual return +- Parameters needed constant re-optimization + +**Why it fails**: Fitted to historical noise rather than genuine market structure. + +**Lesson**: Prefer simple strategies with fewer parameters; demand strong out-of-sample results. + +## 3. Case Study Framework + +### Template for Documenting Failed Ideas + +Use this framework when a backtest fails: + +#### 1. Initial Hypothesis +- **What edge were you trying to capture?** +- **Why did you think this would work?** +- **What was the logical basis?** + +#### 2. Implementation Details +- **Entry rules** (specific and complete) +- **Exit rules** (stop loss, profit target, time-based) +- **Position sizing** +- **Filters or conditions** + +#### 3. Test Results +- **Basic metrics**: + - Total trades + - Win rate + - Average win/loss + - Max drawdown + - Annual returns by year + +- **Parameter sensitivity**: + - How results changed with parameter variations + - Whether "plateau" of stable performance existed + +- **Regime analysis**: + - Performance in different market conditions + - Which regimes caused problems + +#### 4. Breaking Points +- **What specifically caused the strategy to fail?** + - Slippage too high? + - Parameter sensitivity? + - Regime-specific? + - Insufficient sample size? + +#### 5. Lessons Learned +- **What assumptions were wrong?** +- **What would you test differently next time?** +- **Are there salvageable elements?** + +### Example: Failed Momentum Reversal Strategy + +#### 1. Initial Hypothesis +Tried to capture mean reversion after strong momentum moves. Hypothesis: Stocks that gap up 5%+ on earnings often pull back 2-3% before continuing, providing short-term reversal opportunity. + +#### 2. Implementation +- Entry: Short when stock gaps up 5%+ on earnings at market open +- Exit: Cover at 2% profit or 3% stop loss +- Holding period: Maximum 3 days +- Filters: Market cap >$2B, average volume >500K shares + +#### 3. Test Results +- 67 trades over 5 years +- Win rate: 58% +- Avg win: 2.1%, Avg loss: 3.2% +- Max drawdown: 18% +- 2019-2021: Profitable +- 2022-2023: Significant losses + +#### 4. Breaking Points +- Strategy failed during strong momentum environments (2021 meme stocks) +- Stop losses hit frequently during continued upward momentum +- Gap-ups that continued higher immediately caused outsized losses +- Small sample size (67 trades) provided low statistical confidence +- Slippage on short entries during high volatility eliminated thin edge + +#### 5. Lessons Learned +- Mean reversion strategies vulnerable during momentum regimes +- Need regime filter (e.g., only trade during high VIX or weak market) +- 5-year test insufficient for momentum strategies; need 10+ years +- Edge too small (2% target vs 3% stop) to survive slippage +- Better approach: Wait for actual pullback, then enter, rather than fade immediately + +## 4. Red Flags Checklist + +Use this checklist when evaluating any backtest: + +### Data Quality Issues +- [ ] Has survivorship bias been addressed? +- [ ] Are delisted stocks included in test? +- [ ] Is data alignment correct (no look-ahead bias)? +- [ ] Are corporate actions (splits, dividends) handled correctly? + +### Sample Size Concerns +- [ ] At least 100 trades? (Preferably 200+) +- [ ] At least 5 years of data? (Preferably 10+) +- [ ] Includes full market cycle? +- [ ] Tested across multiple market regimes? + +### Parameter Robustness +- [ ] Does strategy work with nearby parameter values? +- [ ] Are there "plateaus" of stable performance? +- [ ] Minimal parameters (ideally <5)? +- [ ] Parameters based on logical reasoning, not pure optimization? + +### Execution Realism +- [ ] Realistic commissions included? +- [ ] Slippage modeled conservatively (1.5-2x typical)? +- [ ] Worst-case fills considered? +- [ ] Order rejection/partial fills addressed? + +### Performance Characteristics +- [ ] Positive expectancy in majority of years? +- [ ] Acceptable performance in all major regimes? +- [ ] No catastrophic drawdowns (>50%)? +- [ ] Edge large enough to survive friction? + +### Bias Prevention +- [ ] Strategy defined before testing? +- [ ] Hypothesis has economic logic? +- [ ] Results aren't "too good to be true"? +- [ ] Out-of-sample testing performed? +- [ ] No cherry-picking of examples? + +### Tool Limitations +- [ ] Aware of testing platform's interpolation methods? +- [ ] Understand how platform handles low-liquidity situations? +- [ ] Know quirks specific to data provider? + +**If more than 2-3 items aren't checked, the backtest requires additional work before considering live implementation.** diff --git a/references/methodology.md b/references/methodology.md new file mode 100644 index 0000000..633654f --- /dev/null +++ b/references/methodology.md @@ -0,0 +1,227 @@ +# Backtesting Methodology Reference + +## Table of Contents + +1. Core Testing Techniques +2. Stress Testing Methods +3. Parameter Sensitivity Analysis +4. Slippage and Friction Modeling +5. Sample Size Guidelines +6. Market Regime Analysis +7. Common Pitfalls and Biases + +## 1. Core Testing Techniques + +### "Beat Ideas to Death" Approach + +**Core principle**: Add friction and punishment to find strategies that break the least, not those that profit the most on paper. + +**Key techniques**: +- Multiple stop loss variations +- Different profit targets +- Realistic + exaggerated commissions +- Worst-case fills +- Extended time periods +- Multiple market regimes + +### The 80/20 Rule for R&D Time + +- 20% generating and codifying ideas +- 80% stress testing and trying to break them + +## 2. Stress Testing Methods + +### Execution Friction Tests + +**Required friction additions**: +- Realistic commissions (actual broker rates) +- Pessimistic slippage (1.5-2x typical) +- Worst-case entry fills (ask + 1-2 ticks) +- Worst-case exit fills (bid - 1-2 ticks) +- Order rejection scenarios +- Partial fills + +### Parameter Robustness Tests + +Test across multiple configurations: +- Entry timing variations (±15-30 minutes) +- Stop loss distances (50%, 75%, 100%, 125%, 150% of baseline) +- Profit targets (80%, 90%, 100%, 110%, 120% of baseline) +- Position sizing rules +- Filter thresholds + +**Goal**: Find "plateau" performance where small parameter changes don't drastically alter results. + +### Time-Based Robustness + +**Minimum requirements**: +- Test across at least 5-10 years +- Include multiple market regimes: + - Bull markets + - Bear markets + - High volatility periods + - Low volatility periods + - Trending markets + - Range-bound markets + +**Year-by-year analysis**: Strategy should show positive expectancy in majority of years, not rely on 1-2 exceptional years. + +## 3. Parameter Sensitivity Analysis + +### Heat Map Analysis + +Create 2D heat maps varying two parameters simultaneously: +- Profit target (rows) × Stop loss (columns) +- Entry time (rows) × Exit time (columns) +- Volatility filter (rows) × Volume filter (columns) + +**Interpretation**: +- Robust strategies show "plateaus" of consistent performance +- Fragile strategies show "spikes" or narrow optimal ranges +- Avoid strategies with performance cliffs at parameter boundaries + +### Walk-Forward Analysis + +1. Optimize parameters on training period (e.g., Year 1-2) +2. Test with those parameters on validation period (Year 3) +3. Roll forward and repeat +4. Compare in-sample vs out-of-sample performance + +**Warning signs**: +- Out-of-sample performance <50% of in-sample +- Frequent need to re-optimize parameters +- Parameters that change dramatically between periods + +## 4. Slippage and Friction Modeling + +### Realistic Slippage Assumptions + +**By market capitalization**: +- Mega cap (>$200B): 0.01-0.02% +- Large cap ($10B-$200B): 0.02-0.05% +- Mid cap ($2B-$10B): 0.05-0.10% +- Small cap ($300M-$2B): 0.10-0.20% +- Micro cap (<$300M): 0.20-0.50%+ + +**By order type**: +- Market orders: Higher slippage +- Limit orders: Lower slippage but potential non-fills +- Stop orders: Significant slippage in volatile conditions + +### Conservative Testing Approach + +Use 1.5-2x typical slippage estimates for stress testing: +- If typical slippage is 0.05%, test with 0.075-0.10% +- If typical is 0.10%, test with 0.15-0.20% + +**Rationale**: Strategies that survive pessimistic assumptions often perform better in practice than in backtests. + +## 5. Sample Size Guidelines + +### Minimum Trade Requirements + +**Statistical significance thresholds**: +- Absolute minimum: 30 trades +- Preferred minimum: 100 trades +- High confidence: 200+ trades + +**Why large samples matter**: +- Reduces impact of outliers +- Provides statistical confidence +- Reveals true edge vs luck + +### Time Period Considerations + +**Minimum testing period**: 5 years +**Preferred testing period**: 10+ years + +**Must include**: +- At least one full market cycle +- Multiple volatility regimes +- Different Federal Reserve policy environments + +## 6. Market Regime Analysis + +### Regime Classification + +**Volatility-based regimes**: +- Low volatility: VIX <15 +- Normal volatility: VIX 15-25 +- High volatility: VIX 25-35 +- Extreme volatility: VIX >35 + +**Trend-based regimes**: +- Strong uptrend: Market +10%+ over 6 months +- Moderate uptrend: Market +5% to +10% over 6 months +- Sideways: Market -5% to +5% over 6 months +- Downtrend: Market <-5% over 6 months + +### Performance Requirements by Regime + +**Robust strategy characteristics**: +- Positive expectancy in majority of regimes +- Acceptable (not necessarily best) in all regimes +- No catastrophic failures in any single regime +- Understanding of which regime causes weakness + +## 7. Common Pitfalls and Biases + +### Survivorship Bias + +**Issue**: Testing only on currently-trading stocks ignores delisted/bankrupt companies. + +**Solution**: Use survivorship-bias-free datasets that include historical delistings. + +### Look-Ahead Bias + +**Issue**: Using information not available at the time of trade. + +**Examples**: +- Using EOD data for intraday decisions +- Using next-day's open for today's close decisions +- Calculating indicators with future data points + +**Prevention**: Strict timestamp control and data alignment checks. + +### Curve-Fitting (Over-Optimization) + +**Warning signs**: +- Too many parameters (>5-7) +- Highly specific parameter values (e.g., RSI = 37.3) +- Perfect backtest results +- Large performance drop in validation period + +**Prevention techniques**: +- Limit parameters to essential ones only +- Use round numbers when possible +- Require out-of-sample testing +- Analyze parameter sensitivity + +### Sample Selection Bias + +**Issue**: Testing only on hand-picked examples (e.g., known market leaders). + +**Problem**: Ignoring all stocks that met criteria but failed creates false impression of strategy quality. + +**Solution**: Test on ALL historical examples meeting the criteria, not just successful outcomes. + +### Hindsight Bias + +**Issue**: Using outcome knowledge to influence decisions. + +**Prevention for systematic trading**: +- Define all rules in advance +- No manual intervention based on hindsight +- Test rules across all cases, not cherry-picked examples + +### Data Mining Bias + +**Issue**: Testing hundreds of strategies until finding one that "works" by random chance. + +**Risk**: With enough attempts, random data will produce seemingly profitable patterns. + +**Mitigation**: +- Have hypothesis before testing +- Require economic logic for the edge +- Use Bonferroni correction for multiple comparisons +- Demand higher significance thresholds (p < 0.01 instead of p < 0.05)