Initial commit with translated description

2026-03-29 14:34:36 +08:00
commit f727ce26b6
4 changed files with 675 additions and 0 deletions
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,206 @@
 ---
 name: backtest-expert
 description: "交易策略系统回测的专家指导。"
 ---
 # Backtest Expert
 Systematic approach to backtesting trading strategies based on professional methodology that prioritizes robustness over optimistic results.
 ## Core Philosophy
 **Goal**: Find strategies that "break the least", not strategies that "profit the most" on paper.
 **Principle**: Add friction, stress test assumptions, and see what survives. If a strategy holds up under pessimistic conditions, it's more likely to work in live trading.
 ## When to Use This Skill
 Use this skill when:
 - Developing or validating systematic trading strategies
 - Evaluating whether a trading idea is robust enough for live implementation  
 - Troubleshooting why a backtest might be misleading
 - Learning proper backtesting methodology
 - Avoiding common pitfalls (curve-fitting, look-ahead bias, survivorship bias)
 - Assessing parameter sensitivity and regime dependence
 - Setting realistic expectations for slippage and execution costs
 ## Backtesting Workflow
 ### 1. State the Hypothesis
 Define the edge in one sentence.
 **Example**: "Stocks that gap up >3% on earnings and pull back to previous day's close within first hour provide mean-reversion opportunity."
 If you can't articulate the edge clearly, don't proceed to testing.
 ### 2. Codify Rules with Zero Discretion
 Define with complete specificity:
 - **Entry**: Exact conditions, timing, price type
 - **Exit**: Stop loss, profit target, time-based exit
 - **Position sizing**: Fixed $$, % of portfolio, volatility-adjusted
 - **Filters**: Market cap, volume, sector, volatility conditions
 - **Universe**: What instruments are eligible
 **Critical**: No subjective judgment allowed. Every decision must be rule-based and unambiguous.
 ### 3. Run Initial Backtest
 Test over:
 - **Minimum 5 years** (preferably 10+)
 - **Multiple market regimes** (bull, bear, high/low volatility)
 - **Realistic costs**: Commissions + conservative slippage
 Examine initial results for basic viability. If fundamentally broken, iterate on hypothesis.
 ### 4. Stress Test the Strategy
 This is where 80% of testing time should be spent.
 **Parameter sensitivity**:
 - Test stop loss at 50%, 75%, 100%, 125%, 150% of baseline
 - Test profit target at 80%, 90%, 100%, 110%, 120% of baseline  
 - Vary entry/exit timing by ±15-30 minutes
 - Look for "plateaus" of stable performance, not narrow spikes
 **Execution friction**:
 - Increase slippage to 1.5-2x typical estimates
 - Model worst-case fills (buy at ask+1 tick, sell at bid-1 tick)
 - Add realistic order rejection scenarios
 - Test with pessimistic commission structures
 **Time robustness**:
 - Analyze year-by-year performance
 - Require positive expectancy in majority of years
 - Ensure strategy doesn't rely on 1-2 exceptional periods
 - Test in different market regimes separately
 **Sample size**:
 - Absolute minimum: 30 trades
 - Preferred: 100+ trades
 - High confidence: 200+ trades
 ### 5. Out-of-Sample Validation
 **Walk-forward analysis**:
 1. Optimize on training period (e.g., Year 1-3)
 2. Test on validation period (Year 4)
 3. Roll forward and repeat
 4. Compare in-sample vs out-of-sample performance
 **Warning signs**:
 - Out-of-sample <50% of in-sample performance
 - Need frequent parameter re-optimization
 - Parameters change dramatically between periods
 ### 6. Evaluate Results
 **Questions to answer**:
 - Does edge survive pessimistic assumptions?
 - Is performance stable across parameter variations?
 - Does strategy work in multiple market regimes?
 - Is sample size sufficient for statistical confidence?
 - Are results realistic, not "too good to be true"?
 **Decision criteria**:
 - ✅ **Deploy**: Survives all stress tests with acceptable performance
 - 🔄 **Refine**: Core logic sound but needs parameter adjustment
 - ❌ **Abandon**: Fails stress tests or relies on fragile assumptions
 ## Key Testing Principles
 ### Punish the Strategy
 Add friction everywhere:
 - Commissions higher than reality
 - Slippage 1.5-2x typical
 - Worst-case fills
 - Order rejections
 - Partial fills
 **Rationale**: Strategies that survive pessimistic assumptions often outperform in live trading.
 ### Seek Plateaus, Not Peaks
 Look for parameter ranges where performance is stable, not optimal values that create performance spikes.
 **Good**: Strategy profitable with stop loss anywhere from 1.5% to 3.0%
 **Bad**: Strategy only works with stop loss at exactly 2.13%
 Stable performance indicates genuine edge; narrow optima suggest curve-fitting.
 ### Test All Cases, Not Cherry-Picked Examples
 **Wrong approach**: Study hand-picked "market leaders" that worked
 **Right approach**: Test every stock that met criteria, including those that failed
 Selective examples create survivorship bias and overestimate strategy quality.
 ### Separate Idea Generation from Validation
 **Intuition**: Useful for generating hypotheses
 **Validation**: Must be purely data-driven
 Never let attachment to an idea influence interpretation of test results.
 ## Common Failure Patterns
 Recognize these patterns early to save time:
 1. **Parameter sensitivity**: Only works with exact parameter values
 2. **Regime-specific**: Great in some years, terrible in others  
 3. **Slippage sensitivity**: Unprofitable when realistic costs added
 4. **Small sample**: Too few trades for statistical confidence
 5. **Look-ahead bias**: "Too good to be true" results
 6. **Over-optimization**: Many parameters, poor out-of-sample results
 See `references/failed_tests.md` for detailed examples and diagnostic framework.
 ## Available Reference Documentation
 ### Methodology Reference
 **File**: `references/methodology.md`
 **When to read**: For detailed guidance on specific testing techniques.
 **Contents**:
 - Stress testing methods
 - Parameter sensitivity analysis  
 - Slippage and friction modeling
 - Sample size requirements
 - Market regime classification
 - Common biases and pitfalls (survivorship, look-ahead, curve-fitting, etc.)
 ### Failed Tests Reference
 **File**: `references/failed_tests.md`
 **When to read**: When strategy fails tests, or learning from past mistakes.
 **Contents**:
 - Why failures are valuable
 - Common failure patterns with examples
 - Case study documentation framework
 - Red flags checklist for evaluating backtests
 ## Critical Reminders
 **Time allocation**: Spend 20% generating ideas, 80% trying to break them.
 **Context-free requirement**: If strategy requires "perfect context" to work, it's not robust enough for systematic trading.
 **Red flag**: If backtest results look too good (>90% win rate, minimal drawdowns, perfect timing), audit carefully for look-ahead bias or data issues.
 **Tool limitations**: Understand your backtesting platform's quirks (interpolation methods, handling of low liquidity, data alignment issues).
 **Statistical significance**: Small edges require large sample sizes to prove. 5% edge per trade needs 100+ trades to distinguish from luck.
 ## Discretionary vs Systematic Differences
 This skill focuses on **systematic/quantitative** backtesting where:
 - All rules are codified in advance
 - No discretion or "feel" in execution  
 - Testing happens on all historical examples, not cherry-picked cases
 - Context (news, macro) is deliberately stripped out
 Discretionary traders study differently—this skill may not apply to setups requiring subjective judgment.
--- a/_meta.json
+++ b/_meta.json
@@ -0,0 +1,6 @@
 {
  "ownerId": "kn7agf701n3afzzbq8ge0wa8k1809wm4",
  "slug": "backtest-expert",
  "version": "0.1.0",
  "publishedAt": 1769870095738
 }
--- a/references/failed_tests.md
+++ b/references/failed_tests.md
@@ -0,0 +1,236 @@
 # Learning from Failed Backtests
 ## Table of Contents
 1. Why Failed Ideas Are Valuable
 2. Common Failure Patterns
 3. Case Study Framework
 4. Red Flags Checklist
 ## 1. Why Failed Ideas Are Valuable
 ### The Value of Failures
 **Key insights**:
 - Failed tests save capital by preventing live implementation
 - Failure patterns reveal which assumptions don't hold
 - Understanding what doesn't work narrows the search space
 - Failed tests build experience in recognizing fragile strategies
 ### Documentation Discipline
 **Record for each failed idea**:
 - The hypothesis being tested
 - Why you thought it would work
 - What the data showed
 - Specific breaking points
 - Lessons learned
 **Purpose**: Build a library of "anti-patterns" to avoid repeating mistakes.
 ## 2. Common Failure Patterns
 ### Pattern 1: Parameter Sensitivity
 **Symptom**: Strategy only works with very specific parameter values.
 **Example scenario**:
 - Strategy profitable with stop loss at exactly 2.5%
 - Increasing to 3% or decreasing to 2% causes significant performance drop
 - No "plateau" of stable performance
 **Why it fails**: Real markets have noise; if small changes break the strategy, it likely captured noise, not signal.
 **Lesson**: Seek strategies with stable performance across parameter ranges.
 ### Pattern 2: Regime-Specific Performance
 **Symptom**: Strategy works brilliantly in some years, terribly in others.
 **Example scenario**:
 - Great performance in 2017-2019 (low volatility bull market)
 - Catastrophic losses in 2020 (high volatility)
 - Poor performance in 2022 (downtrend)
 **Why it fails**: Strategy dependent on specific market conditions, not robust enough for diverse environments.
 **Lesson**: Require acceptable (not necessarily best) performance across all regimes.
 ### Pattern 3: Slippage Sensitivity
 **Symptom**: Strategy becomes unprofitable when realistic trading costs added.
 **Example scenario**:
 - Backtest shows 0.5% average gain per trade
 - Adding 0.1% slippage per side (0.2% round-trip) eliminates profits
 - Strategy requires unrealistic fills to be profitable
 **Why it fails**: Edge too small to survive real-world friction.
 **Lesson**: Edge must be large enough to survive pessimistic assumptions about costs.
 ### Pattern 4: Sample Size Issues
 **Symptom**: Strong results based on small number of trades.
 **Example scenario**:
 - Backtest shows 80% win rate
 - Only 15 total trades in 5 years
 - A few different outcomes would dramatically change results
 **Why it fails**: Insufficient data to distinguish edge from luck.
 **Lesson**: Require minimum 100 trades for meaningful conclusions, preferably 200+.
 ### Pattern 5: Look-Ahead Bias
 **Symptom**: Perfect or near-perfect backtest results.
 **Example scenario**:
 - Strategy shows 95%+ win rate
 - Unrealistically good entry/exit timing
 - Performance too good to be realistic
 **Why it fails**: Likely using information not available at time of trade.
 **Lesson**: Be suspicious of "too good to be true" results; audit data alignment carefully.
 ### Pattern 6: Over-Optimization (Curve Fitting)
 **Symptom**: Complex strategy with many parameters shows excellent in-sample results but poor out-of-sample.
 **Example scenario**:
 - Strategy uses 8-10 different indicators with specific thresholds
 - In-sample performance: 40% annual return
 - Out-of-sample performance: -5% annual return
 - Parameters needed constant re-optimization
 **Why it fails**: Fitted to historical noise rather than genuine market structure.
 **Lesson**: Prefer simple strategies with fewer parameters; demand strong out-of-sample results.
 ## 3. Case Study Framework
 ### Template for Documenting Failed Ideas
 Use this framework when a backtest fails:
 #### 1. Initial Hypothesis
 - **What edge were you trying to capture?**
 - **Why did you think this would work?**
 - **What was the logical basis?**
 #### 2. Implementation Details
 - **Entry rules** (specific and complete)
 - **Exit rules** (stop loss, profit target, time-based)
 - **Position sizing**
 - **Filters or conditions**
 #### 3. Test Results
 - **Basic metrics**:
  - Total trades
  - Win rate
  - Average win/loss
  - Max drawdown
  - Annual returns by year
 - **Parameter sensitivity**:
  - How results changed with parameter variations
  - Whether "plateau" of stable performance existed
 - **Regime analysis**:
  - Performance in different market conditions
  - Which regimes caused problems
 #### 4. Breaking Points
 - **What specifically caused the strategy to fail?**
  - Slippage too high?
  - Parameter sensitivity?
  - Regime-specific?
  - Insufficient sample size?
 #### 5. Lessons Learned
 - **What assumptions were wrong?**
 - **What would you test differently next time?**
 - **Are there salvageable elements?**
 ### Example: Failed Momentum Reversal Strategy
 #### 1. Initial Hypothesis
 Tried to capture mean reversion after strong momentum moves. Hypothesis: Stocks that gap up 5%+ on earnings often pull back 2-3% before continuing, providing short-term reversal opportunity.
 #### 2. Implementation
 - Entry: Short when stock gaps up 5%+ on earnings at market open
 - Exit: Cover at 2% profit or 3% stop loss
 - Holding period: Maximum 3 days
 - Filters: Market cap >$2B, average volume >500K shares
 #### 3. Test Results
 - 67 trades over 5 years
 - Win rate: 58%
 - Avg win: 2.1%, Avg loss: 3.2%
 - Max drawdown: 18%
 - 2019-2021: Profitable
 - 2022-2023: Significant losses
 #### 4. Breaking Points
 - Strategy failed during strong momentum environments (2021 meme stocks)
 - Stop losses hit frequently during continued upward momentum
 - Gap-ups that continued higher immediately caused outsized losses
 - Small sample size (67 trades) provided low statistical confidence
 - Slippage on short entries during high volatility eliminated thin edge
 #### 5. Lessons Learned
 - Mean reversion strategies vulnerable during momentum regimes
 - Need regime filter (e.g., only trade during high VIX or weak market)
 - 5-year test insufficient for momentum strategies; need 10+ years
 - Edge too small (2% target vs 3% stop) to survive slippage
 - Better approach: Wait for actual pullback, then enter, rather than fade immediately
 ## 4. Red Flags Checklist
 Use this checklist when evaluating any backtest:
 ### Data Quality Issues
 - [ ] Has survivorship bias been addressed?
 - [ ] Are delisted stocks included in test?
 - [ ] Is data alignment correct (no look-ahead bias)?
 - [ ] Are corporate actions (splits, dividends) handled correctly?
 ### Sample Size Concerns
 - [ ] At least 100 trades? (Preferably 200+)
 - [ ] At least 5 years of data? (Preferably 10+)
 - [ ] Includes full market cycle?
 - [ ] Tested across multiple market regimes?
 ### Parameter Robustness
 - [ ] Does strategy work with nearby parameter values?
 - [ ] Are there "plateaus" of stable performance?
 - [ ] Minimal parameters (ideally <5)?
 - [ ] Parameters based on logical reasoning, not pure optimization?
 ### Execution Realism
 - [ ] Realistic commissions included?
 - [ ] Slippage modeled conservatively (1.5-2x typical)?
 - [ ] Worst-case fills considered?
 - [ ] Order rejection/partial fills addressed?
 ### Performance Characteristics
 - [ ] Positive expectancy in majority of years?
 - [ ] Acceptable performance in all major regimes?
 - [ ] No catastrophic drawdowns (>50%)?
 - [ ] Edge large enough to survive friction?
 ### Bias Prevention
 - [ ] Strategy defined before testing?
 - [ ] Hypothesis has economic logic?
 - [ ] Results aren't "too good to be true"?
 - [ ] Out-of-sample testing performed?
 - [ ] No cherry-picking of examples?
 ### Tool Limitations
 - [ ] Aware of testing platform's interpolation methods?
 - [ ] Understand how platform handles low-liquidity situations?
 - [ ] Know quirks specific to data provider?
 **If more than 2-3 items aren't checked, the backtest requires additional work before considering live implementation.**
--- a/references/methodology.md
+++ b/references/methodology.md
@@ -0,0 +1,227 @@
 # Backtesting Methodology Reference
 ## Table of Contents
 1. Core Testing Techniques
 2. Stress Testing Methods
 3. Parameter Sensitivity Analysis
 4. Slippage and Friction Modeling
 5. Sample Size Guidelines
 6. Market Regime Analysis
 7. Common Pitfalls and Biases
 ## 1. Core Testing Techniques
 ### "Beat Ideas to Death" Approach
 **Core principle**: Add friction and punishment to find strategies that break the least, not those that profit the most on paper.
 **Key techniques**:
 - Multiple stop loss variations
 - Different profit targets
 - Realistic + exaggerated commissions
 - Worst-case fills
 - Extended time periods
 - Multiple market regimes
 ### The 80/20 Rule for R&D Time
 - 20% generating and codifying ideas
 - 80% stress testing and trying to break them
 ## 2. Stress Testing Methods
 ### Execution Friction Tests
 **Required friction additions**:
 - Realistic commissions (actual broker rates)
 - Pessimistic slippage (1.5-2x typical)
 - Worst-case entry fills (ask + 1-2 ticks)
 - Worst-case exit fills (bid - 1-2 ticks)
 - Order rejection scenarios
 - Partial fills
 ### Parameter Robustness Tests
 Test across multiple configurations:
 - Entry timing variations (±15-30 minutes)
 - Stop loss distances (50%, 75%, 100%, 125%, 150% of baseline)
 - Profit targets (80%, 90%, 100%, 110%, 120% of baseline)
 - Position sizing rules
 - Filter thresholds
 **Goal**: Find "plateau" performance where small parameter changes don't drastically alter results.
 ### Time-Based Robustness
 **Minimum requirements**:
 - Test across at least 5-10 years
 - Include multiple market regimes:
  - Bull markets
  - Bear markets
  - High volatility periods
  - Low volatility periods
  - Trending markets
  - Range-bound markets
 **Year-by-year analysis**: Strategy should show positive expectancy in majority of years, not rely on 1-2 exceptional years.
 ## 3. Parameter Sensitivity Analysis
 ### Heat Map Analysis
 Create 2D heat maps varying two parameters simultaneously:
 - Profit target (rows) × Stop loss (columns)
 - Entry time (rows) × Exit time (columns)
 - Volatility filter (rows) × Volume filter (columns)
 **Interpretation**:
 - Robust strategies show "plateaus" of consistent performance
 - Fragile strategies show "spikes" or narrow optimal ranges
 - Avoid strategies with performance cliffs at parameter boundaries
 ### Walk-Forward Analysis
 1. Optimize parameters on training period (e.g., Year 1-2)
 2. Test with those parameters on validation period (Year 3)
 3. Roll forward and repeat
 4. Compare in-sample vs out-of-sample performance
 **Warning signs**:
 - Out-of-sample performance <50% of in-sample
 - Frequent need to re-optimize parameters
 - Parameters that change dramatically between periods
 ## 4. Slippage and Friction Modeling
 ### Realistic Slippage Assumptions
 **By market capitalization**:
 - Mega cap (>$200B): 0.01-0.02%
 - Large cap ($10B-$200B): 0.02-0.05%
 - Mid cap ($2B-$10B): 0.05-0.10%
 - Small cap ($300M-$2B): 0.10-0.20%
 - Micro cap (<$300M): 0.20-0.50%+
 **By order type**:
 - Market orders: Higher slippage
 - Limit orders: Lower slippage but potential non-fills
 - Stop orders: Significant slippage in volatile conditions
 ### Conservative Testing Approach
 Use 1.5-2x typical slippage estimates for stress testing:
 - If typical slippage is 0.05%, test with 0.075-0.10%
 - If typical is 0.10%, test with 0.15-0.20%
 **Rationale**: Strategies that survive pessimistic assumptions often perform better in practice than in backtests.
 ## 5. Sample Size Guidelines
 ### Minimum Trade Requirements
 **Statistical significance thresholds**:
 - Absolute minimum: 30 trades
 - Preferred minimum: 100 trades
 - High confidence: 200+ trades
 **Why large samples matter**:
 - Reduces impact of outliers
 - Provides statistical confidence
 - Reveals true edge vs luck
 ### Time Period Considerations
 **Minimum testing period**: 5 years
 **Preferred testing period**: 10+ years
 **Must include**:
 - At least one full market cycle
 - Multiple volatility regimes
 - Different Federal Reserve policy environments
 ## 6. Market Regime Analysis
 ### Regime Classification
 **Volatility-based regimes**:
 - Low volatility: VIX <15
 - Normal volatility: VIX 15-25
 - High volatility: VIX 25-35
 - Extreme volatility: VIX >35
 **Trend-based regimes**:
 - Strong uptrend: Market +10%+ over 6 months
 - Moderate uptrend: Market +5% to +10% over 6 months
 - Sideways: Market -5% to +5% over 6 months
 - Downtrend: Market <-5% over 6 months
 ### Performance Requirements by Regime
 **Robust strategy characteristics**:
 - Positive expectancy in majority of regimes
 - Acceptable (not necessarily best) in all regimes
 - No catastrophic failures in any single regime
 - Understanding of which regime causes weakness
 ## 7. Common Pitfalls and Biases
 ### Survivorship Bias
 **Issue**: Testing only on currently-trading stocks ignores delisted/bankrupt companies.
 **Solution**: Use survivorship-bias-free datasets that include historical delistings.
 ### Look-Ahead Bias
 **Issue**: Using information not available at the time of trade.
 **Examples**:
 - Using EOD data for intraday decisions
 - Using next-day's open for today's close decisions
 - Calculating indicators with future data points
 **Prevention**: Strict timestamp control and data alignment checks.
 ### Curve-Fitting (Over-Optimization)
 **Warning signs**:
 - Too many parameters (>5-7)
 - Highly specific parameter values (e.g., RSI = 37.3)
 - Perfect backtest results
 - Large performance drop in validation period
 **Prevention techniques**:
 - Limit parameters to essential ones only
 - Use round numbers when possible
 - Require out-of-sample testing
 - Analyze parameter sensitivity
 ### Sample Selection Bias
 **Issue**: Testing only on hand-picked examples (e.g., known market leaders).
 **Problem**: Ignoring all stocks that met criteria but failed creates false impression of strategy quality.
 **Solution**: Test on ALL historical examples meeting the criteria, not just successful outcomes.
 ### Hindsight Bias
 **Issue**: Using outcome knowledge to influence decisions.
 **Prevention for systematic trading**:
 - Define all rules in advance
 - No manual intervention based on hindsight
 - Test rules across all cases, not cherry-picked examples
 ### Data Mining Bias
 **Issue**: Testing hundreds of strategies until finding one that "works" by random chance.
 **Risk**: With enough attempts, random data will produce seemingly profitable patterns.
 **Mitigation**:
 - Have hypothesis before testing
 - Require economic logic for the edge
 - Use Bonferroni correction for multiple comparisons
 - Demand higher significance thresholds (p < 0.01 instead of p < 0.05)