Initial commit with translated description
This commit is contained in:
236
references/failed_tests.md
Normal file
236
references/failed_tests.md
Normal file
@@ -0,0 +1,236 @@
|
||||
# Learning from Failed Backtests
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. Why Failed Ideas Are Valuable
|
||||
2. Common Failure Patterns
|
||||
3. Case Study Framework
|
||||
4. Red Flags Checklist
|
||||
|
||||
## 1. Why Failed Ideas Are Valuable
|
||||
|
||||
### The Value of Failures
|
||||
|
||||
**Key insights**:
|
||||
- Failed tests save capital by preventing live implementation
|
||||
- Failure patterns reveal which assumptions don't hold
|
||||
- Understanding what doesn't work narrows the search space
|
||||
- Failed tests build experience in recognizing fragile strategies
|
||||
|
||||
### Documentation Discipline
|
||||
|
||||
**Record for each failed idea**:
|
||||
- The hypothesis being tested
|
||||
- Why you thought it would work
|
||||
- What the data showed
|
||||
- Specific breaking points
|
||||
- Lessons learned
|
||||
|
||||
**Purpose**: Build a library of "anti-patterns" to avoid repeating mistakes.
|
||||
|
||||
## 2. Common Failure Patterns
|
||||
|
||||
### Pattern 1: Parameter Sensitivity
|
||||
|
||||
**Symptom**: Strategy only works with very specific parameter values.
|
||||
|
||||
**Example scenario**:
|
||||
- Strategy profitable with stop loss at exactly 2.5%
|
||||
- Increasing to 3% or decreasing to 2% causes significant performance drop
|
||||
- No "plateau" of stable performance
|
||||
|
||||
**Why it fails**: Real markets have noise; if small changes break the strategy, it likely captured noise, not signal.
|
||||
|
||||
**Lesson**: Seek strategies with stable performance across parameter ranges.
|
||||
|
||||
### Pattern 2: Regime-Specific Performance
|
||||
|
||||
**Symptom**: Strategy works brilliantly in some years, terribly in others.
|
||||
|
||||
**Example scenario**:
|
||||
- Great performance in 2017-2019 (low volatility bull market)
|
||||
- Catastrophic losses in 2020 (high volatility)
|
||||
- Poor performance in 2022 (downtrend)
|
||||
|
||||
**Why it fails**: Strategy dependent on specific market conditions, not robust enough for diverse environments.
|
||||
|
||||
**Lesson**: Require acceptable (not necessarily best) performance across all regimes.
|
||||
|
||||
### Pattern 3: Slippage Sensitivity
|
||||
|
||||
**Symptom**: Strategy becomes unprofitable when realistic trading costs added.
|
||||
|
||||
**Example scenario**:
|
||||
- Backtest shows 0.5% average gain per trade
|
||||
- Adding 0.1% slippage per side (0.2% round-trip) eliminates profits
|
||||
- Strategy requires unrealistic fills to be profitable
|
||||
|
||||
**Why it fails**: Edge too small to survive real-world friction.
|
||||
|
||||
**Lesson**: Edge must be large enough to survive pessimistic assumptions about costs.
|
||||
|
||||
### Pattern 4: Sample Size Issues
|
||||
|
||||
**Symptom**: Strong results based on small number of trades.
|
||||
|
||||
**Example scenario**:
|
||||
- Backtest shows 80% win rate
|
||||
- Only 15 total trades in 5 years
|
||||
- A few different outcomes would dramatically change results
|
||||
|
||||
**Why it fails**: Insufficient data to distinguish edge from luck.
|
||||
|
||||
**Lesson**: Require minimum 100 trades for meaningful conclusions, preferably 200+.
|
||||
|
||||
### Pattern 5: Look-Ahead Bias
|
||||
|
||||
**Symptom**: Perfect or near-perfect backtest results.
|
||||
|
||||
**Example scenario**:
|
||||
- Strategy shows 95%+ win rate
|
||||
- Unrealistically good entry/exit timing
|
||||
- Performance too good to be realistic
|
||||
|
||||
**Why it fails**: Likely using information not available at time of trade.
|
||||
|
||||
**Lesson**: Be suspicious of "too good to be true" results; audit data alignment carefully.
|
||||
|
||||
### Pattern 6: Over-Optimization (Curve Fitting)
|
||||
|
||||
**Symptom**: Complex strategy with many parameters shows excellent in-sample results but poor out-of-sample.
|
||||
|
||||
**Example scenario**:
|
||||
- Strategy uses 8-10 different indicators with specific thresholds
|
||||
- In-sample performance: 40% annual return
|
||||
- Out-of-sample performance: -5% annual return
|
||||
- Parameters needed constant re-optimization
|
||||
|
||||
**Why it fails**: Fitted to historical noise rather than genuine market structure.
|
||||
|
||||
**Lesson**: Prefer simple strategies with fewer parameters; demand strong out-of-sample results.
|
||||
|
||||
## 3. Case Study Framework
|
||||
|
||||
### Template for Documenting Failed Ideas
|
||||
|
||||
Use this framework when a backtest fails:
|
||||
|
||||
#### 1. Initial Hypothesis
|
||||
- **What edge were you trying to capture?**
|
||||
- **Why did you think this would work?**
|
||||
- **What was the logical basis?**
|
||||
|
||||
#### 2. Implementation Details
|
||||
- **Entry rules** (specific and complete)
|
||||
- **Exit rules** (stop loss, profit target, time-based)
|
||||
- **Position sizing**
|
||||
- **Filters or conditions**
|
||||
|
||||
#### 3. Test Results
|
||||
- **Basic metrics**:
|
||||
- Total trades
|
||||
- Win rate
|
||||
- Average win/loss
|
||||
- Max drawdown
|
||||
- Annual returns by year
|
||||
|
||||
- **Parameter sensitivity**:
|
||||
- How results changed with parameter variations
|
||||
- Whether "plateau" of stable performance existed
|
||||
|
||||
- **Regime analysis**:
|
||||
- Performance in different market conditions
|
||||
- Which regimes caused problems
|
||||
|
||||
#### 4. Breaking Points
|
||||
- **What specifically caused the strategy to fail?**
|
||||
- Slippage too high?
|
||||
- Parameter sensitivity?
|
||||
- Regime-specific?
|
||||
- Insufficient sample size?
|
||||
|
||||
#### 5. Lessons Learned
|
||||
- **What assumptions were wrong?**
|
||||
- **What would you test differently next time?**
|
||||
- **Are there salvageable elements?**
|
||||
|
||||
### Example: Failed Momentum Reversal Strategy
|
||||
|
||||
#### 1. Initial Hypothesis
|
||||
Tried to capture mean reversion after strong momentum moves. Hypothesis: Stocks that gap up 5%+ on earnings often pull back 2-3% before continuing, providing short-term reversal opportunity.
|
||||
|
||||
#### 2. Implementation
|
||||
- Entry: Short when stock gaps up 5%+ on earnings at market open
|
||||
- Exit: Cover at 2% profit or 3% stop loss
|
||||
- Holding period: Maximum 3 days
|
||||
- Filters: Market cap >$2B, average volume >500K shares
|
||||
|
||||
#### 3. Test Results
|
||||
- 67 trades over 5 years
|
||||
- Win rate: 58%
|
||||
- Avg win: 2.1%, Avg loss: 3.2%
|
||||
- Max drawdown: 18%
|
||||
- 2019-2021: Profitable
|
||||
- 2022-2023: Significant losses
|
||||
|
||||
#### 4. Breaking Points
|
||||
- Strategy failed during strong momentum environments (2021 meme stocks)
|
||||
- Stop losses hit frequently during continued upward momentum
|
||||
- Gap-ups that continued higher immediately caused outsized losses
|
||||
- Small sample size (67 trades) provided low statistical confidence
|
||||
- Slippage on short entries during high volatility eliminated thin edge
|
||||
|
||||
#### 5. Lessons Learned
|
||||
- Mean reversion strategies vulnerable during momentum regimes
|
||||
- Need regime filter (e.g., only trade during high VIX or weak market)
|
||||
- 5-year test insufficient for momentum strategies; need 10+ years
|
||||
- Edge too small (2% target vs 3% stop) to survive slippage
|
||||
- Better approach: Wait for actual pullback, then enter, rather than fade immediately
|
||||
|
||||
## 4. Red Flags Checklist
|
||||
|
||||
Use this checklist when evaluating any backtest:
|
||||
|
||||
### Data Quality Issues
|
||||
- [ ] Has survivorship bias been addressed?
|
||||
- [ ] Are delisted stocks included in test?
|
||||
- [ ] Is data alignment correct (no look-ahead bias)?
|
||||
- [ ] Are corporate actions (splits, dividends) handled correctly?
|
||||
|
||||
### Sample Size Concerns
|
||||
- [ ] At least 100 trades? (Preferably 200+)
|
||||
- [ ] At least 5 years of data? (Preferably 10+)
|
||||
- [ ] Includes full market cycle?
|
||||
- [ ] Tested across multiple market regimes?
|
||||
|
||||
### Parameter Robustness
|
||||
- [ ] Does strategy work with nearby parameter values?
|
||||
- [ ] Are there "plateaus" of stable performance?
|
||||
- [ ] Minimal parameters (ideally <5)?
|
||||
- [ ] Parameters based on logical reasoning, not pure optimization?
|
||||
|
||||
### Execution Realism
|
||||
- [ ] Realistic commissions included?
|
||||
- [ ] Slippage modeled conservatively (1.5-2x typical)?
|
||||
- [ ] Worst-case fills considered?
|
||||
- [ ] Order rejection/partial fills addressed?
|
||||
|
||||
### Performance Characteristics
|
||||
- [ ] Positive expectancy in majority of years?
|
||||
- [ ] Acceptable performance in all major regimes?
|
||||
- [ ] No catastrophic drawdowns (>50%)?
|
||||
- [ ] Edge large enough to survive friction?
|
||||
|
||||
### Bias Prevention
|
||||
- [ ] Strategy defined before testing?
|
||||
- [ ] Hypothesis has economic logic?
|
||||
- [ ] Results aren't "too good to be true"?
|
||||
- [ ] Out-of-sample testing performed?
|
||||
- [ ] No cherry-picking of examples?
|
||||
|
||||
### Tool Limitations
|
||||
- [ ] Aware of testing platform's interpolation methods?
|
||||
- [ ] Understand how platform handles low-liquidity situations?
|
||||
- [ ] Know quirks specific to data provider?
|
||||
|
||||
**If more than 2-3 items aren't checked, the backtest requires additional work before considering live implementation.**
|
||||
227
references/methodology.md
Normal file
227
references/methodology.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Backtesting Methodology Reference
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. Core Testing Techniques
|
||||
2. Stress Testing Methods
|
||||
3. Parameter Sensitivity Analysis
|
||||
4. Slippage and Friction Modeling
|
||||
5. Sample Size Guidelines
|
||||
6. Market Regime Analysis
|
||||
7. Common Pitfalls and Biases
|
||||
|
||||
## 1. Core Testing Techniques
|
||||
|
||||
### "Beat Ideas to Death" Approach
|
||||
|
||||
**Core principle**: Add friction and punishment to find strategies that break the least, not those that profit the most on paper.
|
||||
|
||||
**Key techniques**:
|
||||
- Multiple stop loss variations
|
||||
- Different profit targets
|
||||
- Realistic + exaggerated commissions
|
||||
- Worst-case fills
|
||||
- Extended time periods
|
||||
- Multiple market regimes
|
||||
|
||||
### The 80/20 Rule for R&D Time
|
||||
|
||||
- 20% generating and codifying ideas
|
||||
- 80% stress testing and trying to break them
|
||||
|
||||
## 2. Stress Testing Methods
|
||||
|
||||
### Execution Friction Tests
|
||||
|
||||
**Required friction additions**:
|
||||
- Realistic commissions (actual broker rates)
|
||||
- Pessimistic slippage (1.5-2x typical)
|
||||
- Worst-case entry fills (ask + 1-2 ticks)
|
||||
- Worst-case exit fills (bid - 1-2 ticks)
|
||||
- Order rejection scenarios
|
||||
- Partial fills
|
||||
|
||||
### Parameter Robustness Tests
|
||||
|
||||
Test across multiple configurations:
|
||||
- Entry timing variations (±15-30 minutes)
|
||||
- Stop loss distances (50%, 75%, 100%, 125%, 150% of baseline)
|
||||
- Profit targets (80%, 90%, 100%, 110%, 120% of baseline)
|
||||
- Position sizing rules
|
||||
- Filter thresholds
|
||||
|
||||
**Goal**: Find "plateau" performance where small parameter changes don't drastically alter results.
|
||||
|
||||
### Time-Based Robustness
|
||||
|
||||
**Minimum requirements**:
|
||||
- Test across at least 5-10 years
|
||||
- Include multiple market regimes:
|
||||
- Bull markets
|
||||
- Bear markets
|
||||
- High volatility periods
|
||||
- Low volatility periods
|
||||
- Trending markets
|
||||
- Range-bound markets
|
||||
|
||||
**Year-by-year analysis**: Strategy should show positive expectancy in majority of years, not rely on 1-2 exceptional years.
|
||||
|
||||
## 3. Parameter Sensitivity Analysis
|
||||
|
||||
### Heat Map Analysis
|
||||
|
||||
Create 2D heat maps varying two parameters simultaneously:
|
||||
- Profit target (rows) × Stop loss (columns)
|
||||
- Entry time (rows) × Exit time (columns)
|
||||
- Volatility filter (rows) × Volume filter (columns)
|
||||
|
||||
**Interpretation**:
|
||||
- Robust strategies show "plateaus" of consistent performance
|
||||
- Fragile strategies show "spikes" or narrow optimal ranges
|
||||
- Avoid strategies with performance cliffs at parameter boundaries
|
||||
|
||||
### Walk-Forward Analysis
|
||||
|
||||
1. Optimize parameters on training period (e.g., Year 1-2)
|
||||
2. Test with those parameters on validation period (Year 3)
|
||||
3. Roll forward and repeat
|
||||
4. Compare in-sample vs out-of-sample performance
|
||||
|
||||
**Warning signs**:
|
||||
- Out-of-sample performance <50% of in-sample
|
||||
- Frequent need to re-optimize parameters
|
||||
- Parameters that change dramatically between periods
|
||||
|
||||
## 4. Slippage and Friction Modeling
|
||||
|
||||
### Realistic Slippage Assumptions
|
||||
|
||||
**By market capitalization**:
|
||||
- Mega cap (>$200B): 0.01-0.02%
|
||||
- Large cap ($10B-$200B): 0.02-0.05%
|
||||
- Mid cap ($2B-$10B): 0.05-0.10%
|
||||
- Small cap ($300M-$2B): 0.10-0.20%
|
||||
- Micro cap (<$300M): 0.20-0.50%+
|
||||
|
||||
**By order type**:
|
||||
- Market orders: Higher slippage
|
||||
- Limit orders: Lower slippage but potential non-fills
|
||||
- Stop orders: Significant slippage in volatile conditions
|
||||
|
||||
### Conservative Testing Approach
|
||||
|
||||
Use 1.5-2x typical slippage estimates for stress testing:
|
||||
- If typical slippage is 0.05%, test with 0.075-0.10%
|
||||
- If typical is 0.10%, test with 0.15-0.20%
|
||||
|
||||
**Rationale**: Strategies that survive pessimistic assumptions often perform better in practice than in backtests.
|
||||
|
||||
## 5. Sample Size Guidelines
|
||||
|
||||
### Minimum Trade Requirements
|
||||
|
||||
**Statistical significance thresholds**:
|
||||
- Absolute minimum: 30 trades
|
||||
- Preferred minimum: 100 trades
|
||||
- High confidence: 200+ trades
|
||||
|
||||
**Why large samples matter**:
|
||||
- Reduces impact of outliers
|
||||
- Provides statistical confidence
|
||||
- Reveals true edge vs luck
|
||||
|
||||
### Time Period Considerations
|
||||
|
||||
**Minimum testing period**: 5 years
|
||||
**Preferred testing period**: 10+ years
|
||||
|
||||
**Must include**:
|
||||
- At least one full market cycle
|
||||
- Multiple volatility regimes
|
||||
- Different Federal Reserve policy environments
|
||||
|
||||
## 6. Market Regime Analysis
|
||||
|
||||
### Regime Classification
|
||||
|
||||
**Volatility-based regimes**:
|
||||
- Low volatility: VIX <15
|
||||
- Normal volatility: VIX 15-25
|
||||
- High volatility: VIX 25-35
|
||||
- Extreme volatility: VIX >35
|
||||
|
||||
**Trend-based regimes**:
|
||||
- Strong uptrend: Market +10%+ over 6 months
|
||||
- Moderate uptrend: Market +5% to +10% over 6 months
|
||||
- Sideways: Market -5% to +5% over 6 months
|
||||
- Downtrend: Market <-5% over 6 months
|
||||
|
||||
### Performance Requirements by Regime
|
||||
|
||||
**Robust strategy characteristics**:
|
||||
- Positive expectancy in majority of regimes
|
||||
- Acceptable (not necessarily best) in all regimes
|
||||
- No catastrophic failures in any single regime
|
||||
- Understanding of which regime causes weakness
|
||||
|
||||
## 7. Common Pitfalls and Biases
|
||||
|
||||
### Survivorship Bias
|
||||
|
||||
**Issue**: Testing only on currently-trading stocks ignores delisted/bankrupt companies.
|
||||
|
||||
**Solution**: Use survivorship-bias-free datasets that include historical delistings.
|
||||
|
||||
### Look-Ahead Bias
|
||||
|
||||
**Issue**: Using information not available at the time of trade.
|
||||
|
||||
**Examples**:
|
||||
- Using EOD data for intraday decisions
|
||||
- Using next-day's open for today's close decisions
|
||||
- Calculating indicators with future data points
|
||||
|
||||
**Prevention**: Strict timestamp control and data alignment checks.
|
||||
|
||||
### Curve-Fitting (Over-Optimization)
|
||||
|
||||
**Warning signs**:
|
||||
- Too many parameters (>5-7)
|
||||
- Highly specific parameter values (e.g., RSI = 37.3)
|
||||
- Perfect backtest results
|
||||
- Large performance drop in validation period
|
||||
|
||||
**Prevention techniques**:
|
||||
- Limit parameters to essential ones only
|
||||
- Use round numbers when possible
|
||||
- Require out-of-sample testing
|
||||
- Analyze parameter sensitivity
|
||||
|
||||
### Sample Selection Bias
|
||||
|
||||
**Issue**: Testing only on hand-picked examples (e.g., known market leaders).
|
||||
|
||||
**Problem**: Ignoring all stocks that met criteria but failed creates false impression of strategy quality.
|
||||
|
||||
**Solution**: Test on ALL historical examples meeting the criteria, not just successful outcomes.
|
||||
|
||||
### Hindsight Bias
|
||||
|
||||
**Issue**: Using outcome knowledge to influence decisions.
|
||||
|
||||
**Prevention for systematic trading**:
|
||||
- Define all rules in advance
|
||||
- No manual intervention based on hindsight
|
||||
- Test rules across all cases, not cherry-picked examples
|
||||
|
||||
### Data Mining Bias
|
||||
|
||||
**Issue**: Testing hundreds of strategies until finding one that "works" by random chance.
|
||||
|
||||
**Risk**: With enough attempts, random data will produce seemingly profitable patterns.
|
||||
|
||||
**Mitigation**:
|
||||
- Have hypothesis before testing
|
||||
- Require economic logic for the edge
|
||||
- Use Bonferroni correction for multiple comparisons
|
||||
- Demand higher significance thresholds (p < 0.01 instead of p < 0.05)
|
||||
Reference in New Issue
Block a user