Initial commit with translated description

2026-03-29 09:49:22 +08:00
commit dc93ba7c9d
7 changed files with 593 additions and 0 deletions
--- a/techniques.md
+++ b/techniques.md
@@ -0,0 +1,169 @@
+# Analysis Techniques — When to Use Each
+
+## Hypothesis Testing
+
+**Use when:** Comparing two groups to determine if a difference is real or random chance.
+
+**Technique selection:**
+| Data type | Groups | Test |
+|-----------|--------|------|
+| Continuous | 2 | t-test (if normal) or Mann-Whitney |
+| Continuous | 3+ | ANOVA or Kruskal-Wallis |
+| Proportions | 2 | Chi-square or Fisher's exact |
+| Paired data | 2 | Paired t-test or Wilcoxon signed-rank |
+
+**Key outputs:**
+- p-value (probability of seeing this difference by chance)
+- Effect size (how big is the difference - Cohen's d, odds ratio)
+- Confidence interval (range of plausible true values)
+
+**Watch out for:**
+- Large samples make everything "significant" - focus on effect size
+- Multiple comparisons inflate false positives
+- Normality assumptions (use non-parametric if violated)
+
+---
+
+## Cohort Analysis
+
+**Use when:** Understanding how user behavior changes over time, segmented by when they started.
+
+**Types:**
+- **Retention cohorts:** % of users still active N days after signup
+- **Revenue cohorts:** Revenue per cohort over time
+- **Behavioral cohorts:** Feature adoption by signup cohort
+
+**Setup:**
+1. Define cohort (usually signup week/month)
+2. Define event (login, purchase, specific action)
+3. Define time windows (day 1, 7, 30, 90)
+4. Build matrix: cohort × time period
+
+**Key outputs:**
+- Retention curves (line chart by cohort)
+- Cohort comparison (are newer cohorts performing better?)
+- Time-to-event patterns
+
+**Watch out for:**
+- Cohort size differences (small cohorts = noisy data)
+- Seasonality (December cohort behaves differently)
+- Definition consistency (what counts as "active"?)
+
+---
+
+## Funnel Analysis
+
+**Use when:** Understanding conversion through a multi-step process.
+
+**Setup:**
+1. Define stages (visit -> signup -> activate -> purchase)
+2. Count users at each stage
+3. Calculate drop-off rates between stages
+
+**Key outputs:**
+- Conversion rates per stage
+- Biggest drop-off points
+- Segment comparison (mobile vs desktop funnels)
+
+**Watch out for:**
+- Time window (did they convert eventually, or just not today?)
+- Stage ordering (users don't always follow linear paths)
+- Defining "same session" vs "ever"
+
+---
+
+## Regression Analysis
+
+**Use when:** Understanding what predicts an outcome, controlling for other factors.
+
+**Types:**
+- **Linear:** Continuous outcome (revenue, time spent)
+- **Logistic:** Binary outcome (churned/retained, converted/didn't)
+- **Poisson:** Count outcome (purchases, logins)
+
+**Key outputs:**
+- Coefficients (effect of each variable, holding others constant)
+- R² (how much variance is explained)
+- p-values per variable
+- Residual plots (are assumptions met?)
+
+**Watch out for:**
+- Multicollinearity (correlated predictors)
+- Omitted variable bias (missing important controls)
+- Extrapolation beyond data range
+- Causation claims from observational data
+
+---
+
+## Segmentation/Clustering
+
+**Use when:** Discovering natural groups in your data.
+
+**Techniques:**
+- **K-means:** Simple, fast, assumes spherical clusters
+- **Hierarchical:** Shows cluster relationships, good for exploration
+- **RFM:** Business-specific (Recency, Frequency, Monetary)
+
+**Process:**
+1. Select features (what defines a segment?)
+2. Normalize features (so scale doesn't dominate)
+3. Choose number of clusters (elbow method, silhouette score)
+4. Profile each cluster (what makes them different?)
+
+**Key outputs:**
+- Cluster profiles (avg values per segment)
+- Segment sizes
+- Distinguishing characteristics
+
+**Watch out for:**
+- Garbage in, garbage out (feature selection matters)
+- Cluster count is subjective
+- Stability (do clusters hold with different random seeds?)
+
+---
+
+## Anomaly Detection
+
+**Use when:** Finding unusual data points that warrant investigation.
+
+**Approaches:**
+- **Statistical:** Points beyond 2-3 standard deviations
+- **IQR method:** Below Q1-1.5×IQR or above Q3+1.5×IQR
+- **Isolation Forest:** For multivariate anomalies
+- **Domain rules:** Negative revenue, future dates, impossible values
+
+**Key outputs:**
+- Flagged records with anomaly scores
+- Context (why is this unusual?)
+- Severity (how far from normal?)
+
+**Watch out for:**
+- Seasonality (Black Friday isn't an anomaly)
+- Trends (growth makes old "normal" look like anomalies)
+- False positives (investigate before acting)
+
+---
+
+## Time Series Analysis
+
+**Use when:** Understanding patterns in data over time.
+
+**Components:**
+- **Trend:** Long-term direction
+- **Seasonality:** Repeating patterns (daily, weekly, yearly)
+- **Noise:** Random variation
+
+**Techniques:**
+- **Moving averages:** Smooth out noise
+- **Decomposition:** Separate trend, seasonal, residual
+- **Year-over-year:** Compare same period last year
+
+**Key outputs:**
+- Trend direction and strength
+- Seasonal patterns identified
+- Forecast with uncertainty bands
+
+**Watch out for:**
+- Comparing different lengths (months vary in days)
+- Holidays/events (one-time vs recurring)
+- Structural breaks (COVID, product changes)