170 lines
4.9 KiB
Markdown
170 lines
4.9 KiB
Markdown
|
|
# Analysis Techniques — When to Use Each
|
|||
|
|
|
|||
|
|
## Hypothesis Testing
|
|||
|
|
|
|||
|
|
**Use when:** Comparing two groups to determine if a difference is real or random chance.
|
|||
|
|
|
|||
|
|
**Technique selection:**
|
|||
|
|
| Data type | Groups | Test |
|
|||
|
|
|-----------|--------|------|
|
|||
|
|
| Continuous | 2 | t-test (if normal) or Mann-Whitney |
|
|||
|
|
| Continuous | 3+ | ANOVA or Kruskal-Wallis |
|
|||
|
|
| Proportions | 2 | Chi-square or Fisher's exact |
|
|||
|
|
| Paired data | 2 | Paired t-test or Wilcoxon signed-rank |
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- p-value (probability of seeing this difference by chance)
|
|||
|
|
- Effect size (how big is the difference - Cohen's d, odds ratio)
|
|||
|
|
- Confidence interval (range of plausible true values)
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Large samples make everything "significant" - focus on effect size
|
|||
|
|
- Multiple comparisons inflate false positives
|
|||
|
|
- Normality assumptions (use non-parametric if violated)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Cohort Analysis
|
|||
|
|
|
|||
|
|
**Use when:** Understanding how user behavior changes over time, segmented by when they started.
|
|||
|
|
|
|||
|
|
**Types:**
|
|||
|
|
- **Retention cohorts:** % of users still active N days after signup
|
|||
|
|
- **Revenue cohorts:** Revenue per cohort over time
|
|||
|
|
- **Behavioral cohorts:** Feature adoption by signup cohort
|
|||
|
|
|
|||
|
|
**Setup:**
|
|||
|
|
1. Define cohort (usually signup week/month)
|
|||
|
|
2. Define event (login, purchase, specific action)
|
|||
|
|
3. Define time windows (day 1, 7, 30, 90)
|
|||
|
|
4. Build matrix: cohort × time period
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Retention curves (line chart by cohort)
|
|||
|
|
- Cohort comparison (are newer cohorts performing better?)
|
|||
|
|
- Time-to-event patterns
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Cohort size differences (small cohorts = noisy data)
|
|||
|
|
- Seasonality (December cohort behaves differently)
|
|||
|
|
- Definition consistency (what counts as "active"?)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Funnel Analysis
|
|||
|
|
|
|||
|
|
**Use when:** Understanding conversion through a multi-step process.
|
|||
|
|
|
|||
|
|
**Setup:**
|
|||
|
|
1. Define stages (visit -> signup -> activate -> purchase)
|
|||
|
|
2. Count users at each stage
|
|||
|
|
3. Calculate drop-off rates between stages
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Conversion rates per stage
|
|||
|
|
- Biggest drop-off points
|
|||
|
|
- Segment comparison (mobile vs desktop funnels)
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Time window (did they convert eventually, or just not today?)
|
|||
|
|
- Stage ordering (users don't always follow linear paths)
|
|||
|
|
- Defining "same session" vs "ever"
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Regression Analysis
|
|||
|
|
|
|||
|
|
**Use when:** Understanding what predicts an outcome, controlling for other factors.
|
|||
|
|
|
|||
|
|
**Types:**
|
|||
|
|
- **Linear:** Continuous outcome (revenue, time spent)
|
|||
|
|
- **Logistic:** Binary outcome (churned/retained, converted/didn't)
|
|||
|
|
- **Poisson:** Count outcome (purchases, logins)
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Coefficients (effect of each variable, holding others constant)
|
|||
|
|
- R² (how much variance is explained)
|
|||
|
|
- p-values per variable
|
|||
|
|
- Residual plots (are assumptions met?)
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Multicollinearity (correlated predictors)
|
|||
|
|
- Omitted variable bias (missing important controls)
|
|||
|
|
- Extrapolation beyond data range
|
|||
|
|
- Causation claims from observational data
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Segmentation/Clustering
|
|||
|
|
|
|||
|
|
**Use when:** Discovering natural groups in your data.
|
|||
|
|
|
|||
|
|
**Techniques:**
|
|||
|
|
- **K-means:** Simple, fast, assumes spherical clusters
|
|||
|
|
- **Hierarchical:** Shows cluster relationships, good for exploration
|
|||
|
|
- **RFM:** Business-specific (Recency, Frequency, Monetary)
|
|||
|
|
|
|||
|
|
**Process:**
|
|||
|
|
1. Select features (what defines a segment?)
|
|||
|
|
2. Normalize features (so scale doesn't dominate)
|
|||
|
|
3. Choose number of clusters (elbow method, silhouette score)
|
|||
|
|
4. Profile each cluster (what makes them different?)
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Cluster profiles (avg values per segment)
|
|||
|
|
- Segment sizes
|
|||
|
|
- Distinguishing characteristics
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Garbage in, garbage out (feature selection matters)
|
|||
|
|
- Cluster count is subjective
|
|||
|
|
- Stability (do clusters hold with different random seeds?)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Anomaly Detection
|
|||
|
|
|
|||
|
|
**Use when:** Finding unusual data points that warrant investigation.
|
|||
|
|
|
|||
|
|
**Approaches:**
|
|||
|
|
- **Statistical:** Points beyond 2-3 standard deviations
|
|||
|
|
- **IQR method:** Below Q1-1.5×IQR or above Q3+1.5×IQR
|
|||
|
|
- **Isolation Forest:** For multivariate anomalies
|
|||
|
|
- **Domain rules:** Negative revenue, future dates, impossible values
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Flagged records with anomaly scores
|
|||
|
|
- Context (why is this unusual?)
|
|||
|
|
- Severity (how far from normal?)
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Seasonality (Black Friday isn't an anomaly)
|
|||
|
|
- Trends (growth makes old "normal" look like anomalies)
|
|||
|
|
- False positives (investigate before acting)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Time Series Analysis
|
|||
|
|
|
|||
|
|
**Use when:** Understanding patterns in data over time.
|
|||
|
|
|
|||
|
|
**Components:**
|
|||
|
|
- **Trend:** Long-term direction
|
|||
|
|
- **Seasonality:** Repeating patterns (daily, weekly, yearly)
|
|||
|
|
- **Noise:** Random variation
|
|||
|
|
|
|||
|
|
**Techniques:**
|
|||
|
|
- **Moving averages:** Smooth out noise
|
|||
|
|
- **Decomposition:** Separate trend, seasonal, residual
|
|||
|
|
- **Year-over-year:** Compare same period last year
|
|||
|
|
|
|||
|
|
**Key outputs:**
|
|||
|
|
- Trend direction and strength
|
|||
|
|
- Seasonal patterns identified
|
|||
|
|
- Forecast with uncertainty bands
|
|||
|
|
|
|||
|
|
**Watch out for:**
|
|||
|
|
- Comparing different lengths (months vary in days)
|
|||
|
|
- Holidays/events (one-time vs recurring)
|
|||
|
|
- Structural breaks (COVID, product changes)
|