Files
ivangdavila_data-analysis/techniques.md

170 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Analysis Techniques — When to Use Each
## Hypothesis Testing
**Use when:** Comparing two groups to determine if a difference is real or random chance.
**Technique selection:**
| Data type | Groups | Test |
|-----------|--------|------|
| Continuous | 2 | t-test (if normal) or Mann-Whitney |
| Continuous | 3+ | ANOVA or Kruskal-Wallis |
| Proportions | 2 | Chi-square or Fisher's exact |
| Paired data | 2 | Paired t-test or Wilcoxon signed-rank |
**Key outputs:**
- p-value (probability of seeing this difference by chance)
- Effect size (how big is the difference - Cohen's d, odds ratio)
- Confidence interval (range of plausible true values)
**Watch out for:**
- Large samples make everything "significant" - focus on effect size
- Multiple comparisons inflate false positives
- Normality assumptions (use non-parametric if violated)
---
## Cohort Analysis
**Use when:** Understanding how user behavior changes over time, segmented by when they started.
**Types:**
- **Retention cohorts:** % of users still active N days after signup
- **Revenue cohorts:** Revenue per cohort over time
- **Behavioral cohorts:** Feature adoption by signup cohort
**Setup:**
1. Define cohort (usually signup week/month)
2. Define event (login, purchase, specific action)
3. Define time windows (day 1, 7, 30, 90)
4. Build matrix: cohort × time period
**Key outputs:**
- Retention curves (line chart by cohort)
- Cohort comparison (are newer cohorts performing better?)
- Time-to-event patterns
**Watch out for:**
- Cohort size differences (small cohorts = noisy data)
- Seasonality (December cohort behaves differently)
- Definition consistency (what counts as "active"?)
---
## Funnel Analysis
**Use when:** Understanding conversion through a multi-step process.
**Setup:**
1. Define stages (visit -> signup -> activate -> purchase)
2. Count users at each stage
3. Calculate drop-off rates between stages
**Key outputs:**
- Conversion rates per stage
- Biggest drop-off points
- Segment comparison (mobile vs desktop funnels)
**Watch out for:**
- Time window (did they convert eventually, or just not today?)
- Stage ordering (users don't always follow linear paths)
- Defining "same session" vs "ever"
---
## Regression Analysis
**Use when:** Understanding what predicts an outcome, controlling for other factors.
**Types:**
- **Linear:** Continuous outcome (revenue, time spent)
- **Logistic:** Binary outcome (churned/retained, converted/didn't)
- **Poisson:** Count outcome (purchases, logins)
**Key outputs:**
- Coefficients (effect of each variable, holding others constant)
- R² (how much variance is explained)
- p-values per variable
- Residual plots (are assumptions met?)
**Watch out for:**
- Multicollinearity (correlated predictors)
- Omitted variable bias (missing important controls)
- Extrapolation beyond data range
- Causation claims from observational data
---
## Segmentation/Clustering
**Use when:** Discovering natural groups in your data.
**Techniques:**
- **K-means:** Simple, fast, assumes spherical clusters
- **Hierarchical:** Shows cluster relationships, good for exploration
- **RFM:** Business-specific (Recency, Frequency, Monetary)
**Process:**
1. Select features (what defines a segment?)
2. Normalize features (so scale doesn't dominate)
3. Choose number of clusters (elbow method, silhouette score)
4. Profile each cluster (what makes them different?)
**Key outputs:**
- Cluster profiles (avg values per segment)
- Segment sizes
- Distinguishing characteristics
**Watch out for:**
- Garbage in, garbage out (feature selection matters)
- Cluster count is subjective
- Stability (do clusters hold with different random seeds?)
---
## Anomaly Detection
**Use when:** Finding unusual data points that warrant investigation.
**Approaches:**
- **Statistical:** Points beyond 2-3 standard deviations
- **IQR method:** Below Q1-1.5×IQR or above Q3+1.5×IQR
- **Isolation Forest:** For multivariate anomalies
- **Domain rules:** Negative revenue, future dates, impossible values
**Key outputs:**
- Flagged records with anomaly scores
- Context (why is this unusual?)
- Severity (how far from normal?)
**Watch out for:**
- Seasonality (Black Friday isn't an anomaly)
- Trends (growth makes old "normal" look like anomalies)
- False positives (investigate before acting)
---
## Time Series Analysis
**Use when:** Understanding patterns in data over time.
**Components:**
- **Trend:** Long-term direction
- **Seasonality:** Repeating patterns (daily, weekly, yearly)
- **Noise:** Random variation
**Techniques:**
- **Moving averages:** Smooth out noise
- **Decomposition:** Separate trend, seasonal, residual
- **Year-over-year:** Compare same period last year
**Key outputs:**
- Trend direction and strength
- Seasonal patterns identified
- Forecast with uncertainty bands
**Watch out for:**
- Comparing different lengths (months vary in days)
- Holidays/events (one-time vs recurring)
- Structural breaks (COVID, product changes)