Files
ivangdavila_data-analysis/techniques.md

4.9 KiB
Raw Permalink Blame History

Analysis Techniques — When to Use Each

Hypothesis Testing

Use when: Comparing two groups to determine if a difference is real or random chance.

Technique selection:

Data type Groups Test
Continuous 2 t-test (if normal) or Mann-Whitney
Continuous 3+ ANOVA or Kruskal-Wallis
Proportions 2 Chi-square or Fisher's exact
Paired data 2 Paired t-test or Wilcoxon signed-rank

Key outputs:

  • p-value (probability of seeing this difference by chance)
  • Effect size (how big is the difference - Cohen's d, odds ratio)
  • Confidence interval (range of plausible true values)

Watch out for:

  • Large samples make everything "significant" - focus on effect size
  • Multiple comparisons inflate false positives
  • Normality assumptions (use non-parametric if violated)

Cohort Analysis

Use when: Understanding how user behavior changes over time, segmented by when they started.

Types:

  • Retention cohorts: % of users still active N days after signup
  • Revenue cohorts: Revenue per cohort over time
  • Behavioral cohorts: Feature adoption by signup cohort

Setup:

  1. Define cohort (usually signup week/month)
  2. Define event (login, purchase, specific action)
  3. Define time windows (day 1, 7, 30, 90)
  4. Build matrix: cohort × time period

Key outputs:

  • Retention curves (line chart by cohort)
  • Cohort comparison (are newer cohorts performing better?)
  • Time-to-event patterns

Watch out for:

  • Cohort size differences (small cohorts = noisy data)
  • Seasonality (December cohort behaves differently)
  • Definition consistency (what counts as "active"?)

Funnel Analysis

Use when: Understanding conversion through a multi-step process.

Setup:

  1. Define stages (visit -> signup -> activate -> purchase)
  2. Count users at each stage
  3. Calculate drop-off rates between stages

Key outputs:

  • Conversion rates per stage
  • Biggest drop-off points
  • Segment comparison (mobile vs desktop funnels)

Watch out for:

  • Time window (did they convert eventually, or just not today?)
  • Stage ordering (users don't always follow linear paths)
  • Defining "same session" vs "ever"

Regression Analysis

Use when: Understanding what predicts an outcome, controlling for other factors.

Types:

  • Linear: Continuous outcome (revenue, time spent)
  • Logistic: Binary outcome (churned/retained, converted/didn't)
  • Poisson: Count outcome (purchases, logins)

Key outputs:

  • Coefficients (effect of each variable, holding others constant)
  • R² (how much variance is explained)
  • p-values per variable
  • Residual plots (are assumptions met?)

Watch out for:

  • Multicollinearity (correlated predictors)
  • Omitted variable bias (missing important controls)
  • Extrapolation beyond data range
  • Causation claims from observational data

Segmentation/Clustering

Use when: Discovering natural groups in your data.

Techniques:

  • K-means: Simple, fast, assumes spherical clusters
  • Hierarchical: Shows cluster relationships, good for exploration
  • RFM: Business-specific (Recency, Frequency, Monetary)

Process:

  1. Select features (what defines a segment?)
  2. Normalize features (so scale doesn't dominate)
  3. Choose number of clusters (elbow method, silhouette score)
  4. Profile each cluster (what makes them different?)

Key outputs:

  • Cluster profiles (avg values per segment)
  • Segment sizes
  • Distinguishing characteristics

Watch out for:

  • Garbage in, garbage out (feature selection matters)
  • Cluster count is subjective
  • Stability (do clusters hold with different random seeds?)

Anomaly Detection

Use when: Finding unusual data points that warrant investigation.

Approaches:

  • Statistical: Points beyond 2-3 standard deviations
  • IQR method: Below Q1-1.5×IQR or above Q3+1.5×IQR
  • Isolation Forest: For multivariate anomalies
  • Domain rules: Negative revenue, future dates, impossible values

Key outputs:

  • Flagged records with anomaly scores
  • Context (why is this unusual?)
  • Severity (how far from normal?)

Watch out for:

  • Seasonality (Black Friday isn't an anomaly)
  • Trends (growth makes old "normal" look like anomalies)
  • False positives (investigate before acting)

Time Series Analysis

Use when: Understanding patterns in data over time.

Components:

  • Trend: Long-term direction
  • Seasonality: Repeating patterns (daily, weekly, yearly)
  • Noise: Random variation

Techniques:

  • Moving averages: Smooth out noise
  • Decomposition: Separate trend, seasonal, residual
  • Year-over-year: Compare same period last year

Key outputs:

  • Trend direction and strength
  • Seasonal patterns identified
  • Forecast with uncertainty bands

Watch out for:

  • Comparing different lengths (months vary in days)
  • Holidays/events (one-time vs recurring)
  • Structural breaks (COVID, product changes)