Initial commit with translated description
This commit is contained in:
165
SKILL.md
Normal file
165
SKILL.md
Normal file
@@ -0,0 +1,165 @@
|
|||||||
|
---
|
||||||
|
name: Data Analysis
|
||||||
|
slug: data-analysis
|
||||||
|
version: 1.0.2
|
||||||
|
homepage: https://clawic.com/skills/data-analysis
|
||||||
|
description: "数据分析和可视化。"
|
||||||
|
changelog: Added metric contracts, chart guidance, and decision brief templates for more reliable analysis.
|
||||||
|
metadata: {"clawdbot":{"emoji":"D","requires":{"bins":[]},"os":["linux","darwin","win32"]}}
|
||||||
|
---
|
||||||
|
|
||||||
|
## When to Use
|
||||||
|
|
||||||
|
Use this skill when the user needs to analyze, explain, or visualize data from SQL, spreadsheets, notebooks, dashboards, exports, or ad hoc tables.
|
||||||
|
|
||||||
|
Use it for KPI debugging, experiment readouts, funnel or cohort analysis, anomaly reviews, executive reporting, and quality checks on metrics or query logic.
|
||||||
|
|
||||||
|
Prefer this skill over generic coding or spreadsheet help when the hard part is analytical judgment: metric definition, comparison design, interpretation, or recommendation.
|
||||||
|
|
||||||
|
User asks about: analyzing data, finding patterns, understanding metrics, testing hypotheses, cohort analysis, A/B testing, churn analysis, or statistical significance.
|
||||||
|
|
||||||
|
## Core Principle
|
||||||
|
|
||||||
|
Analysis without a decision is just arithmetic. Always clarify: **What would change if this analysis shows X vs Y?**
|
||||||
|
|
||||||
|
## Methodology First
|
||||||
|
|
||||||
|
Before touching data:
|
||||||
|
1. **What decision** is this analysis supporting?
|
||||||
|
2. **What would change your mind?** (the real question)
|
||||||
|
3. **What data do you actually have** vs what you wish you had?
|
||||||
|
4. **What timeframe** is relevant?
|
||||||
|
|
||||||
|
## Statistical Rigor Checklist
|
||||||
|
|
||||||
|
- [ ] Sample size sufficient? (small N = wide confidence intervals)
|
||||||
|
- [ ] Comparison groups fair? (same time period, similar conditions)
|
||||||
|
- [ ] Multiple comparisons? (20 tests = 1 "significant" by chance)
|
||||||
|
- [ ] Effect size meaningful? (statistically significant != practically important)
|
||||||
|
- [ ] Uncertainty quantified? ("12-18% lift" not just "15% lift")
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
This skill does not require local folders, persistent memory, or setup state.
|
||||||
|
|
||||||
|
Use the included reference files as lightweight guides:
|
||||||
|
- `metric-contracts.md` for KPI definitions and caveats
|
||||||
|
- `chart-selection.md` for visual choice and chart anti-patterns
|
||||||
|
- `decision-briefs.md` for stakeholder-facing outputs
|
||||||
|
- `pitfalls.md` and `techniques.md` for analytical rigor and method choice
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
Load only the smallest relevant file to keep context focused.
|
||||||
|
|
||||||
|
| Topic | File |
|
||||||
|
|-------|------|
|
||||||
|
| Metric definition contracts | `metric-contracts.md` |
|
||||||
|
| Visual selection and chart anti-patterns | `chart-selection.md` |
|
||||||
|
| Decision-ready output formats | `decision-briefs.md` |
|
||||||
|
| Failure modes to catch early | `pitfalls.md` |
|
||||||
|
| Method selection by question type | `techniques.md` |
|
||||||
|
|
||||||
|
## Core Rules
|
||||||
|
|
||||||
|
### 1. Start from the decision, not the dataset
|
||||||
|
- Identify the decision owner, the question that could change a decision, and the deadline before doing analysis.
|
||||||
|
- If no decision would change, reframe the request before computing anything.
|
||||||
|
|
||||||
|
### 2. Lock the metric contract before calculating
|
||||||
|
- Define entity, grain, numerator, denominator, time window, timezone, filters, exclusions, and source of truth.
|
||||||
|
- If any of those are ambiguous, state the ambiguity explicitly before presenting results.
|
||||||
|
|
||||||
|
### 3. Separate extraction, transformation, and interpretation
|
||||||
|
- Keep query logic, cleanup assumptions, and analytical conclusions distinguishable.
|
||||||
|
- Never hide business assumptions inside SQL, formulas, or notebook code without naming them in the write-up.
|
||||||
|
|
||||||
|
### 4. Choose visuals to answer a question
|
||||||
|
- Select charts based on the analytical question: trend, comparison, distribution, relationship, composition, funnel, or cohort retention.
|
||||||
|
- Do not add charts that make the deck look fuller but do not change the decision.
|
||||||
|
|
||||||
|
### 5. Brief every result in decision format
|
||||||
|
- Every output should include the answer, evidence, confidence, caveats, and recommended next action.
|
||||||
|
- If the output is going to a stakeholder, translate the method into business implications instead of leading with technical detail.
|
||||||
|
|
||||||
|
### 6. Stress-test claims before recommending action
|
||||||
|
- Segment by obvious confounders, compare the right baseline, quantify uncertainty, and check sensitivity to exclusions or time windows.
|
||||||
|
- Strong-looking numbers without robustness checks are not decision-ready.
|
||||||
|
|
||||||
|
### 7. Escalate when the data cannot support the claim
|
||||||
|
- Block or downgrade conclusions when sample size is weak, the source is unreliable, definitions drifted, or confounding is unresolved.
|
||||||
|
- It is better to say "unknown yet" than to produce false confidence.
|
||||||
|
|
||||||
|
## Common Traps
|
||||||
|
|
||||||
|
- Reusing a KPI name after changing numerator, denominator, or exclusions -> trend comparisons become invalid.
|
||||||
|
- Comparing daily, weekly, and monthly grains in one chart -> movement looks real but is mostly aggregation noise.
|
||||||
|
- Showing percentages without underlying counts -> leadership overreacts to tiny denominators.
|
||||||
|
- Using a pretty chart instead of the right chart -> the output looks polished but hides the actual decision signal.
|
||||||
|
- Hunting for interesting cuts after seeing the result -> narrative follows chance instead of evidence.
|
||||||
|
- Shipping automated reports without metric owners or caveats -> bad numbers spread faster than they can be corrected.
|
||||||
|
- Treating observational patterns as causal proof -> action plans get built on correlation alone.
|
||||||
|
|
||||||
|
## Approach Selection
|
||||||
|
|
||||||
|
| Question type | Approach | Key output |
|
||||||
|
|---------------|----------|------------|
|
||||||
|
| "Is X different from Y?" | Hypothesis test | p-value + effect size + CI |
|
||||||
|
| "What predicts Z?" | Regression/correlation | Coefficients + R² + residual check |
|
||||||
|
| "How do users behave over time?" | Cohort analysis | Retention curves by cohort |
|
||||||
|
| "Are these groups different?" | Segmentation | Profiles + statistical comparison |
|
||||||
|
| "What's unusual?" | Anomaly detection | Flagged points + context |
|
||||||
|
|
||||||
|
For technique details and when to use each, see `techniques.md`.
|
||||||
|
|
||||||
|
## Output Standards
|
||||||
|
|
||||||
|
1. **Lead with the insight**, not the methodology
|
||||||
|
2. **Quantify uncertainty** - ranges, not point estimates
|
||||||
|
3. **State limitations** - what this analysis can't tell you
|
||||||
|
4. **Recommend next steps** - what would strengthen the conclusion
|
||||||
|
|
||||||
|
## Red Flags to Escalate
|
||||||
|
|
||||||
|
- User wants to "prove" a predetermined conclusion
|
||||||
|
- Sample size too small for reliable inference
|
||||||
|
- Data quality issues that invalidate analysis
|
||||||
|
- Confounders that can't be controlled for
|
||||||
|
|
||||||
|
## External Endpoints
|
||||||
|
|
||||||
|
This skill makes no external network requests.
|
||||||
|
|
||||||
|
| Endpoint | Data Sent | Purpose |
|
||||||
|
|----------|-----------|---------|
|
||||||
|
| None | None | N/A |
|
||||||
|
|
||||||
|
No data is sent externally.
|
||||||
|
|
||||||
|
## Security & Privacy
|
||||||
|
|
||||||
|
Data that leaves your machine:
|
||||||
|
- Nothing by default.
|
||||||
|
|
||||||
|
Data that stays local:
|
||||||
|
- Nothing by default.
|
||||||
|
|
||||||
|
This skill does NOT:
|
||||||
|
- Access undeclared external endpoints.
|
||||||
|
- Store credentials or raw exports in hidden local memory files.
|
||||||
|
- Create or depend on local folder systems for persistence.
|
||||||
|
- Create automations or background jobs without explicit user confirmation.
|
||||||
|
- Rewrite its own instruction source files.
|
||||||
|
|
||||||
|
## Related Skills
|
||||||
|
Install with `clawhub install <slug>` if user confirms:
|
||||||
|
- `sql` - query design and review for reliable data extraction.
|
||||||
|
- `csv` - cleanup and normalization for tabular inputs before analysis.
|
||||||
|
- `dashboard` - implementation patterns for KPI visualization layers.
|
||||||
|
- `report` - structured stakeholder-facing deliverables after analysis.
|
||||||
|
- `business-intelligence` - KPI systems and operating cadence beyond one-off analysis.
|
||||||
|
|
||||||
|
## Feedback
|
||||||
|
|
||||||
|
- If useful: `clawhub star data-analysis`
|
||||||
|
- Stay updated: `clawhub sync`
|
||||||
6
_meta.json
Normal file
6
_meta.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"ownerId": "kn73vp5rarc3b14rc7wjcw8f8580t5d1",
|
||||||
|
"slug": "data-analysis",
|
||||||
|
"version": "1.0.2",
|
||||||
|
"publishedAt": 1773241910484
|
||||||
|
}
|
||||||
40
chart-selection.md
Normal file
40
chart-selection.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
# Chart Selection
|
||||||
|
|
||||||
|
Choose visuals based on the question, not on what is easiest to render.
|
||||||
|
|
||||||
|
## Question to Chart Map
|
||||||
|
|
||||||
|
| Question | Preferred chart | Notes |
|
||||||
|
|----------|-----------------|-------|
|
||||||
|
| How is a metric changing over time? | line chart | annotate structural breaks and missing data |
|
||||||
|
| Which groups are highest or lowest? | sorted bar chart | keep a shared baseline |
|
||||||
|
| How is the distribution shaped? | histogram or box plot | avoid average-only summaries |
|
||||||
|
| Are two variables related? | scatter plot | show trend and outliers separately |
|
||||||
|
| How do parts contribute to the whole? | stacked bar with totals | keep category count low |
|
||||||
|
| Where are users dropping? | funnel chart | define the time window explicitly |
|
||||||
|
| How do cohorts retain over time? | cohort table or heatmap | show cohort size alongside retention |
|
||||||
|
|
||||||
|
## Default Rules
|
||||||
|
|
||||||
|
- Bars start at zero unless there is a strong reason not to.
|
||||||
|
- Show underlying counts next to percentages when denominators are small.
|
||||||
|
- Prefer direct labels over legends when possible.
|
||||||
|
- Use one chart per decision question, not one chart per available metric.
|
||||||
|
|
||||||
|
## Visual Anti-Patterns
|
||||||
|
|
||||||
|
- Pie charts with many slices -> comparisons become guesswork.
|
||||||
|
- Dual-axis charts -> viewers infer relationships that are not there.
|
||||||
|
- Cumulative-only charts -> hide recent deterioration or recovery.
|
||||||
|
- Truncated bar axes -> exaggerate small differences.
|
||||||
|
- Stacked areas with many categories -> impossible to compare layers.
|
||||||
|
|
||||||
|
## Before Shipping a Chart
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
1. What decision question this chart answers.
|
||||||
|
2. Whether the baseline is visible.
|
||||||
|
3. Whether the grain and time window match the narrative.
|
||||||
|
4. Whether annotations explain outages, launches, or missing data.
|
||||||
|
5. Whether a table would be clearer than the chart.
|
||||||
45
decision-briefs.md
Normal file
45
decision-briefs.md
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
# Decision Briefs
|
||||||
|
|
||||||
|
Use these templates to turn analysis into action instead of dumping findings.
|
||||||
|
|
||||||
|
## Standard Decision Brief
|
||||||
|
|
||||||
|
1. Decision question.
|
||||||
|
2. Short answer.
|
||||||
|
3. Evidence: key numbers and comparison baseline.
|
||||||
|
4. Confidence: high, medium, or low, with one sentence why.
|
||||||
|
5. Caveats and what could still change the conclusion.
|
||||||
|
6. Recommended next action, owner, and due date.
|
||||||
|
|
||||||
|
## Experiment Readout
|
||||||
|
|
||||||
|
- Hypothesis:
|
||||||
|
- Primary metric and guardrails:
|
||||||
|
- Estimated effect and uncertainty:
|
||||||
|
- Segment differences:
|
||||||
|
- Ship, iterate, or stop:
|
||||||
|
- Follow-up test:
|
||||||
|
|
||||||
|
## Anomaly Note
|
||||||
|
|
||||||
|
- What moved:
|
||||||
|
- Since when:
|
||||||
|
- Likely drivers:
|
||||||
|
- Data quality checks passed or failed:
|
||||||
|
- Immediate action:
|
||||||
|
- What to watch next:
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
- One-sentence answer.
|
||||||
|
- Two or three supporting bullets with numbers.
|
||||||
|
- One caveat.
|
||||||
|
- One decision or escalation request.
|
||||||
|
|
||||||
|
## Writing Rules
|
||||||
|
|
||||||
|
- Lead with the answer, not the method.
|
||||||
|
- Translate statistics into business implications.
|
||||||
|
- Separate observations from recommendations.
|
||||||
|
- If confidence is low, say what would raise confidence.
|
||||||
|
- Avoid dumping every cut you explored; keep only evidence that changes the decision.
|
||||||
48
metric-contracts.md
Normal file
48
metric-contracts.md
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# Metric Contracts
|
||||||
|
|
||||||
|
Use this when a KPI, dashboard tile, or report number could be interpreted in more than one way.
|
||||||
|
|
||||||
|
## Contract Template
|
||||||
|
|
||||||
|
Capture each metric in this order before trusting comparisons:
|
||||||
|
|
||||||
|
1. Business question the metric is meant to answer.
|
||||||
|
2. Entity and grain: user, account, order, session, day, week, month.
|
||||||
|
3. Numerator and denominator with exact inclusion logic.
|
||||||
|
4. Filters and exclusions: internal traffic, refunds, test accounts, paused users.
|
||||||
|
5. Time window, timezone, and refresh cadence.
|
||||||
|
6. Source of truth and owner.
|
||||||
|
7. Known caveats, version changes, and safe interpretation range.
|
||||||
|
|
||||||
|
## Minimum Contract Output
|
||||||
|
|
||||||
|
| Field | Example |
|
||||||
|
|-------|---------|
|
||||||
|
| Metric | Paid conversion rate |
|
||||||
|
| Question | Is onboarding quality improving? |
|
||||||
|
| Grain | weekly |
|
||||||
|
| Numerator | first paid subscriptions |
|
||||||
|
| Denominator | qualified onboarding starts |
|
||||||
|
| Filters | excludes employees and QA accounts |
|
||||||
|
| Timezone | UTC |
|
||||||
|
| Source | warehouse.subscriptions_daily |
|
||||||
|
| Owner | Growth lead |
|
||||||
|
| Caveat | Launch week excluded because tracking was partial |
|
||||||
|
|
||||||
|
## Stop Conditions
|
||||||
|
|
||||||
|
Do not present a metric as stable if:
|
||||||
|
|
||||||
|
- Numerator or denominator changed between periods.
|
||||||
|
- Source ownership is unclear.
|
||||||
|
- Filters were applied ad hoc and not documented.
|
||||||
|
- Time windows or timezones differ across comparisons.
|
||||||
|
- A dashboard label hides a formula change.
|
||||||
|
|
||||||
|
## Fast Questions to Ask
|
||||||
|
|
||||||
|
- "What exactly counts in the numerator?"
|
||||||
|
- "Who is excluded and why?"
|
||||||
|
- "What is the comparison baseline?"
|
||||||
|
- "Has this definition changed over time?"
|
||||||
|
- "Who would dispute this number internally?"
|
||||||
120
pitfalls.md
Normal file
120
pitfalls.md
Normal file
@@ -0,0 +1,120 @@
|
|||||||
|
# Analytical Pitfalls — Detailed Examples
|
||||||
|
|
||||||
|
## Simpson's Paradox
|
||||||
|
|
||||||
|
**What it is:** A trend that appears in aggregated data reverses when you segment by a key variable.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- Overall: Treatment A has 80% success, Treatment B has 85% -> "B is better"
|
||||||
|
- But segmented by severity:
|
||||||
|
- Mild cases: A=90%, B=85% -> A is better
|
||||||
|
- Severe cases: A=70%, B=65% -> A is better
|
||||||
|
- Paradox: A is better in BOTH groups, but B looks better overall because B got more mild cases
|
||||||
|
|
||||||
|
**How to catch:** Always segment by obvious confounders (user type, time period, source, severity) before concluding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Survivorship Bias
|
||||||
|
|
||||||
|
**What it is:** Drawing conclusions only from "survivors" while ignoring those who dropped out.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- "Users who completed onboarding have 80% retention!"
|
||||||
|
- Problem: You're only looking at users who already demonstrated commitment by completing onboarding
|
||||||
|
- The 60% who abandoned onboarding aren't in your "user" dataset
|
||||||
|
|
||||||
|
**How to catch:** Ask "Who is NOT in this dataset that should be?" Include churned users, failed attempts, non-converters.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparing Unequal Periods
|
||||||
|
|
||||||
|
**What it is:** Comparing metrics across time periods of different lengths or characteristics.
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- February (28 days) vs January (31 days) revenue
|
||||||
|
- Holiday week vs normal week traffic
|
||||||
|
- Q4 (holiday season) vs Q1 for e-commerce
|
||||||
|
|
||||||
|
**How to catch:**
|
||||||
|
- Normalize to per-day, per-user, or per-session
|
||||||
|
- Compare same period last year (YoY) not sequential months
|
||||||
|
- Flag seasonal factors explicitly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## p-Hacking (Multiple Comparisons)
|
||||||
|
|
||||||
|
**What it is:** Running many statistical tests until finding a "significant" result, then reporting only that one.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- Test 20 different user segments for conversion difference
|
||||||
|
- At p=0.05, expect 1 "significant" result by chance alone
|
||||||
|
- Report: "Segment X shows significant improvement!" (cherry-picked)
|
||||||
|
|
||||||
|
**How to catch:**
|
||||||
|
- Apply Bonferroni correction (divide alpha by number of tests)
|
||||||
|
- Pre-register hypotheses before looking at data
|
||||||
|
- Report ALL tests run, not just significant ones
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Spurious Correlation in Time Series
|
||||||
|
|
||||||
|
**What it is:** Two variables both trending over time appear correlated, but the relationship is meaningless.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- "Revenue and employee count are 95% correlated!"
|
||||||
|
- Both grew over time. Controlling for time, there's no relationship.
|
||||||
|
- Classic: "Ice cream sales correlate with drowning deaths" (both rise in summer)
|
||||||
|
|
||||||
|
**How to catch:**
|
||||||
|
- Detrend both series before correlating
|
||||||
|
- Check if relationship holds within time periods
|
||||||
|
- Ask: "Is there a causal mechanism, or just shared time trend?"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Aggregating Percentages
|
||||||
|
|
||||||
|
**What it is:** Averaging percentages instead of recalculating from underlying totals.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- Store A: 10/100 = 10% conversion
|
||||||
|
- Store B: 5/10 = 50% conversion
|
||||||
|
- Wrong: "Average conversion is 30%"
|
||||||
|
- Right: 15/110 = 13.6% conversion
|
||||||
|
|
||||||
|
**How to catch:** Never average percentages. Sum numerators, sum denominators, recalculate.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Selection Bias in A/B Tests
|
||||||
|
|
||||||
|
**What it is:** Treatment and control groups differ systematically before treatment is applied.
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- Users who opted into new feature vs those who didn't
|
||||||
|
- Early adopters (Monday signups) vs late week (Friday signups)
|
||||||
|
- Users who saw the experiment (loaded fast enough) vs those who didn't
|
||||||
|
|
||||||
|
**How to catch:**
|
||||||
|
- Verify pre-experiment metrics are balanced
|
||||||
|
- Use intention-to-treat analysis
|
||||||
|
- Check for differential attrition
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Confusing Causation
|
||||||
|
|
||||||
|
**What it is:** Assuming X causes Y when the relationship might be: Y causes X, Z causes both, or it's coincidental.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
- "Power users have higher retention"
|
||||||
|
- Did power usage cause retention? Or did retained users become power users over time? Or does a third factor (job role) drive both?
|
||||||
|
|
||||||
|
**How to catch:**
|
||||||
|
- Can you run an experiment? (randomize treatment)
|
||||||
|
- Is there a natural experiment? (policy change, feature rollout)
|
||||||
|
- At minimum: control for obvious confounders
|
||||||
169
techniques.md
Normal file
169
techniques.md
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
# Analysis Techniques — When to Use Each
|
||||||
|
|
||||||
|
## Hypothesis Testing
|
||||||
|
|
||||||
|
**Use when:** Comparing two groups to determine if a difference is real or random chance.
|
||||||
|
|
||||||
|
**Technique selection:**
|
||||||
|
| Data type | Groups | Test |
|
||||||
|
|-----------|--------|------|
|
||||||
|
| Continuous | 2 | t-test (if normal) or Mann-Whitney |
|
||||||
|
| Continuous | 3+ | ANOVA or Kruskal-Wallis |
|
||||||
|
| Proportions | 2 | Chi-square or Fisher's exact |
|
||||||
|
| Paired data | 2 | Paired t-test or Wilcoxon signed-rank |
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- p-value (probability of seeing this difference by chance)
|
||||||
|
- Effect size (how big is the difference - Cohen's d, odds ratio)
|
||||||
|
- Confidence interval (range of plausible true values)
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Large samples make everything "significant" - focus on effect size
|
||||||
|
- Multiple comparisons inflate false positives
|
||||||
|
- Normality assumptions (use non-parametric if violated)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cohort Analysis
|
||||||
|
|
||||||
|
**Use when:** Understanding how user behavior changes over time, segmented by when they started.
|
||||||
|
|
||||||
|
**Types:**
|
||||||
|
- **Retention cohorts:** % of users still active N days after signup
|
||||||
|
- **Revenue cohorts:** Revenue per cohort over time
|
||||||
|
- **Behavioral cohorts:** Feature adoption by signup cohort
|
||||||
|
|
||||||
|
**Setup:**
|
||||||
|
1. Define cohort (usually signup week/month)
|
||||||
|
2. Define event (login, purchase, specific action)
|
||||||
|
3. Define time windows (day 1, 7, 30, 90)
|
||||||
|
4. Build matrix: cohort × time period
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Retention curves (line chart by cohort)
|
||||||
|
- Cohort comparison (are newer cohorts performing better?)
|
||||||
|
- Time-to-event patterns
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Cohort size differences (small cohorts = noisy data)
|
||||||
|
- Seasonality (December cohort behaves differently)
|
||||||
|
- Definition consistency (what counts as "active"?)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Funnel Analysis
|
||||||
|
|
||||||
|
**Use when:** Understanding conversion through a multi-step process.
|
||||||
|
|
||||||
|
**Setup:**
|
||||||
|
1. Define stages (visit -> signup -> activate -> purchase)
|
||||||
|
2. Count users at each stage
|
||||||
|
3. Calculate drop-off rates between stages
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Conversion rates per stage
|
||||||
|
- Biggest drop-off points
|
||||||
|
- Segment comparison (mobile vs desktop funnels)
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Time window (did they convert eventually, or just not today?)
|
||||||
|
- Stage ordering (users don't always follow linear paths)
|
||||||
|
- Defining "same session" vs "ever"
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Regression Analysis
|
||||||
|
|
||||||
|
**Use when:** Understanding what predicts an outcome, controlling for other factors.
|
||||||
|
|
||||||
|
**Types:**
|
||||||
|
- **Linear:** Continuous outcome (revenue, time spent)
|
||||||
|
- **Logistic:** Binary outcome (churned/retained, converted/didn't)
|
||||||
|
- **Poisson:** Count outcome (purchases, logins)
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Coefficients (effect of each variable, holding others constant)
|
||||||
|
- R² (how much variance is explained)
|
||||||
|
- p-values per variable
|
||||||
|
- Residual plots (are assumptions met?)
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Multicollinearity (correlated predictors)
|
||||||
|
- Omitted variable bias (missing important controls)
|
||||||
|
- Extrapolation beyond data range
|
||||||
|
- Causation claims from observational data
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Segmentation/Clustering
|
||||||
|
|
||||||
|
**Use when:** Discovering natural groups in your data.
|
||||||
|
|
||||||
|
**Techniques:**
|
||||||
|
- **K-means:** Simple, fast, assumes spherical clusters
|
||||||
|
- **Hierarchical:** Shows cluster relationships, good for exploration
|
||||||
|
- **RFM:** Business-specific (Recency, Frequency, Monetary)
|
||||||
|
|
||||||
|
**Process:**
|
||||||
|
1. Select features (what defines a segment?)
|
||||||
|
2. Normalize features (so scale doesn't dominate)
|
||||||
|
3. Choose number of clusters (elbow method, silhouette score)
|
||||||
|
4. Profile each cluster (what makes them different?)
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Cluster profiles (avg values per segment)
|
||||||
|
- Segment sizes
|
||||||
|
- Distinguishing characteristics
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Garbage in, garbage out (feature selection matters)
|
||||||
|
- Cluster count is subjective
|
||||||
|
- Stability (do clusters hold with different random seeds?)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Anomaly Detection
|
||||||
|
|
||||||
|
**Use when:** Finding unusual data points that warrant investigation.
|
||||||
|
|
||||||
|
**Approaches:**
|
||||||
|
- **Statistical:** Points beyond 2-3 standard deviations
|
||||||
|
- **IQR method:** Below Q1-1.5×IQR or above Q3+1.5×IQR
|
||||||
|
- **Isolation Forest:** For multivariate anomalies
|
||||||
|
- **Domain rules:** Negative revenue, future dates, impossible values
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Flagged records with anomaly scores
|
||||||
|
- Context (why is this unusual?)
|
||||||
|
- Severity (how far from normal?)
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Seasonality (Black Friday isn't an anomaly)
|
||||||
|
- Trends (growth makes old "normal" look like anomalies)
|
||||||
|
- False positives (investigate before acting)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Time Series Analysis
|
||||||
|
|
||||||
|
**Use when:** Understanding patterns in data over time.
|
||||||
|
|
||||||
|
**Components:**
|
||||||
|
- **Trend:** Long-term direction
|
||||||
|
- **Seasonality:** Repeating patterns (daily, weekly, yearly)
|
||||||
|
- **Noise:** Random variation
|
||||||
|
|
||||||
|
**Techniques:**
|
||||||
|
- **Moving averages:** Smooth out noise
|
||||||
|
- **Decomposition:** Separate trend, seasonal, residual
|
||||||
|
- **Year-over-year:** Compare same period last year
|
||||||
|
|
||||||
|
**Key outputs:**
|
||||||
|
- Trend direction and strength
|
||||||
|
- Seasonal patterns identified
|
||||||
|
- Forecast with uncertainty bands
|
||||||
|
|
||||||
|
**Watch out for:**
|
||||||
|
- Comparing different lengths (months vary in days)
|
||||||
|
- Holidays/events (one-time vs recurring)
|
||||||
|
- Structural breaks (COVID, product changes)
|
||||||
Reference in New Issue
Block a user