Skip to content

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240

Open
igerber wants to merge 6 commits intomainfrom
survey-phase-seven
Open

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240
igerber wants to merge 6 commits intomainfrom
survey-phase-seven

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 28, 2026

Summary

  • Phase 7a: Remove NotImplementedError gate for CallawaySantAnna IPW/DR + covariates + survey. Implement DRDID panel nuisance IF corrections (propensity score + outcome regression) for both survey-weighted and non-survey DR paths (Sant'Anna & Zhao 2020, Theorem 3.1). Extract _safe_inv() helper for matrix inversions.
  • Phase 7d: Thread survey degrees of freedom through HonestDiD for t-distribution critical values. Compute full event-study variance-covariance matrix from influence function vectors in CallawaySantAnna aggregation. Add event_study_vcov field to CallawaySantAnnaResults and survey_metadata/df_survey to HonestDiDResults.
  • Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc(), _compute_att_gt_rc(), and three RC estimation methods (_outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap mixins. RCS data generator via generate_staggered_data(panel=False).

Methodology references

  • Method name(s): Callaway-Sant'Anna (2021), Sant'Anna & Zhao (2020) DRDID panel/cross-section, Rambachan & Roth (2023) HonestDiD
  • Paper / source link(s):
    • Sant'Anna, P.H.C. & Zhao, J. (2020). "Doubly Robust Difference-in-Differences Estimators." J. Econometrics 219(1). Theorem 3.1 (panel IF corrections), Section 4 (cross-sectional DRDID).
    • Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." J. Econometrics 225(2). Section 4.1 (repeated cross-sections).
    • Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Rev. Econ. Studies 90(5).
  • Intentional deviations: DR nuisance IF corrections use the same survey-weighted Hessian/score pattern as the existing IPW path. Non-survey DR path also receives IF corrections (was plug-in only). Per-cell SEs remain IF-based (not full TSL) — documented in REGISTRY.md. Event-study VCV under replicate weights falls back to diagonal (multivariate replicate VCV deferred).

Validation

  • Tests added/updated:
    • tests/test_survey_phase7a.py (22 tests): smoke, scale invariance, uniform-weight equivalence, IF correction, aggregation, bootstrap, edge cases
    • tests/test_staggered_rc.py (23 tests): all methods, covariates, survey, aggregation, bootstrap, control groups, base periods, data generator, edge cases
    • tests/test_honest_did.py (+4 tests): survey df extraction, VCV computation, bounds widening, no-survey baseline
    • tests/test_survey_phase4.py: 2 negative tests converted to positive assertions
  • Full test suite: 365 tests pass across all affected files (0 failures)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…DiD survey variance

Phase 7a: Remove NotImplementedError gate for IPW/DR + covariates + survey.
Add DRDID panel nuisance IF corrections (PS + OR) for both survey and
non-survey DR paths. Extract _safe_inv helper for matrix inversions.

Phase 7d: Thread survey df through HonestDiD for t-distribution critical
values. Compute full event-study VCV from influence function vectors.
Add event_study_vcov to CallawaySantAnnaResults.

Phase 7b: Add panel=False for repeated cross-section support in
CallawaySantAnna. New _precompute_structures_rc, _compute_att_gt_rc,
and three RC estimation methods (reg, ipw, dr) with covariates and
survey weights. Canonical index abstraction in aggregation/bootstrap.
RCS data generator in generate_staggered_data(panel=False).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1: the new repeated-cross-section CallawaySantAnna paths do not fully match the documented/source-method inference contract, and the new HonestDiD survey covariance plumbing diverges from the registry for replicate-weight designs.

Executive summary

  • Affected methods: Callaway-Sant'Anna (panel=False, IPW/DR, aggregation) and HonestDiD (survey-aware event-study covariance).
  • The Phase 7a panel DR survey changes look broadly aligned with the updated registry; I did not find a blocker in the new panel nuisance-correction code itself.
  • The new repeated-cross-section IPW/DR analytic SEs are still plug-in only and do not implement the cross-sectional nuisance-estimation IF corrections the registry says are supported.
  • For panel=False, unweighted simple/event-study aggregation uses time-specific treated-cell counts while the WIF code uses full-sample cohort shares, so the weighting contract is internally inconsistent.
  • The new HonestDiD covariance path for replicate-weight surveys does not implement the documented diagonal fallback; it builds a full Psi'Psi covariance and passes that to HonestDiD.
  • The new public panel parameter is not propagated into CallawaySantAnnaResults, and result summaries still label repeated-cross-section observation counts as “units”.

Methodology

Cross-check basis: docs/methodology/REGISTRY.md:L291-L319, docs/methodology/REGISTRY.md:L419-L424, and docs/methodology/REGISTRY.md:L1633-L1637.

  1. Severity: P1. diff_diff/staggered.py:L2872-L2978, diff_diff/staggered.py:L2980-L3127, docs/methodology/REGISTRY.md:L423-L424.
    Impact: the new repeated-cross-section IPW and DR analytic inference does not include nuisance-estimation IF corrections. _ipw_estimation_rc() computes SE from the plug-in IF only, and _doubly_robust_rc() likewise stops at the plug-in IF. That is a mismatch with the registry’s claim that panel=False uses Section 4 cross-sectional DRDID with per-observation influence functions, and it is notably weaker than the panel IPW/DR code in the same file, which now adds explicit PS/OR correction terms. The result is understated or otherwise incorrect SEs/CIs/p-values for covariate-adjusted RCS IPW/DR.
    Concrete fix: implement the Section 4 cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR, or explicitly document the deviation in REGISTRY.md and disable analytic inference for those branches until the correct IF is in place.

  2. Severity: P1. diff_diff/staggered_aggregation.py:L37-L152, diff_diff/staggered_aggregation.py:L289-L314, diff_diff/staggered_aggregation.py:L574-L645, diff_diff/staggered_bootstrap.py:L223-L267, diff_diff/staggered_bootstrap.py:L560-L657.
    Impact: the new unweighted panel=False aggregation uses data["n_treated"] from each (g,t) cell as the aggregation weight, but the WIF path for the same estimator computes pg from full-sample cohort counts. In panel data those coincide because cohort size is constant across t; in repeated cross-sections they generally do not. That means the point estimate, WIF correction, and bootstrap aggregation are no longer using the same weight definition. This changes the estimand/finite-sample weighting and makes the SE formula internally inconsistent with the aggregated estimator.
    Concrete fix: precompute fixed cohort masses for panel=False once from the full repeated-cross-section sample, then use those same cohort masses everywhere simple/event-study/bootstrap weights are formed.

  3. Severity: P1. diff_diff/staggered_aggregation.py:L710-L739, diff_diff/honest_did.py:L664-L669, docs/methodology/REGISTRY.md:L1637-L1637.
    Impact: the registry explicitly says replicate-weight event-study covariance should fall back to a diagonal matrix until multivariate replicate VCV is implemented, but _aggregate_event_study() currently builds a full Psi.T @ Psi matrix for all non-TSL cases, which includes replicate-weight designs. HonestDiD then consumes that full matrix whenever event_study_vcov is present. That is an undocumented methodology mismatch and can change HonestDiD bounds under replicate designs without warning.
    Concrete fix: when uses_replicate_variance is true, do not populate a full off-diagonal event_study_vcov; set it to None or an explicit diagonal-from-SEs fallback until a proper multivariate replicate covariance estimator is implemented and validated.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

  1. Severity: P1. diff_diff/staggered.py:L258-L336, diff_diff/staggered.py:L1374-L1386, diff_diff/staggered.py:L1774-L1785, diff_diff/staggered_results.py:L63-L123, diff_diff/staggered_results.py:L157-L163.
    Impact: the new public panel parameter is not stored on CallawaySantAnnaResults, even though it changes how counts and aggregation should be interpreted. For panel=False, the fit path stores observation counts in n_treated_units / n_control_units, but the results API and summary still present them as “units”. That makes downstream use ambiguous and violates the expected propagation pattern for new public parameters.
    Concrete fix: add panel: bool (or equivalent index-space metadata) to CallawaySantAnnaResults, populate it in fit(), and switch summary labels/output semantics to observation-based wording when panel=False.

Tech Debt

No separate findings. The new TODO.md changes only resolve prior items; they do not track the P1 issues above, so those remain unmitigated.

Security

No findings.

Documentation/Tests

  1. Severity: P2. diff_diff/staggered.py:L119-L184, docs/api/staggered.rst:L15-L32, tests/test_staggered_rc.py:L48-L351, tests/test_honest_did.py:L1125-L1244.
    Impact: the public CallawaySantAnna class docstring still does not describe panel, so the generated API docs do not expose the new parameter contract. The new RCS tests mostly check “finite result” behavior, not methodology alignment, aggregation weights, or SE correctness. The HonestDiD additions test Taylor-style survey inputs, but there is no test for the new replicate-weight covariance branch, which is exactly where the implementation currently diverges from the registry.
    Concrete fix: document panel in the estimator docstring/API docs, add a reference-validation test against did::att_gt(panel=FALSE) for at least one aggregation case and one IPW/DR covariate case, and add a replicate-weight HonestDiD test that asserts diagonal fallback until multivariate replicate VCV is implemented.

Path to Approval

  1. Implement cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR analytic inference, or explicitly document and temporarily disable those analytic SEs.
  2. Replace per-cell n_treated weighting with fixed cohort-mass weighting throughout panel=False simple/event-study/bootstrap aggregation.
  3. Make replicate-weight HonestDiD use the documented diagonal event-study covariance fallback instead of a full Psi.T @ Psi matrix.
  4. Propagate panel into CallawaySantAnnaResults and correct result-summary labels/semantics for repeated cross-sections.
  5. Add reference tests for panel=False methodology and a replicate-weight HonestDiD regression test.

Static review only; I was not able to do runtime spot-checks in this environment because the Python runtime here is missing required dependencies such as numpy.

… VCV, panel on results

Fix 5 findings from PR #240 CI review:
- Add cross-sectional nuisance IF corrections (PS + OR) to _ipw_estimation_rc
  and _doubly_robust_rc, matching panel path methodology
- Use fixed full-sample cohort masses for unweighted RCS aggregation weights
  (consistency with WIF group-share denominator)
- Guard replicate-weight designs from full event-study VCV (diagonal fallback)
- Add panel field to CallawaySantAnnaResults, fix summary labels for RCS
- Add panel to class docstring, replicate VCV test, RCS IF correction test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf566d1f94ebca6500b5ffe806c87302aa9bb9a


PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1. The prior HonestDiD replicate-weight fallback and panel propagation issues look resolved, but two repeated-cross-section inference problems remain.

Executive Summary

  • The remaining blockers are both in the new Callaway-Sant'Anna panel=False path.
  • Repeated-cross-section DR analytical inference is still internally inconsistent: the nuisance IF corrections reuse the post-period treated denominator for both the post and base-period components.
  • The earlier RCS weighting-contract fix is only partial: balance_e event-study aggregation and the unweighted bootstrap overall/event-study helpers still revert to per-cell n_treated counts instead of fixed cohort masses.
  • HonestDiD’s survey-variance changes now appear aligned with the updated registry, including the replicate-weight diagonal fallback.
  • The new RC tests are mostly smoke/finite-result checks and would not catch either remaining RC inference defect.

Methodology

Affected methods: Callaway-Sant'Anna repeated cross sections (panel=False, Section 4-style DRDID and aggregation/bootstrap). The HonestDiD survey-variance path looks consistent with the updated registry note.

  • Severity: P1. Impact: In panel=False DR, the point estimator correctly uses separate treated normalizers for the post and base-period pieces (sw_gt_sum vs sw_gs_sum, or n_gt vs n_gs), but the nuisance IF corrections collapse both periods onto a single normalizer = sum(sw_gt) or n_gt. That mis-scales both the PS correction and the base-period OR correction whenever the cohort-g sample size or treated weight sum differs across periods, which is the ordinary repeated-cross-section case. The resulting analytical SE/CIs/p-values are inconsistent with the estimator and with the Section 4 repeated-cross-section decomposition promised in the registry. Concrete fix: use separate normalizer_t and normalizer_s throughout M2_dr, M1_t, and M1_s, matching the denominators used in att_t_aug and att_s_aug; add a regression test with n_gt != n_gs and unequal treated weight sums. References: diff_diff/staggered.py:L3118-L3159 diff_diff/staggered.py:L3184-L3228 docs/methodology/REGISTRY.md:L423-L424

  • Severity: P1. Impact: The earlier panel=False weighting-contract bug is only partially fixed. Analytical simple/event-study aggregation now uses fixed cohort masses, but unweighted event-study with balance_e and the unweighted bootstrap helpers still fall back to cell-specific n_treated. In repeated cross-sections those cell counts vary by period, so the bootstrap SEs/CIs and balance_e event-study weights no longer correspond to the estimator/WIF denominator used elsewhere in the same results object. Concrete fix: compute one unweighted cohort-mass map from precomputed["unit_cohorts"] and use it everywhere panel=False aggregation weights are formed, including the balance_e branch in _aggregate_event_study(), overall bootstrap weights, and _prepare_event_study_aggregation(). References: diff_diff/staggered_aggregation.py:L76-L100 diff_diff/staggered_aggregation.py:L590-L610 diff_diff/staggered_aggregation.py:L621-L648 diff_diff/staggered_bootstrap.py:L229-L243 diff_diff/staggered_bootstrap.py:L561-L575 diff_diff/staggered_bootstrap.py:L614-L631

Code Quality

No findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. Neither remaining P1 issue is tracked in TODO.md:L48-L60, so TODO.md does not mitigate them.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC tests are largely finiteness/smoke checks. They do not exercise the two cases that break the current implementation: unequal treated cohort sizes across the two RC periods, and panel=False aggregation/bootstrap weight formation under balance_e or n_bootstrap>0. As written, the suite would pass with both P1 bugs still present. Concrete fix: add deterministic panel=False fixtures where cohort-g counts differ between t and s, assert the corrected DR SE against a hand-built IF/reference calculation, and add bootstrap/event-study tests that verify fixed cohort-mass weights under balance_e and n_bootstrap>0. References: tests/test_staggered_rc.py:L135-L152 tests/test_staggered_rc.py:L258-L300 tests/test_staggered_rc.py:L362-L381

Path to Approval

  1. Fix the repeated-cross-section DR nuisance IF normalization so the post-period and base-period PS/OR corrections use their own treated denominators (sw_gt_sum/sw_gs_sum or n_gt/n_gs).
  2. Finish propagating fixed cohort-mass weighting through all panel=False aggregation paths, especially balance_e event-study aggregation and the unweighted overall/event-study bootstrap helpers.
  3. Add regression tests covering unequal RC cohort counts across periods and panel=False bootstrap/event-study weight consistency.

Static review only: I could not execute the test suite here because this environment lacks pytest and scientific Python dependencies such as numpy.

…equal-count tests

- Use separate normalizer_t/normalizer_s in _doubly_robust_rc() IF corrections
  (post vs base period treated denominators must match the DR estimator)
- Precompute rcs_cohort_masses in _precompute_structures_rc() and return cohort
  mass as n_treated from _compute_att_gt_rc() — fixes all downstream consumers
  (aggregation, bootstrap, balance_e) at the source instead of per-consumer patches
- Remove now-unnecessary rcs_cohort_counts blocks from aggregation
- Add unequal cohort count test fixture and regression tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6080f927c1c76a760b7825b59d5f46032da64b1c


PR Review

Overall assessment
⚠️ Needs changes. The highest unmitigated severity is P1. The prior RC normalizer and fixed-cohort-weighting blockers appear resolved, but the new panel=False estimator formulas still do not line up with the repeated-cross-section methods this PR claims to implement.

Executive summary

  • The earlier re-review blockers look fixed: RC DR no longer reuses one denominator for both periods, and fixed cohort-mass weighting now propagates through analytical aggregation and bootstrap.
  • The remaining blocker is methodological: the new repeated-cross-section reg and dr paths do not match the DRDID / did::att_gt(panel=FALSE) estimators they are supposed to mirror.
  • _outcome_regression_rc() uses separate pre/post treated residual averages, but the reference reg_did_rc estimator pools treated weights when averaging the predicted change.
  • _doubly_robust_rc() is further off: it uses only control-group ORs and normalizes the control augmentation terms by treated-period masses, which does not match either drdid_rc or the simpler AIPW repeated-cross-section formula.
  • The added RC tests are almost entirely smoke/finite-result checks, so this kind of formula mismatch passes undetected.
  • The HonestDiD survey-df / event-study-vcov changes look consistent with the new registry note.

Methodology

  • Severity: P1. Impact: The new repeated-cross-section reg path in _outcome_regression_rc at diff_diff/staggered.py:L2795 computes ATT = mean_t(Y - m_t(X)) - mean_s(Y - m_s(X)) using separate treated averages for the post and base periods (diff_diff/staggered.py:L2843, diff_diff/staggered.py:L2859, diff_diff/staggered.py:L2869). did::att_gt(panel=FALSE, est_method="reg") dispatches to DRDID::reg_did_rc, and that estimator averages the predicted change over the treated group with pooled treated weights rather than separate pre/post treated residual means. That is a different finite-sample estimator whenever treated-sample composition differs across the two cross-sections. The registry note at docs/methodology/REGISTRY.md:L423 documents panel=False support, but not this estimator change. Concrete fix: Rework _outcome_regression_rc() and its IF to match reg_did_rc exactly, or explicitly document and rename a different RC regression estimator if that deviation is intentional. citeturn3view0turn5view1
  • Severity: P1. Impact: The new repeated-cross-section dr path in _doubly_robust_rc at diff_diff/staggered.py:L3031 does not match the cited DRDID repeated-cross-section estimators. The point estimator uses only control-group ORs (diff_diff/staggered.py:L3059) and divides the control augmentation terms by treated-period masses (diff_diff/staggered.py:L3131, diff_diff/staggered.py:L3153), with the same normalization baked into the IF corrections (diff_diff/staggered.py:L3190, diff_diff/staggered.py:L3211). But did::att_gt(panel=FALSE, est_method="dr") dispatches to DRDID::drdid_rc; that locally efficient estimator includes treated- and control-group outcome-regression pieces in both periods, and even the simpler AIPW repeated-cross-section formula normalizes each treated/control pre/post component by its own weight sum rather than the treated totals. So the current code changes both point estimates and IF-based SEs whenever reweighted control mass differs from treated mass. Concrete fix: Pick one specific Section 4 DR estimator (drdid_rc, aipw_did_rc1, or another named variant), implement its point estimator and IF end-to-end, and update the registry to name that exact estimator. citeturn3view1turn5view2turn1search4

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P2. Impact: In the RC path, _compute_att_gt_rc() stores the full cohort mass in n_treated so downstream aggregation can reuse existing weight plumbing (diff_diff/staggered.py:L2710), but the public results contract still documents n_treated as the number of treated observations for that group-time cell (diff_diff/staggered_results.py:L21). That silently turns a reporting field into an aggregation-weight field. Concrete fix: Keep n_treated as the actual cell count and add a separate cohort_mass / agg_weight field for RC aggregation.

Tech Debt

  • No separate findings. The P1 methodology issue above is not tracked in TODO.md:L48, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC test file is almost entirely smoke coverage: it checks that fits are finite, methods run, and bootstrap returns something, but never compares panel=False reg/dr against DRDID formulas, did outputs, or even hand-built 2×2 RC reference values (tests/test_staggered_rc.py:L48, tests/test_staggered_rc.py:L160, tests/test_staggered_rc.py:L413). That means the formula mismatch above passes cleanly. Concrete fix: Add deterministic repeated-cross-section fixtures where treated covariate composition differs between pre and post, then assert reg/dr ATT and SE outputs against checked-in DRDID / R reference values or hand-verified formulas.

Path to Approval

  1. Re-implement _outcome_regression_rc() to match the repeated-cross-section reg_did_rc formula and influence function used by did::att_gt(panel=FALSE, est_method="reg").
  2. Re-implement _doubly_robust_rc() to match one named DRDID repeated-cross-section estimator exactly, including the correct normalization and nuisance-IF terms.
  3. Add RC reference-value tests that fail under the current formulas, not just finiteness/smoke checks.

…ulas

_outcome_regression_rc: Pool all treated obs for OR correction term
(was: separate per-period averages). Period-specific treated means for Y.
Matches Sant'Anna & Zhao (2020) Eq 2.2 / R reg_did_rc exactly.

_doubly_robust_rc: Fit 4 OLS models (control+treated, pre+post) for
locally efficient DR estimator (was: 2 control-only). Implements tau_1
(AIPW) + tau_2 (local efficiency adjustment) with full 11-component IF.
Matches Sant'Anna & Zhao (2020) Eq 3.3+3.4 / R drdid_rc exactly.

Add agg_weight field to group_time_effects for RCS aggregation weight
(cohort mass), separate from n_treated (per-cell display count).
Aggregation uses data.get("agg_weight", data["n_treated"]) for
backward compatibility with panel data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b623deeed20d79ab05bdf255819745c93140371a


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

  • The prior repeated-cross-section point-estimator blockers appear addressed: the new reg path now pools the treated OR correction, and the new dr path includes treated-side OR terms plus nuisance IF corrections.
  • [Newly identified] The non-survey panel=False bootstrap path still reaggregates with realized cell counts n_{g,t} instead of the fixed cohort mass agg_weight/N_g, so bootstrap overall and event-study inference target a different estimator than the analytical path.
  • That mismatch is not covered by the new methodology note for repeated cross-sections in docs/methodology/REGISTRY.md:L423, and it is not tracked in TODO.md:L48-L60 or TODO.md:L167-L170.
  • The HonestDiD survey-df / event-study-vcov changes are consistent with the new registry note; no separate methodology defect stood out there.
  • The new RC bootstrap tests are smoke-only, so this weighting regression currently slips through.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3 Impact: event_study_vcov is cached on the estimator instance and later copied into results via getattr(self, "_event_study_vcov", None). If the same CallawaySantAnna object is reused, a fit that does not compute event-study aggregation can inherit stale covariance metadata from a previous fit. References: diff_diff/staggered_aggregation.py:L714-L755, diff_diff/staggered.py:L1772-L1796. Concrete fix: reset self._event_study_vcov = None at the start of fit() and only attach it when the current fit actually computed event_study_effects.

Tech Debt

No separate findings. The P1 bootstrap-weight issue above is not currently tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2 Impact: The new RC bootstrap coverage only checks that bootstrap runs, including on the unequal-cohort fixture, but never asserts that bootstrap aggregation uses the fixed cohort masses introduced for panel=False. References: tests/test_staggered_rc.py:L289-L300, tests/test_staggered_rc.py:L413-L482. Concrete fix: add a deterministic unequal-cohort panel=False test that verifies bootstrap overall/event-study prep uses the same fixed cohort weights as the analytical aggregation path.

Path to Approval

  1. Update CallawaySantAnnaBootstrapMixin._run_multiplier_bootstrap() so non-survey panel=False overall ATT bootstrap weights use fixed cohort mass (agg_weight / precomputed cohort mass), not per-cell n_treated.
  2. Update _prepare_event_study_aggregation() so non-survey panel=False event-study bootstrap weights use the same fixed cohort masses.
  3. Add a regression test on the unequal-cohort RC fixture that fails if bootstrap reaggregation falls back to per-cell n_treated instead of fixed cohort mass.

- Bootstrap overall/event-study reaggregation now uses agg_weight (fixed
  cohort mass) for panel=False, matching the analytical aggregation path
- Reset self._event_study_vcov = None at start of fit() to prevent stale
  VCV from prior fit leaking into reused estimator objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3b405b7f9009afb4d8d90397d931b327a5c65825


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the changed tests in this environment because the default Python interpreter here does not have the project dependencies available.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna.fit() computes and stores event_study_vcov from analytical IF vectors during event-study aggregation,diff_diff/staggered_aggregation.py:L714-L755 then, when n_bootstrap>0, overwrites event_study_effects[*]["se"], CIs, and p-values with bootstrap results while leaving that covariance matrix unchanged on the results object.diff_diff/staggered.py:L1709-L1733 diff_diff/staggered.py:L1773-L1799 HonestDiD now always prefers event_study_vcov when present,diff_diff/honest_did.py:L664-L670 so bootstrap-fit CS results silently feed analytical covariance into sensitivity analysis. That contradicts the Phase 7d intent that HonestDiD respect the same variance structure as the underlying event study.docs/methodology/REGISTRY.md:L1637-L1637 Concrete fix: when bootstrap inference is used for event-study results, either compute and store a bootstrap event-study covariance matrix from the bootstrap draws, or clear/ignore event_study_vcov so HonestDiD falls back to the bootstrap variance path instead of mixing analytical and bootstrap inference.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD tests validate analytical event_study_vcov creation and the replicate-weight diagonal fallback, but they do not cover CallawaySantAnna(..., n_bootstrap>0, aggregate="event_study"). That leaves the new covariance-source mismatch untested.tests/test_honest_did.py:L1158-L1281 Concrete fix: add a regression test that fits a bootstrap event study, runs HonestDiD.fit(), and asserts that the covariance source matches the bootstrap path (or that event_study_vcov is absent/recomputed from bootstrap draws on bootstrapped fits).

Path to Approval

  1. Make bootstrap-fit CallawaySantAnna event-study results expose a variance path that matches the reported bootstrap SEs: either store a bootstrap event-study covariance matrix, or suppress/ignore analytical event_study_vcov on bootstrapped fits.
  2. Add a targeted regression test for CallawaySantAnna(n_bootstrap>0, aggregate="event_study") -> HonestDiD.fit(...) that fails if HonestDiD consumes the analytical IF covariance instead of the bootstrap path.
  3. If the analytical-on-bootstrap behavior is intentional, document that explicitly in docs/methodology/REGISTRY.md; otherwise the current behavior remains an undocumented methodology deviation.

…dy SEs

Prevents HonestDiD from mixing analytical IF-based VCV with bootstrap SEs
on bootstrap-fit CallawaySantAnna results. When n_bootstrap>0, the
event_study_vcov is set to None so HonestDiD falls back to diagonal from
the bootstrap SEs (consistent variance path).

Add regression test: bootstrap CS → HonestDiD asserts vcov is None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 53cfd5d5b06dd44f71f5272fe16a5e537543a327


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the added tests here because pytest is not installed in this environment.

Executive Summary

Methodology

  • Severity: P3. Impact: The previous HonestDiD/bootstrap covariance-source finding looks resolved. Bootstrapped CS fits now discard analytical event_study_vcov before results are stored, so HonestDiD falls back to the bootstrap-compatible diagonal path, and that exact regression is now tested. Concrete fix: None. diff_diff/staggered.py:1323 diff_diff/staggered.py:1777 tests/test_honest_did.py:1283
  • Severity: P1. Impact: _outcome_regression_rc says it matches DRDID::reg_did_rc, but its control-side OLS estimation-effect term is divided by the treated-mass denominator twice. M1 is already normalized by sum_w_D, then inf_ct / inf_cs divide inf_cont_2_* by sum_w_D again. That shrinks the nuisance-estimation piece of the influence function, so covariate-adjusted repeated-cross-section reg fits understate per-cell analytical SEs and any bootstrap path built from those IFs. Concrete fix: Keep M1 normalized as written and remove the extra / sum_w_D on inf_cont_2_ct and inf_cont_2_cs, or re-port the reg_did_rc IF algebra directly from the reference implementation. diff_diff/staggered.py:2824 diff_diff/staggered.py:2920 diff_diff/staggered.py:2938 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn0view0turn1view0turn1view3
  • Severity: P1. Impact: The new RCS ipw and dr nuisance corrections are also mis-scaled. _ipw_estimation_rc normalizes w_ct / w_cs to w_*_norm and then forms M2_rc with np.mean(...), adding an extra 1/n_ct or 1/n_cs; _doubly_robust_rc likewise divides its PS moment by n_all after already normalizing by sum_w_ipw_*. In the standardized RC IPW and locally efficient RC DR references, those PS moments are ratio-of-means terms, not extra sample-size-scaled means. That under-scales the PS correction and therefore understates inference for covariate-adjusted repeated-cross-section ipw and dr, including survey-weighted fits. Concrete fix: Rewrite the PS moment construction to match std_ipw_did_rc / drdid_rc exactly: summed normalized-weight moments for standardized IPW, and ratio-of-means PS moments without the extra / n_all for DR. diff_diff/staggered.py:3022 diff_diff/staggered.py:3076 diff_diff/staggered.py:3111 diff_diff/staggered.py:3371 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn2view0turn4view0turn1view3

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. These methodology defects are not mitigated by tracking: the nearby CS item in TODO.md is already marked resolved. TODO.md:48 TODO.md:57

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests do not verify the magnitude of the RCS nuisance corrections. They check finiteness, positivity, or that different methods return different SEs, which would still pass with the scaling errors above. Concrete fix: Add fixed-data regression tests for panel=False reg / ipw / dr with covariates that compare at least one ATT(g,t) SE or influence-function vector against DRDID reference output, and assert that bootstrap SEs move with the corrected IF magnitude. tests/test_staggered_rc.py:163 tests/test_staggered_rc.py:362 tests/test_survey_phase7a.py:60 tests/test_survey_phase7a.py:199

Path to Approval

  1. Fix _outcome_regression_rc so its OLS estimation-effect correction matches reg_did_rc, specifically by removing the extra treated-mass division after M1 has already been normalized.
  2. Fix _ipw_estimation_rc and _doubly_robust_rc so their PS nuisance moments use the same ratio-of-means scaling as the DRDID repeated-cross-section references.
  3. Add fixed-dataset regression tests for panel=False covariate-adjusted reg, ipw, and dr that check both analytical SEs and bootstrap SEs against reference values.
  4. Keep the HonestDiD bootstrap covariance fix and its regression test; that prior blocker looks resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant