perf: Element-wise comparison only for tolerance-requiring data types#26
perf: Element-wise comparison only for tolerance-requiring data types#26Marius Merkle (MariusMerkleQC) merged 18 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #26 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 10 10
Lines 758 776 +18
=========================================
+ Hits 758 776 +18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes condition_equal_columns for nested list/array columns by avoiding the expensive element-wise comparison path when tolerances/special handling aren’t needed, and updates the performance benchmark accordingly.
Changes:
- Add
_needs_element_wise_comparison()(plus helpers) to decide when list/array columns require element-wise comparison. - Shortcut list/array comparisons to
eq_missing()when element-wise handling is deemed unnecessary. - Update the performance test to assert comparable performance for
list<i64>comparisons.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
diffly/_conditions.py |
Introduces dtype-based gating to skip element-wise list/array comparisons unless tolerances/special handling are needed. |
tests/test_performance.py |
Updates benchmark expectations to ensure the optimized path is not significantly slower than direct eq_missing(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Oliver Borchert (borchero)
left a comment
There was a problem hiding this comment.
Nice, thanks
Motivation
See this comment.
Changes
Introduce a function
_needs_element_wise_comparisonthat checks whether element-wise comparison needs to be performed; this is the case for(1) float vs numeric columns -> absolute and relative tolerances apply (->
_is_float_numeric_pair())(2) temporal columns -> absolute temporal tolerance applies (->
_is_temporal_pair())(3) Different enums
(4) Enum vs categorical comparison
In all other cases, naive comparison suffices, and this shortcut is taken if the above helper returns
False. This avoids the expensive_compare_sequence_columns(). The performance improvement can be seen in the benchmark test.