-
-
Notifications
You must be signed in to change notification settings - Fork 19.9k
ENH: parallelize DataFrame.corr #40956
Copy link
Copy link
Open
Labels
EnhancementMultithreadingParallelism in pandasParallelism in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionPerformanceMemory or execution speed performanceMemory or execution speed performancecov/corr
Metadata
Metadata
Assignees
Labels
EnhancementMultithreadingParallelism in pandasParallelism in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionPerformanceMemory or execution speed performanceMemory or execution speed performancecov/corr
Is your feature request related to a problem?
DataFrame.corr(method="spearman") is extremely slow.
method="pearson" is quite slow too.
I can see from my machine resource monitor that the implementation is single threaded. Is it a design choice? If so, there should be at least an optional argument to parallelize it (at C++ level, of course).
I did not check the actual code implementing this method.
Describe the solution you'd like
scipy.stats.spearmanr implements this computation on a numpy array in 1/20 of the time in my 6-core machine.
API breaking implications
None.
Describe alternatives you've considered
Add an optional argument (ex. "parallelize"=[True, False]) so that you give the user this option.
Then, the method should either be reimplemented from scratch at C++ level or we must use the existing scipy.stats function
on the DataFrame.values, wrapping the returned array in a new DataFrame.
Additional context
IMPORTANT: DataFrame.corr and spearmanr gives slightly different results (some kind of small rounding error of about 10e-15)