Code Sample
import pandas as pd
import time
squares = set(a**2 for a in range(100000000))
series = pd.Series(range(100))
start = time.time()
apply_result = series.apply(lambda x: x in squares)
apply_end = time.time()
isin_result = series.isin(squares)
isin_end = time.time()
assert((apply_result==isin_result).all())
print("pandas.Series.apply() took {} seconds and pandas.Series.isin() took {} seconds.".format(apply_end - start, isin_end - apply_end))
Output:
pandas.Series.apply() took 0.0044422149658203125 seconds and pandas.Series.isin() took 72.23143887519836 seconds.
Problem description
When a set is passed to pandas.Series.isin, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.
Suggested improvements
The quick and dirty workaround is to use pandas.Series.apply (as in the above code sample) instead of pandas.Series.isin. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround into isin directly. I would suggest, however, that at a minimum the documentation for isin be updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)
Output of pd.show_versions()
Details
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: 3.0.5
pip: 19.0.3
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.16.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml.etree: 3.7.2
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Code Sample
Output:
Problem description
When a set is passed to
pandas.Series.isin, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.Suggested improvements
The quick and dirty workaround is to use
pandas.Series.apply(as in the above code sample) instead ofpandas.Series.isin. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround intoisindirectly. I would suggest, however, that at a minimum the documentation forisinbe updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)Output of
pd.show_versions()Details
INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8pandas: 0.24.1
pytest: 3.0.5
pip: 19.0.3
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.16.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml.etree: 3.7.2
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None