Some Pandas/Python benchmarks
Problem: apply a lookup-table based map to a crapton of data
Toy example: 363k lines, 19M file
python3.8, df.apply: 34s
pypy3, df.apply: 65s (!!?!)
python3.8, mapply, n_workers=-1: hopeless
max_chunks_per_worker=2, slightly less hopeless, but still hopeless
Use a join instead of a lookup:
Constructing lookup dataframe — 86 seconds
Actual join — 20 seconds — not drastically different?
Pypy — construction takes 143 seconds. Join takes the same time.
Join takes 20 seconds regardless of the size of df1!?!?
Final workflow
- Use joins instead of applys. Join takes 20–30 seconds on something apply took 1 hour for.
- The df that needs to be constructed for join is expensive. Share it among all processes using:
mgr = multiprocessing.Manager()
ns = mgr.Namespace()
df = ns.df
args [ { df, rank } for rank in range(512)
pool.map(f, args)
This workflow achieves in 30 minutes what naive apply took 500+ hours for. 1000X improvement!!!
Moral of the story
- Apply’s are stupid
- Most distributed runtimes are stupid (sorry)
- Don’t switch to pypy and expect magic. It can be slower, esp with compiled modules