Some Pandas/Python benchmarks

Ankush Jain
1 min readJul 21, 2022

--

Problem: apply a lookup-table based map to a crapton of data

Toy example: 363k lines, 19M file

python3.8, df.apply: 34s

pypy3, df.apply: 65s (!!?!)

python3.8, mapply, n_workers=-1: hopeless

max_chunks_per_worker=2, slightly less hopeless, but still hopeless

Use a join instead of a lookup:

Constructing lookup dataframe — 86 seconds

Actual join — 20 seconds — not drastically different?

Pypy — construction takes 143 seconds. Join takes the same time.

Join takes 20 seconds regardless of the size of df1!?!?

Final workflow

  1. Use joins instead of applys. Join takes 20–30 seconds on something apply took 1 hour for.
  2. The df that needs to be constructed for join is expensive. Share it among all processes using:

mgr = multiprocessing.Manager()

ns = mgr.Namespace()

df = ns.df

args [ { df, rank } for rank in range(512)

pool.map(f, args)

This workflow achieves in 30 minutes what naive apply took 500+ hours for. 1000X improvement!!!

Moral of the story

  1. Apply’s are stupid
  2. Most distributed runtimes are stupid (sorry)
  3. Don’t switch to pypy and expect magic. It can be slower, esp with compiled modules

--

--