Some Pandas/Python benchmarks

  1. Use joins instead of applys. Join takes 20–30 seconds on something apply took 1 hour for.
  2. The df that needs to be constructed for join is expensive. Share it among all processes using:
  1. Apply’s are stupid
  2. Most distributed runtimes are stupid (sorry)
  3. Don’t switch to pypy and expect magic. It can be slower, esp with compiled modules

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store