Spark Resources

  1. PySpark DataFrame API
  2. Pandas API on Spark
  3. Spark DAGs, Execution Plans, and Caching
  4. Excellent Coursera course by Heather Miller: Shows how Spark SQL automatically optimizes (e.g., reorders RDD transformations in) the underlying Scala code to minimize high-latency operations such as inter-node data shuffling
  5. Pipelining and Model selection in Spark MLlib
  6. MLflow on Spark
  7. Distributed training with Tensorflow 2 on Spark