Spark Resources
- PySpark DataFrame API
- Pandas API on Spark
- Spark DAGs, Execution Plans, and Caching
- Excellent Coursera course by Heather Miller:
Shows how Spark SQL automatically optimizes (e.g., reorders RDD transformations in) the underlying Scala code
to minimize high-latency operations such as inter-node data shuffling
- Pipelining and Model selection in Spark MLlib
- MLflow on Spark
- Distributed training with Tensorflow 2 on Spark