Machine Learning with PySpark - Review

Raswitha Bandi, R Karthik, J Amudhavel


A reasonable distributed memory-based Computing system for machine learning is Apache Spark. Spark is being superior in computing when compared with Hadoop. Apache Spark is a quick, simple to use for handling big data that has worked in modules of Machine Learning, streaming SQL, and graph processing. We can apply machine learning algorithms to big data easily, which makes it simple by using Spark and its machine learning library MLlib, even this can be made simpler by using the Python API PySpark. This paper presents the study on how to develop machine learning algorithms in PySpark. 


Apache spark; Machine Learning; PySpark; SCALA


Nick Pentreath, Machine Learning with Spark, Beijing, pp. 1-140, 2015.

Zhijie Han, and Yujie Zhang, “A Big Data Processing Platform Based On Memory Computing” 2015 Seventh International Symposium on in Parallel Architectures, Algorithms and Programming (PAAP), Nanjing, pp. 172-176, 2015.

Aaron N. Richter, Taghi M. Khoshgoftaar, Sara Landset, and Tawfiq Hasanin, “A Multi-Dimensional Comparison of Toolkits for Machine Learning with Big Data”, 2015 IEEE International Conference on Information Reuse and Integration (IRI), San Francisco CA, pp. 1-8, 2015.

Sauptik Dhar, Congrui Yi, Naveen Ramakrishnan, and Mohak Shah, ADMM based Scalable Machine Learning on Spark, in Big Data (Big Data), 2015 IEEE International Conference on, Santa Clara CA, 2015, pp. 1174-1182

Asmelash Teka Hadgu, Aastha Nigam, and Ernesto DiazAviles Large-scale learning with AdaGrad on Spark, in Big Data (Big Data), 2015 IEEE International Conference on, Santa Clara CA, 2015, pp. 2828-2830

Hang Tao, Bin Wu, and Xiuqin Lin, Budgeted mini-batch parallel gradient descent for support vector machines on Spark, in 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Hsinchu, 2014, pp. 945-950

Andre Luckow, Ken Kennedy, Fabian Manhardt, Emil Djerekarov, Bennie Vorster, and Amy Apon, Automotive big data: Applications, workloads and infrastructures, in Big Data(BigData),2015IEEEInternationalConferenceon,Santa Clara CA, 2015, pp. 1201-1210

Mark Gates, Hartwig Anzt, Jakub Kurzak, and Jack Dongarra, Accelerating collaborative filtering using concepts from high performance computing, in Big Data (Big Data), 2015 IEEE International Conference on, Santa Clara CA, 2015, pp. 667676

Yicheng Huang, Xingtu Lan, Xing Chen, and Wenzhong Guo, Towards Model Based Approach to Hadoop Deployment and Configuration, in 2015 12th Web Information System and Application Conference (WISA), Jinan, 2015, pp. 79-84

E.Dede, B.Sendir, P.Kuzlu, J.Weachock, M.Govindaraju, and L.Ramakrishnan, Processing Cassandra Datasets with Hadoop-Streaming Based Approaches, IEEE Transactions on Services Computing , 2015, pp. 46-58

Alexander J.Stimpson, and Mary L.Cummings, Assessing Intervention Timing in Computer-Based Education Using Machine Learning Algorithms, in IEEE Access, 2014, pp. 78-87.

Xianqing Yu, Peng Ning, and Mladen A.Vouk, Enhancing security of Hadoop in a public cloud, in Information and Communication Systems (ICICS), 2015 6th International Conference on, 2015, Amman, pp. 38-43.

Raswitha Bandi, Sheikh Gouse, J Amudhvel, “A Comparitive analysis for big data challenges and big data issues using information security encryption techniques”, International Journal of Pure and Applied Mathematics, Vol 115, No 8, pp. 245-251, (2017).

Total views : 7 times


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

shopify stats IJEECS visitor statistics