How is the best programming language for learning and becoming a Data Scientist?
R vs Python: The Battle
To choosing and following a new high hire demanded career, the proponent needs to know how to improve and what are the skills needed for that. In this approach, considered for many people and companies the Sexiest Job of the 21st Century, the Data Scientist became a job dream of young and experienced professionals (HBR https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century).
Compared with the noble occupations in business in the 1990s, as the computer engineer and Wall Street “quant’s”, nowadays when companies scuffle with unprecedented volumes and types of information, the hire of data scientists is that the firms are competing to make today, and the demand for these experts has sprinted well ahead of supply (E. Brynjolfsson & McAfee, 2012). Describing what abilities make a data scientist successful, Brynjolfsson & McAfee cited him or her as a hybrid of data hacker, analyst, communicator, and trusted partner and this combination is extremely robust and unique (E. Brynjolfsson & McAfee, 2012). In another article from the same newspaper, Redman cites “According to LinkedIn, the top 10 skills for a data scientist include machine learning, R, Python, data mining, data analysis, data science, SQL, MatLab, big data, and statistical modelling” (Redman, 2018).
Considering that the main purpose of work is provoking insights about the business, the necessity to do this as fast as possible emerged. Thus, using the most efficient program language is the better choice, but choosing the most efficient is a dilemma. This statement guides to the main question of this essay: which program language to select? The pair most common programming for data science work, cited in a survey conducted by O’Reilly, are Python and R (King & Magoulas, 2016). Therefore, starting with the learnability, passthrough in the usability and ending with the popularity of use, the conclusion will reveal which one between the Language R and Language Python as the better choice to the data scientist aspirant.
First of all, before the comparison between both languages, some characteristics of each solution needed to show. The R Language was created by Ross Ihaka and Robert Gentleman, in 1995, being their purpose is focused on user-friendly data analysis, statistics, and graphical models (R Core Team, 2013). On the other hand, Python was launched in 1991 and made by Guido van Rossum, its purpose is to emphasize productivity and code readability (Python Core Team, 2015). Moving to the facility to learn, the curve of learning was exposed by Pandya, where both languages were compared. In the majority of cases, R embraces more simple language to a person that has a preponderantly scientific or statistical background, because the language development was made to working with data and it delineates its differentials when analysts first start working with its expressions (Pandya, 2018). On the basic distribution of this language, the student could employ data frames, building linear models and putting together and including some basic visualizations (Pandya, 2018). In contrast to that, Python did not have the same feature. Although Python was considered an excellent language, it shows more general-purpose compared to R language (Pandya, 2018). Considering that most usual packages and libraries to machine learning programming did not appear in the basic language package, is because the language was building to the core purpose that creates programs (Pandya, 2018). On the other hand, the author describes your experience and makes a counterpoint about when the student needs to demonstrate their analyses in a web service or online dashboard, it will be easier to learn how to do it in Python than in R (Pandya, 2018).
Furthermore, other authors made this comparison and there are many programmers who believe in easability and understandably from the syntax in R without the necessity of instruction (Ozgur et al., 2017). Moreover, a data scientist and teacher exposed that when a student has a basic or none programming experience, the learnability has a comparable curve of learning among both languages (Markham, 2015). Although a similar comparison, considering that the code could be read more like regular human language, Python can be easier to learn (Markham, 2015). Discovering about reading and manipulating data frames, in R a built-in code was easily used and easier for a programmer to get started, therefore in Python, the student needed to choose how the program will read these data and how it will manipulate, and it is because Python has a general-purpose while R was dedicated to statistical analyses (Ozgur et al., 2017). The aspect of helping and supporting the R appears easier to find, thus it has a lot of open-source code available on the web, and a very active community (Ozgur et al., 2017). In the abstract of that article, the authors expressed an opinion that Python is fairly easy to learn and has add-on programs that help the students (Ozgur et al., 2017). To sum up this component, the readability and simplicity would be making the learning curve more low and gradual, thus Python assures it a moderate advantage on this.
Secondly, to evaluating the usability, slicing of the data pipeline is needed, and following both manuals of languages, they showed four stages and they are Data Collection, Data Exploration, Data Modeling and Data Visualization (R Core Team, 2013) and (Python Core Team, 2015). To compare each slice, the official manuals explained their functions and packages. In terms of Data collection, R supports importing data from Excel, CSV and all text files, and could use packages (a group of additional functions) to access a wide variety of data sources (R Core Team, 2013). Compared to Python, where the import functions are natively embedded on the base program, both are equivalent (Python Core Team, 2015). Furthermore, related to Exploration of Data, Python has a package called Pandas, where it is possible to easily filter, sort and display data, as a result, scanning and cleaning up the data that addresses no practical consideration are very easy too (Python Core Team, 2015). For R, the programmers have many options to do statistical analysis of large amounts of data sets (R Core Team, 2013). The contrast here is, to given the same quantity of analyses, in R was necessary three packages in comparison to unique on Python. Advancing to the next pipeline, Data Modeling, the R has an abundance of packages for specific analyses, such as the Poisson distribution, Multivariate Distribution, and compounds of advanced probability and statistics (R Core Team, 2013). In the same way, Python is capable of performing numerical modelling analysis with the Numpy package, machine learning algorithms with the scikit-learn library, and others that comprise almost all statistical modelling (Python Core Team, 2015). Moving to the last slice, Data Visualization, in R is possible citing the efficacious background to scientific visualization with many packages that concentrate the graphical display of results, and their skills to save this result into image or PDFs formats (R Core Team, 2013). In Python there are more options comparatively, such as Python Notebook, where is possible to execute and to view results at the same time, Matplotlib to create basic charts, Plot.Ly for more advanced demonstrations, nbconvert, to transform Notebooks into HTML pages, and others (Python Core Team, 2015).
Analyzing another author, in the important portal for the DataScience area, KDnuggets, Martijn Theuwissen wrote about the extensive comparison between R and Python and cited some characteristics about usability (Theuwissen, 2015). First referring to Python, the author cited that people find Python more natural when they have a software engineering background, the simple syntax becomes coding and debugging easier, and there is the same form to write all sections of functionality (Theuwissen, 2015). Second, in terms of R, it is easier to learn when the people did not have coding experience, needing only a few lines to write statistical models, and there are several ways to code the same functionality (Theuwissen, 2015).To summarize the component, the language Python showed an advantage when it made it easier to create visualizations and used to use fewer additional packages to execute the same importation of data that R could do. For usability, Python showed more efficiency.
Finally, the analyses about popularity refer to how often the language appears on actual projects and, reflecting the most usual language, there is a high quantity of technicians that are developing and creating new packages and solutions. It is necessary to not forget all discussions that are increasing on the web because it is creating a large material of code sources. Furthermore, keeping in numbers, a simple comparison was not possible, because the main purpose of Python is core programming software, and R is Statistical programming, as a result, Python has more packages and general uses. Despite general numbers, in the same website cited before, KDnuggets, is possible search figures about the use of these languages in Data Science analyses, such as percentage language used, seeking of terms, and analyses using Python and R together (Theuwissen, 2015). The author concluded that switching from R to Python was shown signals, most analysts are using both languages and this is their suggestion to the students: Learning both languages (Theuwissen, 2015). Following in the article cited before, Ozgur and others describe the ubiquity of Python, and because of this, especially when one was looking to start a new career, knowledge of coding and computer science in Python was essential (Ozgur et al., 2017). In terms of popularity, acknowledging the factors exposed above, Python has a slight advantage again.
To conclude, as quoted in the article from Harvard Business Review, the insufficiency of data scientists is becoming a serious constraint in some sectors because those professionals are high-ranking with the training and curiosity to make discoveries in the world of big data (E. Brynjolfsson & McAfee, 2012). In addition to this shortage, the manager at a software development company, Vikash Kumar writes a recommendation: “When it comes to machine learning projects, both R and Python have their own advantages. Still, Python seems to perform better in data manipulation and repetitive tasks. Hence, it is the right choice if you plan to build a digital product based on machine learning. Moreover, if you need to develop a tool for ad-hoc analysis at an early stage of your project then go for R” (Kumar, 2020). Moving to the end of this comparison, Brittain and others made real utilizations and showed their impressions (Brittain et al., 2018). Citing the Python community enhancements for the code and well performance in a dozen different open source packages, the authors emphasized the Python capabilities (Brittain et al., 2018). In the same experiment, the highest time performance on datasets was targeted by R, and about the community, the advantage here is the migration of academic researchers to using R (Brittain et al., 2018). Resuming, considering details, all sides of comparisons and contrasts and receiving slight advantages, the Python language is the best choice for new data scientist aspirants.
References
Brittain, J., Cendon, M., Nizzi, J., & Pleis, J. (2018). Data Scientist’s Analysis Toolbox: Comparison of Python, R, and SAS Performance. 1(2), 20.
Brynjolfsson, E., & McAfee, A. (2012, October 1). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Kappal, S. (2017, November 6). R vs. Python — DZone Big Data. Big Data Zone.Com. https://dzone.com/articles/r-or-python-data-scientists-delight
King, J., & Magoulas, R. (2016). Data Science Salary Survey. Data Science Salary Survey by O’Reilly, 51.
Kumar, V. (2020, February 28). Python Vs R: What’s Best for Machine Learning. Medium. https://towardsdatascience.com/python-vs-r-whats-best-for-machine-learning-93432084b480
Markham, K. (2015, February 2). Should you teach Python or R for data science? Data School. https://www.dataschool.io/python-or-r-for-data-science/
Ozgur, C., Colliau, T., Rogers, G., Hughes, Z., & Myer-Tyson, B. (2017). MatLab vs. Python vs. R. Journal of Data Science: JDS, Vol 15, 355–372.
Pandya, D. (2018, January 11). How does R differ from Python in terms of learning curve, analysis of large data files, machine learning and data exploration? Quora. https://www.quora.com/How-does-R-differ-from-Python-in-terms-of-learning-curve-analysis-of-large-data-files-machine-learning-and-data-exploration
Python Core Team (2015). Python: A dynamic, open source programming language. Python Software Foundation. URL https://www.python.org/.
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Redman, T. C. (2018, January 25). Are You Setting Your Data Scientists Up to Fail? Harvard Business Review. https://hbr.org/2018/01/are-you-setting-your-data-scientists-up-to-fail
Theuwissen, M. (2015, May 1). R vs Python for Data Science: The Winner is …. KDnuggets. https://www.kdnuggets.com/r-vs-python-for-data-science-the-winner-is.html/