It was not so long ago that Steve Lohr quoted economist Hal Varian, Chief Economist at Google at the time, as saying “ that the sexy job in the next 10 years will be statisticians.” Lohr spoke of a “new breed of statisticians (…) that use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data. The applications are as diverse as improving Internet search and online advertising, culling gene sequencing information for cancer research and analysing sensor and location data to optimise the handling of food shipments.” But then Lohr went on to say: “though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis (…), the new data sleuths come from backgrounds like economics, computer science and mathematics.”
Five years later, in 2014, Lohr clarified what he meant, baptized the sleuths as data scientists, and broadcasted that the sexy part of the job, the data analysis and discovery, ensues only after spending 80 percent of the time coding and programming to clean and prepare the data for analysis. This is not unfamiliar to statisticians, who know that data analysis has always involved a large part of data cleaning, often delegated to the lower ranks of the statistics career (the data entry and data cleaning career). But that division of labor has now reached unprecedented dimensions. Large amounts of data brought about by the internet has resulted in startup companies specialising in cloud computing, software engineering and coding to prepare heterogeneous masses of data from web, sensors, smartphones and corporate databases for machine learning and statistical analysis. Being a “janitor” of data (the old data management job) is as lucrative a career as being a statistician now. But being both, and also being a data miner, is priceless, whether the employer is a biotech company, the ONS or Google.
Data science is and old term that, according to Wikipedia, became popularized when DJ Patil and Jeff Hammerbacher used the term “data scientist” to define their jobs at LinkedIn and Facebook, respectively. DJ Patil was recently named Chief Data Scientist of the White House in the United States, due in large part to his being, according to Forbes magazine, one of the top 7 data scientists in the U.S. The other six are indeed an army of experts from backgrounds like economics, computer science, mathematics, health sciences. Patil, in a memo to the American people, defined data science as “the ability to extract knowledge and insights from large and complex data sets” with social media, search, and e-commerce being the areas most benefitting from this. He explains that the role of an organization’s CDO (Chief Data Officer) or CDS (Chief Data Scientist) is to help their organisation acquire, process, and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.” The job title was created before a CIPS (Classification of Instructional Programs) code for data science. Perhaps that is why we find job advertisement for a Statistics position that require the same skills as for a Data Science position.
Job seekers with a statistics training have been wondering what role Statistics plays in data science. Fearful of being left behind, professional statistics groups have defined data science for their constituencies. For example, a recent statement made by The American Statistical Association (ASA) on the role of Statistics in Data Science says: “While there is not yet a consensus on what precisely constitutes data science, three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources; (ii)Statistics and Machine Learning convert data into knowledge; and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis.” (…) At its most fundamental level, the ASA says “we view data science as a mutually beneficial collaboration among these three professional communities, complemented with significant interactions with numerous related disciplines. For data science to fully realise its potential requires maximum and multifaceted collaboration among these groups.” Thus, “the next generation of statistical professionals needs a broader skill set and must be more able to engage with database and distributed systems experts (…), there will be an ever-increasing demand for such “multi-lingual” experts.”
It appears then that data sleuths, multi-lingual experts and data scientists do what applied statisticians always did, but at a much larger scale, and with the complexity of the digital world added to the data cleaning and data analysis cycle. Big data has made the task bigger and more complex. For those who are not sure where they fit, there is the survey at http://survey.datacommunitydc.org , which was used by Harlan D. Harris, Sean Patrick Murphy, and Marck Vaisman in 2012 to find the attributes that are seen in data science practitioners today, and their experiences in the job market. Those authors think that terms like “data scientist”, “analytics” and “big data” are the result of what one might call a “buzzword meat grinder” that often results in lack of clarity in what is expected from job candidates. Their survey revealed that there are a variety of data scientists. They all do machine learning or big data, math, programming, statistics and data business. However, they do so to different extents depending on the context of their job. Some are more focused in researching data using a great deal of statistics (now called analytics by many), others do a lot of machine learning and big data handling, others do a lot of programming, and others are more focused in the business part. The difference between being a data scientist and a statistician may lie only on whether the employer is looking for a person that will do it all, or a person to join a team that embraces a variety of data scientist each doing one of the tasks. The job is more likely to be well defined in the latter case and not very well defined in the former. Job seekers be aware and ask for details!
From an economic point of view, it appears that keeping the vagueness in job descriptions may be a profit maximising strategy. Data science seems to have arisen from the need to capture the constant inflow of data that arrives to business and government, manage that data, mine it, analyse it using statistics and machine learning, extract knowledge from it using large scale computing, and communicate what is learned from this process. For example, Twitter is a prototypical example of what people call “Big Data.” According to Nicole Lazar, Twitter generates masses of information every second, information that is unstructured, with strings of text that have to be mined for meaning (not unlike the surveillance data monitored by governments constantly or Amazon data). There can be several “data scientists” looking at the number of users, tweets over a given time span, distribution of users around the world, subjects of tweets, and trends in subject. A full-fledged data scientist that understands and can do all the tasks would cost less than having specialized data miners, data janitors and data analysts. This economic principle extends from Amazon to the smallest theater company tracking attendees. The bigger employers (i.e., Facebook, Linkedin, Google, Amazon, Twitter, National Defense) tend to have a diversified and specialised group, while the smaller ones may tend to hire just one person to do it all. But the trend in job advertising is going in the direction of playing it safe and asking for data scientists or business analyst. Job seekers, look carefully to see if your skills are requested.
Vagueness may not last much longer however. Pioneer schools venturing into graduate and undergraduate data science degrees or modifying existing statistics curricula are setting the trends on what a company may require and that effort will end up making the job descriptions clearer. And they are doing so at high speed.
Graduate programs started mushrooming after Thomas H. Davenport and D.J. Patil cheerfully reiterated in a 2012 Harvard Business Review article that the sexiest job of the 21st century is data science, but also gloomingly acknowledged that there were no university programs offering degrees. As if predicting the rise of Patil to CDS of the White house, many universities started offering a well-defined Master in Data Science and others initiated a specialised data science undergraduate program. Business schools on the other hand, started offering MBAs in Business Analytics, to emphasise the application of data Science to business.
Take, for example, the online Professional Master of Information and Data Science at the University of California at Berkeley, the most comprehensive there is so far, and featured, among others, in a July 2015 article of AMSTAT News, a magazine of the American Statistical Association. The curriculum contains 5 main modules at a basic level, three of which can be identified with the typical offerings of a master in Statistics. But two of those modules are not offered in conventional master programs. One of them is on storing and retrieving data (Python, Relational databases, Hadoop, Map reduce, Spark , Cloud Computing (AWS)) and the other is on applied machine learning (Python libraries for linear algebra, plotting, machine learning: numpy, matplotlib, sk-learn, Github for submitting project code). The program also offers advanced modules on scaling up really big data (OpenStack [Heat], Distributed Filesystems,Apache Hadoop, Apache Spark [Dstreams], SaltStack, Ansible, CouchDB, Cloundant, CloudSoft Brooklyn, Swift, Apache Solr, Apache Mesos, Open MPI, Computational Genomics, IBM Watson) , parallel computing and advanced statistics. But Berkeley is not alone.
Master’s degrees in data science have gained so much prominence that they are already being ranked, or at least the top 25 are. The same occurs with MBAs in Business Analytics. Ranked high are, in addition to Berkeley, the Master of Science in Data Science at Columbia University, which has curriculum similar to that of Berkeley and is the product of the collaboration among 6 different departments, the master of data science at New York University and the one at the University of Virginia, to name a few. As indicated, when attached to a school of business, data science is masked under the name Business Analytics, or Big data. The number of web sites compiling lists of universities where to get such degrees are booming. Regardless of the name used (Big Data, Analytics, Data Science), all these degrees share content knowledge similar to that found in the Master at Berkeley, with variation in the extent to which they cover each of the areas. And we should not neglect to notice the proliferation of open online sites such as Coursera, which offers a nine-course introduction to data science. All this may seem like a very fast response of universities and open education to the shortage of data scientist. However, many universities have not yet initiated a degree program called data science.
Not jumping in the bandwagon of Master of Data Science or a MBA in Business (or data) Analytics does not mean they are not making great efforts to look like they are. Consider for example Stanford University, which has created a special track on data science in their regular Master of Statistics without changing the name of the program. This track consists of 18 units of electives covering the areas of data science that full-fledged data science master’s programs offer. Harvard, to give another example, has a very popular online course called Big Data Analysis and has a web page in the statistics department dedicated to data science, where they claim that they pioneered data science. Harvard and MIT offer open courses in many areas relevant to data science as well. But graduate programs are not alone.
Many other statistics departments have changed their undergraduate curriculum to incorporate courses on big data, computing and machine learning. To name a few, in the April 2015 issue of Amstat News, Purdue University, the University of Florida, the University of California Davis, the University of Illinois at Urbana-Champaign, and the University of California at Berkeley acknowledge changing their curriculum to adapt to the data science era. But the real indication that data science and big data are here to stay is the growth in the number of colleges offering undergraduate in not statistics but data science. This is happening just as the number of statistics departments is increasing and new elementary, middle and high school statistics curriculum is implemented. Data science curriculum will be knocking at middle school teachers’ doors before they have time to learn how to cope with the new statistics curriculum.
Some universities, however, have realised that simply tweaking existing or new undergraduate statistics program is not considered enough to prepare students for data science. That is why new data science undergraduate degrees have started to populate universities, all of them encompassing the core of computer science, statistics and mathematics in a well-integrated program. In an article featured in the July 2015 issue of Amstat News, Northern Kentucky University, the University of California at Irvine, Winona State University, the University of Nottingham and Warwick University explained their data science majors. David Hodge, Uwe Alckelin, Christian Wagner and Ian Drydenis, of Nottingham see data science as the newborn sister of mathematics, computational sciences and statistics. Housed in the Computer Science and Mathematics departments, they put together a curriculum similar to that of Berkeley, but at an undergraduate level. So did Warwick, motivated by the growing demand for those skills in the job market. The creators of the program at Winona State claim that “data science is not statistics.” Their program was motivated by the need to help those undergraduate students seeking employment, and it also intends to have their students think like both computer science and mathematics. Having a major in Statistics with a minor in Computer Science or vice versa is not enough training. Students from now on will have to think like a statistician, a computer scientist and a business person, all at once. Training to make them do so requires a degree in data science.
The reasoning behind all the changes in graduate and undergraduate programs and curricula is that most likely a data scientist will be working as part of a team on a project, and being comfortable in communicating statistical and computer science methodology will be invaluable in business and consulting settings. To properly address the challenges in Big Data takes knowledge and experience in more than one area, and that’s what integrated, comprehensive undergraduate and graduate degree in data science are trying to prepare students for. New degrees built in touch with employers emphasise many of the “algorithmic components” of a traditional computer science degree (algorithms, data structures, programming, data management, software engineering), but will be combined with a large number of courses in statistics and machine learning (to a much more significant degree than would be in a traditional computer science degree). The statistical skill will always be there as employers need to answer the key questions statisticians always ask: When can you generalise your results to a larger population? What assumptions are we making when using statistical methods? How can we inherently understand and quantify variability and uncertainty? As the field of data science emerges and evolves, the best data scientists will be those with solid foundations in both statistical thinking and computational skills.
We should not forget however that although big data is behind much of the data science phenomenon, it is not all that is needed for data science. The Cincinnati Shakespeare Company sells 25000 tickets every year for 10 different productions. That is small data. Xinping Zhang, Byran J. Smuckler and Jay Woffington show how a statistician with basic statistical skills used the company’s data to more effectively give advance notice of possible shortfalls or windfalls. True that a job like that is now called “predictive analytics, ” and true that were the company to go digital, the statistician would have to learn to mine the web to achieve the same goal and keep the job. To give another example, medical companies that need to test a drug on a small number of patients abound that require statisticians that can design a clinical trial, manage the follow up and analyse the data. Were the company to do surveillance of populations using an app, the statistician would have to create some software to extract the data and transform it for analysis.
At a large or small scale, extracting knowledge from data is behind a “data science”, “analytics”, “big data”, “small data” or simply statistics job (although there are less and less job adds asking for a “statistician”). They all are likely to require data gathering and cleaning, data exploration, and statistics. Job candidates should ask how much of each, and at what level of each, at least until job descriptions get more specific. And do not forget to ask: “what expert domain knowledge do I need?” And “what type of data do you have?” But also, whether the job is a Statistician or a Data Scientist or Analytics job, be prepared to sound knowledgeable about the following items put together by the National Science Foundation, particularly if the word Big Data, Big Questions, Analytics or Data Science appears in the job add (and perhaps stay away from these terms if it doesn’t:
- Reproducibility, replicability, and uncertainty quantification
- Data confidentiality, privacy, and security issues as they relate to Big Data
- Generating hypotheses, explanations, and models from data
- Prioritizing, testing, scoring, and validating hypotheses
- Interactive data visualization techniques
- Scalable machine learning, statistical inference, and data mining
- Eliciting causal relations from observations and experiments
- Addressing foundational mathematical and statistical principles at the core of the new BIGDATA technologies
If that sounds like much, compare job adds (like those given in the Appendix). Statistics is center stage in all job adds that involve data, big or small. But it is clear that employers that have web sites, be it in Business, Government, or Science want employees to transform that data into knowledge that advances their respective goals.
ASA, Data science undergraduate degrees http://magazine.amstat.org/blog/2015/07/01/new-undergraduate-data-science-programs/
Berkeley, Professional Master in Information and Data Science at Berkeleyhttps://datascience.berkeley.edu
Columbia University Master of Data Science http://datascience.columbia.edu/master-of-science-in-data-science
Coursera online offerings on data science https://www.coursera.org/specializations/jhudatascience
Harlan D. Harris, Sean Patrick Murphy, and Marck Vaisman Analyzing the Analyzers, O’Reilly 2013 http://www.oreilly.com/data/free/files/analyzing-the-analyzers.pdf
Harvard online course on Big data analysis http://www.online-learning.harvard.edu/course/big-data-analytics
Harvard Data science web page http://statistics.fas.harvard.edu/datascience
Lazar, Nicole. Now Trending on Twitter. Chance, Vol 28.2, 2015.
Steve Lohr, New York Times, August 6, 2009, For Today’s Graduate, Just One Word: Statistics)
Steve Lohr, New York Times, August 17, 2014, For Big Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights.
Methodist University Master of Data Science http://www.datascience@smu
New York University, http://cds.nyu.edu/academics/ms-in-data-science/
Other site listing schools with degrees in data science, analytics, big data. http://101.datascience.community/2012/04/09/colleges-with-data-science-degrees/
Ranking of Master Degrees in Data Science http://www.mastersindatascience.org/schools/23-great-schools-with-masters-programs-in-data-science/
Ranking of Master in Business Analytics degrees. http://www.mba.com/us/plan-for-business-school/decide-to-go/specialized-masters-programs/big-data-programs.aspx
Stanford University’s data science track in the Master of Science Program. https://statistics.stanford.edu/academics/ms-statistics-data-science
Michael Vogelius, Nandini Kannan, and Xiaoming Huo, NSF Division of Mathematical Sciences
NSF Big Data Funding Opportunity for the Statistics Community
Xinping Zhang, Byran J. Smucker, Jay Woffington. Statistics and Show Business: Shakespeare Meets Predictive Analytis. Chance Vol