The Rise of Citizen Data Science

2021-07-08
Author: Shokoufeh Abrishami

 

Businesses are contending with a significantly increased amount of data every day. Aligned with this data collection trend, in 2017, IBM predicted the shortage of data scientists in 2020 when 90% of them are highly educated (have a master or PhD ). On the other side, we are witnessing the rapid growth of data science and machine learning platforms (DSMLP) which support data scientists’ tasks and are used to automate augmented analytics. The lack of data scientists in recent years and the emergence of increasing trends in the utilization of DSMLPs (e.g. Dataiku, SAS, Databricks, TIBCO, Amazon SageMaker, Azure Machine Learning) has led to the introduction of the concept of citizen data science. 

But what is citizen data science and who are citizen data scientists? Why should you consider this role in a data-driven organization? What points should data leaders take into account when incorporating citizen data science? And what capabilities are people required to have to become citizen data scientists? 

In this article and the next, I will address among other these questions. 

Who is Citizen Data Scientist (CDS)? 

Referring to a Gartner research citizen data scientists are people who create or generate models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.  

In other words, citizen data scientists can perform both simple and moderately sophisticated analytical tasks that otherwise require more expertise. They can play a complementary role to data scientists and are able to integrate machine learning output into business requirements by using their business experience and their awareness of business priorities. 

Why is the role of CDS important? 

1. Data scientists scarcity 

The amount of data processed in organizations has been and still is growing exponentially, while the capability of businesses in bringing data insights is linear. It is essential that organizations bend the new data towards new insights. That cannot happen unless organizations expand the number of employees who daily have access to data and wrangle and work with it. In other words, turning data into exponential insights requires scalability. 

 

 

Figure 1. The exponential growth of data processing and liner growth of data insights [2] 

Shortage of data scientists 

Yet for most companies, hiring exponentially more data scientists (who are not only expensive, but infamously difficult to find, hire, and keep on staff) is out of the question. According to QuantHub research, in 2020 the data scientist job positions are three times more than this job research. 

Data scientist skills are not replicable 

The specialized capabilities of data scientists are too hard to be replicated to other employees without the proper education, therefore the benefits that could be derived by businesses from such skills are limited.  

 

2. The advent of DSML and AutoML 

To understand the relation between AutoML and the importance of citizen data scientists better, it is required to know what AutoML actually is. AutoML and augmented analytics utilize machine learning models to perform the machine learning process automatically (automation).  

DSMLPs (e.g. Dataiku, SAS, Databricks, TIBCO, Amazon SageMaker, Azure Machine Learning), enable organizations to leverage the AutoML technologies for applying automation in data science processes. However, these technologies still focus mainly on algorithm selection and hyperparameter tuning, which is not cover all the day to day work of data scientists. 

Indeed, the target of these technologies is to automate the whole data-to-insights pipeline (including the four stages of Identify Data, Gather Data, Transform Data, and Analyze Data), which means they empower organizations to automate the machine learning tasks including: 

  • data access and data engineering (data preprocessing);  
  • feature engineering; 
  • model selection and validation; 
  • deployment and operationalization. 

Hence, with these capabilities, the AutoML has the potential to impact the structure of data teams, when part of specialists’ tasks is implemented automatically. Citizen data science has been emerging since the uprise of AutoML. The synergy of AutoML and citizen data science empower enterprises to upscale and accelerate AI projects, because many citizen data sciences are supported by a few number of data scientists. 

 

3. Lack of business translators in data science teams 

With the role of citizen data scientists, organizations are able to bridge the gap between business analytics users and those extremely advanced analytics as data scientists. Their role, as defined in this article, could be filled by engineers who bear a background in math, statistics, and modeling, however, but do not have the necessary statistics skills data scientists have. While data scientists may not be specialized in business problem solving, citizen data scientists can bring their expertise in the business, the market, and the industry to the table. They leverage their awareness of business priorities to effectively integrate DSML outputs into business processes. 

Although this role has not yet gained a lot of territories, when you look at data specialist positions on job research websites, the aforementioned explanation makes it clear that citizen data science is becoming one of the imperative roles in data teams.  

In my next blog, I will explain in detail how organizations can embrace citizen data science and what skills citizen data scientists need to fulfill this role successfully.  In conclusion I would like to add that data science and citizen data science are not interchangeable and distinguishing these two roles may lead to having more scalable data teams!