September 9, 2019

Are We Asking Too Much from Citizen Data Scientists?

Alex Woodie

Anybody who has tried hiring a data scientist can attest to the fact that we’re in the midst of a skills crunch of epic proportion. Cutthroat competition and sky high salaries are just two signs of the considerable lack of available data science talent. Some folks are trying to fill that by putting automated machine learning (AutoML) tools in front of citizen data scientists, but others warn that it could backfire.

Gartner is credited with coining the title “citizen data scientist” back in 2016 to refer to data professionals who use advanced software like AutoML packages to develop predictive analytic applications. While citizen data scientists’ primary job function typically lies outside statistics and analytics, they’re knowledge of the business and access to new tools turns them into “power users” who can tackle simple and moderately sophisticated analytical tasks that would typically have required the services of a full-fledged data scientist.

Thanks to unsatisfied demand and better AutoML software, the conditions are ripe for a “perfect storm” for the creation of citizen data scientists, Gartner analyst Carlie Idoine wrote in a May 2018 blog post.

“Organizations are increasingly prioritizing the move into more advanced predictive and prescriptive analytics,” Idoine wrote. “The expert skills of traditional data scientists to address these challenges are often expensive and difficult to come by. Citizen data scientists can be an effective way to mitigate this current skills gap.”

Google Cloud AutoML can be used to identify images

Nathan Korda, the director of research at data science platform provider Mind Foundry, says that citizen data scientists are enabling organizations to take fuller advantage of their ever-growing collections of data. In the May 2019 Datanami article “The Rise of the Citizen Data Scientist: How Humanized Machine Learning Is Augmenting Human Intelligence,” Korda defines citizen data scientists as:

“Employees [who are] not operating in dedicated data science or analytics roles, who can use a humanized machine learning platform to explore their data and easily deploy models to unlock the value it holds,” Korda wrote. “Thanks to user-centric platforms, current employees can enjoy access to machine learning technology without the need for specialist training.”

Organizations seeking data science capabilities have turned to a new wave of capable AutoML tools to jumpstart their initiatives. Forrester recently ranked DataRobot and H2O.ai as the two leading AutoML providers, with other firms like dotData providing solid functionality in a fast-growing sector.

Thanks to better software and a continued shortage of data scientists, Gartner estimates that by 2020, 40% of data science tasks will be automated through the use of AutoML tools and data science platforms. The Connecticut firm has also stated that the ranks of citizen data scientists are growing 5x faster than full-fledged data scientists.

Clearly, there is momentum behind the citizen data science trend. But not everybody is hopping on this bandwagon. One person who is expressing caution about it is Nick Elprin, the co-founder and CEO of Domino Data Lab, a San Francisco-based provider of a data science platform.

According to Elprin, there are some tasks that citizen data scientists most certainly will exceed at, but there are other tasks that will be too much to handle for those without specialist data science training.

Full-fledged data scientists are necessary to solve the toughest problems (Who Is Danny/Shutterstock)

“There’s going to be a place for them. They’re useful for a set of things,” Elprin says. “But for any problem that’s going to be really competitively differentiating for a business or require deep domain expertise or inventing something new, I think that’s going to be hard for citizens to attack that problem.”

Citizen data scientists will able to help companies implement predictive analytics across a range of “fairly commoditized” use cases, he says. For example, there are canned techniques for dealing with customer churn use cases, particularly when standard tabular data is readily available. A citizen data scientist should have no problem implementing a machine learning-based approach here.

But citizen data scientists will quickly feel out of their depth with more advanced use cases that deal with more challenging data sets, Elprin argues.

“If you’re a financial services company trying to develop new trading strategies or an insurance company trying to better price risk for customers applying for insurance, or if your trying to use image recognition to classify buildings for insurance policies for underwriting – I don’t see how that’s going to be a one-button press,” Elprin tells Datanami. “It’s going to require deeper domain knowledge.”

Nick Elprin, co-founder and CEO of Domino Data Lab

Similarly, companies in the pharmaceutical and life sciences industries that are exploring new approaches for chemical engineering for drug compound design or using bio-statistics to explore new cancer treatments or gene therapies will probably not be successful with citizen data scientists directing the research. “I don’t think companies can rely on citizen data scientists to create competitively differentiating models,” he says.

Elprin says he has talked to Domino Data Lab customers who also use AutoML tools, which can automate many aspects of data science, from feature engineering to generation of the predictive models on behalf of the customer. When full-fledged data scientists try to use AutoML tools, they often feel too constrained, he says.

The Domino Data Lab software is there to assist the data scientist by automating certain aspects of his job, such as accessing data sources, spinning up compute environments, ensuring the correct drivers are in place, and tracking models over time. “We automate the DevOps part, not the statistical reasoning part,” he says.

Elprin says companies may inadvertently increase their financial exposure when they rely on non-experts to develop predictive applications that will be running in regulated industries. “There’s going to be another shoe to drop,” Elprin warns. “There’s a risk of people building models where they don’t have deep understanding of the statistical fundamental for models that have risk associated with them.”

There are a range of data science challenges in the world, and organizations will need to pick the right person to tackle them. Citizen data scientists can handle easier use cases, but full-fledged data scientists will be required for those tougher challenges.

The world needs lifeguards, but the world also needs doctors, Elprin says. “It’s a world where both co-exist,” he says. “You don’t go to the lifeguard if you need surgery done.”

This article originally appeared on Datanami.