March 15, 2019

What Kind of Data Scientist Are You?

Alex Woodie

If you’ve worked with the data science community, you’ve probably interacted with data scientists and formed a definition for the increasingly popular position. But it turns out, not all data scientists are alike, and according to a recent analysis by researchers at UCLA and Microsoft, there are actually nine different types of data scientists.

Miryung Kim, an associate professor in UCLA’s Computer Science Department, last week presented a session at the Strata Data Conference that showcased her research into the data science and software development community. The research revolved around a survey of 793 professional data scientists working at Microsoft that investigated how they spent their time, what tools they use, and the challenges they face in their jobs.

Kim and her team ran the results of the survey through a clustering algorithm (naturally) and published the results last September in a 17-page paper titled “Data Scientists in Software Teams: State of the Art and Challenges,” that can be downloaded from the IEEE Xplore Digital Library.

The first thing Kim and her colleagues discovered was that not all people practicing data science call themselves “data scientists.” Nearly 40% of the survey respondents identified as data scientists, but 24% called themselves software engineers, 18% were software engineers, while 20% had some other title. All told, Kim concluded 532 could be considered to be data scientists.

Experience and education levels also varied. About one-third had bachelor’s degrees, while 22% had PhDs and 41% had master’s degrees. The average experience level was 13.6 years, with an average of about 10 years spent analyzing data.

(Source: “Data Scientists in Software Teams: State of the Art and Challenges” September 2017, Kim et al.)

The clustering algorithm highlighted patterns in how data science practitioners spend their time. Based on the predominant activity of a group, Kim and her team came up with a name that defined that group.

The results showed nine different kinds of data scientist, including:

  • Data Preparer: This type of data scientist spends an average of 25% of their time querying data, and about 20% actually preparing data for analysis. Data Preparers are more likely to work with SQL and less likely to work with machine learning algorithms.
  • Data Shaper: The Data Shaper shares many of the skills of the Data Preparer, but brings additional expertise, such as machine learning expertise and experience with tools like MATLAB and Python. They’re also more likely to have a PhD and less likely to work with SQL or structured data.
  • The Data Analyzer: Data scientists who spend more than half their time analyzing data could fall into this bucket. Other traits of Data Analyzers include more experience with classical statistics, math, and data manipulations, and a predilection for using R.
  • Platform Builder: You might be a Platform Builder if you spend about half of your time building platforms and instrumenting code for the purpose of collecting data. Platform Builders are more likely to work in distributed systems, like Hadoop, and have “engineer” in their title, but not to have a PhD.
  • Data Evangelist: This type of data scientists spends a good portion of her time engaging with others. They’re more likely to work with line-of-business decision makers and those in product development than the group as a whole, and less likely to work with SQL or structured data.
  • Insight Actor: This data science type spends nearly 60% of her time acting on insight, and nearly 20% disseminating insights from the data. This is a relatively small group, percentage-wise, but it was statistically significant.
  • 50% Moonlighter: Sometimes, you might be a data scientist but not even know it. Software engineers and program managers who spend half their time using data science-related skills and the other half doing something else fall into this category.
  • 20% Moonlighter: Engineers and managers who only dabble in data science (i.e. spend 20% of their time doing it) fall into this category.
  • Polymath: This is the “jack of all trades” type of data scientist who spends his time doing all sorts of data-oriented tasks, from building platforms to gather data to analyzing data and acting on it too. Polymath’s are more likely to have a PhD, more likely to use Python, and more likely to use Bayesian-style Monte Carlo statistics than the group as a whole.

“What is really interesting to me,” Kim said, “is while we think of data science a buzzword, when we look at the…data we saw very different characteristics of different groups of data scientists who have very different kinds of work activities.”

The biggest challenges reported by data scientists may ring a bell to those who have worked in data science. The challenges were gropued into three main categories, including data, analysis, and people.

Miryung Kim is an Associate Professor in UCLA’s Computer Science Department, where she heads up the Software Engineering and Analysis Laboratory.

On the data front, poor data quality was one of the most commonly reported problems. “Some respondents mentioned that there is an expectation that it is a data scientist’s job to correct data quality issues, even though they are the main consumers of data,” the report states.

Data availability, including missing values and the inability to tap legacy systems for data collection, was also cited as a major challenge. Data integration, including the merging of different streams of data into a single data set for analysis, remains a bugaboo for data scientists around the world.

Scale was the biggest problem related to analysis (which is probably while some still refer to it as “big data”). Survey respondents reported that it can sometimes take too long to collect and analyze the data, whether it’s on Hadoop or Cosmos, Microsoft’s version of the big distributed storage and processing framework.

On the personnel side of the data science equation (a factor too often overlooked in many human endeavors), the UCLA researcher identified one major impediment to data science success: communicating what insights the data science team has discovered. Staying up-to-date on changing tools and technologies is another concern.

Related Items:

Taking the Data Scientist Out of Data Science

Standards Effort Seeks to Redefine ‘Data Scientist’

Microsoft Readies Major Push Into Big Data