For years, we’ve been told that data scientists hold the keys to unlocking the value hidden in our data, which has created a hiring frenzy for these folks. But many companies now are creating a variety of engineering positions – from data and data science engineers to machine learning engineers – in the hopes of boosting the effectiveness of their statistical wizards.
Data science plays a big role at Two Sigma, the 17-year old New York City hedge fund with $56 billion under management. Even before “big data” became a thing, Two Sigma was embracing statistics and machine learning at scale as ways to help find good investment opportunities faster than its competitors.
Ph.D.-carrying data scientists play a big role in Two Sigma’s operations. But lately the company has been hiring data science engineers to build and maintain the distributed systems that allow data scientists to perform their work.
“What I’ve seen over last two years is that the type of people and the profiles we’re bringing in and hiring have a strong background in applying data science,” says Two Sigma engineering manager David Palaitis. “That’s the real novel change we’ve seen in industry and we’re really keen to develop this profile more because it’s rare to find people with that talent.”
The data science engineers at Two Sigma have different skills and job requirements than the data scientists it employs. The mix of skills for those two jobs also differs from what’s expected of data engineers, too. Palaitis explains:
“The data scientist is really the one working with the data set, doing feature engaging, looking to build features. They will build a predictive model and then present those results and apply them,” he says. “The data science engineer at Two Sigma is the one working on building the system for which data scientist can then do their work in a way that’s hyper scalable, easy to use, and delivers a good experience to the data science themselves so they can be more productive.”
In the early days of the firm, Two Sigma relied on software engineers who had computer science degrees to build the systems. However, as the analytical workloads evolved over time, it’s become more important for Two Sigma to have people on staff who understand how machine learning works and can minimize model training time by tweaking algorithms or making other changes. Two Sigma refers to those folks as data science engineers.
“We recently had a data scientist who complained that it was taking four days to train his model,” Palaitis tells Datanami. “A data science engineer or a data engineer will look at that job and be able to understand how the model training is performing, and then get in there and really make changes, in this case, to the algorithm itself, to improve the speed and the convergence of that, so we can bring the time from a few days to run this job to a few hours.”
Data engineers, however, are distinct from data science engineers, at least at Two Sigma. According to Palaitis, data engineers are those who excel at manipulating, transforming, and cleaning raw data so that the data scientist can use it for building and training the machine learning models. A data science engineer, by contrast, is more focused on the distributed systems, models, and algorithms that process the data.
Tobi Knaup, the CTO and co-founder of Mesosphere, sees parallels between today’s DevOps engineers who excel at taking applications from development into production, and the data science engineers who take big data workloads like machine learning from development into production.
“DevOps is a role that combines both operations knowledge and software engineering knowledge, and you need both of those things to be effective at their job,” Knaup tells Datanami. “It’s the same thing here. Those [data science] engineers need to know about large-scale infrastructure, like cluster management and scheduling, but they also need to know machine learning and data science to do their job. That’s what’s new about this role, the combination of those two skills.”
According to the August 2018 LinkedIn jobs report, demand for data scientists is “off the charts.” “In 2015, there was a national surplus of people with data science skills,” the company writes. “But today, three years later, the picture has changed markedly: data science skills shortages are present in almost every large U.S. city.” Across the nation, there are 150,000 fewer data scientists than needed to fill open jobs.
By some estimates, the demand for data engineers and data science engineers could be even greater than demand for data scientists. Many companies consider it ideal to have at least two engineers working with every full-fledged data scientist, while some look for even more. That raises the prospect of an even bigger hole to fill in the coming years if there aren’t enough engineers to satisfy demand.
“If you don’t have that [engineering] role and just have data scientist, then you’re going to get a very fragmented landscape of tooling,” says Mesopshere’s Knaup. “You’re going to get a classic problem I see a lot in companies that don’t do this, is a data scientist might build some algorithm in a development environment, but they’re not able to run it on a cluster in a large data set. So someone else needs to take that over because the right tools don’t exist, or they hack something together that’s not reproducible and that introduces data errors and other types of errors over time. That’s why this role is really essential.”
Whereas data scientists live and work in Juypter not ebooks and are fluent in languages like Python and Python frameworks like Pandas, data science engineers are more apt to know lower level languages like Scala and C++.
“What we look for is someone who has experience in Python but also C++ and even Scala, because a lot of this is done in Apache Spark, and to make Apache Spark perform, you really have to get down to the Scala level,” Two Signma’s Palaitis says. “To make Pandas perform we want to get down to C++ level…But we don’t want our users to have to see any of that, so we use Python binding for everything.”
As if the emergence of data engineer and data science engineer wasn’t enough, there’s another engineering title that’s emerging. “If you look at Silicon Valley these days, there’s a new title that has appeared called machine learning engineer,” says Ali Ghodsi, the CEO and co-founder of Databricks. “It’s a portmanteau between data engineer and data scientist to say machine engineer,” says. Machine Learning represents the data science part and engineer represents the engineer part.”
Companies that are looking for machine learning engineers typically want somebody who is good at the data science aspects of machine learning, but who are also engineers who are good at building and running systems. The job title probably most closely resembles data science engineer, although some of the work that data science engineers perform doesn’t necessarily involve machine learning.
But put machine learning engineer on the list of possible titles if you’re serious about working in big data. “When I see someone has that title or when I see an organization hiring those people,” Ghodsi says, “I say, those guys are probably ahead of the game.”
Originally published on Datanami.