Tuesday, March 29, 2016

The Data Structures and Algorithms Problem

Here's a snippet of an email
In big data / data science, the curse of dimensionality keeps showing up over and over. A good place to start is the wiki article “curse of dimensionality.” The issue seems to be that a lot of these big data / data science people have not taken the time to study fundamental data structures.
There was more about Foundations of Multidimensional and Metric Data Structures by Hanan Samet being too detailed. And Stack Overflow being too high-level.  And more hand-wringing after that, too.

The email was pleading for some book or series of blog posts that would somehow educate data science folks on more fundamental issues of data structures and algorithms. Perhaps getting them to drop some dimensions when doing k-NN problems or perhaps exploit some other data structure that didn't involve 100's of columns.

I think.

I'm guessing because -- like a lot of hand-waving emails -- it didn't involve code. And yes, I'm very bigoted about the distinction between code and hand-waving.

If there is a lack of awareness of appropriate data structures, the real place to start is The Algorithm Design Manual by Steven Skiena.

I harbor my doubts that this is the real problem, however. I think that the broad spectrum of computing applications leads to a lot of specialization. I don't think that it's really prudent to try and think of generalists who can handle deep data science issues as well as algorithm design and performance issues. No one expects them to write JavaScript and tinker with CSS so that the web site which presents the results looks good.

I actually think the real problem is that some folks expect too much from their data scientists.

In fantasy land the rock stars are full stack developers who can span the entire spectrum from OS to CSS. In the real world, developers have different strengths and interests. In some cases, "full stack" means mediocre skills in a lot of areas.

Here's a more useful response: Bridging the Gap Between Data Science and DevOps. I don't think the problem is "big data / data science people have not taken the time to study fundamental data structures". I think the problem is that big data is a cooperative venture. It takes a team to solve a problem.