Research-Driven Startups

Measuring Measures - July 2, 2010
Bradford Cross


The web boom has taken the valley from its roots as a research haven to a consumer media app haven.

Companies like Facebook and Twitter build simple apps, get traction, and then bring in the researchers.

Nevertheless, I think we may be at the dawn of a data and research renaissance.

We are starting to see more research-driven data startups.

We are seeing more startups like Facebook and Twitter that don't start out research-driven or data focused, but then need to use data and research to monetize through personalization, targeting offers and ads, product recommendations, premium products, and other forms of intelligence.

Supposing this hypothesis is correct, we need to understand how to do research-driven data startups.

Solving problems with products, build on research, driven by processing data

The value proposition of the kinds of research-driven data startups that interest me is a straightforward chain.

We take data, process it to extract information, do research on it to extract intelligence from the information, and use the intelligence to create a product that solves a problem.

problem <- product <- intelligence <- research <- information <- processing <- data

In some cases we only come up with information and that is good enough. We're still working on the intelligence part.


Portrait of Lee De Forest (1873-1961), Electrical Engineer

We need to draw skills from three key groups.

  1. researchers - machine learners, statisticians, mathematicians, computer scientists
  2. systems hackers - computer scientists and engineers with a focus on data storage, messaging and queuing, processing, and in many cases also distributed systems
  3. frontend builders - designers, interaction designers, javascript hackers, user experience

The researchers and frontend builders need to have a strong product focus, and they all need to have a strong data focus.

Productize to hide complexity

Having a data moat may not be enough enough.

If you are B2B, you may get away with shoveling data to customers through an API.  If you are B2C, consumers don't want to consume data - they want their problem solved.

Even for most B2B cases, it is not the raw data but some productization around it that customers want.  They want actionable information, which is not typically in the form of probability distributions.

People don't do well with probabilities.  See Kahneman & Tversky, and their work on Prospect Theory.

For B2C, you need to have a lovely product.  You need to take information from the data, a hopefully take intelligence from the information.

The big wins we can offer come from encapsulating monsterous amounts of data and complexity behind a simple interface.

What is your homepage? Here's mine...

Find a "good enough" model

Research-driven data startups tend to be heavily resource constrained.  We need to find problems that we can solve with a "good enough" simple model that allows us to get customers, raise some money from investors, and all those kinds of things.

The thing about research-driven startups is that you need to be able to get going without arriving at the optimal solution yet.  We use the results from the good enough solution as leverage to do more advanced research and work toward better solutions over time.

This is an important aspect to consider early on.  You will be a hero if you beat the S&P 500 by only a small marginal return for the same risk.  For other problems, you can end up in a situation where you need extraordinarily good results for your service to be interesting or useful.

If false positives or negatives are intolerable, you might be looking at a problem that is not ideally suited to a research-driven startup.  If you are working with a problem where marginally better results are valuable, that might be a good problem to build a startup around.

You can always do better over time - starting with perfection is not what you want to bet a company on.

Start with one data source

If you integrate many different data sources into a single view to create your feature vectors, then you may want to consider starting with a model based on a single data source, and folding the other data sources in one at a time.

Many problems share the common pattern of one primary dense data source and several sparser data sources that you fold into the main source.

If you integrate too much data at once, you might find that you are overwhelemed by the complexity of all the data preprocessing and transformation and it hurts your ability to focus on research.

It may also hinder you ability to extract the maximum information from each data source because all the data munging makes it more difficult to focus on the features you can extract from each source individually.

Lessons from software development

Release early, release often, and measure.  Word up - you can do all this with research too.

The notion that research is should have nebulous goals in the distant future is bullshit.  You can do it incrementally - it works better that way.

The pace may be different, the timelines may be different, and it may be a lot harder, but you can still do it incrementally.

Research is what led me to agile and TDD.  I've been running all my research this way since 2003.  TDD is science - state your hypothesis, figure out how to test it, and then go test it.

Define your metrics and how you will measure.  What is good enough?  When do you hit the point of diminishing returns?

Don't build stuff that isn't done until after you're dead.  Find a way to iterate.

Giza. Pyramid of Khafre and Sphinx

Hypothesis Testing

Remember that, in startups, everything is a hypothesis and your job is to test the hypotheses.

How much information can you extract from your data sources?  Maybe you have a lot of data but it is noisy and not so valuable.

How sparse is the data?  If it is highly informative by hardly ever there, that may not help a lot.

Can you find a model that is simple enough to build on a startup time frame and valuable enough to get traction with users?

Can you leverage revenue from an early "good enough" model to grow the business and get into deeper research to build a more ideal model?

Can your model be productized or wrapped in a service that people really care about?

Go forth and create solutions to our problems

Recall that research-driven data startups take data, process it to extract information, do research on it to extract intelligence from the information, and use the intelligence to create a product that solves a problem.

Moreover, recall that in some cases we only come up with information and that is good enough.  We have a lot of problems.  Sometimes the intelligence part isn't quite necessary yet and we'd rather have something that helps us out now and give you a chance to work on the intelligence over time.

Solve problems for people.  Invent.  Do some research.