Data Science: Pre-Hypothesis Discovery

Learn Aster
Teradata Employee

In layman terms, a hypothesis is a somewhat informed idea that's not proven yet. Hypotheses is a plural of hypothesis which means 'many ideas that are not proven or tested yet.' Some folks like to think of it as 'informed guesses' on what or why something could have occurred.

Hypothesis is the basis of ANY scientific method and is not a new invention for analytics/data science :). You start with an initial idea, and then you prove or disprove it. Once determined, you go to the next step in the discovery process and then test another idea. Often, the hypothesis can fail and that's ok too - at least you'd know one way or the other if an idea has merit or not. 

In analytics/data science, hypotheses are educated guesses of possible theories that is tested through statistical methods resulting in inference. If you have ten different hypotheses, you'd check all ten and validate or invalidate it systematically to arrive at an answer.

Hypothesis in Data Science

Hypothesis is the basis of ANY scientific method and is not a new invention for analytics/data science :). You start with an initial idea, and then you prove or disprove it. Once determined, you go to the next step in the discovery process and then test another idea. Often, the hypothesis can fail and that's ok too - at least you'd know one way or the other if an idea has merit or not. 

In analytics/data science, hypotheses are educated guesses of possible theories that is tested through statistical methods resulting in inference. If you have ten different hypotheses, you'd check all ten and validate or invalidate it systematically to arrive at an answer.

How does an Analyst generate Hypotheses (multiple hypothesis) to test ?

Hypotheses are informed guesses, and so prior knowledge of data or situation helps to define it using white papers or scholarly material. For example, if you have some data and think it's a normal distribution (with mean/standard deviation etc.,), you can do a quick statistical test on the data to prove or disprove it, etc., using P-values and such. You can also do correlation tests between two data sets to see if they are related etc., As you continue to "size up" the data using different tests, the goal is fit the data to one of the well established and understood models (mostly parametric) sans outliers, etc., so analysis can happen.

What is Pre-Hypothesis Discovery ?

Finding out what you don't know ..

Often in meetings when decisions had to be made with incomplete data, you hear a Boss, Manager, Project lead or a CEO asking 'Are we asking all the right questions ?'. This is the pre-hypotheses state. Knowing what to ask for the data especially big data is often the hardest problem. This is where most businesses get stuck because no sponsor wants to have the burden of asking the wrong questions and spending resources on it to solve a problem - only to find out time and money were wasted with no ROI. 

Let's revisit the definition of pre-hypothesis again. It's like saying we are going to bet on going after ten different theories out of a possible 100 or 1000 theories. Deciding which ten would be the result of pre-hypothesis discovery. Sometimes even formulating the hypothesis can be considered as a pre-hypothesis step. Once the ten is chosen, we should be able to validate it systematically with statistical and machine learning and data profiling tools. 

Where you'd run into Pre-Hypothesis Discovery

Finding the haystack before you figure out ways to search for needles

Not really business problems below, but the following examples makes the point.

  • There are 1000 galaxies with a billion planets each. We want to find 100+ planets with water. Which ten galaxies are worth exploring and which star systems within each we should look at? We have some prior information based on looking at our solar system. It is impossible to look at every galaxy and every planet.
  • A lot of us followed the news on the tragic Malaysian airline crash into the Pacific Ocean a few years ago. Investigators spent the time to decide where the area in the ocean to search. Many bets were made and resources allocated to find the airline/survivors as it's impossible to search every square inch of the vast ocean because of limited resources/weather.

Example Business Problems that begs for insight (definitely a scale lower than problem above) and hypothesis testing:

  • There is 100 TB of call center data and 1 TB of transaction data for customers. We want to find some leading causes of customer churn before bringing in quantitative techniques like machine learning etc., 
  • We have 50 TB of mobile logs from iPhone and Android apps. We want to find the leading cause of abandonment before checkout happens. We know iPhone users checkout more than Android users. Is it because there are more iPhone registered users or is it because of an App problem or is it something else?
  • You have XYZ TB of something form of data. You want to know at least 5 to 10 starting points of discovery before you go and test each one of them.

Business scientists needs the right tools to reveal insights on possible hypotheses that we can test. These tools should eventually also help test the hypothesis and fail fast to get to the good model.

What about BI Reporting tools ?

Reporting tools are great for finding patterns of aggregations over time but lack clarity in explaining causal effects of events that could eventually lead to events of interest. But it can certainly be used to get basic insights to weed out a lot of theories.

Bayesian, Pathing and Associative Mining Visualizations:

These are excellent candidates for reducing the 'Hypothesis Space' to get to the root cause or finding some inexplicable really quickly. 

Bayesian approaches can use prior probability of events to estimate possibilities based on likelihood. They can also update the models as new evidence comes in for updated go-forward plans.

For Time series or Event Sequence data, pathing approaches "sequences" the events over time and find dominant paths that become candidates for further exploration. Here's an example of paths that lead to a checkout page and after. One could can bury half of the theories with this pre-hypothesis discovery :). Visualization doesn't really say why, but certainly what questions to go after next.

Learn about creating Sankey visualizations like above from John Thuma's blog post.

For Market basket type data, we can use associative mining measures like Support, Lift, Confidence and Conviction to start with Causality discovery. Here's a Pre-Hypothesis Discovery visualization when exploring Reuters News data that talks about agricultural topics. The visualization shows the affinity of phrases in documents.

The mini clusters are the haystacks where we want to focus and drill in and find out why they are so. We can then come up with interesting theories on what the documents are talking about and it's easier to verify as we've subsetted the problem. Compare that to manually reading everything single document in a 10K news articles dump.

Hope you enjoyed the blog post. To learn more creating pre-hypothesis insights from data or "Help me what problems I could be solving, given 100 TBs of some data. Also help on testing those ideas", checkout the Aster Community Website ...