A Basic Guide to Evaluating Validity of Experiments & Data

Part 1 of 2

Dec 01, 2021

source: https://unsplash.com/photos/wg2xU_UNiVc

I have an ongoing problem in my household. Specifically, my middle child is holding a massive grudge against our Amazon Echo devices. I have a lot of these devices. They act as intercoms, shopping lists, music hubs and answer random questions. However, the only people in our household of five with Amazon accounts linked to the devices are myself, my wife, and my oldest son. The consequence of this is that when Amazon started using household members’ names in responses (e.g., “have a good day Marc”) things fell apart. Since my middle son doesn’t have an account, the Amazon Echo devices perpetually call him by his older brother’s name (they’re 15 months apart in age). I won’t get into the psychology of birth order in siblings, but suffice it to say, he’s not happy about this.

This problem with how Alexa refers to my son is a perfect example of how data issues can cause unintended consequences. Similarly, it’s a problem that affects many businesses, yet if you spend any time in a business seminar, you’ll quickly learn that the path to success is grounded in your understanding of the consumer. This is true of sales, innovation, product development, you name it. It’s a modern-day version of the customer is always right.

Therefore, if learning is the key to success, then presumably one of the critical courses in every businessperson’s career is a course designed to teach people how to learn, how to design experiments, and inevitably how to draw conclusions from data. The study of statistics helps people understand how to measure and quantify populations. However, a grounding framework to evaluate the validity of an experiment often gets left out of the statistics class curriculum.

Luckily for us back in 1963, there was a book published on experimental design that included such a framework — Experimental and Quasi-Experimental Designs for Research, by Donald Campbell and Julian Stanley. The book is certainly quite academic, but I’d highly recommend giving it a read as a mini masterclass in experimental design. In the meantime, I’ll summarize Campbell and Stanley’s method for evaluating the quality of experiments. Foremost, let’s consider two big concepts laid out in the book: Internal and External Validity. I’ll also provide a couple of anecdotes for both these concepts to make it easier to interpret.

Let’s start with the easy one, external validity. This is the issue often behind the concept of lying with statistics. External validity often refers to the representativeness of data. Let’s make up a super simple example: 75% of survey respondents believe German made cars are more reliable. It’s very easy to imagine this stat highlighted in ad campaigns, on social media, or passed via word of mouth. There is a fundamental question for understanding the veracity of this stat: who was surveyed? Perhaps in this fictional example, the survey ran amongst 100 BMW drivers. For example, someone who’s spent thousands of dollars on a German car is more likely to believe that German-made cars are more reliable. To understand external validity is understanding the question: does the data I have measure what I think it’s measuring.

Campbell and Stanley illustrate what they believe are the four principal sources of External validity biases, which are all really forms of interaction effects; I’ll paraphrase them below as they’re all relatively simple:

Reactive Effects of Testing (pretest bias)

Sometimes you need to prime or pretest a population prior to testing. Individuals exposed or primed can answer questions differently. For example, most product studies rely on category questions as a means of “warming” up to a concept, but this process can create a Pretest bias. Assume I recruit you into my research project about a new car; as part of preparing you for the study, I ask you a lot of questions about your vehicle ownership history, what kinds of cars you like, and other things automotive related. If I then ask you to evaluate my new car, I will probably end up with a series of answers that might not be a correct view of the everyday consumer, but the views of a consumer who’s diagnosed and disentangled their relationship with cars. As a result, we have forced the person who answered all the category related questions to think about cars more esoterically and generalize their feelings. Upon seeing the new car, they may fit the mental model they developed in the pretest of how they buy cars vs. providing an accurate representation of what they think.

Interaction Effects of Selection (sampling bias)

This is the most common external validity issue that we see. It’s the example I used above relating to German-made cars; a mismatch between the audience you intended to measure and the audience you ended up measuring. This bias is often and purposefully ignored and, despite making fun of obvious selection biases, I think many analytical researchers are guilty of this one. It’s also the easiest to illustrate that external validity issues exist in all data, not just survey data. We have biases that suggest to us a larger dataset is always more accurate. Imagine a category analyst at a retail chain looking at frequent shopper loyalty card data. It’s likely representative of millions of consumers and hundreds of millions of transactions. Yet, to draw a conclusion about the top 10 selling products, they would need to qualify the list as being amongst those using loyalty card data. This is because the list represents loyalty card purchases, not all purchases. Take the retail category, you’ll know that a lot of purchases that qualify as convenience purchases (beer, milk, tobacco, candy) never use loyalty card information, yet are top sellers for stores. Just because the data set is large doesn’t mean it’s without selection biases. This is where AI has more recently been getting itself into trouble. AI models are increasingly being criticized for selection biases. Not too long ago, the popular photo filter application FaceApp got itself into hot water by training its AI algorithm on a dataset made up of mostly white Eurocentric people. The result was that choosing the “Hot” filter more commonly lightened the skin tone of users. Clearly, this is a selection bias in action. The dataset used to train an AI model on how to make someone more attractive was externally invalid, as it didn’t represent all ethnic populations.

Reactive effects of experimental arrangements (lab bias)

This one is easy to interpret. It’s the concept that studying something in a lab means it’s not the real world. Some concepts are hard to measure directly, so experimentalists will create a lab environment to enable measurement. The easiest analogy is the traditional lab coat environment. If I wanted to test my new diet, I could opt to ask you to follow the diet and see what happens later. I could also have you check into my lab for a month and closely monitor your meals and physical activity/vitals. As a result, the “lab” test would likely outstrip the results of the at home test. Technically, both tests are valid, but the lab environment suffers from external validity issues. No reasonable person would expect an at home dieter to see the same results. The lab environment itself introduces a bias.

Multiple-treatment interference (repetition bias)

This happens in the situation where we measure a subject multiple times and they become conditioned to the routine of the experiment. This often happens in direct measurements where participation is mandatory. For example, let’s say I ask a person to take a survey wherein I show them a list of 10 brands of toothpaste. I ask them to select the brands they are familiar with. After selecting the brands, the survey responder will then answer 5 questions about each of the brands they chose. When someone picks one brand from the list, they’ll have 5 more questions to answer. Someone who picks all ten brands will have 50 questions to answer. If we presented the individual that ended up answering 50 questions with a list of brands for another category, they’ll likely choose fewer brands, knowing that they can have an easier time completing the survey. This repetition bias results in the user responding differently, based on their exposure to the same measurement technique.

So, in summary, external validity is about knowing your experiment is measuring the right population free of any effects that could change how the population responds. Having an external validity issue is common in most, if not all data sets. Therefore, the goal is to not discard research that suffers with external validity, but to use our understanding as a moderator in decision making. With that, I’ve put together a list of questions you can use to understand external validity in data.

Pretest Bias: Prior to data collection, were the users required to do something? Do I think that might change their behavior or response?

Sampling Bias: Who was left out of use selection in the data? If included, would it change the results?

Lab Bias: Was this data collected in an artificial environment? How much do I think that environment affects the results?

Repetition Bias: Is this the first time we measured the users being studied? Do I think they’re aware of the measurement and will that change how they respond?

Developing a mental muscle around validity issues is an important critical thinking skill, especially now that we’re solidly in an information-based economy that runs on data. I advocate more transparency in data validity. The only way data is usable is if we understand its strengths and weaknesses. Explaining to my son that the Amazon Echo suffers from Sampling Bias will not solve his problem, but it is something to be considered by the developers at Amazon in how they implement voice recognition in the future. Hopefully, they’ll address this soon or I might explore if Google Home devices have the same issue.

See Part 2 of this article:
A Basic Guide to Evaluating Validity of Experiments & Data (Part 2)