No one disagrees that the transition into the information age has profoundly changed the world. The ease with which technology drives transformation, often through complex and opaque systems (e.g. Google’s search algorithm), would seem almost magical to past generations. My work in advertising measurement has contributed, in a very, very small way, to this change. However, in pursuing methods to evaluate the incremental impact of advertising, I recognize that I’ve played a part in normalizing alternative measurement methods, some of which lack transparency.
A quote I frequently reference when discussing measurement with stakeholders is by renowned statistician George Box: “All models are wrong, but some are useful.” This simple yet profound truth encapsulates the reality of our algorithm-driven world. Google’s search algorithm, though imperfect, is remarkably effective. Advanced language models such as ChatGPT push this idea even further.
Yet, while models may be improving some outcomes, the creeping acceptance of alternative measurement methods can normalize subpar solutions. One such area is incrementality measurement, using experiments to gauge the incremental benefits of advertising campaigns, an area growing in prevalence across the media industry.
If you’ve followed methodological discussions at conferences and across LinkedIn, you’ve likely encountered the debate around experimental versus quasi-experimental designs in incrementality research. During my time, I’ve noticed frequent confusion between these methodologies, which opens the door for some to exploit the ambiguity around quasi-experimental designs. Because of my contributions to normalizing this approach over the last twenty-five years, I believe that revisiting this topic can be helpful in reducing ambiguity and providing a framework for reliability assessments.
The Gold Standard: Experimental Design
Experimental Designs, also known as Randomized Control Trials (RCTs), were first documented by James Lind in the 1700s, and remain the “gold standard” for research because of their ability to control external influences. Consider the common pharmaceutical trial: 100 headache sufferers are randomly split into two groups, ensuring comparable demographics and conditions. One group receives medication, the other a placebo, isolating the medication as the sole variable. After two weeks, surveying both groups reveals the medication’s effectiveness based on a reduction in headaches.
When testing online advertising in the late 1990s, RCT was the primary method. We randomly assigned visitors to a site such as Yahoo.com into two groups: one exposed to the test ad, the other to a public service announcement. Surveying both groups afterward allowed us to make comparisons and pinpoint the ad’s impact. When introduced, RCT was a game changer over traditional pre-post testing. For the first time, we could isolate the effectiveness of part of a campaign, even with other media influencing users (e.g. measure online effect while TV is running).
The Compromise: Quasi-experimental Design
Using RCT to measure campaigns in the early days didn’t last. Before the new millennium dawned, measurement had already shifted from RCT to quasi-experimental design.
Why Use Quasi-experiments?
While RCT remained ideal, its complexity and cost often deterred advertisers. Coordinating randomization across diverse ad servers, media properties, and walled garden platforms such as Google and Facebook were a logistical nightmare. Thus, quasi-experimental designs, used since the 1960s in academic contexts, became widely adopted for their practicality. However, with no standardized terminology to describe each variation, these designs will be quite diverse; as is the quality of the resulting data.
Characteristics of Quasi-experiments
The defining characteristic of a quasi-experiment is the absence of the most important characteristic of an RCT: lack of random group assignment. Instead, researchers strive to assemble comparable control groups alongside test groups. Removal of the randomization component simplifies quasi-experiment implementation, particularly within a fragmented media ecosystem. The industry that figured out look-a-like modeling took those innovations straight into the world of measurement. However, as George Box pointed out, these models are essentially wrong and unless the methodologist does their best to control for biases, a quasi-experiment has the potential to create substantial data issues.
Factors to Control For
Effective quasi-experimental studies try to mimic randomized conditions by controlling the key factors:
In Target: Ensuring control participants are in the target audience yet have not been exposed to the ad.
Temporally Relevant: Capturing control participants at the same time as the exposed group to negate external effects.
Audience Profile: Ensuring a close match in demographics between test and control groups.
Confounders: Managing additional variables like product ownership, device types, or geographic location.
Researchers often address these factors through careful data processing and bias-reduction techniques, but they’re fundamentally imperfect.
The Importance of Precise Language
It seems the criticism of quasi-experimental measurement is often not because of the methodology itself but its susceptibility of the term to misuse. The term “experimental design” carries significant weight, it implies rigor and validity. Quasi-experiments take benefit from the parent they’re named after, even though they are a flawed attempt to replicate RCT conditions. That mimicry results in researchers often calling quasi-experimental studies “Experimental Design” which, while true, is a misuse of the term and misleads data consumers into a false sense of security.
When Campbell & Stanley introduced guidelines for quasi-experimental designs, the intent was to offer practical alternatives when RCTs were unfeasible. Today, we’ve standardized this method, frequently choosing it over the RCT as it’s also viewed as an “experimental design”. In the high-volume, complex world of advertising, quasi-experiments are a convenient, low-cost substitute, obscuring methodological weaknesses behind a veneer of credibility. For a consumer of experimental results, without a mechanism to validate the results, you have only one option: trust the data. If this is you, then you need to evaluate the credibility of the measurement.
Five Critical Questions
To better evaluate experimental studies, consider these essential questions:
Is this a quasi-experimental design? If yes, proceed with caution.
Can the measurement provider show test and control group allocation by media placement? If yes, check for potential biases in media exposure.
Can the measurement provider show test and control group allocation by day? If yes, ensure equal numbers included by day.
Can the measurement provider show demographic profiles of test and control groups? Use to verify demographic consistency.
Do test and control groups match on other significant variables (e.g., device type, geolocation)? Confirm no hidden biases exist (e.g. exposed mostly iOS, unexposed mostly Android).
Although not exhaustive, this framework offers a practical starting point for ensuring data integrity. While understanding these answers won’t eliminate potential biases, it will help enhance your confidence in quasi-experimental results.
Not a Bad Thing
The shift to alternative measurement approaches is unavoidable and not always a bad thing. Driven by practical needs and technological advancements, alternative measurement opens new doors. However, embracing these methods demands a critical eye toward maintaining data integrity and transparency. It is crucial to grasp the distinctions between rigorous experimental designs and practical quasi-experiments. By applying a thoughtful evaluation framework, advertisers and researchers can better navigate methodological complexities and uphold high standards in a more algorithmic and data-driven world.
For more on Experimental Designs, refer to this article.
To understand factors affecting data quality, explore this article.
View my proposal for transparent quality disclosures in FASRIM.
Rick Bruner has a strong argument for geo-based RCT measurement.
Thanks for highlighting this evolution and indicating how it came about. It is important to track the history of this relatively young industry.