[Talk Summary 12] A/B Testing at Scale

Dr. Pavel Dmitriev, a Principal Data Scientist, from Microsoft's Analysis and Experimentation team had a talk about "A/B Testing at Scale" on Thursday, 2016/12/08. The talk was about the introduction of a controlled experiment, four real experiments that Microsoft had been running, and 5 challenges about testing at scale.

Dr. Pavel started the talk with a brief introduction of controlled experiments, aka A/B tests. A/B testing is a method of comparing two versions of a webpage or app against each other to determine which one performs better. A/B testing is also used to evaluate a new feature of an application. If the feature has an effect on users, the result will show the significant difference (p<0.05); the lack of different is called null hypothesis.

With the evolving product development process, Dr. Pavel presented the motivation for A/B testing. In classical software development, a product is usually designed, developed, tested and then released. However, in customer-driven development, the process is from "build" to "measure" to learn (continuous deployment cycles), because we are poor at assessing the value of ideas. There experiment and get the data can help us to evaluate the value of ideas. To demonstrate four real experiments that Microsoft had been running, he showed the experiments and asked the attendances to choose which design between A and B will win. By doing that, he made some statistics on how different between the two groups to show whether the two designs are significant or not.

Finally, Dr. Pavel claimed that while the theory of experimentation is well established, scaling experimentation to millions of users, devices, platforms, websites, apps, social networks, etc. presents new challenges of A/B testing:
  • Challenge 1: trustworthiness
  • Challenge 2: protecting the users
  • Challenge 3: the OEC (Overall Evaluation Criterion)
  • Challenge 4: violations of classical assumptions of a controlled experiment
  • Challenge 5: analysis of results
    • NHST = Null Hypothesis Testing
    • Heterogeneous treatment effect
Information Sciences Building, 3rd Floor
University of Pittsburgh


Popular posts from this blog


FolkTrails: Interpreting Navigation Behavior in a Social Tagging System

[Talk Summary 1] Web as a textbook: Curating Targeted Learning Paths through the Heterogeneous Learning Resources on the Web