A/B tests project — Udacity DAND P2

8 min readMay 28, 2020

Introduction

This is a project in the Practical Statistics part in Udacity Data Analyst Nanodegree. A/B tests are very commonly performed by data analysts and data scientists. Therefore, this project is a practical work for me to know more about how to implement A/B test in the real business.

Business Understanding

An e-commerce company has developed a new web page in order to try and increase the number of users who “convert,” meaning the number of users who decide to pay for the company’s product. I will work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

Data Understanding

# import all packages I need
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline# read the dataset and check the top few rows
df = pd.read_csv("ab_data.csv")
df.head()

There are 5 columns in the dataset, and 294,478 rows in this dataset. The number of unique users in the dataset is 290,584. It means there are duplicate users’ records.

# The proportion of users converted.
user_converted_per = round(df.converted.mean()*100,2)#The number of times the new_page and treatment don't matchno_match = df[~((df['group'] == 'treatment') == (df['landing_page'] == 'new_page'))].shape[0]

In the 294,478 records, the proportion of users converted is 11.97%. Meanwhile, the number of times the new_page and treatment don’t match is 3893.

#Do any of the rows have missing values?
df.isnull().sum()

There is no row that has null values, which is a great thing. But remember we have the rows where the landing_page and group columns don’t match. We can delete those rows. And then save the result to a new dataset.

Updated Dataframe

In the new dataset df2, there are 290585 rows. The number of unique ids in df2 is 290,584. There is only one duplicate user in df2.

The landing_page for the non-unique id is new_page, the group is “treatment” and the converted is 0. The records for this user is reasonable because users in the treatment group will load to the new page. As the converted of two rows are the same, we can drop any one of these two rows.

Probability

the probability of an individual converting regardless of the page they receive is 11.97% as we calculated before. We would like to know when given that an individual was in the control group, the probability of converting, and when given that an individual was in the treatment group, the probability of converting. These two conversion rates can show us how many users decide to pay for the company’s product in both groups.


control_converted_per = round((len(df2[(df2.converted == 1)&(df2.group == ‘control’)])/len(df2[df2.group == ‘control’]))*100,2)treat_converted_per = round((len(df2[(df2.converted == 1)&(df2.group == ‘treatment’)])/len(df2[df2.group == ‘treatment’]))*100,2)

The proportion of the control group converted is 12.04%. The proportion of treatment group converted is 11.88%. we can see that the proportion in the control group is 12.04%, which is larger than that of the treatment group. Can we make a conclusion that the control group performs better than the treatment group? This small difference could appear by chance, therefore we don’t have sufficient evidence to conclude that the new treatment page leads to more conversions.

Let us check if we divide the treatment and control group evenly.

new_page_per = round(len(df2[df2.landing_page == 'new_page'])/df2.shape[0]*100,2)

The probability of receiving the new page is 50.01%. It shows that the users received the new or the old page in a ration very close to 50:50, which is a good thing.

Hypothesis Testing

Method one: simulate by ourselves

I want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%. My null hypothesis and alternative hypothesis are as follows:

H0: p_new − p_old <= 0

H1: p_new − p_old > 0

Assume under the null hypothesis, 𝑝_𝑛𝑒𝑤 and 𝑝_𝑜𝑙𝑑 both have "true" success rates equal to the converted success rate regardless of the page. Use a sample size for each page equal to the ones in ab_data.csv.

Perform the sampling distribution for the difference in converted between the two pages over 10,000 iterations of calculating an estimate from the null.

p_new = df2.converted.mean()
p_old = df2.converted.mean()

The p_new and p_old are both 12.0%. The number of individuals in the control group and treatment group are as follows:

n_new = df2.query(“group == ‘treatment’”).user_id.nunique()
n_old = df2.query(“group == ‘control’”).user_id.nunique()

There are 145,310 users are in the treatment group and 145,273 are in the control group.

Simulate 𝑛_𝑛𝑒𝑤 transactions with a conversion rate of 𝑝_𝑛𝑒𝑤 under the null. Store these 𝑛_𝑛𝑒𝑤 1’s and 0’s in new_page_converted.

new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))

The result is an array with values 0 or1 and a length of 145,310. The conversion rate of this array is a number fluctuate around 12%.

Create 10,000 𝑝_𝑛𝑒𝑤 — 𝑝_𝑜𝑙𝑑 values using the same simulation process you used in parts (a) through (g) above. Store all 10,000 values in a NumPy array called p_diffs.

p_diffs = []
for _ in range(1,10000):
 
 new_page_converted = np.random.choice([1,0], size = n_new, replace = True, p = (p_new, 1-p_new))
 old_page_converted = np.random.choice([1,0], size = n_old, replace = True, p = (p_old, 1-p_old))
 
 p_diff = new_page_converted.mean() — old_page_converted.mean()
 p_diffs.append(p_diff)

The proportion of the p_diffs are greater than the actual difference observed in ab_data.csv can be identified as follow:

# I wonder what is the proportion of the values in null_vals are larger than the actual difference.


(null_vals > actual_diff).mean()

The number is 89.65%.

The overview of the whole process is as follows:

First, I assume that the null hypothesis is true. With that, I assume that p_old is equal to p_new. Therefore, the old page and new page have the same conversion rate which is equal to the conversion rate without considering the new and old page. The number is 0.119597.

Secondly, I bootstrap a sampling distribution for old and new pages and calculated the differences in the conversion rate of the new page and old page. In the bootstrap formula, I set the n as the number of people who received each page and a conversion rate which is 0.119597.

Thirdly, there are 10000 conversion rate differences in the p_diffs list. Then I calculate the standard deviation of the 10000 differences. Then I calculate values coming from a normal distribution around 0.

Lastly, I calculate the proportion of values that are bigger than the observed difference. The p-value is 0.8965. It shows that 89.65% of the conversion rate difference is larger than the actual conversion rate difference. With a Type-I-Error-Rate of 0.05, 0.8965 is much bigger than 0.05, therefore there is no enough evidence to reject the null hypothesis.

Method 2 — built-in simulation

We could also use a built-in to achieve similar results.

from pandas.core import datetools
import statsmodels.api as sm
from scipy.stats import norm#calculate z-test
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative=”smaller”)#calculate the critical z_term
z_critical = norm.ppf(1-(0.05))print(“Z-Score: “, z_score, “\nCritical Z-Score: “, z_critical, “\nP-Value: “, p_value)

The p-value is 0.9052 which is so much larger than 0.05. Also, the Z-Score is smaller than the Critical Z-Score, so we can not reject the null hypothesis.

There is no evidence to reject the null hypothesis. The conclusion is that the conversion rates of the new page are not larger than that of the old page.

A regression approach

Since each row is either a conversion or no conversion, so I will use logistic regression. The goal is to use statsmodels to fit the regression model to see if there is a significant difference in conversion based on which page a customer receives. I create a column for the intercept and create a dummy variable column for which page each user received. Add an intercept column, as well as an ab_page column, which is 1 when an individual receives the treatment and 0 if control.

df_reg = df2.copy()
df_reg.head()#add intercept
df_reg[“intercept”] = 1#get dummies and rename
df_reg = df_reg.join(pd.get_dummies(df_reg[‘group’]))
df_reg.rename(columns = {“treatment”: “ab_page”}, inplace=True)

Instantiate your regression model on the two columns I created to predict whether or not an individual converts.

y = df_reg[“converted”]
x = df_reg[[“intercept”, “ab_page”]]#load model
log_mod = sm.Logit(y,x)#fit model
result = log_mod.fit()

The result is not that good. So I will consider other things that might influence whether or not an individual converts. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives in.

df_countries = pd.read_csv(“countries.csv”)
df_countries.head()

After merging the df_countries, the dataset is as follow:

Tune the model by creating interactions between page and country.

# Create the variables show the interactions between page and country. 
df_reg_country["CA_page"] = df_reg_country["CA"] * df_reg_country["ab_page"]
df_reg_country["UK_page"] = df_reg_country["UK"] * df_reg_country["ab_page"]

The results show that p_values are all larger than 0.05. These new variables that show the interactions between page and country even decrease the significance of the original “CA” and “UK” columns. Therefore these new interaction variables should not be added to this model.

Limitation

Notice that because of the time stamp associated with each event, we could technically run a hypothesis test continuously as each observation was observed. However, then the hard question is do you stop as soon as one page is considered significantly better than another, or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

Conclusion

A/B testing does not get a winning result. The conversion rate of the new page is not significantly bigger than that of the old page. But we can do something to make the test better with a much greater chance of succeeding.

Maybe we can collect at least 7 days of data. If we wait at least one week to reduce the impact of differences in traffic by day. Then we may get a different result.
We can also check the modification we make can be noticed by the users.
Segment our A/B test results just for the loyal user’s group will also have different A/B test results.

A/B tests project — Udacity DAND P2

Written by Sisi (Rachel) Chen