WeRateDogs project- Part II Data cleaning and Data analysis

Sisi (Rachel) Chen
10 min readMay 28, 2020

Data Cleaning

Define

Here is an instruction list of data cleaning, there are 9 issues I will solve during the data cleaning process.

  1. Change the datatypes of the tweet_ids in the three tables and the timestamp column in df_twitter;
  2. Merge three tables;
  3. Drop the retweets, reply records, drop the columns we do not need and drop the tweets without an image or with images which don’t display dogs;
  4. Clean the wrong ratings — find out the appropriate rating from the multiple ratings;
  5. Create new columns, one is in the df_twitter to combine “doggo”, “floorfer”, “pupper”, and “puppo” into one column, the other is changing the display_text_range column into text_length column;
  6. Extract the source from Html code;
  7. Replace the wrong names of name column;
  8. Reduce the prediction columns into two — breed and confidence;
  9. Clean the new breed column by replacing the “_” and making them all lowercase.

Code

The following are the code that these issues converted. Firstly, the copies of the original pieces of data should be made prior to cleaning.

# I add the "_fc" at the end of the dataframe name. The "_fc" means "for cleaning"
df_twitter_fc = df_twitter.copy()
df_predict_fc = df_predict.copy()
df_api_fc = df_api.copy()
  1. Change the datatypes of the tweet_ids in the three tables and the timestamp column in df_twitter;
df_twitter_fc.tweet_id = df_twitter_fc.tweet_id.astype(str)
df_predict_fc.tweet_id = df_predict_fc.tweet_id.astype(str)
df_api_fc.tweet_id = df_api_fc.tweet_id.astype(str)
df_twitter_fc.timestamp = pd.to_datetime(df_twitter.timestamp)

2. Merge three tables;

#use the outer join to make sure the twitter_id in all the informaiton can be showed in the joined table
df_master = pd.merge(df_twitter_fc, df_api_fc, on = “tweet_id”, how = “outer”)
df_master = pd.merge(df_master, df_predict_fc, on = “tweet_id”, how = “outer”)
# create the copy of the merged dataframe for cleaning
df_master_fc = df_master.copy()

The columns and data types of the df_master_fc are as follows:

3. Drop the retweets, reply records, drop the columns we do not need and drop the tweets without an image or with images which don’t display dogs;

#check with the column from the api table, the number of the no retweets records
df_master_fc.dropna(subset = [“retweeted”], inplace = True)
#We only want the rows without an entry in “retweeted_status_id”
df_master_fc = df_master_fc[df_master_fc[“retweeted_status_id”].isnull()]
#We only want the rows without an entry in “in_reply_to_status_id”
df_master_fc = df_master_fc[df_master_fc.in_reply_to_status_id.isnull()]
# The rows without an image are the ones that the “jpg_url” column is not null
df_master_fc.dropna(subset = [“jpg_url”], inplace = True)
# The images which do not display dogs are the ones without a dog breed identification in the df_api
df_master_fc.drop(df_master_fc.query(“p1_dog == False and p2_dog == False and p3_dog == False”).index, inplace = True)
#drop the unneeded columns
df_master_fc.drop(["in_reply_to_status_id", "in_reply_to_user_id",
"retweeted_status_id", "retweeted_status_user_id",
"retweeted_status_timestamp", "retweeted"], inplace=True, axis = 1)

Then the columns and data types of the df_master_fc are as follows:

4. Clean the wrong ratings — find out the appropriate rating from the multiple ratings;

# Some of the rows contain two or more rating numbers, I want to find these rows out to check why they have multiple ratings. 
df_duplicate_ratings = df_master_fc.copy()
pattern = “\d+\/\d+”
df_duplicate_ratings[‘all_patterns’] = df_duplicate_ratings.text.str.findall(pattern)
df_duplicate_ratings[‘patterns_count’] = df_duplicate_ratings[‘all_patterns’].apply(lambda x: len(x))
# find out all the index
duplicate_rating_index = df_duplicate_ratings.index[df_duplicate_ratings['patterns_count'] != 1].tolist()
#remove the rows with multiple ratings
df_master_fc = df_master_fc.loc[~df_master_fc.index.isin(duplicate_rating_index)]

5. Create new columns, one is in the df_twitter to combine “doggo”, “floorfer”, “pupper”, and “puppo” into one column, the other is changing the display_text_range column into text_length column;

# replace “None” with “” in each column
for x in [“doggo”, “floofer”, “pupper”, “puppo”]:
df_master_fc[x].replace(“None”, “”, inplace = True)
#Add the four string columns together to create one column called dog_type
df_master_fc[‘dog_type’] = df_master_fc[‘doggo’] + df_master_fc[
‘floofer’]+ df_master_fc[‘pupper’] + df_master_fc[‘puppo’]
df_master_fc.dog_type.value_counts()

The dog_type column still has a lot of blank values. We can drop “doggo”, “floofer”, “pupper”, and “puppo” columns.

Now let us change the display_text_range column into text_length column;

def length(value):
return(value[1] - value[0])

df_master_fc['display_text_length'] = df_master_fc['display_text_range'].apply(length)
# Check the result
df_master_fc[['display_text_range','display_text_length']].head()

6. Extract the source from .html format;

# Apparently, we can not use the original source data
df_master_fc[‘extracted_source’] = df_master_fc.source.str.extract(“\>(.*)\<”)
df_master_fc.drop(‘source’, axis = 1, inplace = True)

7. Replace the wrong names of name column;

In the data assessment part, we have found a list of stopwords in the “name” column

[‘None’, ‘a’, ‘all’, ‘an’, ‘by’, ‘his’, ‘just’, ‘my’, ‘not’, ‘such’, ‘the’, ‘this’, ‘very’]

# replace the stopwords into nan
for x in tokens_without_sw:
df_master_fc[“name”].replace(x, np.nan, inplace = True)
df_master_fc.name.value_counts()

Now there are no stopwords in the ‘name’ column.

8. Reduce the prediction columns into two — breed and confidence, only use the breed with the highest confidence.

We know that some of the pictures are identified as other species, but we only want the tweets that are identified as dogs.

#extract the most likely prediction which is a dogdef get_attr(x):
“””
input:
Columns: p1_dog, p1/p1_conf, p2_dog, p2/p2_conf, p3_conf
output:
Results after checking the true/false of the p1_dog and p2_dog columns
“””
if x[0] == True:
return x[1]
elif x[2] == True:
return x[3]
else:
return x[4]

df_master_fc[“breed”] = df_master_fc[[“p1_dog”, “p1”, “p2_dog”, “p2”, “p3”]].apply(get_attr, axis = 1)
df_master_fc[“confidence”] = df_master_fc[[“p1_dog”, “p1_conf”, “p2_dog”, “p2_conf”, “p3_conf”]].apply(get_attr, axis = 1)
df_master_fc.iloc[:,13:].head(10)

Drop the breed and confidence columns we do not need.

df_master_fc.drop([‘p1’,’p1_conf’, ‘p1_dog’, ‘p2’, ‘p2_conf’, ‘p2_dog’, ‘p3’, ‘p3_conf’,’p3_dog’],axis = 1, inplace = True)

9. Clean the new breed column by replacing the “_” and making them all lowercase.

df_master_fc.breed = df_master_fc.breed.str.replace(“_”,” “)
df_master_fc.breed = df_master_fc.breed.str.lower()
df_master_fc.breed.value_counts()

After the cleaning, we can check whether there are null values in each column.

Only the name column has lots of NA values, so we may do not use it for further data analysis.

Test

I will test the “df_master_fc” to make sure my cleaning operations worked.

Test 1: The data type of the tweet_id is String and that of the timestamp column is DateTime.

Test 2: two new columns, one is the combination of “doggo”, “floorfer”, “pupper”, and “puppo”, the other is changing the display_text_range column into text_length column;

df_master_fc['dog_type'].value_counts()
Out:
pupper 164
doggo 54
puppo 21
doggopupper 7
floofer 7
doggopuppo 1
doggofloofer 1
Name: dog_type, dtype: int64

Test 3: extract the source from the .html format;

df_master_fc[‘extracted_source’].value_counts()
Out:
Twitter for iPhone 1610
Twitter Web Client 20
TweetDeck 8

Test 4: Replace the wrong names of name column;

df_master_fc[‘name’].unique().tolist()Out: 
['Tilly',
'Archie',
'Darla',
'Franklin',
nan,
'Jax',
'Zoey',
'Cassie',
'Koda',
'Bruno',
'Ted',
'Stuart',
'Oliver',
'Jim',
'Zeke',
'Ralphus',
'Gerald',
'Jeffrey',
...]

Test 5: Reduce the prediction columns into two — breed and confidence

Index(['tweet_id', 'timestamp', 'text', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'retweet_count', 'favorite_count',
'display_text_range', 'jpg_url', 'img_num', 'dog_type',
'display_text_length', 'extracted_source', 'breed', 'confidence'],
dtype='object')

Test 6: clean the new breed column by replace the “_” and making them all lowercase.

df_master_fc[‘breed’].unique()
Out: array(['chihuahua', 'labrador retriever', 'basset',
'chesapeake bay retriever', 'appenzeller', 'pomeranian',
'irish terrier', 'pembroke', 'samoyed', 'french bulldog',
'golden retriever', 'whippet', 'siberian husky', 'mexican hairless',
'kuvasz', 'pug', 'blenheim spaniel', 'malamute',
'italian greyhound', 'chow', 'german shepherd', 'doberman',
'eskimo dog', 'weimaraner', 'saluki', 'miniature pinscher',
'german short-haired pointer', 'english springer', 'vizsla',
'bloodhound', 'bernese mountain dog', 'west highland white, ...]

Data Analysis

Questions:

  1. Which dog breed has the largest number of retweet count, favorite count, and rating?
  2. During which time the users would like to rate the dogs the most?

3. Which time period in a day is the one with the highest retweet count and favorite count?

df_master_fc.to_csv(‘twitter_archive_master.csv’, index = False)
df_analysis = pd.read_csv(‘twitter_archive_master.csv’)
df_breed = df_analysis.copy()
# The rating is equal to its numerator divided by its denominator
df_breed[‘rating’] = df_breed[‘rating_numerator’]/df_breed[‘rating_denominator’]
df_breed['rating'].value_counts()

1. Which dog breed has the largest number of retweet count, favorite count, and rating?

Calculate the total number of retweet and favorite count.

df_breed_total = df_breed.groupby(‘breed’).agg({‘retweet_count’: “sum”, ‘favorite_count’: “sum”}).reset_index()# Check the top 5 breeds
for i in [‘retweet_count’, ‘favorite_count’]:
print(i)
print(df_breed_total.sort_values(by = i, ascending = False).head())
fig, ax = plt.subplots(figsize = (15,6))
ax = sns.barplot(x = “breed”, y = i, data = df_breed_total.sort_values(i, \
ascending = False).iloc[:10], palette = “magma”)
ax.set_title(“Total “ + i + “ per breed”)

We can see that the most liked and retweeted tweets are the images of a Golden Retriever. The second and third breeds are Labrador Retriever and Pembroke Welsh Corgi.

If the tweets we extract are mainly the tweets about Golden Retriever, Golden Retriever will surely be the breed with most retweet count and most favorite count. To decrease the bias of our sample selection, we can use the mean retweet count, mean favorite count, and mean rating score to check which breed is the most popular one.

# Calculate the mean metrics.
df_breed_mean = df_breed.groupby(‘breed’).agg({‘retweet_count’: “mean”, ‘favorite_count’: “mean”,’rating’:’mean’,}).reset_index()
df_breed_mean.head()
# Change the column names
df_breed_mean.columns = ['breed', 'avg_retweet_count', 'avg_favorite_count', 'avg_rating']
# Check the top 5 of the average metrics:
for i in ['avg_retweet_count', 'avg_favorite_count', 'avg_rating']:
print(i)
print(df_breed_mean.sort_values(by = i, ascending = False).head())
fig, ax = plt.subplots(figsize = (15,6))
ax = sns.barplot(x = "breed", y = i, data = df_breed_mean.sort_values(i, \
ascending = False).iloc[:10], palette = "magma")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_title("Total " + i + " per breed")

When we look at the average metrics, we can see that the most popular breeds are not Golden Retriever, Labrador Retriever, and Pembroke Welsh Corgi. Australian Terrier and Dumber are the ones with the highest average retweet count, average favorite count, and the average rating.

2. Which day in a week is the one with the highest retweet count and favorite count?

df_time = df_analysis.copy()
df_time['timestamp'] = df_time['timestamp'].apply(lambda x : pd.to_datetime(str(x)))
# get the weekday column from the timestamp column
def retweet_favorited_period(time_format,aggr):
"""args:
this function can create a dataframe to calculate the sum or the mean of the retweet count and favorite count
by different time format.
input:
- time_format: an integer that is the time format you want to check, for example "year", "month", "weekday", "day", "hour".
- aggr: an integer which is the aggregation function you want to apply on the retweet count and favorite count
output:
- a dataframe shows the aggregation results of retweet count and favorite count by the time format
- a multiple lines plot shows the aggregation results by the time format
"""
if time_format == 'weekday':
df_time[time_format] = df_time['timestamp'].dt.weekday
period_length = 7
elif time_format == 'hour':
df_time[time_format] = df_time['timestamp'].dt.hour
period_length = 24
elif time_format == 'month':
df_time[time_format] = df_time['timestamp'].dt.month
period_length = 12
# calculate the average metrics by weekday
df_time_aggr = df_time.groupby(time_format).agg({'retweet_count': aggr, 'favorite_count': aggr}).reset_index()
print(df_time_aggr)
# multiple lines plot
plt.figure(figsize=(15,6))
plt.plot(time_format, 'retweet_count', data=df_time_aggr, marker='o', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4)
plt.plot(time_format, 'favorite_count', data=df_time_aggr, marker='', color='olive', linewidth=2)
plt.title(aggr + " retweet and favorite count " + time_format)
plt.xticks(np.arange(0, period_length ))
plt.legend()
plt.show()

Then I change the inputs of the function to get the total retweet count and favorite count by weekday.

retweet_favorited_period(‘weekday’,’sum’)

The day of the week starts with Monday(0) and Sunday(6). From the line chart above we know that the total retweet and favorite count numbers have two peaks on Monday and Wednesday.

retweet_favorited_period(‘weekday’,’mean’)

From the line chart above we know that the average retweet count and favorite count have two peaks on Wednesday and Saturday.

3. Which time period in a day is the one with the highest retweet count and favorite count?

retweet_favorited_period(‘hour’,’sum’)
retweet_favorited_period('hour','mean')

When checking the total number of retweet and favorite, there are no tweets during the 7:00 to 13:00. The two peaks of total retweet and favorite count happen at 0:00 to 1:00 and 16:00 to 17:00 in a day.

When checking the average number of retweet and favorite, there are no tweets from 7:00 to 13:00. The two peaks of mean favorite count happen at 6:00 to 7:00 and 16:00 to 17:00 in a day.

--

--