WeRateDogs project- Part I Data assessment

DSND Udacity P3

Introduction:

Real-world data rarely come clean. Gathering data from a variety of sources and in a variety of formats, assessing its quality and tidiness, cleaning it is called data wrangling.

The datasets that I will wrangle are as follows:

Data Sources

  1. Name: WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)

Data Content: Basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet’s text, which I used to extract rating, dog name, and dog “stage” (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive “enhanced.” Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).

How to gather: Download this file manually in the “Project Details” part.

2. Name: Tweet image predictions (image_predictions.tsv)

Data Content: what breed of dog (or other objects, animal, etc.) is present in each tweet according to a neural network.

How to gather: Download the data programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3. Name: WeRateDogs™

Data Content: Each tweet’s retweet count and favorite (“like”) count at minimum, and any additional data you find interesting.

How to gather: Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet’s JSON data using Python’s Tweepy library and store each tweet’s entire set of JSON data in a file called tweet_json.txt file.

#keys and token to access the API
api_key = my_key
api_secret_key = my_secret_key
access_token = my_access_token
access_secret = my_access_secret
#access the API
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_secret)
#Setting the wait_on_rate_limit and wait_on_rate_limit_notify parameters to True in the tweepy.api class to control #the rate of traffic sent or received by a server
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

We read one tweet_id to see what kind of data I can get from the api

tweet_id_list = df_twitter.tweet_id.tolist()
tweet_id = tweet_id_list[0]
print(tweet_id)
tweet = api.get_status(tweet_id)
print(tweet._json)

Data Assessing

WeRateDogs Twitter Archive (twitter-archive-enhanced.csv)

  1. Missing value check and visual check

There are a lot of missing data in the columns about the reply and the retweeted status. There are also lots of “None” in the “doggo”, “floorfer”, “pupper”, and “puppo” columns, so I want to return true if there is no dog classification in any of the columns.

2. check what are the values in the “source” column

We only want to extract the information we need in this column. For example, we do not need the “<a href=”https:” part in the source descriptions.

3. datatype

From the information of the dataset, we find two datatype problems need to be dealt with:
a. The datatype of the tweet_id column should be String
b. The datatype of the timestamp column should be Datetime

Let us continue to look at the “tweet_id”, “name”, “rating_numerator” and “rating_denominator” columns.

4. Check if the tweet_id is unique

len(df_twitter[‘tweet_id’].unique()) == df_twitter.shape[0]
Output: True

5. Check the value counts in “name” column

df_twitter[‘name’].value_counts()

We can see that the “None”, “a”, and “an” can not be the correct name.

# check what stopwords are there in the name list.
name_list = sorted(df_twitter[‘name’].unique().tolist())
all_stopwords = stopwords.words(‘english’)
all_stopwords.append(“None”)
tokens_without_sw = [word for word in name_list if word in all_stopwords]

As we can see, the name column contains wrong names like ‘None’, ‘a’, ‘all’, ‘an’, ‘by’, ‘his’, ‘just’, ‘my’, ‘not’, ‘such’, ‘the’, ‘this’, and ‘very’. These names should be removed or replace with other things in further analysis.

6. Now we can check the “rating_numerator” and “rating_numerator” columns.

rating_denominator column

The unique values are:
[0, 2, 7, 10, 11, 15, 16, 20, 40, 50, 70, 80, 90, 110, 120, 130, 150, 170]

Count how many tweets under each unique value:

The range of denominator is large, which is from 0 to 170. Let us check what happened with the rows with large denominator values.

print(df_twitter.query("rating_denominator > 100").text)

The rows with large denominators seem to have no problems, so we check the ones with really small denominators.

print(df_twitter.query("rating_denominator == 0").text)

There are two numbers in this tweet, but I do not think “960/00” is the right rating, “13/10” is the correct rating.

rating_numerator column

The unique values:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 20, 24, 26, 27, 44, 45, 50, 60, 75, 80, 84, 88, 99, 121, 143, 144, 165, 182, 204, 420, 666, 960, 1776]

Count of how many tweets under each unique value:

The range of numerator is large, which is from 0 to 1776. We can check what happened with the rows with large numerator values.

print(df_twitter.query("rating_numerator > 150 ").text)

The rows with large denominators seem to have no problems, so we check the ones with really small denominators.

print(df_twitter.query(“rating_numerator == 0”).text)

The entry 1016 shows that the tweet contains pictures that don’t contain any dogs.

In conclusion, the rating numerator and denominator columns are invalid when the tweet contains a picture that does not contain any dog, and when there are two ratings in the tweet text.

Let us check the tweets with two ratings in it.

From the columns above we can tell that the multiple ratings in a text are caused by the reasons as follows:

a. There are retweet records, so more than 2 people have rated the dog image.

b. There are multiple dogs in one picture, so the users will rate all the dogs.

c. Some of the extracting ratings are actually the numbers of dog legs or represent the convenient store “7/11”.

Tweet image predictions (image_predictions.tsv)

  1. Missing value

There seem no NA values in each column. Let us check the rows in the dataset. The dataset sample is as follows:

From the samples above we can tell that:

a. There are rows without any predictions of the dog

b. There are both uppercase and lowercase characters in the p1, p2, p3 results columns. Also, the prediction result in these three columns also shows that there are lots of other species’ pictures, not just dog pictures.

2. Data type

The tweet_id column’s datatype should be a string. The other columns’ data types are correct.

3. Unique tweet_id or not

len(df_predict['tweet_id'].unique()) == df_predict.shape[0]Output: True

4. Duplicate images

An image may be uploaded by different users. The 66 duplicate images show that in the df_predict dataset, there are retweet records or repeated image upload. I wonder how many images are not to be identified as a dog breed.

df_predict[["p1_dog","p2_dog","p3_dog"]].apply(lambda x:
(x[0] == False and x[1] == False and x[2] == False)
, axis = 1).value_counts()
Output: False 1751
True 324

There are 324 records without the prediction of the dog breed. Let us check some samples of the images that can not be identified.

I find that some pictures do not contain any dogs, and some of them only contain vague dog images.

Name: WeRateDogs™

  1. Missing value

The data sample is as follow:

2. Check the data types

The tweet_id datatype should be changed into String.

The df_api data looks pretty good. There are no NA values in each column, but in the display_text_range column, the value is a list with two numbers.

Data assessment overview

Data quality issues:

df_twitter

  • The datatype of tweet_id should be string, not Integer.
  • The datatype of the timestamp should be DateTime not Object.
  • The columns “in_reply_to_status_id”, “in_reply_to_user_id”, “retweeted_status_id”, “retweeted_status_user_id”, and “retweeted_status_timestamp” have lots of NA values.
  • The “doggo”, “floorfer”, “pupper”, and “puppo” columns have lots of “None” values.
  • some of the dog names are not correct (None, an, by, a, …)
  • In the “source” column, we only want to extract the information we need. For example, we do not need the “<a href=”https:” part.
  • The rating columns are incorrect because there are multiple ratings in the text. The reasons are as follows:
  1. There are retweet records, so more than 2 people have rated the dog image.
  2. There are multiple dogs in one picture, so the users will rate them all.
  3. Some of the extracting ratings are the numbers of dog legs or represent the convenient store “7/11”.

df_predict

  • The prediction is an uppercase and lowercase mix, also there are “_” in the breed name.
  • The datatype of tweet_id should be string, not Integer
  • The dataset contains duplicate images.
  • There are pictures in this table that are not dogs

df_api

  • The datatype of tweet_id should be String, not Integer

Data tidiness issues:

df_twitter

The “doggo”, “floorfer”, “pupper”, and “puppo” columns have lots of “None” values, we can combine them into one column.

df_predict

There are three predictive results with probabilities, only the prediction result with the highest probabilities is the one we need.

df_api

In the display_text_range column, the value is a list with two numbers. The column can change into display_text_length to represent the length of the text.

Data Scientist Candidate