Hands-On ML Chapter 2

Sisi (Rachel) Chen
6 min readJun 11, 2020

End-to-End Machine Learning Project

There are eight steps to go through a house price prediction project.

  1. Look at the big picture.
  2. Get the data.
  3. Discover and visualize the data to gain insights.
  4. Prepare the data for Machine Learning algorithms.
  5. Select a model and train it.
  6. Fine-tune your model.
  7. Present your solution.
  8. Launch, monitor, and maintain your system

1. Look at the big picture

Frame the Problem
The first question to ask your boss is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model? The next question to ask is what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.

Select a Performance Measure

Regression: Root Mean Square Error (RMSE)

Suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error

RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

2. Get the data

You will need a number of Python modules: Jupyter, NumPy, Pandas, Matplotlib, and Scikit-Learn. Create an environment to make your life easier.

After download, take a quick look at the data using the df.head(), df.info(), df.desctibe(), and value_counts().

Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.

Then split the data into training and test set.

3. Discover and Visualize the Data to Gain Insights

If the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method.

corr_matrix['median_house_value'].sort_values(ascending = False)median_house_value    1.000000
median_income 0.687160
income_cat 0.642274
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724

Experimenting with Attribute Combinations

Hopefully, You identified a few data quirks that you may want to clean up before feeding the data to a Machine Learning algorithm, and you found interesting correlations between attributes, in particular with the target attribute. You also noticed that some attributes have a tail-heavy distribution, so you may want to transform them.

df_train[“rooms_per_household”] = df_train[“total_rooms”]/df_train[“households”] 
df_train[“bedrooms_per_room”] = df_train[“total_bedrooms”]/df_train[“total_rooms”]
df_train[“population_per_household”]=df_train[“population”]/df_train[“households”]

4. Prepare the Data for Machine Learning Algorithms

Step One, deal with missing features

Step Two, convert these string labels to numbers.

Step Three, convert integer categorical values into one-hot vectors

Step Four, two common ways to get all attributes to have the same scale: min-max scaling and standardization.

You can use the transformation pipelines to execute all those steps in the right order. You can set one pipeline for numerical columns transformation. The other pipeline for categorical columns transformation. Then union these pipelines together, to finish the preparation of the data in one step.

5. Select a model and train it

The target variable is a continuous variable. You can use linear regression, decision tree regression, random forest regression, SVM regression, and so on. We can perform K-fold cross-validation when we train the model, then we can get the RMSE of each model and compare RMSE to see which model is the best.

Better Evaluation Using Cross-Validation

K-fold cross-validation randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores.

6. Fine-Tune Your Model

Let’s assume that you now have a shortlist of promising models. You now need to fine-tune them.

Method One: Grid Search

One way to tune the model would be to fiddle with the hyperparameters manually until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead, you should get Scikit-Learn’s GridSearchCV to search for you.

Method Two: Randomized Search

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. This class will try out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration.

Method Three: Randomized Search

Another way to fine-tune your system is to try to combine the models that perform best. The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Trees they rely on).

Method Four:

Analyze the Best Models and Their ErrorsYou will often gain good insights into the problem by inspecting the best models. For example, the RandomForestRegressor can indicate the relative importance of each attribute for making accurate predictions. With this information, you may want to try dropping some of the less useful features (e.g., apparently only one ocean_proximity category is really useful, so you could try dropping the others).

7. Present your solution

You need to present your solution (high‐ lighting what you have learned, what worked and what did not, what assumptions were made, and what your system’s limitations are), document everything, and create nice presentations with clear visualizations and easy-to-remember statements (e.g., “the median income is the number one predictor of housing prices”).

8. Launch, Monitor, and Maintain Your System

First, Now you need to plug the production input data sources into your system and writing tests.

Second, you need to write a monitoring code to check your system’s live performance at regular intervals and trigger alerts. There may be a performance degradation problem. This is quite common because models tend to “rot” as data evolves over time unless the models are regularly trained on fresh data.

Thirdly, evaluating your system’s performance will require sampling the system’s predictions and evaluating them. This will generally require human analysis. These analysts may be field experts or workers on a crowdsourcing platform (such as Amazon Mechanical Turk or CrowdFlower).

In addition, you should also make sure you evaluate the system’s input data quality. Sometimes performance will degrade slightly because of a poor quality signal. If you monitor your system’s inputs, you may catch this earlier.

Finally, you will generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state.

Please check the Github link for the Jupyter Notebook of the housing price prediction project and the answers to the exercise.

https://github.com/cssamanda0104/Hands-on-ML

--

--