Thursday 5 November 2020

Random Forest Regression using Scikit Learn 




Background

Price of a house can be affected by a lot of factors, and people can have contradictory opinions on it, depending on their knowledge, experience and understanding.  

To see which opinions hold true, we can analyse a housing dataset.  In this post I have tried to identify the most important factor which influences the house price, from a housing dataset from Taiwan.  Also, I looked at the relationships and trends existing between factors available, as it gives a sense of what can be expected.


Business challenge:

To understand which factors are the most important in influencing the house price.


Dataset Information:

The market historical data set of real estate valuation is collected from Sindian Dist., New Taipei City, Taiwan and is available here 


Results:

Correlation: Data shows that house prices are expensive when the distance to the MRT station is less, and number of convenience stores are higher around.  But, because correlation can't be treated as 'causation' so we have to look at other metrics.



ScatterplotsIt is evident by looking at the scatterplot between price and distance to the nearest MRT station, that the houses closer to the station are more expensive, which seems logical. Also, it looks like the houses are expensive when they have more convenience stores close by. There is no clear trend between house age and price.



Prediction: Regression is a prediction technique used to understand what are the most important factors influencing an outcome (housing price in this instance).  Random Forest regression performed very well on the data, resulting into a correlation of 0.97 between the actual house price and predicted house price.  






Of all the variables which were tested to influence the house price, following is the order of importance.  So, we can conclude that closeness to the MRT station is the most important factor in determining the house price.


 

Conclusion

Hopefully this is a helpful introduction to a straightforward implementation of the Random Forest regression algorithm.  The result helps us to understand which factors are the most important in explaining the variability of the dependent variable in the dataset.  In addition, the analysis gave an insight into the relationships existing between the factors which has its own benefit in building an understanding of the domain. 

For the code, refer to this link

For the dataset, refer to this link

No comments: