A journey into the wondrous land of Machine Learning, or “Did I get ripped off?” (Part 1)

Two years ago, I bought an apartment. It took me a year to get a proper idea of the real estate market: make some visits, get disappointed a couple of times, and finally find the flat of my dreams (and that I perceived to be priced appropriately for the market).

But how exactly did I come up with this perceived “appropriate price”? Well, I saw a lot of ads for a lot of apartments. I found out that some properties were desirable and thus marketable (having a balcony, a good view, being in a safe district, etc), and step by step I built up my perception based on evidence of how these features impacted the price of an apartment.

In other words, I learned from experience, and this knowledge allowed me to build a mental model of how much an apartment should cost depending on its characteristics.

But like every other human being, I am fallible, and thus keep asking myself the same question: “Did I get ripped off?”. Well, looking back at the process that led me there, it is evident that there is a way to follow the same process, but with more data, less bias, and hopefully in better time: Machine Learning.

In this series of blog posts, I am going to explore how Machine Learning could have helped me estimate the price of the apartment I bought; what tools you would need to do that, how you would proceed, and what difficulties you might encounter. Now, before we dive head first into this wondrous land, here is a quick disclaimer: the journey we are about to go through together serves as an illustration of how Machine Learning works and why it is not, in fact, magic.

You will not finish this journey with an algorithm that is able to price any apartment with extreme precision. You will not be the next real estate mogul (if it was that simple, I would be selling courses on how to become a millionaire on Facebook for 20€). Hopefully, however, if I have done my job correctly, you will understand why Machine Learning is simpler than you might think, but not as simple at it may appear.

What do you need to build a model with Machine Learning?

Labelled data

In this example, for predicting property exchange value, we are looking for data that contains the characteristics of houses and apartments as well as the price they were sold for (the price being the label). Luckily, a few years ago, the French government decided to build an open data platform providing access to dozens of datasets related to the administration – such as a public drug database, real-time atmospheric pollutant concentration and many others. Luckily for us, they included a dataset containing property transactions!

Well, here is our dataset. If you look through the webpage, you will see that there are multiple community contributions, which all enrich the data, making it suitable for a macro-analysis. For our purposes we can use the dataset provided by Etalab, the public organisation responsible for maintaining the open data platform.

Software

This is where it gets a bit tricky. If I were a gifted data scientist, I could simply rely on an environment – such as TensorFlow, PyTorch, or ScikitLearn – to load my data, and find out which algorithm is best to train a model. From here, I could determine the optimal learning parameters and then train it. Unfortunately, I am not a gifted data scientist. But, in my misfortune, I am still lucky: there are tools – generally grouped under the name “AI Studios” – which are developed exactly for this purpose, yet require none of the skills. In this example, we will use Dataiku, one of the most well-known and effective AI Studios:

Infrastructure

Obviously, the software has to run on a computer. I could try to install Dataiku on my machine, but since we are on the OVHCloud blog, it seems suitable that we would deploy it in the Cloud! As it turns out, you can easily deploy a Virtual Machine with Dataiku pre-installed on our Public Cloud (you could also do that on some of our competitors’ public cloud) with a free community license. With the Dataiku tool, I should be able to upload the dataset, train a model, and deploy it. Easy, right?

Working with your dataset

Once you have installed your VM, Dataiku automatically launches a web server accessible through the port 11000. To access it, simply go to your browser and type http://<your-VM-IP-address>:11000. You will then be greeted by the welcome screen prompting you to register for a free community license. Once all this is done, create your first project and follow the steps to import your first dataset (you should just have to drag and drop a few files) and save it.

Once this is done, go to the “Flow” view of your project. You should see your dataset there. Select it and you will see a wide range of possible actions.

Now, let’s get straight to the point and train a model. To do that, go to the “Lab” action and select the “Quick Model” then “Prediction” options, with “valeur_fonciere” as the target variable (as we are trying to predict the price).

Since we are not experts, let’s try the Automated Machine Learning proposed by Dataiku with an interpretable model and train it. Dataiku should automatically train two models, a decision tree and a regression, with one having a better score than the other (the default metric chosen is the R2 score, one of the multiple metrics that can be used to evaluate the performance of a model). Click on this model and you will be able to deploy it. You are now able to use this model on any dataset following the same scheme.

Now that the model is trained, let’s try it and see if it predicts the correct price for the apartment I bought! Now for very obvious reasons, I will not share with you my address, the price of my apartment or other such personal informations. Therefore, for the remainder of this post, I will pretend that I bought my apartment for 100€ (lucky me!) and normalize every other price the same way.

As I mentioned above, in order to query the model, we need to build a new dataset comprised of the data we want to test. In my case, I just had to build a new csv file from the header (you have to keep it in the file for Dataiku to understand it) and the line that relates to the actual transaction I made (easy as it was already done).

If I wanted to do that for a property I intended to buy, I would have had to gather as much information as possible to best fill the criteria in the scheme (address, surface, postal code, geographic coordinates, etc) and build that line myself. At this stage, I am also able to build a dataset with multiple lines to query the model on several cases at once.

Once this new dataset is built, just upload it like first time and go back to the flow view. Your new dataset should appear there. Click on the model on the right of the flow and then on the “Score” action. Select your sample dataset, go with the default options and run the job. You should now have a new dataset appearing in your flow. If you explore this dataset, you will see that there is a new column at the end, containing the predictions!

In my case, for my 100€ apartment, the price predicted is 105€. That can mean two things:

  • The model we trained is quite good, and the property was a good deal!

Or…

  • The model made a lucky guess.

Lets give it a go on the other transactions that have taken place on the same street and in the same year. Well, this time the result is underwhelming: apparently, every apartment bought on my street in 2018 is worth exactly 105€, regardless of their size or features. If I had known that, I would probably have bought a bigger apartment! It would appear that our model is not as smart as we initially thought and we still have work to do…

In this post, we explored where machine learning might come in handy, which data would be helpful, the software we need to make use of it, and looked at the infrastructure required to run the software. We found that everyone can give machine learning a go – it is not magic – but we also found that we would have to work a bit harder to get results. Indeed, rushing into the wondrous journey as we did, we did not take the time to look more closely at the data itself and how to make it easier to exploit the model – nor did we look at what the model was actually observing when predicting a price. If we had, we would have realized that indeed, it didn’t make sense. In other words, we did not qualify our problem enough. But let that not diminish our enthusiasm on our journey, for this is merely a small setback!

In the next post, we will go further on our journey and understand why the model we trained was ultimately useless: we will look more precisely at the data we have at our disposal and will find multiple ways to make it more suitable for the machine learning algorithms – following simple, common sense guidelines. Hopefully, with better data, we will then be able to greatly improve our model.