Introduction

Data collection is the initial and fundamental step in any Data Science or Analytics project, and on which all following activities rely, from data analysis to model deployment.

With the pervasive presence of APIs and Cloud Computing, I am ever more intrigued in maximizing the efficiency and level of automation of data collection activities for both work and personal projects.

In the latter category, I have been interested in collecting data from online home-rental platforms in the UK market (ZooplaRightMoveOnTheMarket, and similar) with the aim of extracting image and text data to be processed for use in machine learning models (for use cases such as prediction of a property’s price, extraction of key features from image-data to infer a listing’s true value, processing of customer reviews through NLP techniques, etc…)

In the following lines, I aim to discuss how to potentially go about:

  1. The identification of the most critical data sources
  2. The estimation of data collection costs should you want to put your solution to commercial use

I gave the article a broader cut, which touches upon market and regulatory considerations to be made when reasoning around data collection for potentially commercial purposes, as well as the more technical considerations of working with APIs, as I realize there are multiple layers to be surfaced within this very interesting topic.

I hope the below key points will result useful in setting up the Data Collection block of your current and future Data Science projects, no matter your industry focus.


Do your market research & identify your key data sources

In two-sided markets such as online home rental platforms, which are dominated by supply and demand agents (on the supply side, homeowners looking to rent, either directly or through a real-estate agent; on the demand side, individuals looking to rent), you are going to find the most data, both in terms of quantity and quality, on those platforms which drive the majority of traffic in a given market, from both supply and demand sides.

In this sense, you need to identify the platforms which hold the majority of market power as they pull and attract most eyeballs. Knowing the market’s distribution of overall traffic/data volume is very useful if you are looking to pull high amounts of data over time, and do not want to be integrating multiple data streams coming from smaller market players.

In the UK’s online home rental market, the majority of the traffic and listings is distributed between the top 1–5 players, and those companies (the left of the curve in the below illustrative distribution) are therefore the ones on which you want to focus your data collection efforts on.

Image for post

The Pareto Principle. Source: Mode.com

This is of course a double-edged sword, as the big players from which you are going to be sourcing from have high leverage when it comes to entering data-sharing agreements, which allows them to:

1) act as de-facto gatekeepers to a particular market and set their own data usage policies, especially in a less regulated market scenario

2) charge more per the same unit of data volume when entering data sharing agreements

3) effectively monitor potential competitive threats to their core-business from startups who require access to their data and who are thus more dependent on their services

#data-science #data-collection #real-estate #api #data analysis

How to estimate data collection costs for your Data Science project
1.15 GEEK