Max Bohomolov
Development and implementation of a system for large-scale data collection on real estate properties in the United Kingdom from various online platforms, including Zoopla, Airbnb, and Unihomes. The project aims to create an extensive database for subsequent real estate market analysis.
Key Challenges:
Data Source Heterogeneity: Each target website had a unique structure and data presentation format, requiring an individual approach to information extraction.
Data Format Diversity: The need to process various types of server responses, including HTML pages and JSON responses, which complicated the parsing and data normalization process.
Overcoming Protection Mechanisms: Developing strategies to bypass web scraping protection systems such as Cloudflare, as well as addressing the issue of request rate limiting from a single IP address.
Implementation Stages:
Preliminary Analysis:
Database Design:
Scraper Development:
Implementation of Protection Bypass Measures:
Testing and Optimization:
Deployment and Automation:
Additional Project Aspects:
Data Normalization: Development of algorithms to bring heterogeneous data to a unified format, ensuring consistency and ease of analysis.
Address Analysis and Matching: Creation of a system to identify and link real estate objects across different data sources, enhancing the completeness and reliability of collected information.
Data Analytics: Conducting preliminary analysis of collected data to identify trends and patterns in the UK real estate market.
Project Outcomes:
The developed system effectively collected, processed, and analyzed large volumes of real estate data from multiple sources, overcoming technical and methodological challenges. This resulted in the creation of a unique database providing valuable information for analyzing the UK real estate market.