Centris
real estate platform website. It leverages Scrapy for web scraping and Playwright for browser automation to handle dynamic content, user authentication, and page navigation. The scraped data is cleaned, structured, and saved as JSON files, with associated images downloaded to organized directories.config.json
file to store sensitive information such as the username, password, and root download folder for scraped data. It also tracks already scraped listing IDs to prevent duplicate scraping.config.json
, enabling access to user-specific saved searches.config.json
: Stores configuration details such as:user_name
: Centris login emailuser_password
: Centris login passwordalready_scrape_id
: List of previously scraped listing IDsroot_download_folder
: Directory path for saving scraped datapipelines.py
: Defines the CentrisPipeline
class, which:settings.py
: Configures Scrapy settings, including:items.py
: Defines the CentrisItem
class, specifying fields for scraped data (e.g., url
, id
, price
, etc.).middlewares.py
: Contains custom middleware:ScrapeOpsFakeUserAgentMiddleware
: Rotates user agents via the ScrapeOps APIMyProxyMiddleware
: Manages rotating proxies (currently disabled)centris_spider.py
: The main spider class (CentrisSpiderSpider
) that:config.json
file with your Centris credentials and desired output directory. Example:"headless": False
in PLAYWRIGHT_LAUNCH_OPTIONS
in settings.py
.settings.py
and providing a list of proxies in ROTATING_PROXY_LIST
.12345678
) containing:12345678.json
: A JSON file with structured property detailsabc123.jpg
): Downloaded images associated with the listing This project demonstrates a robust approach to web scraping with Scrapy and Playwright, effectively handling authentication, dynamic content, and data storage for real estate data extraction.Posted Apr 13, 2025
This project contains two web scraping scripts for Centris.ca.
0
0