https://search.ridho.dev/ (if I still host it)requests installeddocker compose up -dlocalhost:5000.python3 ./scripts/run_list.py@mozilla/readiblity which used by Firefox's reader mode. This library can extract the main content of the pages and left out things like navigation, sidebar, footer, etc.jsonl files, because I thought that it could be consumed by the tantivy-cli later. But after seeing that it have some overhead, I went with writing a rust server that wraps the tantivy engine that could write and query the docs.main-index and page-index. The main-index stores the whole page that Scrapy scraped, and is used for the main search query. The page-index stores the sentences snippet group for each of the pages (so that each pages would generate multiple snippets) which then later can be queried to find which snippet is the most relevant for the current query.Posted Jan 4, 2025
A Docs Scraper and Search Engine built using Rust and Python
0
5