https://search.ridho.dev/
(if I still host it)requests
installeddocker compose up -d
localhost:5000
.python3 ./scripts/run_list.py
@mozilla/readiblity
which used by Firefox's reader mode. This library can extract the main content of the pages and left out things like navigation, sidebar, footer, etc.jsonl
files, because I thought that it could be consumed by the tantivy-cli later. But after seeing that it have some overhead, I went with writing a rust server that wraps the tantivy engine that could write and query the docs.main-index
and page-index
. The main-index
stores the whole page that Scrapy scraped, and is used for the main search query. The page-index
stores the sentences snippet group for each of the pages (so that each pages would generate multiple snippets) which then later can be queried to find which snippet is the most relevant for the current query.