Archive for the ‘search’ Category

Crawling btcbase.org/log

Thursday, June 18th, 2020

The results for lekythion’s first crawl of btcbase.org are in. The index created is not of much use due to the crawler being blocked by many sites including archive.is, reddit and tardstalk for not complying with the robots.txt files.

Nevertheless I now have a comprehensive list of links from the #trilema logs. I think the index should include the logs themselves, but it might be convenient to be able to compartmentalize crawling external links into a separate task/configuration.

Lekythion search update: additional blogs and other things

Sunday, May 31st, 2020

I’ve added several new blogs to lekythion’s search feature, including:

trinque.org
loper-os.org
thimbron.com
ossasepia.com
fixpoint.welshcomputing.com
billymg.com
ztkfg.com
qntra.net

I intend to add more as I have time. I am also open to suggestions for additional sites and blogs to index.

The crawler still needs some tuning for many of the sites listed above. For example there are still instances where the bot will return multiple identical results for a search term due to different urls displaying the same content.

Also ranking is still very basic and doesn’t incorporate anything like pagerank, although it can search using l-distance.

In addition to adding the encyclopedia (work on which is underway), I’m considering adding Bitcoin transaction search. I wrote my own app for such purposes last year and have found it occasionally useful. I don’t know how much demand there is for publicly tracking transactions, but it wouldn’t be a big deal to set something up and try it out.

In terms of infrastructure, the bot is now running on its own vps. The next step is automatic index updates.

Search Prototype

Tuesday, May 26th, 2020

So the search project produced a prototype, which is available in #exusiae.

It works like this:

18:00:07 thimbronion !s Bitcoin
18:00:08 lekythion 10 results
18:00:08 lekythion ³http://trilema.com/2013/bitcoin-prices-bitcoin-inflexibility/
18:00:08 lekythion Bitcoin prices, Bitcoin inflexibility on Trilema - A blog by Mircea Popescu.
18:00:08 lekythion …keeping the Bitcoin). Other than this ~4% of the Bitcoin monetary…
18:00:08 lekythion …Bitcoin. Will people stop throwing dollars at Bitcoin because Bitcoin
18:00:09 lekythion …Will people start throwing Bitcoin at dollars because Bitcoin prices…
18:00:10 lekythion ²http://trilema.com/2015/introducing-the-bitcoin-isp/
18:00:11 lekythion Introducing the Bitcoin ISP on Trilema - A blog by Mircea Popescu.
18:00:12 lekythion …Bitcoin, The Most Serene Republic Of ~. In any case, Bitcoin ISP will…
18:00:13 lekythion …Bitcoin ISP, your only avenue is to voice your concerns in #bitcoin
18:00:14 lekythion …Soon to become a Bitcoin registered company, trading as S.BISP…
18:00:15 lekythion All results can be found at ¹http://paste.deedbot.org/?id=ZwnE.

The bot currently only searches an index of trilema.com. The !s command accepts Apache Lucene queries.

I now confront some problems.

  1. Fine tuning was required. I had to tune the indexer to extract certain elements from Trilema to get the quality of the results somewhere near acceptable. This means every site is going to need tuning. For example, the good stuff is all the div.entry class in mp-wp, while trinque.org has it somewhere else.

  2. I don’t yet know how to let others add sites they want to search. This is partially due to the first issue because if I just take lists of sites from people and don’t customize the indexer, the results won’t be great. It’s also due to not knowing the best way to allow users to configure their lists of sites to index. The first thing that popped into my head was to allow users to sign a text file that includes a list of all the sites they want to index and provide that to me. I would then do the configuration on the server manually and associate that index with their nick such that it would be the default index searched whenever they search. Perhaps at some point users could also specify by WoT identity others’ indexes they’d like to search.

One positive result is that after futzing around trying to use Google to find particular Trilema articles, I find using my own index to be much more productive.

Sites and documents I personally want to index:

trilema.com
loper-os.org
the blog of everyone from #ossasepia
thebitcoin.foundation
the naggum archive
Encyclopedia Britannica, 11th Edition
bitcointalk.org

Exploring WoT Search Project

Sunday, May 24th, 2020

I am in the process of exploring working on a WoT search bot. At the moment I don’t know how exactly to make a “WoT” search bot. I did a little searching of the logs, and this line from trinque aligns most with my own itch. Some possibilities:

  1. Everyone has their own search bot that searches sites they specify (presumably in their WoT).

  2. Anyone can register their list of sites they want indexed/searched1. Registered lists are associated with a nick. The search bot interfaces with deedbot to get WoT for your nick when you register and ranks search results originating from registered sites according to your WoT rankings when available.

  3. ??

I am not sure how to best display search results. I think I will start with maybe displaying the top 5 in chan/dm. I don’t know if I have the stomach for putting up a search webshit.

At the moment I am indexing Trilema. As soon as that finishes (if ever - currently going for about 10 hours) I’ll try to get lekythion (a bot I wrote a while ago for checking prices and logging #exusiae) serving up search results and see how that works out and gauge interest.

  1. It may be that at some point someone develops a WP plugin that the search bot can interface with that would provide “live” results and obviate the need for indexing mp-wp blogs []