Archive for the ‘search’ Category

Thoughts on WoT Search

Sunday, October 25th, 2020

Here is how I am looking at WoT search right now.

The work remaining on Lekythion is

  1. Constructing a decent search query in SQL
  2. Hooking the bot up to postgres
  3. Formatting the response
  4. Publishing the lisp code I use to build the index

After publication of the code, my hope to see others build their own indexes and share them using the WoT. Index owners could share read-only access to their postgres instance, perhaps via whitelisted IPs, to others in the WoT. This is similar to the approach suggested by asciilifeform, but I can’t find a reference to his proposal at the moment.

Individuals could stand up their own bots and connect queries to whichever indexes they preferred, be they their own indexes or others’.

My indexes will eventually include at least:

  • All known former republican blogs1
  • All pages linked to in the logs
  • The logs

Perhaps at some point it will become relatively easy to quickly build an index of favorite sites/data sets and and make them available.

  1. As time goes on it may be possible for me to connect directly to blog owner indexes, rather than creating my own []

July Search Update

Saturday, July 18th, 2020

Work is progressing on the search bot.

I now have a configuration that will index the btcbase.org logs. The resulting index is not perfect - anything from reddit is excluded due to my ip being blocked, and many archive.is pages are not successfully indexed due to archive.is periodically going offline. There is an error when attempting to index any Bitcointalk link that I haven’t been able to resolve. Also, due to the timespan involved, many links have rotted and are lost forever. Most links provided as “shortened” links also no longer work. The results for this crawl should show up in the bot’s index about one week from now.

Work on the encyclopedia crawl progresses as well. Apify delivered a half-functional crawling script that works with their platform. At the moment I don’t have a configuration which allows the crawl to get all volumes of the encyclopedia. I am currently working with support to get this resolved.

Crawling btcbase.org/log

Thursday, June 18th, 2020

The results for lekythion’s first crawl of btcbase.org are in. The index created is not of much use due to the crawler being blocked by many sites including archive.is, reddit and tardstalk for not complying with the robots.txt files.

Nevertheless I now have a comprehensive list of links from the #trilema logs. I think the index should include the logs themselves, but it might be convenient to be able to compartmentalize crawling external links into a separate task/configuration.

Lekythion search update: additional blogs and other things

Sunday, May 31st, 2020

I’ve added several new blogs to lekythion’s search feature, including:

trinque.org
loper-os.org
thimbron.com
ossasepia.com
fixpoint.welshcomputing.com
billymg.com
ztkfg.com
qntra.net

I intend to add more as I have time. I am also open to suggestions for additional sites and blogs to index.

The crawler still needs some tuning for many of the sites listed above. For example there are still instances where the bot will return multiple identical results for a search term due to different urls displaying the same content.

Also ranking is still very basic and doesn’t incorporate anything like pagerank, although it can search using l-distance.

In addition to adding the encyclopedia (work on which is underway), I’m considering adding Bitcoin transaction search. I wrote my own app for such purposes last year and have found it occasionally useful. I don’t know how much demand there is for publicly tracking transactions, but it wouldn’t be a big deal to set something up and try it out.

In terms of infrastructure, the bot is now running on its own vps. The next step is automatic index updates.

Search Prototype

Tuesday, May 26th, 2020

So the search project produced a prototype, which is available in #exusiae.

It works like this:

18:00:07 thimbronion !s Bitcoin
18:00:08 lekythion 10 results
18:00:08 lekythion ³http://trilema.com/2013/bitcoin-prices-bitcoin-inflexibility/
18:00:08 lekythion Bitcoin prices, Bitcoin inflexibility on Trilema - A blog by Mircea Popescu.
18:00:08 lekythion …keeping the Bitcoin). Other than this ~4% of the Bitcoin monetary…
18:00:08 lekythion …Bitcoin. Will people stop throwing dollars at Bitcoin because Bitcoin
18:00:09 lekythion …Will people start throwing Bitcoin at dollars because Bitcoin prices…
18:00:10 lekythion ²http://trilema.com/2015/introducing-the-bitcoin-isp/
18:00:11 lekythion Introducing the Bitcoin ISP on Trilema - A blog by Mircea Popescu.
18:00:12 lekythion …Bitcoin, The Most Serene Republic Of ~. In any case, Bitcoin ISP will…
18:00:13 lekythion …Bitcoin ISP, your only avenue is to voice your concerns in #bitcoin
18:00:14 lekythion …Soon to become a Bitcoin registered company, trading as S.BISP…
18:00:15 lekythion All results can be found at ¹http://paste.deedbot.org/?id=ZwnE.

The bot currently only searches an index of trilema.com. The !s command accepts Apache Lucene queries.

I now confront some problems.

  1. Fine tuning was required. I had to tune the indexer to extract certain elements from Trilema to get the quality of the results somewhere near acceptable. This means every site is going to need tuning. For example, the good stuff is all the div.entry class in mp-wp, while trinque.org has it somewhere else.

  2. I don’t yet know how to let others add sites they want to search. This is partially due to the first issue because if I just take lists of sites from people and don’t customize the indexer, the results won’t be great. It’s also due to not knowing the best way to allow users to configure their lists of sites to index. The first thing that popped into my head was to allow users to sign a text file that includes a list of all the sites they want to index and provide that to me. I would then do the configuration on the server manually and associate that index with their nick such that it would be the default index searched whenever they search. Perhaps at some point users could also specify by WoT identity others’ indexes they’d like to search.

One positive result is that after futzing around trying to use Google to find particular Trilema articles, I find using my own index to be much more productive.

Sites and documents I personally want to index:

trilema.com
loper-os.org
the blog of everyone from #ossasepia
thebitcoin.foundation
the naggum archive
Encyclopedia Britannica, 11th Edition
bitcointalk.org

Exploring WoT Search Project

Sunday, May 24th, 2020

I am in the process of exploring working on a WoT search bot. At the moment I don’t know how exactly to make a “WoT” search bot. I did a little searching of the logs, and this line from trinque aligns most with my own itch. Some possibilities:

  1. Everyone has their own search bot that searches sites they specify (presumably in their WoT).

  2. Anyone can register their list of sites they want indexed/searched1. Registered lists are associated with a nick. The search bot interfaces with deedbot to get WoT for your nick when you register and ranks search results originating from registered sites according to your WoT rankings when available.

  3. ??

I am not sure how to best display search results. I think I will start with maybe displaying the top 5 in chan/dm. I don’t know if I have the stomach for putting up a search webshit.

At the moment I am indexing Trilema. As soon as that finishes (if ever - currently going for about 10 hours) I’ll try to get lekythion (a bot I wrote a while ago for checking prices and logging #exusiae) serving up search results and see how that works out and gauge interest.

  1. It may be that at some point someone develops a WP plugin that the search bot can interface with that would provide “live” results and obviate the need for indexing mp-wp blogs []