Thimbron

June 18, 2020

Crawling btcbase.org/log

Filed under: bitcoin, irc, search — thimbronion @ 2:06 a.m.

The results for lekythion’s first crawl of btcbase.org are in. The index created is not of much use due to the crawler being blocked by many sites including archive.is, reddit and tardstalk for not complying with the robots.txt files.

Nevertheless I now have a comprehensive list of links from the #trilema logs. I think the index should include the logs themselves, but it might be convenient to be able to compartmentalize crawling external links into a separate task/configuration.

4 Comments »

  1. It'd be rather useful to have a combined searchable index of the logs and linked items, yes. I only care about the linked items insofar as they contributed to whichever interesting conversation in the forum.

    Comment by Michael Trinque — June 18, 2020 @ 2:17 a.m.

  2. Alright - noted.

    Comment by thimbronion — June 18, 2020 @ 2:50 a.m.

  3. At some point Lobbes had a similar thing going for links in logs, also with archival support, which I found quite useful. I don't know if his thing is still active (and if so, for what channels), but IMHO this direction would be worth exploring, given how often things tend to disappear from the web.

    Comment by spyked — June 18, 2020 @ 6:44 a.m.

  4. Awesome. Correct me if I'm wrong, but his thing searches urls only, and not the content of urls, correct?

    Comment by thimbronion — June 18, 2020 @ 9:15 p.m.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress