fetch-nodeinfo-bot
fetch-nodeinfo-bot crawls data reported via the nodeinfo standard by
a variety of federated services. This data commonly includes the type and version of
software in use, the number of users, the number of posts created locally on that site,
and information about which federation protocols the site supports.
It presents User-Agent: fetch-nodeinfo-bot (+https://arewedecentralizedyet.online/) and
can be blocked via robots.txt.
fetch-nodeinfo-bot gets the lists of hosts to crawl from nodes.fediverse.party. The
methodology for node discovery and listing is described on that site, and this extensive documentation is
a large part of why I chose this as my node list. fetch-nodeinfo-bot does not discover other
nodes itself.
fetch-nodeinfo-bot respects robots.txt, and will not fetch nodeinfo on sites that
restrict crawlers in general, or its User-Agent specifically, from /.well-known/nodeinfo.
Thus, if you do not want your side to be included, setting up a robots.txt file will block this
bot.
fetch-nodeinfo-bot is currently set to fetch each server's nodeinfo approximately once per
day, with separate TTLs for re-fetching robots.txt and re-trying in case of failures. It uses
a rate-limit per IP address block to avoid overloading shared infrastructure (such as multiple sites on the
same physical or virtual server), and lowers this limit in response to receiving HTTP 429 (rate limit exceeded)
responses.
Data gathered by this bot is used to create the website arewedecentralizedyet.online, which compares how centralized or decentralized social networks and other Web services are in practice. Historical data and raw nodeinfo snapshots are available in the Data section of the site.
If you would like data about your site to be removed from this dataset, contact the author.
This bot is written and operated by Robert Ricci, who can be reached at rob [at] ricci [dot] io . The source code for this bot is on Codeberg .