For web serving, this site uses data from a number of sources that have been aggregated by the Internet Yellow Pages project from the IIJ Research Laboratory. It uses the worldwide Tranco Top 1m list of the million most popular websites, compiled by the DistriNet research group. The URLs of these websites' front pages are mapped to IP addresses, which are then mapped to AS numbers. We use the sibling AS dataset from the Internet Intelligence Research Lab to aggregate ASes run by the same organization. The process is similar to that used by Habib et al. in SIGCOMM 2025, though we use a different data source and compute statistics worldwide. Note that because this style of measurement looks at the main URL for the site's front page, it will, in many cases, count the CDN serving a site rather than the site's origin server.
For the public git forges, this site uses the number of "origins" of type "git" archived by Software Heritage; this is roughly equivalent to the number of git repositories they crawl. Software Heritage's archive coverage can be found on their coverage page. The likely undercounts the true number of repositories, and each "fork" counts as a separate repository.
For email services, data comes from the paper Who's Got Your Mail?: Characterizing Mail Service Provider Usage by Liu et al., which appeared in the ACM Internet Measurement Conference in 2021. The authors use a variety of methods, including MX records, SMTP banner analysis, and TLS certificate analysis to uncover the provider serving mail for a particular domain. The statistics here come from examining the top 1 million domains on the Alexa list, and finding the email providers for those domains that consistently had MX records for a period of a few years.
The measurements in this section come from the paper "Formalizing Dependence of Web Infrastructure" by Habib et al., which appeared in the ACM SIGCOMM conference in 2025. The authors of this work collected information for the most popular 10,000 websites in 150 countries using Google's CrUX dataset. These measurements were taken once in May 2023.
DNS server data is reported by the network (AS) hosting the authoritative nameservers for the domains in the set. Certificate authority (CA) data was collected by connecting to the front page of the webserver and finding the root CA of the certificate provided by the server.
This page measures the concentration of user data on a variety of systems according to the Herfindahl–Hirschman Index (HHI) and the Shannon Index. User data is only one way to measure centralization: others include network structure, legal exposure, and concentration of social and technical power.
Code and data are available on Codeberg. Comments and pull requests, including other metrics for measuring distribution and resiliency, are welcome!
The Herfindahl–Hirschman Index (HHI) is an indicator from economics used to measure competition between firms in an industry. Mathematically, HHI is the sum of the squares of market shares of all servers. Values close to zero indicate perfectly competitive markets (eg. many servers, with users spread evenly), while values close to 10000 indicate highly concentrated monopolies (eg. most users on a single server). In economics, values below 100 are considered "Highly Competitive", below 1500 is "Unconcentrated", and above 2500 is considered "Highly Concentrated".
This method of measuring centralization for Internet systems is described in the paper "Formalizing Dependence of Web Infrastructure" by Habib et al., which appeared in the ACM SIGCOMM conference in 2025. The authors motivate this measure by describing it as a special case of the "Earth Mover Distance": the amount of "work" it would take to change one statistical distribution into another. Here, the empirical distribution measured for the system's current deployment is compared to an idealized fully-decentralized distribution.
The Shannon Index is an entropy-based measure used in ecological studies to measure population diversity. It is computed as Shannon entropy using the natural log: the negative sum over all servers of the "market share" times the log of the market share. Lower values indicate lower entropy (a high concentration of one species), while higher values indicate a more even population. In this context, the maximum value is the natural log of the number of servers.
Because the Shannon Index is logarithmic, it is more responsive than HHI to changes in the "smaller players": participants that are already large do not change the index much when their size changes. Changes in the smaller participants, on the other hand, are more noticeable in the computed index. Thus, it is a good way of looking at systems that are undergoing change at the smaller end of the hosting scale.
What does this measure?
For the social networks, this site measures where user accounts are stored: in the Fediverse, these accounts are on servers (also known as instances); in the Atmosphere, they are on the PDSes that host users' data repos. All PDSes run by the company Bluesky Social PBC are aggregated in this dataset, since they are under the control of a single entity. Similarly, mastodon.social and mastodon.online are combined as they are run by the same company.
At the present, Fediverse servers count Monthly Active Users, as reported by the server. Atmosphere uses the count for each PDS provided by the Bluesky relays: this account is roughly the number of active users since November 2023, when Bluesky rearranged its PDSes.