SEARCH SEARCH

Article Search

SUPPLEMENTARY TEXT 3

Random Synonymization Scenarios

In the main text, we introduced a random synonymization procedure to generate a scenario of taxonomic inflation. It can be thought of as simulating a random graph similar to the stochastic block model (Abbe, 2018) with overlapping communities (Latouche et al., 2011; Liu et al., 2024), but on the layer of a fixed graph, which represents a temporal and taxonomic constraint.

Results from the random graph theory indicate that when a degree distribution satisfies certain conditions, a giant component (i.e., connected component that contains a significant proportion of vertices) exists with high probability (Molly and Reed, 1995; Söderberg, 2002; Bollobás et al., 2007; Abbe, 2018). While our case is much more constrained and structured than the conditions considered in the above studies, similar situations may occur. We simulated 10 random synonymizing scenarios for the p and q values under the 0 ≤ q < p ≤ 0.05 restriction (with the resolution of 0.005) and calculated the mean size of the largest connected component (LCC) and the number of species for each p and q value. The results are presented in Figure ST3-1 and Figure ST3-2, respectively.

We first note that the LCC of the underlying graph layer (due to temporal and taxonomic constraints) is 213. From Figure ST3-1, it is evident that with the increase of p and q , LCC increases quite rapidly and becomes saturated at large p and q . When p = 0.05 and q = 0.045, LCC was almost 75% of the size of the possible maximum value (213), indicating that a single large component dominates random network structure beyond a certain point around 0.03 < p , q < 0.05. Mean number of species (Figure ST3-2) shows a similar trend, and in the most extreme case, more than 2/3 of the entire species were synonymized with other species. Such scenarios with giant components are very unrealistic. A visual example is shown in Figure ST3-3. When p = 0.05 and q = 0.025, LCC with size 133 dominates the entire network structure, whereas a scenario under p = 0.02 and q = 0.01 shows a much patchier distribution of connected components, which seems more reasonable. Thus, when generating synonymizing scenarios, we only considered p and q values smaller than or equal to 0.03.