Methods

Documentation of the computational methods used to generate the data presented in this webserver

Datasets and Annotation

We analyzed 200 genome annotations generated with BRAKER3 as described in Saenko et al. (2025). For all downstream analyses, only the longest isoform per gene was retained. Protein domains were annotated with pfam_scan.pl v1.6 using the Pfam database version 37.3 (Mistry et al., 2021).

From the full set of 200 species, a reduced subset of 163 species was created using the filtering script GEvol_filter_by_DOGMA_quality.py. Filtering was performed according to the following criteria:

  • DOGMA Quality Score ≥ 75%
  • Partial Domain Percentage ≤ 15%
  • For species with multiple proteomes, only the highest-quality proteome was retained (highest completeness, lowest partial domain score)
  • At most three species were kept per genus, prioritized by quality as above

Quality Assessment

Proteome quality was assessed with DOGMA v3.8.37.3 (Dohmen et al., 2016), run in proteome mode with the insect core set and Pfam v37.3. All other parameters were default. The insect core set includes 4182 single-domain Conserved Domain Arrangements (CDAs) and 4582 multi-domain CDAs.

Orthogroup Identification and MSA Generation with OrthoFinder3

Orthogroups and multiple sequence alignments (MSAs) were constructed using OrthoFinder v3.0.1b1 (Emms et al., 2025; Emms & Kelly, 2019) executed via the official Docker container davidemms/orthofinder:3.0.1b1 on DockerHub. The following parameter choices were made:

  • -M msa: Generate multiple sequence alignments (MSAs)
  • -t 48: Use 48 threads for parallel processing
Resulting Species-Level MSA Properties:
Dataset Sites Patterns Gaps (%) Invariant Sites (%)
200 species set 26,323 25,487 19.93% 7.63%
163 species set 31,125 29,660 11.85% 9.60%

Phylogenetic Tree Construction

Maximum-Likelihood phylogenetic trees were constructed with RAxML-NG v2.0.0 (Kozlov et al., 2019). The best-fit evolutionary model was selected using:

raxml-ng-2 --msa <msa-file> --model AA
  • Best model for the 200 species set: JTT+FC+IU{0.071649}+G4m{0.91863}
  • Best model for the 163 species set: JTT+FC+IU{0.090326}+G4m{0.937173}

Trees were inferred with:

raxml-ng-2 --all --msa <msa-file> --model <evolutionary-model> --seed 7 --threads <n> --bs-metric fbp,tbe
200 Species Set Parameters:
run mode: Tree with branch support (adaptive) (Felsenstein Bootstrap + Transfer Bootstrap)
start tree(s): adaptive
bootstrap replicates: parsimony (max: 1000) + bootstopping (autoMRE, cutoff: 0.030000)
random seed: 7
tip-inner: OFF
pattern compression: ON
per-rate scalers: OFF
site repeats: ON
logLH epsilon: general: 10.000000, brlen-triplet: 1000.000000
stopping rule: KH
fast spr radius: AUTO
spr subtree cutoff: 1.000000
fast CLV updates: ON
branch lengths: proportional (ML estimate, algorithm: NR-FAST)
SIMD kernels: AVX
parallelization: coarse-grained (auto), PTHREADS (20 threads), thread pinning: OFF
163 Species Set Parameters:
run mode: ML tree search + bootstrapping (adaptive) (Felsenstein Bootstrap + Transfer Bootstrap)
start tree(s): adaptive
bootstrap replicates: parsimony (max: 1000) + bootstopping (autoMRE, cutoff: 0.030000)
random seed: 7
tip-inner: OFF
pattern compression: ON
per-rate scalers: OFF
site repeats: ON
logLH epsilon: general: 10.000000, brlen-triplet: 1000.000000
stopping rule: KH
fast spr radius: AUTO
spr subtree cutoff: 1.000000
fast CLV updates: ON
branch lengths: proportional (ML estimate, algorithm: NR-FAST)
SIMD kernels: AVX2
parallelization: coarse-grained (auto), PTHREADS (48 threads), thread pinning: OFF
Tree Post-processing:

In Python 3.12 with the ETE4 library (Huerta-Cepas et al., 2016), the resulting tree underwent the following processing steps:

  • Rooted at Folsomia candida using the .set_outgroup() function
  • Made ultrametric using the .to_ultrametric() function
  • Scaled for the last common ancestor (root node) to span 430 million years

This temporal scaling fits estimations by Thomas et al. (2020) and Misof et al. (2014).

Domain Rearrangement Events Reconstruction with DomRates

Ancestral domain content across the phylogenetic tree and domain rearrangement events were reconstructed with DomRates (Dohmen et al., 2020).

Analysis Parameters:
  • Run on 16 cores (-p parameter)
  • Detailed statistics enabled (-s and -d parameters)
  • Outgroup: Folsomia candida (-g parameter)
Input Data:
  • Pfam annotations of longest isoform files (-a parameter) - as described in Datasets and Annotation section
  • RAxML-NG2 rooted tree (-t parameter) - as described in Phylogenetic Tree Construction section

GO-term Enrichment Analysis

Gene Ontology (GO) term enrichment analysis is carried out with the topGO package in R (Alexa et al., 2006) using scripts analyseGo.r and domain2topGo.py and is based on the DomRates results as described in the Domain Rearrangements section.

GO Universe Composition:

The GO universe is composed of all domain arrangements that are present in all species as well as the reconstructed domain arrangement sets in the ancestral nodes.

GO-term Annotation and Comparison:

New domain arrangements that can be explained by an exact or non-ambiguous solution (see DomRates) are annotated with the pfam2go mapping (v37.3) of Pfam domains to GO terms (Mitchell et al., 2015). The GO-terms of all these new domain arrangements are compared to the GO-terms of the GO Universe as described above either per node or for the whole tree.

Enrichment Analysis:
  • Ontologies analyzed: Molecular Function and Biological Process
  • Algorithm: topGO's weight01 method
  • Significance threshold: P-value ≤ 0.05
  • Visualization: Word clouds generated with make_wordcloud.py

Mapping of Sequence IDs, Orthogroups and Domain Arrangements

NCBI to BRAKER Sequence ID Mapping:

The mapping between NCBI Sequence IDs and BRAKER Sequence IDs is based on reciprocal BLASTp where only the 1:1 top hit is reported. The reciprocal BLASTp analysis was done and provided by Chetan Munegowda. The following NCBI annotations were used:

  • Drosophila melanogaster: NCBI annotation GCF_000001215.4
  • Tribolium castaneum: NCBI annotation GCF_000002335.3

From the BRAKER annotations, only the longest isoforms were used, while all isoforms from the NCBI annotations were included in the analysis.

Orthogroup Mapping:

The Orthogroups from OrthoFinder3 are mapped to BRAKER Sequence IDs based on the OrthoFinder3 results file Orthogroups.tsv.

Pfam Domain Arrangement Mapping:

For Pfam domain arrangement mappings, the BRAKER Sequence IDs were mapped to the annotated Pfam domain arrangements based on the annotation files mentioned above in the Datasets and Annotation section.