Methods

Documentation of the computational methods used to generate the data presented in this webserver

Usage and Citation

All analyses presented on this webserver were conducted by Elias Dohmen, who also developed this webserver platform. If underlying data by others was used, the source is clearly indicated in the respective sections below. The results and data presented here are freely available for use and publication by anyone, given appropriate attribution.

When using data from this webserver, please reference this webserver and its URL in your publications. For specific methodological details and tool citations, refer to the individual method sections below.

For any questions, please contact Elias Dohmen.

Citation Guidelines

Reference this webserver in publications
Cite specific tools from Methods sections
See References section for original papers

Datasets and Annotation

We analyzed 200 genome annotations generated with BRAKER3 as described in Saenko et al. (2025). For all downstream analyses, only the longest isoform per gene was retained. Protein domains were annotated with pfam_scan.pl v1.6 using the Pfam database version 37.3 (Mistry et al., 2021).

From the full set of 200 species, a reduced subset of 163 species was created using the filtering script GEvol_filter_by_DOGMA_quality.py. Filtering was performed according to the following criteria:

DOGMA Quality Score ≥ 75%
Partial Domain Percentage ≤ 15%
For species with multiple proteomes, only the highest-quality proteome was retained (highest completeness, lowest partial domain score)
At most three species were kept per genus, prioritized by quality as above

Quality Assessment

Proteome quality was assessed with DOGMA v3.8.37.3 (Dohmen et al., 2016), run in proteome mode with the insect core set and Pfam v37.3. All other parameters were default. The insect core set includes 4182 single-domain Conserved Domain Arrangements (CDAs) and 4582 multi-domain CDAs.

Orthogroup Identification and MSA Generation with OrthoFinder3

Orthogroups and multiple sequence alignments (MSAs) were constructed using OrthoFinder v3.0.1b1 (Emms et al., 2025; Emms & Kelly, 2019) executed via the official Docker container davidemms/orthofinder:3.0.1b1 on DockerHub. The following parameter choices were made:

-M msa: Generate multiple sequence alignments (MSAs)
-t 48: Use 48 threads for parallel processing

Resulting Species-Level MSA Properties:

Dataset	Sites	Patterns	Gaps (%)	Invariant Sites (%)
200 species set	26,323	25,487	19.93%	7.63%
163 species set	31,125	29,660	11.85%	9.60%

Phylogenetic Tree Construction

Maximum-Likelihood phylogenetic trees were constructed with RAxML-NG v2.0.0 (Kozlov et al., 2019). The best-fit evolutionary model was selected using:

raxml-ng-2 --msa <msa-file> --model AA

Best model for the 200 species set: JTT+FC+IU{0.071649}+G4m{0.91863}
Best model for the 163 species set: JTT+FC+IU{0.090326}+G4m{0.937173}

Trees were inferred with:

raxml-ng-2 --all --msa <msa-file> --model <evolutionary-model> --seed 7 --threads <n> --bs-metric fbp,tbe

200 Species Set Parameters:

run mode: Tree with branch support (adaptive) (Felsenstein Bootstrap + Transfer Bootstrap)
start tree(s): adaptive
bootstrap replicates: parsimony (max: 1000) + bootstopping (autoMRE, cutoff: 0.030000)
random seed: 7
tip-inner: OFF
pattern compression: ON
per-rate scalers: OFF
site repeats: ON
logLH epsilon: general: 10.000000, brlen-triplet: 1000.000000
stopping rule: KH
fast spr radius: AUTO
spr subtree cutoff: 1.000000
fast CLV updates: ON
branch lengths: proportional (ML estimate, algorithm: NR-FAST)
SIMD kernels: AVX
parallelization: coarse-grained (auto), PTHREADS (20 threads), thread pinning: OFF

163 Species Set Parameters:

run mode: ML tree search + bootstrapping (adaptive) (Felsenstein Bootstrap + Transfer Bootstrap)
start tree(s): adaptive
bootstrap replicates: parsimony (max: 1000) + bootstopping (autoMRE, cutoff: 0.030000)
random seed: 7
tip-inner: OFF
pattern compression: ON
per-rate scalers: OFF
site repeats: ON
logLH epsilon: general: 10.000000, brlen-triplet: 1000.000000
stopping rule: KH
fast spr radius: AUTO
spr subtree cutoff: 1.000000
fast CLV updates: ON
branch lengths: proportional (ML estimate, algorithm: NR-FAST)
SIMD kernels: AVX2
parallelization: coarse-grained (auto), PTHREADS (48 threads), thread pinning: OFF

Tree Post-processing:

In Python 3.12 with the ETE4 library (Huerta-Cepas et al., 2016), the resulting tree underwent the following processing steps:

Rooted at Folsomia candida using the .set_outgroup() function
Made ultrametric using the .to_ultrametric() function
Scaled for the last common ancestor (root node) to span 430 million years

This temporal scaling fits estimations by Thomas et al. (2020) and Misof et al. (2014).

Domain Rearrangement Events Reconstruction with DomRates

Ancestral domain content across the phylogenetic tree and domain rearrangement events were reconstructed with DomRates (Dohmen et al., 2020).

Analysis Parameters:

Run on 16 cores (-p parameter)
Detailed statistics enabled (-s and -d parameters)
Outgroup: Folsomia candida (-g parameter)

Input Data:

Pfam annotations of longest isoform files (-a parameter) - as described in Datasets and Annotation section
RAxML-NG2 rooted tree (-t parameter) - as described in Phylogenetic Tree Construction section

GO-term Enrichment Analysis

Gene Ontology (GO) term enrichment analysis is carried out with the topGO package in R (Alexa et al., 2006) using scripts analyseGo.r and domain2topGo.py and is based on the DomRates results as described in the Domain Rearrangements section.

GO Universe Composition:

The GO universe is composed of all domain arrangements that are present in all species as well as the reconstructed domain arrangement sets in the ancestral nodes.

GO-term Annotation and Comparison:

New domain arrangements that can be explained by an exact or non-ambiguous solution (see DomRates) are annotated with the pfam2go mapping (v37.3) of Pfam domains to GO terms (Mitchell et al., 2015). The GO-terms of all these new domain arrangements are compared to the GO-terms of the GO Universe as described above either per node or for the whole tree.

Enrichment Analysis:

Ontologies analyzed: Molecular Function and Biological Process
Algorithm: topGO's weight01 method
Significance threshold: P-value ≤ 0.05
Visualization: Word clouds generated with make_wordcloud.py

Mapping of Sequence IDs, Orthogroups and Domain Arrangements

NCBI to BRAKER Sequence ID Mapping:

The mapping between NCBI Sequence IDs and BRAKER Sequence IDs is based on reciprocal BLASTp where only the 1:1 top hit is reported. The reciprocal BLASTp analysis was done and provided by Chetan Munegowda. The following NCBI annotations were used:

Drosophila melanogaster: NCBI annotation GCF_000001215.4
Tribolium castaneum: NCBI annotation GCF_000002335.3

From the BRAKER annotations, only the longest isoforms were used, while all isoforms from the NCBI annotations were included in the analysis.

NCBI Protein ID to Gene Information Mapping:

NCBI Protein IDs were subsequently mapped to NCBI Gene IDs, Gene Symbols, and FlyBase IDs using the NCBI datasets CLI tool v18.7.0 (https://www.ncbi.nlm.nih.gov/datasets/) with the following command:

datasets summary gene accession <protein_ID>

This mapping provides the connection between protein sequences and their corresponding gene-level information in the NCBI database, enabling cross-referencing with external databases such as FlyBase for Drosophila melanogaster.

Orthogroup Mapping:

The Orthogroups from OrthoFinder3 are mapped to BRAKER Sequence IDs based on the OrthoFinder3 results file Orthogroups.tsv.

Pfam Domain Arrangement Mapping:

For Pfam domain arrangement mappings, the BRAKER Sequence IDs were mapped to the annotated Pfam domain arrangements based on the annotation files mentioned above in the Datasets and Annotation section.

Note: All mappings are accessible through the Mappings page where you can search and browse the relationships between sequence IDs, orthogroups, and domain arrangements.

References and Tools

References

Tools

topGO