How are similar genes calculated?
A 'similar genes' group is composed of a set of eukaryotic genes from the NCBI Gene database selected by a combination of calculated orthology and similarity of protein architectures. Protein architectures are defined by NCBI's SPARCLE resource as the sequential arrangement of CDD domains along a protein sequence. Each unique domain arrangement is assigned a SPARCLE architecture ID which is then annotated on a RefSeq protein sequence when a match is found.
'Similar genes' groups are compiled by a stepwise process. First, an orthologous vertebrate gene set is calculated by the NCBI Gene database. Because any one of these genes can encode multiple protein isoforms, all corresponding RefSeq protein sequences (NP_ or XP_ accessions) for each orthologous gene are selected for protein architecture comparisons. Next, the corresponding SPARCLE IDs annotated on these proteins are used to retrieve all metazoan and a limited selection of non-metazoan eukaryotic genes in the NCBI Gene database encoding proteins annotated with identical architectures. Note, there will be genes from the source ortholog gene set that don't have proteins annotated with SPARCLE IDs. These genes will be retained as part of the 'similar genes' group. Lastly, all NCBI-calculated orthologs of the architecture-related genes are added to the group. This multistep process allows for an expansion of the number of organisms represented, as well as an expansion of the number of distinct gene ortholog groups represented, compared to the starting orthologous gene set.