A procedure to recruit members to enlargeprotein family databases – the building ofUECOG (UniRef-Enriched COG Database) as a model

G.R. Fernandes, D.V.C. Barbosa, F. Prosdocimi, I.A. Pena, L. Santana-Santos, O. CoelhoJunior, A. Barbosa-Silva, H.M. Velloso, M.A. Mudado, D.A. Natale, A.C. Faria-Campos, S.V. A.Campos, J.M. Ortega
Published September 30, 2008
Genet. Mol. Res. 7 (3): 910-924 (2008)

About the Authors
G.R. Fernandes, D.V.C. Barbosa, F. Prosdocimi, I.A. Pena, L. Santana-Santos, O. CoelhoJunior, A. Barbosa-Silva, H.M. Velloso, M.A. Mudado, D.A. Natale, A.C. Faria-Campos, S.V. A.Campos, J.M. Ortega

Corresponding author 
J.M. Ortega
Email: miguel@icb.ufmg.br

Abstract
A procedure to recruit members to enlarge protein family databases is described here. The procedure makes use of UniRef50 clusters roduced by UniProt. Current family entries are used to recruit additional members based on the UniRef50 clusters to which they belong. Only those additional UniRef50 members that are not fragments and whose length is within a restricted range relative to the original entry are recruited. The enriched dataset is then limited to contain only genomes from selected clades. We used the COG database – used for genome annotation and for studies of phylogenetics and gene evolution – as a model. To validate the method, a UniRef-Enriched COG0151 (UECOG) was tested with distinct procedures to compare recruited members with the recruiters: PSI-BLAST, secondary structure overlap (SOV), Seed Linkage, COGnitor, shared domain content, and neighbor-joining single-linkage, and observed that the former four agree in their validations. Presently, the UniRef50-based recruitment procedure enriches the COG database for Archaea, Bacteria and its subgroups Actinobacteria, Firmicutes, Proteobacteria, and other bacteria by 2.2-, 8.0-, 7.0-, 8.8-, 8.7-, and 4.2-fold, respectively, in terms of sequences, and also considerably increased the number of species.

Key words: COG, Secondary database, UniRef, UniProt, UECOG.

Back To Top