NCBI C++ ToolKit
Public Member Functions | Static Public Member Functions | Private Member Functions | Private Attributes | List of all members
CBuildDatabase Class Reference

Search Toolkit Book for CBuildDatabase

Build BlastDB format databases from various data sources. More...

#include <objtools/blast/seqdb_writer/build_db.hpp>

+ Inheritance diagram for CBuildDatabase:
+ Collaboration diagram for CBuildDatabase:

Public Member Functions

 CBuildDatabase (const string &dbname, const string &title, bool is_protein, CWriteDB::TIndexType indexing, bool use_gi_mask, ostream *logfile, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true)
 Constructor. More...
 
 CBuildDatabase (const string &dbname, const string &title, bool is_protein, bool sparse, bool parse_seqids, bool use_gi_mask, ostream *logfil, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true)
 Constructor. More...
 
 ~CBuildDatabase ()
 
void SetTaxids (CTaxIdSet &taxids)
 Specify a mapping of sequence ids to taxonomic ids. More...
 
void SetMaskLetters (const string &mask_letters)
 Specify letters to mask out of protein sequence data. More...
 
void SetSourceDb (const string &src_db_name)
 Specify source database(s) via the database name(s). More...
 
void SetSourceDb (CRef< CSeqDBExpert > src_db)
 Specify source database. More...
 
void SetLinkouts (const TLinkoutMap &linkouts, bool keep_links)
 Specify a linkout bit lookup object. More...
 
void SetMembBits (const TLinkoutMap &membbits, bool keep_mbits)
 Specify a membership bit lookup object. More...
 
void SetLeafTaxIds (const TIdToLeafs &taxids, bool keep_taxids)
 Specify a leaf-taxids object. More...
 
bool Build (const vector< string > &ids, CNcbiIstream *fasta_file)
 Build the database. More...
 
void StartBuild ()
 Start building a new database. More...
 
bool AddIds (const vector< string > &ids)
 Add the specified sequences from the source database. More...
 
bool AddFasta (CNcbiIstream &fasta_file)
 Add sequences from a file containing FASTA data. More...
 
bool AddSequences (IBioseqSource &src, bool add_pig=false)
 Add sequences from an IBioseqSource object. More...
 
bool AddSequences (IRawSequenceSource &src)
 Add sequences from an IRawSequenceSource object. More...
 
bool EndBuild (bool erase=false)
 Finish building a new database. More...
 
void SetUseRemote (bool use_remote)
 Specify whether to use remote fetching for locally absent IDs. More...
 
void SetVerbosity (bool v)
 Specify level of output verbosity. More...
 
void SetSkipCopyingGis (bool v)
 
void SetMaxFileSize (Uint8 max_file_size)
 Set the maximum size of database component files. More...
 
int RegisterMaskingAlgorithm (EBlast_filter_program program, const string &options, const string &name="")
 Define a masking algorithm. More...
 
int RegisterMaskingAlgorithm (const string &program, const string &description, const string &options)
 Define a masking algorithm. More...
 
void SetMaskDataSource (IMaskDataSource &ranges)
 Specify an object mapping Seq-id to subject masking data. More...
 
string GetOutputDbName () const
 
- Public Member Functions inherited from CObject
 CObject (void)
 Constructor. More...
 
 CObject (const CObject &src)
 Copy constructor. More...
 
virtual ~CObject (void)
 Destructor. More...
 
CObjectoperator= (const CObject &src) THROWS_NONE
 Assignment operator. More...
 
bool CanBeDeleted (void) const THROWS_NONE
 Check if object can be deleted. More...
 
bool IsAllocatedInPool (void) const THROWS_NONE
 Check if object is allocated in memory pool (not system heap) More...
 
bool Referenced (void) const THROWS_NONE
 Check if object is referenced. More...
 
bool ReferencedOnlyOnce (void) const THROWS_NONE
 Check if object is referenced only once. More...
 
void AddReference (void) const
 Add reference to object. More...
 
void RemoveReference (void) const
 Remove reference to object. More...
 
void ReleaseReference (void) const
 Remove reference without deleting object. More...
 
virtual void DoNotDeleteThisObject (void)
 Mark this object as not allocated in heap – do not delete this object. More...
 
virtual void DoDeleteThisObject (void)
 Mark this object as allocated in heap – object can be deleted. More...
 
void * operator new (size_t size)
 Define new operator for memory allocation. More...
 
void * operator new[] (size_t size)
 Define new[] operator for 'array' memory allocation. More...
 
void operator delete (void *ptr)
 Define delete operator for memory deallocation. More...
 
void operator delete[] (void *ptr)
 Define delete[] operator for memory deallocation. More...
 
void * operator new (size_t size, void *place)
 Define new operator. More...
 
void operator delete (void *ptr, void *place)
 Define delete operator. More...
 
void * operator new (size_t size, CObjectMemoryPool *place)
 Define new operator using memory pool. More...
 
void operator delete (void *ptr, CObjectMemoryPool *place)
 Define delete operator. More...
 
virtual void DebugDump (CDebugDumpContext ddc, unsigned int depth) const
 Define method for dumping debug information. More...
 
- Public Member Functions inherited from CDebugDumpable
 CDebugDumpable (void)
 
virtual ~CDebugDumpable (void)
 
void DebugDumpText (ostream &out, const string &bundle, unsigned int depth) const
 
void DebugDumpFormat (CDebugDumpFormatter &ddf, const string &bundle, unsigned int depth) const
 
void DumpToConsole (void) const
 

Static Public Member Functions

static void CreateDirectories (const string &dbname)
 Create Directory for blast db. More...
 
- Static Public Member Functions inherited from CObject
static NCBI_XNCBI_EXPORT void ThrowNullPointerException (void)
 Define method to throw null pointer exception. More...
 
static NCBI_XNCBI_EXPORT void ThrowNullPointerException (const type_info &type)
 
static EAllocFillMode GetAllocFillMode (void)
 
static void SetAllocFillMode (EAllocFillMode mode)
 
static void SetAllocFillMode (const string &value)
 Set mode from configuration parameter value. More...
 
- Static Public Member Functions inherited from CDebugDumpable
static void EnableDebugDump (bool on)
 

Private Member Functions

objects::CScope & x_GetScope ()
 Get a scope for remote loading of objects. More...
 
void x_DupLocal ()
 Duplicate IDs from local databases. More...
 
void x_ResolveRemoteId (CRef< objects::CSeq_id > &seqid, TGi &gi)
 Resolve an ID remotely. More...
 
CRef< CInputGiListx_ResolveGis (const vector< string > &ids)
 Resolve various input IDs (as strings) to GIs. More...
 
void x_EditHeaders (CRef< objects::CBlast_def_line_set > headers)
 Modify deflines with linkout and membership bits and taxids. More...
 
void x_AddPig (CRef< objects::CBlast_def_line_set > headers)
 Add pig if id can be extracted from the deflines. More...
 
bool x_EditAndAddBioseq (CConstRef< objects::CBioseq > bs, objects::CSeqVector *sv, bool add_pig=false)
 Modify a Bioseq as needed and add it to the database. More...
 
void x_AddMasksForSeqId (const list< CRef< CSeq_id > > &ids)
 Add the masks for the Seq-id(s) (usually just one) to the database being created. More...
 
bool x_AddRemoteSequences (CInputGiList &gi_list)
 Duplicate IDs from local databases. More...
 
bool x_ReportUnresolvedIds (const CInputGiList &gi_list) const
 Write log messages for any unresolved IDs. More...
 
void x_SetLinkAndMbit (CRef< objects::CBlast_def_line_set > headers)
 Store linkout (now deprecated) and membership bits in provided headers. More...
 
void x_SetLeafTaxids (CRef< objects::CBlast_def_line_set > headers)
 Store leaf taxids in provided headers. More...
 
void x_AddOneRemoteSequence (const objects::CSeq_id &seqid, bool &found, bool &error)
 Fetch a sequence from the remote service and add it to the db. More...
 
bool x_ResolveFromSource (const string &acc, CRef< objects::CSeq_id > &id)
 Determine if this string ID can be found in the source database. More...
 
bool x_EndBuild (bool erase, const CException *close_exception)
 

Private Attributes

bool m_IsProtein
 True for a protein database, false for nucleotide. More...
 
bool m_KeepLinks
 True to keep linkout bits from source dbs, false to discard. More...
 
TIdToBits m_Id2Links
 Table of linkout bits to apply to sequences. More...
 
bool m_KeepMbits
 True to keep membership bits from source dbs, false to discard. More...
 
TIdToBits m_Id2Mbits
 Table of membership bits to apply to sequences. More...
 
bool m_KeepLeafs
 True to keep leaf taxids from source dbs, false to discard. More...
 
TIdToLeafs m_Id2Leafs
 Table of leaf taxids to apply to sequences. More...
 
CRef< objects::CObjectManager > m_ObjMgr
 Object manager, used for remote fetching. More...
 
CRef< objects::CScope > m_Scope
 Sequence scope, used for remote fetching. More...
 
CRef< CTaxIdSetm_Taxids
 Set of TaxIDs configured to apply to sequences. More...
 
CRef< CWriteDBm_OutputDb
 Database being produced here. More...
 
CRef< CSeqDBExpertm_SourceDb
 Database for duplicating sequences locally (-sourcedb option.) More...
 
CRef< IMaskDataSourcem_MaskData
 Subject masking data. More...
 
ostream & m_LogFile
 Logfile. More...
 
bool m_UseRemote
 Whether to use remote resolution and sequence fetching. More...
 
int m_DeflineCount
 Define count. More...
 
int m_OIDCount
 Number of OIDs stored in this database. More...
 
bool m_Verbose
 If true, more detailed log messages will be produced. More...
 
bool m_ParseIDs
 If true, string IDs found in FASTA input will be parsed as Seq-ids. More...
 
bool m_LongIDs
 If true, use long sequence ids (database|accession) More...
 
bool m_FoundMatchingMasks
 If true, there were sequences whose IDs matched those in the provided masking locations (via SetMaskDataSource). More...
 
bool m_SkipCopyingGis
 If set to true, when copying BLASTDBs, skip the GIs. More...
 
bool m_SkipLargeGis
 If set to true, skip GIs with value > 0x7FFFFFFF. More...
 
string m_OutputDbName
 
bool m_ScanBioseq4CFastaReaderUsrObjct
 

Additional Inherited Members

- Public Types inherited from CObject
enum  EAllocFillMode { eAllocFillNone = 1 , eAllocFillZero , eAllocFillPattern }
 Control filling of newly allocated memory. More...
 
typedef CObjectCounterLocker TLockerType
 Default locker type for CRef. More...
 
typedef atomic< Uint8TCounter
 Counter type is CAtomiCounter. More...
 
typedef Uint8 TCount
 Alias for value type of counter. More...
 
- Static Public Attributes inherited from CObject
static const TCount eCounterBitsCanBeDeleted = 1 << 0
 Define possible object states. More...
 
static const TCount eCounterBitsInPlainHeap = 1 << 1
 Heap signature was found. More...
 
static const TCount eCounterBitsPlaceMask
 Mask for 'in heap' state flags. More...
 
static const int eCounterStep = 1 << 2
 Skip over the "in heap" bits. More...
 
static const TCount eCounterValid = TCount(1) << (sizeof(TCount) * 8 - 2)
 Minimal value for valid objects (reference counter is zero) Must be a single bit value. More...
 
static const TCount eCounterStateMask
 Valid object, and object in heap. More...
 
- Protected Member Functions inherited from CObject
virtual void DeleteThis (void)
 Virtual method "deleting" this object. More...
 

Detailed Description

Build BlastDB format databases from various data sources.

This class provides an API for building BlastDB format databases. The WriteDB library is used internally to produce the actual database; the functionality provided by this class helps to bridge the gap between the WriteDB API and the needs of a command line database construction tool.

Definition at line 136 of file build_db.hpp.

Constructor & Destructor Documentation

◆ CBuildDatabase() [1/2]

CBuildDatabase::CBuildDatabase ( const string dbname,
const string title,
bool  is_protein,
CWriteDB::TIndexType  indexing,
bool  use_gi_mask,
ostream *  logfile,
bool  long_seqids = false,
EBlastDbVersion  dbver = eBDB_Version4,
bool  limit_defline = false,
Uint8  oid_masks = EOidMaskType::fNone,
bool  scan_bioseq_4_cfastareader_usrobj = true 
)

Constructor.

Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.

Parameters
dbnameName of the database to create. [in]
titleTitle to use for newly created database. [in]
is_proteinUse true for protein, false for nucleotide. [in]
sparseSpecify true to use sparse Seq-id indexing. [in]
Loggingwill be done to this stream. [in]
use_gi_maskif true will generate GI-based mask files [in]
logfilefile to write the log to [in]
long_seqidsif true, requires long sequence ids (database|accession) when parsing fasta sequences [in]
dbverversion of BLAST database to generate [in]
scan_bioseq_4_cfastareader_usrobj[in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline

Definition at line 1073 of file build_db.cpp.

References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eNucleotide, CWriteDB::eProtein, m_LogFile, m_LongIDs, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().

◆ CBuildDatabase() [2/2]

CBuildDatabase::CBuildDatabase ( const string dbname,
const string title,
bool  is_protein,
bool  sparse,
bool  parse_seqids,
bool  use_gi_mask,
ostream *  logfil,
bool  long_seqids = false,
EBlastDbVersion  dbver = eBDB_Version4,
bool  limit_defline = false,
Uint8  oid_masks = EOidMaskType::fNone,
bool  scan_bioseq_4_cfastareader_usrobj = true 
)

Constructor.

Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.

Parameters
dbnameName of the database to create. [in]
titleTitle to use for newly created database. [in]
is_proteinUse true for protein, false for nucleotide. [in]
sparseSpecify true to use sparse Seq-id indexing. [in]
parse_seqidsspecify true to parse the sequence IDs [in]
use_gi_maskif true will generate GI-based mask files [in]
indexingindex fields to add to database. [in]
long_seqidsif true, requires long sequence ids (database|accession) when parsing fasta sequences [in]
scan_bioseq_4_cfastareader_usrobj[in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline

Definition at line 1136 of file build_db.cpp.

References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eDefault, CWriteDB::eNucleotide, CWriteDB::eProtein, CWriteDB::eSparseIndex, m_LogFile, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().

◆ ~CBuildDatabase()

CBuildDatabase::~CBuildDatabase ( )

Member Function Documentation

◆ AddFasta()

bool CBuildDatabase::AddFasta ( CNcbiIstream fasta_file)

Add sequences from a file containing FASTA data.

The provided file is expected to contain FASTA data for one or more sequences. The data should be suitable input as required by CFastaReader.

Parameters
fasta_fileA file containing FASTA data.
Returns
True if at least one sequence was added.

Definition at line 1398 of file build_db.cpp.

References AddSequences(), EndBuild(), m_IsProtein, m_LongIDs, m_ParseIDs, and NCBI_THROW.

Referenced by BOOST_AUTO_TEST_CASE(), Build(), and CMakeBlastDBApp::x_AddFasta().

◆ AddIds()

bool CBuildDatabase::AddIds ( const vector< string > &  ids)

Add the specified sequences from the source database.

The list of strings are interpreted as GIs if they're composed only of numeric digits, or as Seq-ids otherwise. The sequence IDs will be resolved, and a sequence corresponding to each ID will be added to the output database. If remote resolution is enabled, it will be used to find up-to-date versions for any ambiguously versioned IDs (i.e. unversioned IDs of versioned Seq-id types). Then local fetching will be used to process IDs using the source database if one was specified. If any sequences have not be found, and remote services are enabled, remote fetching will be used for IDs not resolved locally. If any IDs are not found at all, they will be reported as part of the logging output.

Parameters
idsList of sequence IDs as strings.
Returns
true if all sequences were found locally or remotely.

Definition at line 1321 of file build_db.cpp.

References _ASSERT, map_checker< Container >::end(), map_checker< Container >::find(), CSeqDB::GetDBNameList(), CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDB::GetSequenceType(), CSeqDBGiList::SGiOid::gi, i, m_LogFile, m_SourceDb, m_UseRemote, m_Verbose, CRef< C, Locker >::NotEmpty(), CSeqDBGiList::SGiOid::oid, x_AddRemoteSequences(), x_DupLocal(), x_ReportUnresolvedIds(), and x_ResolveGis().

Referenced by BOOST_AUTO_TEST_CASE(), and Build().

◆ AddSequences() [1/2]

bool CBuildDatabase::AddSequences ( IBioseqSource src,
bool  add_pig = false 
)

Add sequences from an IBioseqSource object.

The provided `src' object is queried using GetNext() to get a Bioseq object. The Bioseq is added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns NULL.

Parameters
srcAn object providing one or more Bioseq objects.
add_pigtrue if PIG should be added if available
Returns
True if at least one sequence was added.

Definition at line 794 of file build_db.cpp.

References CBioseq_Base::CanGetId(), debug_mode, CSeq_id_Base::e_Local, CStopWatch::Elapsed(), CStopWatch::eStart, CSeq_id::fAcc_nuc, CSeq_id::fAcc_prot, CBioseq_Base::GetId(), CBioseq::GetLength(), IBioseqSource::GetNext(), CConstRef< C, Locker >::GetNonNullPointer(), GI_CONST, info, CBioseq::IsAa(), label, m_IsProtein, m_LogFile, m_LongIDs, m_SkipLargeGis, m_Verbose, NCBI_THROW, CConstRef< C, Locker >::NotEmpty(), NULL, CBioseq_Base::SetId(), sw, t, and x_EditAndAddBioseq().

Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_AddSeqEntries(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), BlastdbCopyApplication::x_MakeDBwIDList(), and CMakeBlastDBApp::x_ProcessInputData().

◆ AddSequences() [2/2]

bool CBuildDatabase::AddSequences ( IRawSequenceSource src)

Add sequences from an IRawSequenceSource object.

The provided `src' object is queried using GetNext() to get various "raw format" sequence data and metadata components. These pieces of data are added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns false.

Parameters
srcAn object providing one or more "raw" sequences.
Returns
True if at least one sequence was added.

Definition at line 904 of file build_db.cpp.

References _ASSERT, CWriteDB::AddColumnMetaData(), CWriteDB::AddSequence(), CBlastDbBlob::Clear(), CWriteDB::CreateUserColumn(), CTempString::data(), done, CStopWatch::Elapsed(), CMaskedRangesVector::empty(), CTempString::empty(), CRef< C, Locker >::Empty(), map_checker< Container >::end(), CStopWatch::eStart, map_checker< Container >::find(), CWriteDB::FindColumn(), CBlast_def_line_set_Base::Get(), IRawSequenceSource::GetColumnId(), IRawSequenceSource::GetColumnMetaData(), IRawSequenceSource::GetColumnNames(), IRawSequenceSource::GetNext(), IMaskDataSource::GetRanges(), i, int, ITERATE, m_FoundMatchingMasks, m_IsProtein, m_LogFile, m_MaskData, m_OutputDb, NCBI_THROW, CWriteDB::SetBlobData(), CWriteDB::SetDeflines(), CWriteDB::SetMaskData(), ncbi::grid::netcache::search::fields::size, CTempString::size(), sw, t, CBlastDbBlob::WriteRaw(), x_AddPig(), and x_EditHeaders().

◆ Build()

bool CBuildDatabase::Build ( const vector< string > &  ids,
CNcbiIstream fasta_file 
)

Build the database.

This method builds a database from the given list of Sequence IDs and the provided file, which should contain FASTA format data. It is equivalent to calling StartBuild(), AddIds(), AddFasta(), and EndBuild() in that order (except that a little additional logging is done with summary information.).

Parameters
idsList of identifiers to add to the database.
fasta_fileFASTA format data for

Definition at line 1289 of file build_db.cpp.

References AddFasta(), AddIds(), CStopWatch::Elapsed(), EndBuild(), CStopWatch::eStart, m_DeflineCount, m_LogFile, m_OIDCount, StartBuild(), sw, and t.

Referenced by BOOST_AUTO_TEST_CASE().

◆ CreateDirectories()

void CBuildDatabase::CreateDirectories ( const string dbname)
static

Create Directory for blast db.

Parameters
dbnameoutput blast db name (with path)

Definition at line 1051 of file build_db.cpp.

References CDirEntry::CheckAccess(), CDir::CreatePath(), dbname(), CDirEntry::eIfEmptyPath_Empty, CDir::Exists(), CDirEntry::fWrite, CDirEntry::GetDir(), CDirEntry::GetName(), and NCBI_THROW.

Referenced by CBuildDatabase(), CBlastdbConvertApp::Run(), and CMakeProfileDBApp::x_Run().

◆ EndBuild()

bool CBuildDatabase::EndBuild ( bool  erase = false)

Finish building a new database.

This method closes the newly constructed database, flushing any unflushed volumes, creating an alias file to tie the volumes together, and so on.

Parameters
eraseWill erase all files created if true.

Definition at line 1423 of file build_db.cpp.

References CWriteDB::Close(), eUnknown, m_OutputDb, NCBI_EXCEPTION_VAR, NULL, CException::what(), and x_EndBuild().

Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ GetOutputDbName()

string CBuildDatabase::GetOutputDbName ( ) const
inline

Definition at line 465 of file build_db.hpp.

References m_OutputDbName.

Referenced by CMakeBlastDBApp::x_BuildDatabase(), and CMakeClusterDBApp::x_BuildDatabase().

◆ RegisterMaskingAlgorithm() [1/2]

int CBuildDatabase::RegisterMaskingAlgorithm ( const string program,
const string description,
const string options 
)

Define a masking algorithm.

The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).

Parameters
programA string to identify the filtering algorithm [in]
descriptionA free-form string describing the data [in]
optionsA free-form string describing the options used [in]

Definition at line 1597 of file build_db.cpp.

References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().

◆ RegisterMaskingAlgorithm() [2/2]

int CBuildDatabase::RegisterMaskingAlgorithm ( EBlast_filter_program  program,
const string options,
const string name = "" 
)

Define a masking algorithm.

The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).

Parameters
programOne of the predefined masking types (dust etc). [in]
optionsA free-form string describing this type of data. The empty string should be used to indicate default parameters. [in]
nameName of the GI-base mask file [in]

Definition at line 1584 of file build_db.cpp.

References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().

Referenced by CClusterDBSource::CClusterDBSource(), CRawSeqDBSource::CRawSeqDBSource(), and CMakeBlastDBApp::x_ProcessMaskData().

◆ SetLeafTaxIds()

void CBuildDatabase::SetLeafTaxIds ( const TIdToLeafs taxids,
bool  keep_taxids 
)

◆ SetLinkouts()

void CBuildDatabase::SetLinkouts ( const TLinkoutMap linkouts,
bool  keep_links 
)

Specify a linkout bit lookup object.

The provided mapping will be used to look up linkout bits for sequences added to the database.

Parameters
src_dbThe source database. [in]

Definition at line 1262 of file build_db.cpp.

References m_Id2Links, m_KeepLinks, m_LogFile, and MapToLMBits().

◆ SetMaskDataSource()

void CBuildDatabase::SetMaskDataSource ( IMaskDataSource ranges)

Specify an object mapping Seq-id to subject masking data.

Masking data is provided to CBuildDatabase by implementing an interface that can produce masking data given the Seq-ids for the sequence that is to be masked. This object could wrap a simple lookup table, an algorithm that produces the data on the fly, or a wrapper around an existing database that fetches the masking data from that database.

Parameters
rangesAn object mapping Seq-ids to their masking data.

Definition at line 1609 of file build_db.cpp.

References m_MaskData, and CRef< C, Locker >::Reset().

Referenced by CMakeBlastDBApp::x_ProcessMaskData().

◆ SetMaskLetters()

void CBuildDatabase::SetMaskLetters ( const string mask_letters)

Specify letters to mask out of protein sequence data.

Protein sequences sometimes contain rare (or recently defined) letters that cause trouble for some algorithms. This method specifies a list of protein letters that might be found in the input sequences, but which should be replaced by "X" before adding those sequences to the database.

Parameters
taxidsAn object providing defline-to-TaxID lookups. [in]

Definition at line 1221 of file build_db.cpp.

References m_OutputDb, and CWriteDB::SetMaskedLetters().

◆ SetMaxFileSize()

void CBuildDatabase::SetMaxFileSize ( Uint8  max_file_size)

Set the maximum size of database component files.

This will specify the maximum size of file that will be made as a component of a database volume manufactured by the WriteDB library. The default value is 10^9 (one billion bytes.)

Parameters
max_file_sizeMaximum file size in bytes.

Definition at line 1578 of file build_db.cpp.

References m_OutputDb, and CWriteDB::SetMaxFileSize().

Referenced by CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ SetMembBits()

void CBuildDatabase::SetMembBits ( const TLinkoutMap membbits,
bool  keep_mbits 
)

Specify a membership bit lookup object.

The provided mapping will be used to look up membership bit data for sequences added to the database.

Parameters
src_dbThe source database. [in]

Definition at line 1270 of file build_db.cpp.

References m_Id2Mbits, m_KeepMbits, m_LogFile, and MapToLMBits().

Referenced by CMakeBlastDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ SetSkipCopyingGis()

void CBuildDatabase::SetSkipCopyingGis ( bool  v)
inline

◆ SetSourceDb() [1/2]

void CBuildDatabase::SetSourceDb ( const string src_db_name)

Specify source database(s) via the database name(s).

The provided name will be used to find a source database (or several) to look up sequence information for the list of sequences specified by AddIds().

Parameters
src_db_nameDatabase name of the source database. [in]

Definition at line 1250 of file build_db.cpp.

References _ASSERT, CSeqDB::eNucleotide, CSeqDB::eProtein, and m_IsProtein.

Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ SetSourceDb() [2/2]

void CBuildDatabase::SetSourceDb ( CRef< CSeqDBExpert src_db)

Specify source database.

The provided source database will be used to look up sequence information for the list of sequences specified by AddIds().

Parameters
src_dbThe source database. [in]

Definition at line 1242 of file build_db.cpp.

References CSeqDB::GetDate(), CSeqDB::GetDBNameList(), CSeqDB::GetTitle(), m_LogFile, and m_SourceDb.

◆ SetTaxids()

void CBuildDatabase::SetTaxids ( CTaxIdSet taxids)

Specify a mapping of sequence ids to taxonomic ids.

When adding sequences CBuildDatabase will use the object provided here to find TaxIDs for sequences it adds to the newly created database.

Parameters
taxidsAn object providing defline-to-TaxID lookups. [in]

Definition at line 1216 of file build_db.cpp.

References m_Taxids, and CRef< C, Locker >::Reset().

Referenced by BOOST_AUTO_TEST_CASE(), and CMakeBlastDBApp::x_BuildDatabase().

◆ SetUseRemote()

void CBuildDatabase::SetUseRemote ( bool  use_remote)
inline

Specify whether to use remote fetching for locally absent IDs.

If identifiers in the list provided to Build or to AddIds is not found in the source database (if any), remote sequence fetching APIs can be used to fetch those sequences. Normally this happens in two cases. First, sequences listed in the list of IDs are sometimes too new to be found in the source database. Secondly, sequences may be found in the source database, but newer versions might be available in the remote database.

If the use_remote flag is set to true, this class finds the latest version number for unversioned IDs (but only of types that can have versions in the first place), and will attempt to remotely fetch any sequences for which the source database does not have the latest version. If the flag is specified as false, no remote lookups will be done, and sequences found in ids but not found in the source database will not be added to the output database.

Note: This does not affect the AddSequences, AddRawSequences, or AddFasta methods; in those cases, all provided sequences are added in the form they are provided in.

The default value for this flag is "true".

Parameters
use_remoteSpecify true for remote checking & fetching.

Definition at line 385 of file build_db.hpp.

References m_UseRemote.

Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ SetVerbosity()

void CBuildDatabase::SetVerbosity ( bool  v)
inline

Specify level of output verbosity.

Parameters
vSpecify true if output should be more detailed.

Definition at line 392 of file build_db.hpp.

References m_Verbose.

Referenced by CMakeBlastDBApp::x_BuildDatabase(), and CMakeClusterDBApp::x_BuildDatabase().

◆ StartBuild()

void CBuildDatabase::StartBuild ( )

Start building a new database.

This method sets up a new database to begin receiving sequences. It should be called before AddIds, AddFasta, AddSequences, or AddRawSequences is called.

Definition at line 1317 of file build_db.cpp.

Referenced by BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().

◆ x_AddMasksForSeqId()

void CBuildDatabase::x_AddMasksForSeqId ( const list< CRef< CSeq_id > > &  ids)
private

Add the masks for the Seq-id(s) (usually just one) to the database being created.

Parameters
idsSeq-id(s) of the sequence to which masks should be added [in]

Definition at line 447 of file build_db.cpp.

References CMaskedRangesVector::empty(), CRef< C, Locker >::Empty(), IMaskDataSource::GetRanges(), ITERATE, m_FoundMatchingMasks, m_MaskData, m_OutputDb, and CWriteDB::SetMaskData().

Referenced by x_EditAndAddBioseq().

◆ x_AddOneRemoteSequence()

void CBuildDatabase::x_AddOneRemoteSequence ( const objects::CSeq_id &  seqid,
bool found,
bool error 
)
private

Fetch a sequence from the remote service and add it to the db.

The provided Seq-id will be used to fetch a Bioseq remotely, and this Bioseq will be added to this database. If

Parameters
seqidIdentifies the sequence to fetch. [in]
foundWill be set to true if a sequence was found. [out]
errorWill be set to true if an error occurred. [out]

Definition at line 507 of file build_db.cpp.

References debug_mode, CBioseq_Handle::fState_not_found, CBioseq_Handle::GetCompleteBioseq(), CBioseq_Handle::GetState(), m_LogFile, MSerial_AsnText, CException::what(), x_EditAndAddBioseq(), and x_GetScope().

Referenced by x_AddRemoteSequences().

◆ x_AddPig()

void CBuildDatabase::x_AddPig ( CRef< objects::CBlast_def_line_set >  headers)
private

Add pig if id can be extracted from the deflines.

Parameters
headersHeaders to extract the id if available.

Definition at line 418 of file build_db.cpp.

References CBlast_def_line_Base::GetOther_info(), CBlast_def_line_Base::IsSetOther_info(), m_OutputDb, and CWriteDB::SetPig().

Referenced by AddSequences(), and x_EditAndAddBioseq().

◆ x_AddRemoteSequences()

bool CBuildDatabase::x_AddRemoteSequences ( CInputGiList gi_list)
private

Duplicate IDs from local databases.

This method iterates over the list of IDs; any IDs that were not found in the source database are added by fetching the sequence from remote services. (Whether an ID was found locally can be determined by whether the OID found in the GI list is valid.)

Parameters
gi_listA list of GIs and Seq-ids.
Returns
True if all IDs could be added.

Definition at line 555 of file build_db.cpp.

References CStopWatch::Elapsed(), CStopWatch::eStart, CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, CSeqDBGiList::SSiOid::oid, sw, t, and x_AddOneRemoteSequence().

Referenced by AddIds().

◆ x_DupLocal()

void CBuildDatabase::x_DupLocal ( )
private

Duplicate IDs from local databases.

This method iterates over the list of IDs, copying sequences found in the source databases to the output database.

Definition at line 235 of file build_db.cpp.

References CWriteDB::AddSequence(), ambig(), buffer, CSeqDB::CheckOrFindOID(), CStopWatch::Elapsed(), CStopWatch::eStart, CTaxIdSet::FixTaxId(), CBlast_def_line_set_Base::Get(), CSeqDB::GetHdr(), CSeqDBExpert::GetRawSeqAndAmbig(), m_DeflineCount, m_LogFile, m_OIDCount, m_OutputDb, m_SourceDb, m_Taxids, CWriteDB::SetDeflines(), sw, t, and x_SetLinkAndMbit().

Referenced by AddIds().

◆ x_EditAndAddBioseq()

bool CBuildDatabase::x_EditAndAddBioseq ( CConstRef< objects::CBioseq >  bs,
objects::CSeqVector *  sv,
bool  add_pig = false 
)
private

Modify a Bioseq as needed and add it to the database.

The provided Bioseq is added to the database. Modifications are made to the data as needed (but the input object is not affected). In particular, the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set.

Parameters
bsBioseq to add to the database.
bsSequence data to add to the database.
add_pigtrue if PIG should be added if available
Returns
ture if bioseq has been added, otherwise false

Definition at line 469 of file build_db.cpp.

References CWriteDB::AddSequence(), CWriteDB::ExtractBioseqDeflines(), CBlast_def_line_set_Base::Get(), m_DeflineCount, m_LongIDs, m_OIDCount, m_OutputDb, m_ParseIDs, m_ScanBioseq4CFastaReaderUsrObjct, s_FixBioseqDeltas(), CWriteDB::SetDeflines(), x_AddMasksForSeqId(), x_AddPig(), and x_EditHeaders().

Referenced by AddSequences(), and x_AddOneRemoteSequence().

◆ x_EditHeaders()

void CBuildDatabase::x_EditHeaders ( CRef< objects::CBlast_def_line_set >  headers)
private

Modify deflines with linkout and membership bits and taxids.

The provided deflines are modified: the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set. The input object is modified.

Parameters
headersHeaders to modify.

Definition at line 428 of file build_db.cpp.

References CTaxIdSet::FixTaxId(), m_SkipCopyingGis, m_Taxids, and x_SetLinkAndMbit().

Referenced by AddSequences(), and x_EditAndAddBioseq().

◆ x_EndBuild()

bool CBuildDatabase::x_EndBuild ( bool  erase,
const CException close_exception 
)
private

◆ x_GetScope()

CScope & CBuildDatabase::x_GetScope ( )
private

Get a scope for remote loading of objects.

Definition at line 1226 of file build_db.cpp.

References CRef< C, Locker >::Empty(), CObjectManager::GetInstance(), m_ObjMgr, m_Scope, and CRef< C, Locker >::Reset().

Referenced by x_AddOneRemoteSequence(), and x_ResolveRemoteId().

◆ x_ReportUnresolvedIds()

bool CBuildDatabase::x_ReportUnresolvedIds ( const CInputGiList gi_list) const
private

Write log messages for any unresolved IDs.

Parameters
gi_listList of GIs and Seq-ids.
Returns
True if all sequences were resolved.

Definition at line 626 of file build_db.cpp.

References CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, and CSeqDBGiList::SSiOid::oid.

Referenced by AddIds().

◆ x_ResolveFromSource()

bool CBuildDatabase::x_ResolveFromSource ( const string acc,
CRef< objects::CSeq_id > &  id 
)
private

Determine if this string ID can be found in the source database.

The provided string will be looked up as an accession in the source database. If a corresponding sequence is found, it will be returned in the `id' field. The resolution is only considered a match if the provided string is a substring of the FASTA representation of the provided Seq-id, and if that substring seems to represent whole components (so that it's surrounded by delimeters such as `|' and `.' rather than by alphanumeric characters, which may be part of another ID).

Parameters
accThe accession or ID to look up. [in]
idThe returned Seq-id if one is found. [out]
Returns
true if the resolution was successful.

Definition at line 185 of file build_db.cpp.

References CSeqDB::AccessionToOids(), CSeq_id::AsFastaString(), done, CRef< C, Locker >::Empty(), CSeqDB::GetSeqIDs(), ITERATE, m_SourceDb, and S.

Referenced by x_ResolveGis().

◆ x_ResolveGis()

CRef< CInputGiList > CBuildDatabase::x_ResolveGis ( const vector< string > &  ids)
private

Resolve various input IDs (as strings) to GIs.

The input IDs are examined, the type of each is determined as a GIs or some other kind of Seq-id, and each ID is resolved to a GI where possible. The list of GIs and other Seq-ids found is returned in a GI list.

Parameters
idsList of strings representing IDs to resolve.
Returns
GI list produced from the input ids.

Definition at line 116 of file build_db.cpp.

References CInputGiList::AppendGi(), CInputGiList::AppendSi(), CheckAccession(), debug_mode, ITERATE, m_LogFile, m_SourceDb, m_UseRemote, CRef< C, Locker >::NotEmpty(), x_ResolveFromSource(), x_ResolveRemoteId(), and ZERO_GI.

Referenced by AddIds().

◆ x_ResolveRemoteId()

void CBuildDatabase::x_ResolveRemoteId ( CRef< objects::CSeq_id > &  seqid,
TGi gi 
)
private

Resolve an ID remotely.

This method looks up the given ID via remote services in order to find an ID for the most up-to-date version of the sequence. The remote service will return a list of Seq-ids; if at least one of these is a GI, that will be returned in `gi'. If no GI is found, but at least one of the returned IDs is of the same type as the input Seq-id, the version number of the input Seq-id will be updated.

Parameters
seqidSequence identifier to look up remotely. [in|out]
giGenomic ID if one is found, otherwise 0. [out]

Definition at line 65 of file build_db.cpp.

References debug_mode, CSeq_id::GetTextseq_Id(), CSeq_id_Base::IsGi(), CTextseq_id_Base::IsSetVersion(), ITERATE, m_LogFile, NULL, CRef< C, Locker >::Reset(), CSeq_id_Base::Which(), x_GetScope(), and ZERO_GI.

Referenced by x_ResolveGis().

◆ x_SetLeafTaxids()

void CBuildDatabase::x_SetLeafTaxids ( CRef< objects::CBlast_def_line_set >  headers)
private

Store leaf taxids in provided headers.

Parameters
headersThese deflines will be modified. [in|out]

◆ x_SetLinkAndMbit()

void CBuildDatabase::x_SetLinkAndMbit ( CRef< objects::CBlast_def_line_set >  headers)
private

Store linkout (now deprecated) and membership bits in provided headers.

Each Seq-id found in each defline in the provided headers will be looked up in the set of linkout and membership bits provided for building this database, and the appropriate bits will be set for each defline.

Parameters
headersThese deflines will be modified. [in|out]

Definition at line 1563 of file build_db.cpp.

References GetDeflineKeys(), m_Id2Leafs, m_Id2Mbits, m_KeepLeafs, m_KeepMbits, NON_CONST_ITERATE, s_SetDeflineBits(), and s_SetDeflineLeafs().

Referenced by x_DupLocal(), and x_EditHeaders().

Member Data Documentation

◆ m_DeflineCount

int CBuildDatabase::m_DeflineCount
private

Define count.

Definition at line 644 of file build_db.hpp.

Referenced by Build(), x_DupLocal(), and x_EditAndAddBioseq().

◆ m_FoundMatchingMasks

bool CBuildDatabase::m_FoundMatchingMasks
private

If true, there were sequences whose IDs matched those in the provided masking locations (via SetMaskDataSource).

Used to display a warning in case this didn't happen

Definition at line 661 of file build_db.hpp.

Referenced by AddSequences(), x_AddMasksForSeqId(), and ~CBuildDatabase().

◆ m_Id2Leafs

TIdToLeafs CBuildDatabase::m_Id2Leafs
private

Table of leaf taxids to apply to sequences.

Definition at line 617 of file build_db.hpp.

Referenced by SetLeafTaxIds(), and x_SetLinkAndMbit().

◆ m_Id2Links

TIdToBits CBuildDatabase::m_Id2Links
private

Table of linkout bits to apply to sequences.

DEPRECATED

Definition at line 605 of file build_db.hpp.

Referenced by SetLinkouts().

◆ m_Id2Mbits

TIdToBits CBuildDatabase::m_Id2Mbits
private

Table of membership bits to apply to sequences.

Definition at line 611 of file build_db.hpp.

Referenced by SetMembBits(), and x_SetLinkAndMbit().

◆ m_IsProtein

bool CBuildDatabase::m_IsProtein
private

True for a protein database, false for nucleotide.

Definition at line 597 of file build_db.hpp.

Referenced by AddFasta(), AddSequences(), and SetSourceDb().

◆ m_KeepLeafs

bool CBuildDatabase::m_KeepLeafs
private

True to keep leaf taxids from source dbs, false to discard.

Definition at line 614 of file build_db.hpp.

Referenced by SetLeafTaxIds(), and x_SetLinkAndMbit().

◆ m_KeepLinks

bool CBuildDatabase::m_KeepLinks
private

True to keep linkout bits from source dbs, false to discard.

DEPRECATED

Definition at line 601 of file build_db.hpp.

Referenced by SetLinkouts().

◆ m_KeepMbits

bool CBuildDatabase::m_KeepMbits
private

True to keep membership bits from source dbs, false to discard.

Definition at line 608 of file build_db.hpp.

Referenced by SetMembBits(), and x_SetLinkAndMbit().

◆ m_LogFile

ostream& CBuildDatabase::m_LogFile
private

◆ m_LongIDs

bool CBuildDatabase::m_LongIDs
private

If true, use long sequence ids (database|accession)

Definition at line 656 of file build_db.hpp.

Referenced by AddFasta(), AddSequences(), CBuildDatabase(), and x_EditAndAddBioseq().

◆ m_MaskData

CRef<IMaskDataSource> CBuildDatabase::m_MaskData
private

Subject masking data.

Definition at line 635 of file build_db.hpp.

Referenced by AddSequences(), SetMaskDataSource(), x_AddMasksForSeqId(), and ~CBuildDatabase().

◆ m_ObjMgr

CRef<objects::CObjectManager> CBuildDatabase::m_ObjMgr
private

Object manager, used for remote fetching.

Definition at line 620 of file build_db.hpp.

Referenced by x_GetScope().

◆ m_OIDCount

int CBuildDatabase::m_OIDCount
private

Number of OIDs stored in this database.

Definition at line 647 of file build_db.hpp.

Referenced by Build(), x_DupLocal(), and x_EditAndAddBioseq().

◆ m_OutputDb

CRef<CWriteDB> CBuildDatabase::m_OutputDb
private

◆ m_OutputDbName

string CBuildDatabase::m_OutputDbName
private

Definition at line 669 of file build_db.hpp.

Referenced by CBuildDatabase(), and GetOutputDbName().

◆ m_ParseIDs

bool CBuildDatabase::m_ParseIDs
private

If true, string IDs found in FASTA input will be parsed as Seq-ids.

Definition at line 653 of file build_db.hpp.

Referenced by AddFasta(), CBuildDatabase(), and x_EditAndAddBioseq().

◆ m_ScanBioseq4CFastaReaderUsrObjct

bool CBuildDatabase::m_ScanBioseq4CFastaReaderUsrObjct
private

Definition at line 671 of file build_db.hpp.

Referenced by x_EditAndAddBioseq().

◆ m_Scope

CRef<objects::CScope> CBuildDatabase::m_Scope
private

Sequence scope, used for remote fetching.

Definition at line 623 of file build_db.hpp.

Referenced by x_GetScope().

◆ m_SkipCopyingGis

bool CBuildDatabase::m_SkipCopyingGis
private

If set to true, when copying BLASTDBs, skip the GIs.

Definition at line 664 of file build_db.hpp.

Referenced by SetSkipCopyingGis(), and x_EditHeaders().

◆ m_SkipLargeGis

bool CBuildDatabase::m_SkipLargeGis
private

If set to true, skip GIs with value > 0x7FFFFFFF.

Definition at line 667 of file build_db.hpp.

Referenced by AddSequences().

◆ m_SourceDb

CRef<CSeqDBExpert> CBuildDatabase::m_SourceDb
private

Database for duplicating sequences locally (-sourcedb option.)

Definition at line 632 of file build_db.hpp.

Referenced by AddIds(), SetSourceDb(), x_DupLocal(), x_ResolveFromSource(), and x_ResolveGis().

◆ m_Taxids

CRef<CTaxIdSet> CBuildDatabase::m_Taxids
private

Set of TaxIDs configured to apply to sequences.

Definition at line 626 of file build_db.hpp.

Referenced by SetTaxids(), x_DupLocal(), x_EditHeaders(), and ~CBuildDatabase().

◆ m_UseRemote

bool CBuildDatabase::m_UseRemote
private

Whether to use remote resolution and sequence fetching.

Definition at line 641 of file build_db.hpp.

Referenced by AddIds(), SetUseRemote(), and x_ResolveGis().

◆ m_Verbose

bool CBuildDatabase::m_Verbose
private

If true, more detailed log messages will be produced.

Definition at line 650 of file build_db.hpp.

Referenced by AddIds(), AddSequences(), SetVerbosity(), x_AddRemoteSequences(), and x_ReportUnresolvedIds().


The documentation for this class was generated from the following files:
Modified on Fri Dec 01 04:48:21 2023 by modify_doxy.py rev. 669887