NCBI C++ ToolKit
|
Search Toolkit Book for CBuildDatabase
Build BlastDB format databases from various data sources. More...
#include <objtools/blast/seqdb_writer/build_db.hpp>
Public Member Functions | |
CBuildDatabase (const string &dbname, const string &title, bool is_protein, CWriteDB::TIndexType indexing, bool use_gi_mask, ostream *logfile, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true) | |
Constructor. More... | |
CBuildDatabase (const string &dbname, const string &title, bool is_protein, bool sparse, bool parse_seqids, bool use_gi_mask, ostream *logfil, bool long_seqids=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=true) | |
Constructor. More... | |
~CBuildDatabase () | |
void | SetTaxids (CTaxIdSet &taxids) |
Specify a mapping of sequence ids to taxonomic ids. More... | |
void | SetMaskLetters (const string &mask_letters) |
Specify letters to mask out of protein sequence data. More... | |
void | SetSourceDb (const string &src_db_name) |
Specify source database(s) via the database name(s). More... | |
void | SetSourceDb (CRef< CSeqDBExpert > src_db) |
Specify source database. More... | |
void | SetLinkouts (const TLinkoutMap &linkouts, bool keep_links) |
Specify a linkout bit lookup object. More... | |
void | SetMembBits (const TLinkoutMap &membbits, bool keep_mbits) |
Specify a membership bit lookup object. More... | |
void | SetLeafTaxIds (const TIdToLeafs &taxids, bool keep_taxids) |
Specify a leaf-taxids object. More... | |
bool | Build (const vector< string > &ids, CNcbiIstream *fasta_file) |
Build the database. More... | |
void | StartBuild () |
Start building a new database. More... | |
bool | AddIds (const vector< string > &ids) |
Add the specified sequences from the source database. More... | |
bool | AddFasta (CNcbiIstream &fasta_file) |
Add sequences from a file containing FASTA data. More... | |
bool | AddSequences (IBioseqSource &src, bool add_pig=false) |
Add sequences from an IBioseqSource object. More... | |
bool | AddSequences (IRawSequenceSource &src) |
Add sequences from an IRawSequenceSource object. More... | |
bool | EndBuild (bool erase=false) |
Finish building a new database. More... | |
void | SetUseRemote (bool use_remote) |
Specify whether to use remote fetching for locally absent IDs. More... | |
void | SetVerbosity (bool v) |
Specify level of output verbosity. More... | |
void | SetSkipCopyingGis (bool v) |
void | SetMaxFileSize (Uint8 max_file_size) |
Set the maximum size of database component files. More... | |
int | RegisterMaskingAlgorithm (EBlast_filter_program program, const string &options, const string &name="") |
Define a masking algorithm. More... | |
int | RegisterMaskingAlgorithm (const string &program, const string &description, const string &options) |
Define a masking algorithm. More... | |
void | SetMaskDataSource (IMaskDataSource &ranges) |
Specify an object mapping Seq-id to subject masking data. More... | |
string | GetOutputDbName () const |
![]() | |
CObject (void) | |
Constructor. More... | |
CObject (const CObject &src) | |
Copy constructor. More... | |
virtual | ~CObject (void) |
Destructor. More... | |
CObject & | operator= (const CObject &src) THROWS_NONE |
Assignment operator. More... | |
bool | CanBeDeleted (void) const THROWS_NONE |
Check if object can be deleted. More... | |
bool | IsAllocatedInPool (void) const THROWS_NONE |
Check if object is allocated in memory pool (not system heap) More... | |
bool | Referenced (void) const THROWS_NONE |
Check if object is referenced. More... | |
bool | ReferencedOnlyOnce (void) const THROWS_NONE |
Check if object is referenced only once. More... | |
void | AddReference (void) const |
Add reference to object. More... | |
void | RemoveReference (void) const |
Remove reference to object. More... | |
void | ReleaseReference (void) const |
Remove reference without deleting object. More... | |
virtual void | DoNotDeleteThisObject (void) |
Mark this object as not allocated in heap – do not delete this object. More... | |
virtual void | DoDeleteThisObject (void) |
Mark this object as allocated in heap – object can be deleted. More... | |
void * | operator new (size_t size) |
Define new operator for memory allocation. More... | |
void * | operator new[] (size_t size) |
Define new[] operator for 'array' memory allocation. More... | |
void | operator delete (void *ptr) |
Define delete operator for memory deallocation. More... | |
void | operator delete[] (void *ptr) |
Define delete[] operator for memory deallocation. More... | |
void * | operator new (size_t size, void *place) |
Define new operator. More... | |
void | operator delete (void *ptr, void *place) |
Define delete operator. More... | |
void * | operator new (size_t size, CObjectMemoryPool *place) |
Define new operator using memory pool. More... | |
void | operator delete (void *ptr, CObjectMemoryPool *place) |
Define delete operator. More... | |
virtual void | DebugDump (CDebugDumpContext ddc, unsigned int depth) const |
Define method for dumping debug information. More... | |
![]() | |
CDebugDumpable (void) | |
virtual | ~CDebugDumpable (void) |
void | DebugDumpText (ostream &out, const string &bundle, unsigned int depth) const |
void | DebugDumpFormat (CDebugDumpFormatter &ddf, const string &bundle, unsigned int depth) const |
void | DumpToConsole (void) const |
Static Public Member Functions | |
static void | CreateDirectories (const string &dbname) |
Create Directory for blast db. More... | |
![]() | |
static NCBI_XNCBI_EXPORT void | ThrowNullPointerException (void) |
Define method to throw null pointer exception. More... | |
static NCBI_XNCBI_EXPORT void | ThrowNullPointerException (const type_info &type) |
static EAllocFillMode | GetAllocFillMode (void) |
static void | SetAllocFillMode (EAllocFillMode mode) |
static void | SetAllocFillMode (const string &value) |
Set mode from configuration parameter value. More... | |
![]() | |
static void | EnableDebugDump (bool on) |
Private Member Functions | |
objects::CScope & | x_GetScope () |
Get a scope for remote loading of objects. More... | |
void | x_DupLocal () |
Duplicate IDs from local databases. More... | |
void | x_ResolveRemoteId (CRef< objects::CSeq_id > &seqid, TGi &gi) |
Resolve an ID remotely. More... | |
CRef< CInputGiList > | x_ResolveGis (const vector< string > &ids) |
Resolve various input IDs (as strings) to GIs. More... | |
void | x_EditHeaders (CRef< objects::CBlast_def_line_set > headers) |
Modify deflines with linkout and membership bits and taxids. More... | |
void | x_AddPig (CRef< objects::CBlast_def_line_set > headers) |
Add pig if id can be extracted from the deflines. More... | |
bool | x_EditAndAddBioseq (CConstRef< objects::CBioseq > bs, objects::CSeqVector *sv, bool add_pig=false) |
Modify a Bioseq as needed and add it to the database. More... | |
void | x_AddMasksForSeqId (const list< CRef< CSeq_id > > &ids) |
Add the masks for the Seq-id(s) (usually just one) to the database being created. More... | |
bool | x_AddRemoteSequences (CInputGiList &gi_list) |
Duplicate IDs from local databases. More... | |
bool | x_ReportUnresolvedIds (const CInputGiList &gi_list) const |
Write log messages for any unresolved IDs. More... | |
void | x_SetLinkAndMbit (CRef< objects::CBlast_def_line_set > headers) |
Store linkout (now deprecated) and membership bits in provided headers. More... | |
void | x_SetLeafTaxids (CRef< objects::CBlast_def_line_set > headers) |
Store leaf taxids in provided headers. More... | |
void | x_AddOneRemoteSequence (const objects::CSeq_id &seqid, bool &found, bool &error) |
Fetch a sequence from the remote service and add it to the db. More... | |
bool | x_ResolveFromSource (const string &acc, CRef< objects::CSeq_id > &id) |
Determine if this string ID can be found in the source database. More... | |
bool | x_EndBuild (bool erase, const CException *close_exception) |
Private Attributes | |
bool | m_IsProtein |
True for a protein database, false for nucleotide. More... | |
bool | m_KeepLinks |
True to keep linkout bits from source dbs, false to discard. More... | |
TIdToBits | m_Id2Links |
Table of linkout bits to apply to sequences. More... | |
bool | m_KeepMbits |
True to keep membership bits from source dbs, false to discard. More... | |
TIdToBits | m_Id2Mbits |
Table of membership bits to apply to sequences. More... | |
bool | m_KeepLeafs |
True to keep leaf taxids from source dbs, false to discard. More... | |
TIdToLeafs | m_Id2Leafs |
Table of leaf taxids to apply to sequences. More... | |
CRef< objects::CObjectManager > | m_ObjMgr |
Object manager, used for remote fetching. More... | |
CRef< objects::CScope > | m_Scope |
Sequence scope, used for remote fetching. More... | |
CRef< CTaxIdSet > | m_Taxids |
Set of TaxIDs configured to apply to sequences. More... | |
CRef< CWriteDB > | m_OutputDb |
Database being produced here. More... | |
CRef< CSeqDBExpert > | m_SourceDb |
Database for duplicating sequences locally (-sourcedb option.) More... | |
CRef< IMaskDataSource > | m_MaskData |
Subject masking data. More... | |
ostream & | m_LogFile |
Logfile. More... | |
bool | m_UseRemote |
Whether to use remote resolution and sequence fetching. More... | |
int | m_DeflineCount |
Define count. More... | |
int | m_OIDCount |
Number of OIDs stored in this database. More... | |
bool | m_Verbose |
If true, more detailed log messages will be produced. More... | |
bool | m_ParseIDs |
If true, string IDs found in FASTA input will be parsed as Seq-ids. More... | |
bool | m_LongIDs |
If true, use long sequence ids (database|accession) More... | |
bool | m_FoundMatchingMasks |
If true, there were sequences whose IDs matched those in the provided masking locations (via SetMaskDataSource). More... | |
bool | m_SkipCopyingGis |
If set to true, when copying BLASTDBs, skip the GIs. More... | |
bool | m_SkipLargeGis |
If set to true, skip GIs with value > 0x7FFFFFFF. More... | |
string | m_OutputDbName |
bool | m_ScanBioseq4CFastaReaderUsrObjct |
Additional Inherited Members | |
![]() | |
enum | EAllocFillMode { eAllocFillNone = 1 , eAllocFillZero , eAllocFillPattern } |
Control filling of newly allocated memory. More... | |
typedef CObjectCounterLocker | TLockerType |
Default locker type for CRef. More... | |
typedef atomic< Uint8 > | TCounter |
Counter type is CAtomiCounter. More... | |
typedef Uint8 | TCount |
Alias for value type of counter. More... | |
![]() | |
static const TCount | eCounterBitsCanBeDeleted = 1 << 0 |
Define possible object states. More... | |
static const TCount | eCounterBitsInPlainHeap = 1 << 1 |
Heap signature was found. More... | |
static const TCount | eCounterBitsPlaceMask |
Mask for 'in heap' state flags. More... | |
static const int | eCounterStep = 1 << 2 |
Skip over the "in heap" bits. More... | |
static const TCount | eCounterValid = TCount(1) << (sizeof(TCount) * 8 - 2) |
Minimal value for valid objects (reference counter is zero) Must be a single bit value. More... | |
static const TCount | eCounterStateMask |
Valid object, and object in heap. More... | |
![]() | |
virtual void | DeleteThis (void) |
Virtual method "deleting" this object. More... | |
Build BlastDB format databases from various data sources.
This class provides an API for building BlastDB format databases. The WriteDB library is used internally to produce the actual database; the functionality provided by this class helps to bridge the gap between the WriteDB API and the needs of a command line database construction tool.
Definition at line 136 of file build_db.hpp.
CBuildDatabase::CBuildDatabase | ( | const string & | dbname, |
const string & | title, | ||
bool | is_protein, | ||
CWriteDB::TIndexType | indexing, | ||
bool | use_gi_mask, | ||
ostream * | logfile, | ||
bool | long_seqids = false , |
||
EBlastDbVersion | dbver = eBDB_Version4 , |
||
bool | limit_defline = false , |
||
Uint8 | oid_masks = EOidMaskType::fNone , |
||
bool | scan_bioseq_4_cfastareader_usrobj = true |
||
) |
Constructor.
Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.
dbname | Name of the database to create. [in] |
title | Title to use for newly created database. [in] |
is_protein | Use true for protein, false for nucleotide. [in] |
sparse | Specify true to use sparse Seq-id indexing. [in] |
Logging | will be done to this stream. [in] |
use_gi_mask | if true will generate GI-based mask files [in] |
logfile | file to write the log to [in] |
long_seqids | if true, requires long sequence ids (database|accession) when parsing fasta sequences [in] |
dbver | version of BLAST database to generate [in] |
scan_bioseq_4_cfastareader_usrobj | [in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline |
Definition at line 1073 of file build_db.cpp.
References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eNucleotide, CWriteDB::eProtein, m_LogFile, m_LongIDs, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().
CBuildDatabase::CBuildDatabase | ( | const string & | dbname, |
const string & | title, | ||
bool | is_protein, | ||
bool | sparse, | ||
bool | parse_seqids, | ||
bool | use_gi_mask, | ||
ostream * | logfil, | ||
bool | long_seqids = false , |
||
EBlastDbVersion | dbver = eBDB_Version4 , |
||
bool | limit_defline = false , |
||
Uint8 | oid_masks = EOidMaskType::fNone , |
||
bool | scan_bioseq_4_cfastareader_usrobj = true |
||
) |
Constructor.
Create a database with the specified name, type, and other characteristics. The database will use the specified dbname as the base name for database volumes. Note that the indexing argument will be combined with either eSparseIndex or eDefault, depending on the "sparse" flag.
dbname | Name of the database to create. [in] |
title | Title to use for newly created database. [in] |
is_protein | Use true for protein, false for nucleotide. [in] |
sparse | Specify true to use sparse Seq-id indexing. [in] |
parse_seqids | specify true to parse the sequence IDs [in] |
use_gi_mask | if true will generate GI-based mask files [in] |
indexing | index fields to add to database. [in] |
long_seqids | if true, requires long sequence ids (database|accession) when parsing fasta sequences [in] |
scan_bioseq_4_cfastareader_usrobj | [in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline |
Definition at line 1136 of file build_db.cpp.
References CTime::AsString(), CDirEntry::CreateAbsolutePath(), CreateDirectories(), dbname(), DeleteBlastDb(), CTime::eCurrent, CWriteDB::eDefault, CWriteDB::eNucleotide, CWriteDB::eProtein, CWriteDB::eSparseIndex, m_LogFile, m_OutputDb, m_OutputDbName, m_ParseIDs, ParseMoleculeTypeString(), CRef< C, Locker >::Reset(), and CWriteDB::SetMaxFileSize().
CBuildDatabase::~CBuildDatabase | ( | ) |
Definition at line 1204 of file build_db.cpp.
References ERR_POST, Error(), CTaxIdSet::HasEverFixedId(), m_FoundMatchingMasks, m_MaskData, m_Taxids, and CRef< C, Locker >::NotEmpty().
bool CBuildDatabase::AddFasta | ( | CNcbiIstream & | fasta_file | ) |
Add sequences from a file containing FASTA data.
The provided file is expected to contain FASTA data for one or more sequences. The data should be suitable input as required by CFastaReader.
fasta_file | A file containing FASTA data. |
Definition at line 1398 of file build_db.cpp.
References AddSequences(), EndBuild(), m_IsProtein, m_LongIDs, m_ParseIDs, and NCBI_THROW.
Referenced by BOOST_AUTO_TEST_CASE(), Build(), and CMakeBlastDBApp::x_AddFasta().
Add the specified sequences from the source database.
The list of strings are interpreted as GIs if they're composed only of numeric digits, or as Seq-ids otherwise. The sequence IDs will be resolved, and a sequence corresponding to each ID will be added to the output database. If remote resolution is enabled, it will be used to find up-to-date versions for any ambiguously versioned IDs (i.e. unversioned IDs of versioned Seq-id types). Then local fetching will be used to process IDs using the source database if one was specified. If any sequences have not be found, and remote services are enabled, remote fetching will be used for IDs not resolved locally. If any IDs are not found at all, they will be reported as part of the logging output.
ids | List of sequence IDs as strings. |
Definition at line 1321 of file build_db.cpp.
References _ASSERT, map_checker< Container >::end(), map_checker< Container >::find(), CSeqDB::GetDBNameList(), CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDB::GetSequenceType(), CSeqDBGiList::SGiOid::gi, i, m_LogFile, m_SourceDb, m_UseRemote, m_Verbose, CRef< C, Locker >::NotEmpty(), CSeqDBGiList::SGiOid::oid, x_AddRemoteSequences(), x_DupLocal(), x_ReportUnresolvedIds(), and x_ResolveGis().
Referenced by BOOST_AUTO_TEST_CASE(), and Build().
bool CBuildDatabase::AddSequences | ( | IBioseqSource & | src, |
bool | add_pig = false |
||
) |
Add sequences from an IBioseqSource object.
The provided `src' object is queried using GetNext() to get a Bioseq object. The Bioseq is added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns NULL.
src | An object providing one or more Bioseq objects. |
add_pig | true if PIG should be added if available |
Definition at line 794 of file build_db.cpp.
References CBioseq_Base::CanGetId(), debug_mode, CSeq_id_Base::e_Local, CStopWatch::Elapsed(), CStopWatch::eStart, CSeq_id::fAcc_nuc, CSeq_id::fAcc_prot, CBioseq_Base::GetId(), CBioseq::GetLength(), IBioseqSource::GetNext(), CConstRef< C, Locker >::GetNonNullPointer(), GI_CONST, info, CBioseq::IsAa(), label, m_IsProtein, m_LogFile, m_LongIDs, m_SkipLargeGis, m_Verbose, NCBI_THROW, CConstRef< C, Locker >::NotEmpty(), NULL, CBioseq_Base::SetId(), sw, t, and x_EditAndAddBioseq().
Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_AddSeqEntries(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), BlastdbCopyApplication::x_MakeDBwIDList(), and CMakeBlastDBApp::x_ProcessInputData().
bool CBuildDatabase::AddSequences | ( | IRawSequenceSource & | src | ) |
Add sequences from an IRawSequenceSource object.
The provided `src' object is queried using GetNext() to get various "raw format" sequence data and metadata components. These pieces of data are added to the output database (with appropriate modifications of taxid, membership bits, and linkout bits, as configured here). This process repeats until the GetNext() method returns false.
src | An object providing one or more "raw" sequences. |
Definition at line 904 of file build_db.cpp.
References _ASSERT, CWriteDB::AddColumnMetaData(), CWriteDB::AddSequence(), CBlastDbBlob::Clear(), CWriteDB::CreateUserColumn(), CTempString::data(), done, CStopWatch::Elapsed(), CMaskedRangesVector::empty(), CTempString::empty(), CRef< C, Locker >::Empty(), map_checker< Container >::end(), CStopWatch::eStart, map_checker< Container >::find(), CWriteDB::FindColumn(), CBlast_def_line_set_Base::Get(), IRawSequenceSource::GetColumnId(), IRawSequenceSource::GetColumnMetaData(), IRawSequenceSource::GetColumnNames(), IRawSequenceSource::GetNext(), IMaskDataSource::GetRanges(), i, int, ITERATE, m_FoundMatchingMasks, m_IsProtein, m_LogFile, m_MaskData, m_OutputDb, NCBI_THROW, CWriteDB::SetBlobData(), CWriteDB::SetDeflines(), CWriteDB::SetMaskData(), ncbi::grid::netcache::search::fields::size, CTempString::size(), sw, t, CBlastDbBlob::WriteRaw(), x_AddPig(), and x_EditHeaders().
bool CBuildDatabase::Build | ( | const vector< string > & | ids, |
CNcbiIstream * | fasta_file | ||
) |
Build the database.
This method builds a database from the given list of Sequence IDs and the provided file, which should contain FASTA format data. It is equivalent to calling StartBuild(), AddIds(), AddFasta(), and EndBuild() in that order (except that a little additional logging is done with summary information.).
ids | List of identifiers to add to the database. |
fasta_file | FASTA format data for |
Definition at line 1289 of file build_db.cpp.
References AddFasta(), AddIds(), CStopWatch::Elapsed(), EndBuild(), CStopWatch::eStart, m_DeflineCount, m_LogFile, m_OIDCount, StartBuild(), sw, and t.
Referenced by BOOST_AUTO_TEST_CASE().
Create Directory for blast db.
dbname | output blast db name (with path) |
Definition at line 1051 of file build_db.cpp.
References CDirEntry::CheckAccess(), CDir::CreatePath(), dbname(), CDirEntry::eIfEmptyPath_Empty, CDir::Exists(), CDirEntry::fWrite, CDirEntry::GetDir(), CDirEntry::GetName(), and NCBI_THROW.
Referenced by CBuildDatabase(), CBlastdbConvertApp::Run(), and CMakeProfileDBApp::x_Run().
Finish building a new database.
This method closes the newly constructed database, flushing any unflushed volumes, creating an alias file to tie the volumes together, and so on.
erase | Will erase all files created if true. |
Definition at line 1423 of file build_db.cpp.
References CWriteDB::Close(), eUnknown, m_OutputDb, NCBI_EXCEPTION_VAR, NULL, CException::what(), and x_EndBuild().
Referenced by AddFasta(), BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
|
inline |
Definition at line 465 of file build_db.hpp.
References m_OutputDbName.
Referenced by CMakeBlastDBApp::x_BuildDatabase(), and CMakeClusterDBApp::x_BuildDatabase().
int CBuildDatabase::RegisterMaskingAlgorithm | ( | const string & | program, |
const string & | description, | ||
const string & | options | ||
) |
Define a masking algorithm.
The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).
program | A string to identify the filtering algorithm [in] |
description | A free-form string describing the data [in] |
options | A free-form string describing the options used [in] |
Definition at line 1597 of file build_db.cpp.
References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().
int CBuildDatabase::RegisterMaskingAlgorithm | ( | EBlast_filter_program | program, |
const string & | options, | ||
const string & | name = "" |
||
) |
Define a masking algorithm.
The returned integer ID will be defined as corresponding to the provided program enumeration (e.g. DUST, SEG, etc) and options string, for subject masking. Each program enumeration (such as DUST) may be used several times with different options strings, however, the combination of program and options should be unique for each algorithm ID. The options string is a free-form string (at least from this class's point of view).
program | One of the predefined masking types (dust etc). [in] |
options | A free-form string describing this type of data. The empty string should be used to indicate default parameters. [in] |
name | Name of the GI-base mask file [in] |
Definition at line 1584 of file build_db.cpp.
References m_OutputDb, and CWriteDB::RegisterMaskAlgorithm().
Referenced by CClusterDBSource::CClusterDBSource(), CRawSeqDBSource::CRawSeqDBSource(), and CMakeBlastDBApp::x_ProcessMaskData().
void CBuildDatabase::SetLeafTaxIds | ( | const TIdToLeafs & | taxids, |
bool | keep_taxids | ||
) |
Specify a leaf-taxids object.
Definition at line 1278 of file build_db.cpp.
References m_Id2Leafs, m_KeepLeafs, and m_LogFile.
Referenced by CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), BlastdbCopyApplication::x_MakeDBwIDList(), and CMakeBlastDBApp::x_ProcessInputData().
void CBuildDatabase::SetLinkouts | ( | const TLinkoutMap & | linkouts, |
bool | keep_links | ||
) |
Specify a linkout bit lookup object.
The provided mapping will be used to look up linkout bits for sequences added to the database.
src_db | The source database. [in] |
Definition at line 1262 of file build_db.cpp.
References m_Id2Links, m_KeepLinks, m_LogFile, and MapToLMBits().
void CBuildDatabase::SetMaskDataSource | ( | IMaskDataSource & | ranges | ) |
Specify an object mapping Seq-id to subject masking data.
Masking data is provided to CBuildDatabase by implementing an interface that can produce masking data given the Seq-ids for the sequence that is to be masked. This object could wrap a simple lookup table, an algorithm that produces the data on the fly, or a wrapper around an existing database that fetches the masking data from that database.
ranges | An object mapping Seq-ids to their masking data. |
Definition at line 1609 of file build_db.cpp.
References m_MaskData, and CRef< C, Locker >::Reset().
Referenced by CMakeBlastDBApp::x_ProcessMaskData().
Specify letters to mask out of protein sequence data.
Protein sequences sometimes contain rare (or recently defined) letters that cause trouble for some algorithms. This method specifies a list of protein letters that might be found in the input sequences, but which should be replaced by "X" before adding those sequences to the database.
taxids | An object providing defline-to-TaxID lookups. [in] |
Definition at line 1221 of file build_db.cpp.
References m_OutputDb, and CWriteDB::SetMaskedLetters().
void CBuildDatabase::SetMaxFileSize | ( | Uint8 | max_file_size | ) |
Set the maximum size of database component files.
This will specify the maximum size of file that will be made as a component of a database volume manufactured by the WriteDB library. The default value is 10^9 (one billion bytes.)
max_file_size | Maximum file size in bytes. |
Definition at line 1578 of file build_db.cpp.
References m_OutputDb, and CWriteDB::SetMaxFileSize().
Referenced by CMakeBlastDBApp::x_BuildDatabase(), CMakeClusterDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
void CBuildDatabase::SetMembBits | ( | const TLinkoutMap & | membbits, |
bool | keep_mbits | ||
) |
Specify a membership bit lookup object.
The provided mapping will be used to look up membership bit data for sequences added to the database.
src_db | The source database. [in] |
Definition at line 1270 of file build_db.cpp.
References m_Id2Mbits, m_KeepMbits, m_LogFile, and MapToLMBits().
Referenced by CMakeBlastDBApp::x_BuildDatabase(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
|
inline |
Definition at line 397 of file build_db.hpp.
References m_SkipCopyingGis.
Referenced by BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
Specify source database(s) via the database name(s).
The provided name will be used to find a source database (or several) to look up sequence information for the list of sequences specified by AddIds().
src_db_name | Database name of the source database. [in] |
Definition at line 1250 of file build_db.cpp.
References _ASSERT, CSeqDB::eNucleotide, CSeqDB::eProtein, and m_IsProtein.
Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
void CBuildDatabase::SetSourceDb | ( | CRef< CSeqDBExpert > | src_db | ) |
Specify source database.
The provided source database will be used to look up sequence information for the list of sequences specified by AddIds().
src_db | The source database. [in] |
Definition at line 1242 of file build_db.cpp.
References CSeqDB::GetDate(), CSeqDB::GetDBNameList(), CSeqDB::GetTitle(), m_LogFile, and m_SourceDb.
void CBuildDatabase::SetTaxids | ( | CTaxIdSet & | taxids | ) |
Specify a mapping of sequence ids to taxonomic ids.
When adding sequences CBuildDatabase will use the object provided here to find TaxIDs for sequences it adds to the newly created database.
taxids | An object providing defline-to-TaxID lookups. [in] |
Definition at line 1216 of file build_db.cpp.
References m_Taxids, and CRef< C, Locker >::Reset().
Referenced by BOOST_AUTO_TEST_CASE(), and CMakeBlastDBApp::x_BuildDatabase().
|
inline |
Specify whether to use remote fetching for locally absent IDs.
If identifiers in the list provided to Build or to AddIds is not found in the source database (if any), remote sequence fetching APIs can be used to fetch those sequences. Normally this happens in two cases. First, sequences listed in the list of IDs are sometimes too new to be found in the source database. Secondly, sequences may be found in the source database, but newer versions might be available in the remote database.
If the use_remote flag is set to true, this class finds the latest version number for unversioned IDs (but only of types that can have versions in the first place), and will attempt to remotely fetch any sequences for which the source database does not have the latest version. If the flag is specified as false, no remote lookups will be done, and sequences found in ids but not found in the source database will not be added to the output database.
Note: This does not affect the AddSequences, AddRawSequences, or AddFasta methods; in those cases, all provided sequences are added in the form they are provided in.
The default value for this flag is "true".
use_remote | Specify true for remote checking & fetching. |
Definition at line 385 of file build_db.hpp.
References m_UseRemote.
Referenced by BOOST_AUTO_TEST_CASE(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
|
inline |
Specify level of output verbosity.
v | Specify true if output should be more detailed. |
Definition at line 392 of file build_db.hpp.
References m_Verbose.
Referenced by CMakeBlastDBApp::x_BuildDatabase(), and CMakeClusterDBApp::x_BuildDatabase().
void CBuildDatabase::StartBuild | ( | ) |
Start building a new database.
This method sets up a new database to begin receiving sequences. It should be called before AddIds, AddFasta, AddSequences, or AddRawSequences is called.
Definition at line 1317 of file build_db.cpp.
Referenced by BOOST_AUTO_TEST_CASE(), Build(), s_TestReadPDBAsn1(), BlastdbCopyApplication::x_CopyDB(), and BlastdbCopyApplication::x_MakeDBwIDList().
Add the masks for the Seq-id(s) (usually just one) to the database being created.
ids | Seq-id(s) of the sequence to which masks should be added [in] |
Definition at line 447 of file build_db.cpp.
References CMaskedRangesVector::empty(), CRef< C, Locker >::Empty(), IMaskDataSource::GetRanges(), ITERATE, m_FoundMatchingMasks, m_MaskData, m_OutputDb, and CWriteDB::SetMaskData().
Referenced by x_EditAndAddBioseq().
|
private |
Fetch a sequence from the remote service and add it to the db.
The provided Seq-id will be used to fetch a Bioseq remotely, and this Bioseq will be added to this database. If
seqid | Identifies the sequence to fetch. [in] |
found | Will be set to true if a sequence was found. [out] |
error | Will be set to true if an error occurred. [out] |
Definition at line 507 of file build_db.cpp.
References debug_mode, CBioseq_Handle::fState_not_found, CBioseq_Handle::GetCompleteBioseq(), CBioseq_Handle::GetState(), m_LogFile, MSerial_AsnText, CException::what(), x_EditAndAddBioseq(), and x_GetScope().
Referenced by x_AddRemoteSequences().
|
private |
Add pig if id can be extracted from the deflines.
headers | Headers to extract the id if available. |
Definition at line 418 of file build_db.cpp.
References CBlast_def_line_Base::GetOther_info(), CBlast_def_line_Base::IsSetOther_info(), m_OutputDb, and CWriteDB::SetPig().
Referenced by AddSequences(), and x_EditAndAddBioseq().
|
private |
Duplicate IDs from local databases.
This method iterates over the list of IDs; any IDs that were not found in the source database are added by fetching the sequence from remote services. (Whether an ID was found locally can be determined by whether the OID found in the GI list is valid.)
gi_list | A list of GIs and Seq-ids. |
Definition at line 555 of file build_db.cpp.
References CStopWatch::Elapsed(), CStopWatch::eStart, CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, CSeqDBGiList::SSiOid::oid, sw, t, and x_AddOneRemoteSequence().
Referenced by AddIds().
|
private |
Duplicate IDs from local databases.
This method iterates over the list of IDs, copying sequences found in the source databases to the output database.
Definition at line 235 of file build_db.cpp.
References CWriteDB::AddSequence(), ambig(), buffer, CSeqDB::CheckOrFindOID(), CStopWatch::Elapsed(), CStopWatch::eStart, CTaxIdSet::FixTaxId(), CBlast_def_line_set_Base::Get(), CSeqDB::GetHdr(), CSeqDBExpert::GetRawSeqAndAmbig(), m_DeflineCount, m_LogFile, m_OIDCount, m_OutputDb, m_SourceDb, m_Taxids, CWriteDB::SetDeflines(), sw, t, and x_SetLinkAndMbit().
Referenced by AddIds().
|
private |
Modify a Bioseq as needed and add it to the database.
The provided Bioseq is added to the database. Modifications are made to the data as needed (but the input object is not affected). In particular, the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set.
bs | Bioseq to add to the database. |
bs | Sequence data to add to the database. |
add_pig | true if PIG should be added if available |
Definition at line 469 of file build_db.cpp.
References CWriteDB::AddSequence(), CWriteDB::ExtractBioseqDeflines(), CBlast_def_line_set_Base::Get(), m_DeflineCount, m_LongIDs, m_OIDCount, m_OutputDb, m_ParseIDs, m_ScanBioseq4CFastaReaderUsrObjct, s_FixBioseqDeltas(), CWriteDB::SetDeflines(), x_AddMasksForSeqId(), x_AddPig(), and x_EditHeaders().
Referenced by AddSequences(), and x_AddOneRemoteSequence().
|
private |
Modify deflines with linkout and membership bits and taxids.
The provided deflines are modified: the taxid is set (0 is used if no taxid is known), and linkout and membership bits are set. The input object is modified.
headers | Headers to modify. |
Definition at line 428 of file build_db.cpp.
References CTaxIdSet::FixTaxId(), m_SkipCopyingGis, m_Taxids, and x_SetLinkAndMbit().
Referenced by AddSequences(), and x_EditAndAddBioseq().
|
private |
Definition at line 1439 of file build_db.cpp.
References _ASSERT, _TRACE, CException::GetMsg(), ITERATE, CWriteDB::ListFiles(), CWriteDB::ListVolumes(), m_LogFile, m_OutputDb, NCBI_RETHROW, and CDirEntry::Remove().
Referenced by EndBuild().
|
private |
Get a scope for remote loading of objects.
Definition at line 1226 of file build_db.cpp.
References CRef< C, Locker >::Empty(), CObjectManager::GetInstance(), m_ObjMgr, m_Scope, and CRef< C, Locker >::Reset().
Referenced by x_AddOneRemoteSequence(), and x_ResolveRemoteId().
|
private |
Write log messages for any unresolved IDs.
gi_list | List of GIs and Seq-ids. |
Definition at line 626 of file build_db.cpp.
References CSeqDBGiList::GetGiOid(), CSeqDBGiList::GetKey(), CSeqDBGiList::GetNumGis(), CSeqDBGiList::GetNumSis(), CSeqDBGiList::GetSiOid(), i, m_LogFile, m_Verbose, CSeqDBGiList::SGiOid::oid, and CSeqDBGiList::SSiOid::oid.
Referenced by AddIds().
|
private |
Determine if this string ID can be found in the source database.
The provided string will be looked up as an accession in the source database. If a corresponding sequence is found, it will be returned in the `id' field. The resolution is only considered a match if the provided string is a substring of the FASTA representation of the provided Seq-id, and if that substring seems to represent whole components (so that it's surrounded by delimeters such as `|' and `.' rather than by alphanumeric characters, which may be part of another ID).
acc | The accession or ID to look up. [in] |
id | The returned Seq-id if one is found. [out] |
Definition at line 185 of file build_db.cpp.
References CSeqDB::AccessionToOids(), CSeq_id::AsFastaString(), done, CRef< C, Locker >::Empty(), CSeqDB::GetSeqIDs(), ITERATE, m_SourceDb, and S.
Referenced by x_ResolveGis().
|
private |
Resolve various input IDs (as strings) to GIs.
The input IDs are examined, the type of each is determined as a GIs or some other kind of Seq-id, and each ID is resolved to a GI where possible. The list of GIs and other Seq-ids found is returned in a GI list.
ids | List of strings representing IDs to resolve. |
Definition at line 116 of file build_db.cpp.
References CInputGiList::AppendGi(), CInputGiList::AppendSi(), CheckAccession(), debug_mode, ITERATE, m_LogFile, m_SourceDb, m_UseRemote, CRef< C, Locker >::NotEmpty(), x_ResolveFromSource(), x_ResolveRemoteId(), and ZERO_GI.
Referenced by AddIds().
Resolve an ID remotely.
This method looks up the given ID via remote services in order to find an ID for the most up-to-date version of the sequence. The remote service will return a list of Seq-ids; if at least one of these is a GI, that will be returned in `gi'. If no GI is found, but at least one of the returned IDs is of the same type as the input Seq-id, the version number of the input Seq-id will be updated.
seqid | Sequence identifier to look up remotely. [in|out] |
gi | Genomic ID if one is found, otherwise 0. [out] |
Definition at line 65 of file build_db.cpp.
References debug_mode, CSeq_id::GetTextseq_Id(), CSeq_id_Base::IsGi(), CTextseq_id_Base::IsSetVersion(), ITERATE, m_LogFile, NULL, CRef< C, Locker >::Reset(), CSeq_id_Base::Which(), x_GetScope(), and ZERO_GI.
Referenced by x_ResolveGis().
|
private |
Store leaf taxids in provided headers.
headers | These deflines will be modified. [in|out] |
|
private |
Store linkout (now deprecated) and membership bits in provided headers.
Each Seq-id found in each defline in the provided headers will be looked up in the set of linkout and membership bits provided for building this database, and the appropriate bits will be set for each defline.
headers | These deflines will be modified. [in|out] |
Definition at line 1563 of file build_db.cpp.
References GetDeflineKeys(), m_Id2Leafs, m_Id2Mbits, m_KeepLeafs, m_KeepMbits, NON_CONST_ITERATE, s_SetDeflineBits(), and s_SetDeflineLeafs().
Referenced by x_DupLocal(), and x_EditHeaders().
|
private |
Define count.
Definition at line 644 of file build_db.hpp.
Referenced by Build(), x_DupLocal(), and x_EditAndAddBioseq().
|
private |
If true, there were sequences whose IDs matched those in the provided masking locations (via SetMaskDataSource).
Used to display a warning in case this didn't happen
Definition at line 661 of file build_db.hpp.
Referenced by AddSequences(), x_AddMasksForSeqId(), and ~CBuildDatabase().
|
private |
Table of leaf taxids to apply to sequences.
Definition at line 617 of file build_db.hpp.
Referenced by SetLeafTaxIds(), and x_SetLinkAndMbit().
|
private |
Table of linkout bits to apply to sequences.
DEPRECATED
Definition at line 605 of file build_db.hpp.
Referenced by SetLinkouts().
|
private |
Table of membership bits to apply to sequences.
Definition at line 611 of file build_db.hpp.
Referenced by SetMembBits(), and x_SetLinkAndMbit().
|
private |
True for a protein database, false for nucleotide.
Definition at line 597 of file build_db.hpp.
Referenced by AddFasta(), AddSequences(), and SetSourceDb().
|
private |
True to keep leaf taxids from source dbs, false to discard.
Definition at line 614 of file build_db.hpp.
Referenced by SetLeafTaxIds(), and x_SetLinkAndMbit().
|
private |
True to keep linkout bits from source dbs, false to discard.
DEPRECATED
Definition at line 601 of file build_db.hpp.
Referenced by SetLinkouts().
|
private |
True to keep membership bits from source dbs, false to discard.
Definition at line 608 of file build_db.hpp.
Referenced by SetMembBits(), and x_SetLinkAndMbit().
|
private |
Logfile.
Definition at line 638 of file build_db.hpp.
Referenced by AddIds(), AddSequences(), Build(), CBuildDatabase(), SetLeafTaxIds(), SetLinkouts(), SetMembBits(), SetSourceDb(), x_AddOneRemoteSequence(), x_AddRemoteSequences(), x_DupLocal(), x_EndBuild(), x_ReportUnresolvedIds(), x_ResolveGis(), and x_ResolveRemoteId().
|
private |
If true, use long sequence ids (database|accession)
Definition at line 656 of file build_db.hpp.
Referenced by AddFasta(), AddSequences(), CBuildDatabase(), and x_EditAndAddBioseq().
|
private |
Subject masking data.
Definition at line 635 of file build_db.hpp.
Referenced by AddSequences(), SetMaskDataSource(), x_AddMasksForSeqId(), and ~CBuildDatabase().
|
private |
Object manager, used for remote fetching.
Definition at line 620 of file build_db.hpp.
Referenced by x_GetScope().
|
private |
Number of OIDs stored in this database.
Definition at line 647 of file build_db.hpp.
Referenced by Build(), x_DupLocal(), and x_EditAndAddBioseq().
Database being produced here.
Definition at line 629 of file build_db.hpp.
Referenced by AddSequences(), CBuildDatabase(), EndBuild(), RegisterMaskingAlgorithm(), SetMaskLetters(), SetMaxFileSize(), x_AddMasksForSeqId(), x_AddPig(), x_DupLocal(), x_EditAndAddBioseq(), and x_EndBuild().
|
private |
Definition at line 669 of file build_db.hpp.
Referenced by CBuildDatabase(), and GetOutputDbName().
|
private |
If true, string IDs found in FASTA input will be parsed as Seq-ids.
Definition at line 653 of file build_db.hpp.
Referenced by AddFasta(), CBuildDatabase(), and x_EditAndAddBioseq().
|
private |
Definition at line 671 of file build_db.hpp.
Referenced by x_EditAndAddBioseq().
|
private |
Sequence scope, used for remote fetching.
Definition at line 623 of file build_db.hpp.
Referenced by x_GetScope().
|
private |
If set to true, when copying BLASTDBs, skip the GIs.
Definition at line 664 of file build_db.hpp.
Referenced by SetSkipCopyingGis(), and x_EditHeaders().
|
private |
If set to true, skip GIs with value > 0x7FFFFFFF.
Definition at line 667 of file build_db.hpp.
Referenced by AddSequences().
|
private |
Database for duplicating sequences locally (-sourcedb option.)
Definition at line 632 of file build_db.hpp.
Referenced by AddIds(), SetSourceDb(), x_DupLocal(), x_ResolveFromSource(), and x_ResolveGis().
Set of TaxIDs configured to apply to sequences.
Definition at line 626 of file build_db.hpp.
Referenced by SetTaxids(), x_DupLocal(), x_EditHeaders(), and ~CBuildDatabase().
|
private |
Whether to use remote resolution and sequence fetching.
Definition at line 641 of file build_db.hpp.
Referenced by AddIds(), SetUseRemote(), and x_ResolveGis().
|
private |
If true, more detailed log messages will be produced.
Definition at line 650 of file build_db.hpp.
Referenced by AddIds(), AddSequences(), SetVerbosity(), x_AddRemoteSequences(), and x_ReportUnresolvedIds().