NCBI C++ ToolKit
Public Types | Public Member Functions | Static Public Member Functions | Protected Attributes | List of all members
CWriteDB Class Reference

Search Toolkit Book for CWriteDB

CWriteDB. More...

#include <objtools/blast/seqdb_writer/writedb.hpp>

+ Inheritance diagram for CWriteDB:
+ Collaboration diagram for CWriteDB:

Public Types

enum  ESeqType { eProtein = 0 , eNucleotide = 1 }
 Sequence types. More...
 
enum  EIndexType {
  eNoIndex = 0 , eSparseIndex = 0x1 , eFullIndex = 0x2 , eAddTrace = 0x4 ,
  eFullWithTrace = eFullIndex | eAddTrace , eDefault = eFullIndex | eAddTrace , eAddHash = 0x100
}
 Whether and what kind of indices to build. More...
 
typedef int TIndexType
 Bitwise OR of "EIndexType". More...
 
- Public Types inherited from CObject
enum  EAllocFillMode { eAllocFillNone = 1 , eAllocFillZero , eAllocFillPattern }
 Control filling of newly allocated memory. More...
 
typedef CObjectCounterLocker TLockerType
 Default locker type for CRef. More...
 
typedef atomic< Uint8TCounter
 Counter type is CAtomiCounter. More...
 
typedef Uint8 TCount
 Alias for value type of counter. More...
 

Public Member Functions

 CWriteDB (const string &dbname, ESeqType seqtype, const string &title, int itype=eDefault, bool parse_ids=true, bool long_ids=false, bool use_gi_mask=false, EBlastDbVersion dbver=eBDB_Version4, bool limit_defline=false, Uint8 oid_masks=EOidMaskType::fNone, bool scan_bioseq_4_cfastareader_usrobj=false)
 Constructor. More...
 
 ~CWriteDB ()
 Destructor. More...
 
void AddSequence (const CBioseq &bs)
 Add a sequence as a CBioseq. More...
 
void AddSequence (const CBioseq &bs, CSeqVector &sv)
 Add a sequence as a CBioseq. More...
 
void AddSequence (const CBioseq_Handle &bsh)
 Add a sequence as a CBioseq. More...
 
void AddSequence (const CTempString &sequence, const CTempString &ambiguities="")
 Add a sequence as raw data. More...
 
void SetPig (int pig)
 Set the PIG to be used for the sequence. More...
 
void SetDeflines (const CBlast_def_line_set &deflines)
 Set the deflines to be used for the sequence. More...
 
int RegisterMaskAlgorithm (EBlast_filter_program program, const string &options=string(), const string &name=string())
 Register a type of filtering data found in this database. More...
 
int RegisterMaskAlgorithm (const string &id, const string &description=string(), const string &options=string())
 Register a type of filtering data found in this database. More...
 
void SetMaskData (const CMaskedRangesVector &ranges, const vector< TGi > &gis)
 Set filtering data for a sequence. More...
 
void ListVolumes (vector< string > &vols)
 List Volumes. More...
 
void ListFiles (vector< string > &files)
 List Filenames. More...
 
void Close ()
 Close the Database. More...
 
void SetMaxFileSize (Uint8 sz)
 Set maximum size for output files. More...
 
void SetMaxVolumeLetters (Uint8 letters)
 Set maximum letters for output volumes. More...
 
void SetMaskedLetters (const string &masked)
 Set letters that should not be used in sequences. More...
 
int FindColumn (const string &title) const
 Find an existing column. More...
 
int CreateUserColumn (const string &title)
 Set up a user-defined CWriteDB column. More...
 
void AddColumnMetaData (int col_id, const string &key, const string &value)
 Add meta data to a user-defined column. More...
 
CBlastDbBlobSetBlobData (int column_id)
 Add blob data to a user-defined column. More...
 
- Public Member Functions inherited from CObject
 CObject (void)
 Constructor. More...
 
 CObject (const CObject &src)
 Copy constructor. More...
 
virtual ~CObject (void)
 Destructor. More...
 
CObjectoperator= (const CObject &src) THROWS_NONE
 Assignment operator. More...
 
bool CanBeDeleted (void) const THROWS_NONE
 Check if object can be deleted. More...
 
bool IsAllocatedInPool (void) const THROWS_NONE
 Check if object is allocated in memory pool (not system heap) More...
 
bool Referenced (void) const THROWS_NONE
 Check if object is referenced. More...
 
bool ReferencedOnlyOnce (void) const THROWS_NONE
 Check if object is referenced only once. More...
 
void AddReference (void) const
 Add reference to object. More...
 
void RemoveReference (void) const
 Remove reference to object. More...
 
void ReleaseReference (void) const
 Remove reference without deleting object. More...
 
virtual void DoNotDeleteThisObject (void)
 Mark this object as not allocated in heap – do not delete this object. More...
 
virtual void DoDeleteThisObject (void)
 Mark this object as allocated in heap – object can be deleted. More...
 
void * operator new (size_t size)
 Define new operator for memory allocation. More...
 
void * operator new[] (size_t size)
 Define new[] operator for 'array' memory allocation. More...
 
void operator delete (void *ptr)
 Define delete operator for memory deallocation. More...
 
void operator delete[] (void *ptr)
 Define delete[] operator for memory deallocation. More...
 
void * operator new (size_t size, void *place)
 Define new operator. More...
 
void operator delete (void *ptr, void *place)
 Define delete operator. More...
 
void * operator new (size_t size, CObjectMemoryPool *place)
 Define new operator using memory pool. More...
 
void operator delete (void *ptr, CObjectMemoryPool *place)
 Define delete operator. More...
 
virtual void DebugDump (CDebugDumpContext ddc, unsigned int depth) const
 Define method for dumping debug information. More...
 
- Public Member Functions inherited from CDebugDumpable
 CDebugDumpable (void)
 
virtual ~CDebugDumpable (void)
 
void DebugDumpText (ostream &out, const string &bundle, unsigned int depth) const
 
void DebugDumpFormat (CDebugDumpFormatter &ddf, const string &bundle, unsigned int depth) const
 
void DumpToConsole (void) const
 

Static Public Member Functions

static CRef< CBlast_def_line_setExtractBioseqDeflines (const CBioseq &bs, bool parse_ids=true, bool long_ids=false, bool scan_bioseq_4_cfastareader_usrobj=false)
 Extract Deflines From Bioseq. More...
 
- Static Public Member Functions inherited from CObject
static NCBI_XNCBI_EXPORT void ThrowNullPointerException (void)
 Define method to throw null pointer exception. More...
 
static NCBI_XNCBI_EXPORT void ThrowNullPointerException (const type_info &type)
 
static EAllocFillMode GetAllocFillMode (void)
 
static void SetAllocFillMode (EAllocFillMode mode)
 
static void SetAllocFillMode (const string &value)
 Set mode from configuration parameter value. More...
 
- Static Public Member Functions inherited from CDebugDumpable
static void EnableDebugDump (bool on)
 

Protected Attributes

CWriteDB_Implm_Impl
 Implementation object. More...
 

Additional Inherited Members

- Static Public Attributes inherited from CObject
static const TCount eCounterBitsCanBeDeleted = 1 << 0
 Define possible object states. More...
 
static const TCount eCounterBitsInPlainHeap = 1 << 1
 Heap signature was found. More...
 
static const TCount eCounterBitsPlaceMask
 Mask for 'in heap' state flags. More...
 
static const int eCounterStep = 1 << 2
 Skip over the "in heap" bits. More...
 
static const TCount eCounterValid = TCount(1) << (sizeof(TCount) * 8 - 2)
 Minimal value for valid objects (reference counter is zero) Must be a single bit value. More...
 
static const TCount eCounterStateMask
 Valid object, and object in heap. More...
 
- Protected Member Functions inherited from CObject
virtual void DeleteThis (void)
 Virtual method "deleting" this object. More...
 

Detailed Description

CWriteDB.

User interface class for blast databases.

This class provides the top-level interface class for BLAST database users. It defines access to the database component by calling methods on objects which represent the various database files, such as the index, header, sequence, and alias files.

Definition at line 91 of file writedb.hpp.

Member Typedef Documentation

◆ TIndexType

Bitwise OR of "EIndexType".

Definition at line 128 of file writedb.hpp.

Member Enumeration Documentation

◆ EIndexType

Whether and what kind of indices to build.

Enumerator
eNoIndex 

Build a database without any indices.

eSparseIndex 

Use only simple accessions in the string index.

eFullIndex 

Use several forms of each Seq-id in the string index.

eAddTrace 

OR this in to add an index for trace IDs.

eFullWithTrace 

Like eFullIndex but also build a numeric Trace ID index.

eDefault 

Like eFullIndex but also build a numeric Trace ID index.

eAddHash 

Add an index from sequence hash to OID.

Definition at line 104 of file writedb.hpp.

◆ ESeqType

Sequence types.

Enumerator
eProtein 

Protein database.

eNucleotide 

Nucleotide database.

Definition at line 95 of file writedb.hpp.

Constructor & Destructor Documentation

◆ CWriteDB()

CWriteDB::CWriteDB ( const string dbname,
ESeqType  seqtype,
const string title,
int  itype = eDefault,
bool  parse_ids = true,
bool  long_ids = false,
bool  use_gi_mask = false,
EBlastDbVersion  dbver = eBDB_Version4,
bool  limit_defline = false,
Uint8  oid_masks = EOidMaskType::fNone,
bool  scan_bioseq_4_cfastareader_usrobj = false 
)

Constructor.

Starts construction of a blast database.

Parameters
dbnameA list of database or alias names, seperated by spaces. [in]
seqtypeSpecify eProtein, eNucleotide, or eUnknown. [in]
titleThe database title. [in]
itypeIndicates the type of indices to build if specified. [in]
parse_idsIf true, generate ISAM files [in]
long_idsIf true, assume long sequence ids (database|accession) when parsing string ids [in]
use_gi_maskIf true, generate GI-based mask files [in]
dbverversion of BLAST database to generate [in]
scan_bioseq_4_cfastareader_usrobj[in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline

Definition at line 49 of file writedb.cpp.

References dbname(), eProtein, and m_Impl.

◆ ~CWriteDB()

CWriteDB::~CWriteDB ( )

Destructor.

This will return resources acquired by this object, and call Close() if it has not already been called.

Definition at line 74 of file writedb.cpp.

References m_Impl.

Member Function Documentation

◆ AddColumnMetaData()

void CWriteDB::AddColumnMetaData ( int  col_id,
const string key,
const string value 
)

Add meta data to a user-defined column.

In addition to normal blob data, database columns can store a `dictionary' of user-defined metadata in key/value form. This method adds one such key/value pair to the column. Specifying a key a second time causes replacement of the previous value. Using this mechanism to store large amounts of data may have a negative impact on performance.

Parameters
col_idSpecifies the column to add this metadata to.
keyA unique key string.
valueA value string.

Definition at line 185 of file writedb.cpp.

References CWriteDB_Impl::AddColumnMetaData(), ncbi::grid::netcache::search::fields::key, m_Impl, and rapidjson::value.

Referenced by CBuildDatabase::AddSequences().

◆ AddSequence() [1/4]

void CWriteDB::AddSequence ( const CBioseq bs)

Add a sequence as a CBioseq.

This adds the sequence data in the specified CBioseq to the database. If the CBioseq contains deflines, they will also be used unless there is a call to SetDeflines() or AddDefline(). Note that the CBioseq will be held by CWriteDB at least until the next sequence is provided. If this method is used, the CBioseq is expected to contain sequence data accessible via GetInst().GetSeq_data(). If this might not be true, it may be better to use the version of this function that also takes a CSeqVector.

Parameters
bsThe sequence and related data as a CBioseq. [in]

Definition at line 79 of file writedb.cpp.

References CWriteDB_Impl::AddSequence(), and m_Impl.

Referenced by CBuildDatabase::AddSequences(), BOOST_AUTO_TEST_CASE(), s_DupIdsBioseq(), s_DupIdsRaw(), CBuildDatabase::x_DupLocal(), CBuildDatabase::x_EditAndAddBioseq(), and CMakeProfileDBApp::x_MakeVol().

◆ AddSequence() [2/4]

void CWriteDB::AddSequence ( const CBioseq bs,
CSeqVector sv 
)

Add a sequence as a CBioseq.

This adds the sequence data in the specified CSeqVector, and the meta data in the specified CBioseq, to the database. If the CBioseq contains deflines, they will also be used unless there is a call to SetDeflines() or AddDefline(). Note that the CBioseq will be held by CWriteDB at least until the next sequence is provided. This version will use the CSeqVector if the sequence data is not found in the CBioseq.

Parameters
bsA CBioseq containing meta data for the sequence. [in]
svThe sequence data for the sequence. [in]

Definition at line 89 of file writedb.cpp.

References CWriteDB_Impl::AddSequence(), and m_Impl.

◆ AddSequence() [3/4]

void CWriteDB::AddSequence ( const CBioseq_Handle bsh)

Add a sequence as a CBioseq.

This adds the sequence found in the given CBioseq_Handle to the database.

Parameters
bshThe sequence and related data as a CBioseq_Handle. [in]

Definition at line 84 of file writedb.cpp.

References CWriteDB_Impl::AddSequence(), and m_Impl.

◆ AddSequence() [4/4]

void CWriteDB::AddSequence ( const CTempString sequence,
const CTempString ambiguities = "" 
)

Add a sequence as raw data.

This adds a sequence provided as raw sequence data. The raw data must be (and is assumed to be) encoded correctly for the format of database being produced. For protein databases, the ambiguities string should be empty (and is thus optional). If this version of AddSequence() is used, the user must also provide one or more deflines with SetDeflines() or AddDefline() calls.

Parameters
sequenceThe sequence data as a string of bytes. [in]
ambiguitiesThe ambiguity data as a string of bytes. [in]

Definition at line 109 of file writedb.cpp.

References a, CWriteDB_Impl::AddSequence(), ambig(), CTempString::data(), CTempString::length(), and m_Impl.

◆ Close()

void CWriteDB::Close ( void  )

Close the Database.

Flush all data to disk and close any open files.

Definition at line 104 of file writedb.cpp.

References CWriteDB_Impl::Close(), and m_Impl.

Referenced by BOOST_AUTO_TEST_CASE(), CBuildDatabase::EndBuild(), s_DupSequencesTest(), and CMakeProfileDBApp::x_MakeVol().

◆ CreateUserColumn()

int CWriteDB::CreateUserColumn ( const string title)

Set up a user-defined CWriteDB column.

This method creates a user-defined column associated with this database. The column is indexed by OID and contains arbitrary binary data, which is applied using the SetBlobData method below. The `title' parameter identifies the column and must be unique within this database. Because tables are accessed by title, it is not necessary to permanently associate file extensions with specific purposes or data types. The return value of this method is an integer that identifies this column for the purpose of inserting blob data. (The number of columns allowed is currently limited due to the file naming scheme, but some columns are used for built-in purposes.)

Parameters
titleName identifying this column.
Returns
Column identifier (a positive integer).

Definition at line 180 of file writedb.cpp.

References CWriteDB_Impl::CreateColumn(), and m_Impl.

Referenced by CBuildDatabase::AddSequences().

◆ ExtractBioseqDeflines()

CRef< CBlast_def_line_set > CWriteDB::ExtractBioseqDeflines ( const CBioseq bs,
bool  parse_ids = true,
bool  long_ids = false,
bool  scan_bioseq_4_cfastareader_usrobj = false 
)
static

Extract Deflines From Bioseq.

Deflines are extracted from the CBioseq and returned to the user. The caller can then modify or inspect the deflines, and apply them to a sequence with SetDeflines().

Parameters
bsThe bioseq from which to extract a defline set. [in]
parse_idsIf seqid should be parsed [in]
long_idsIt true, use long sequence ids (database|accession) [in]
scan_bioseq_4_cfastareader_usrobj[in] If true, scan the Bioseq objects for a CFastaReader-created User-object containing a defline
Returns
A set of deflines for this CBioseq.

Definition at line 129 of file writedb.cpp.

References CWriteDB_Impl::ExtractBioseqDeflines().

Referenced by BOOST_AUTO_TEST_CASE(), CBuildDatabase::x_EditAndAddBioseq(), and CMakeProfileDBApp::x_MakeVol().

◆ FindColumn()

int CWriteDB::FindColumn ( const string title) const

Find an existing column.

This looks for an existing column with the specified title and returns the column ID if found.

Parameters
titleThe column title to look for.
Returns
The column ID if this title is defined, otherwise -1.

Definition at line 175 of file writedb.cpp.

References CWriteDB_Impl::FindColumn(), and m_Impl.

Referenced by CBuildDatabase::AddSequences().

◆ ListFiles()

void CWriteDB::ListFiles ( vector< string > &  files)

List Filenames.

Returns a list of the files constructed by this class; the returned list may not be complete until Close() has been called.

Parameters
filesThe set of resolved database path names. [out]

Definition at line 146 of file writedb.cpp.

References CWriteDB_Impl::ListFiles(), and m_Impl.

Referenced by BOOST_AUTO_TEST_CASE(), s_DupSequencesTest(), s_WrapUpDb(), and CBuildDatabase::x_EndBuild().

◆ ListVolumes()

void CWriteDB::ListVolumes ( vector< string > &  vols)

List Volumes.

Returns the base names of all volumes constructed by this class; the returned list may not be complete until Close() has been called.

Parameters
volsThe set of volumes produced by this class. [out]

Definition at line 141 of file writedb.cpp.

References CWriteDB_Impl::ListVolumes(), and m_Impl.

Referenced by BOOST_AUTO_TEST_CASE(), and CBuildDatabase::x_EndBuild().

◆ RegisterMaskAlgorithm() [1/2]

int CWriteDB::RegisterMaskAlgorithm ( const string id,
const string description = string(),
const string options = string() 
)

Register a type of filtering data found in this database.

Returns
algorithm ID for the filtering data.
Parameters
idA string to identify the masking data. [in]
descriptionDetails about the masking data. [in]
optionsAlgorithm options provided to the program. [in]

◆ RegisterMaskAlgorithm() [2/2]

int CWriteDB::RegisterMaskAlgorithm ( EBlast_filter_program  program,
const string options = string(),
const string name = string() 
)

Register a type of filtering data found in this database.

Returns
algorithm ID for the filtering data.
Parameters
programProgram used to produce this masking data. [in]
optionsAlgorithm options provided to the program. [in]
nameName of the GI-based mask. [in]

Referenced by CBuildDatabase::RegisterMaskingAlgorithm().

◆ SetBlobData()

CBlastDbBlob & CWriteDB::SetBlobData ( int  column_id)

Add blob data to a user-defined column.

To add data to a user-defined blob column, call this method, providing the column handle. A blob object will be returned; the user data should be stored in this object. The data can be stored any time up to the next call to an `AddSequence' method (just as with any other per-sequence data) but access to the returned object after that point results is incorrect and will have undefined consequences.

Parameters
column_idIdentifier for a user-defined column.
Returns
Blob data should be written to this object.

Definition at line 190 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetBlobData().

Referenced by CBuildDatabase::AddSequences().

◆ SetDeflines()

void CWriteDB::SetDeflines ( const CBlast_def_line_set deflines)

Set the deflines to be used for the sequence.

This method sets all the deflines at once as a complete set, overriding any deflines provided by AddSequence(). If this method is used with the CBioseq version of AddSequence, it replaces the deflines found in the CBioseq.

Parameters
deflinesDeflines to use for this sequence. [in]

Definition at line 94 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetDeflines().

Referenced by CBuildDatabase::AddSequences(), BOOST_AUTO_TEST_CASE(), s_DupIdsBioseq(), s_DupIdsRaw(), CBuildDatabase::x_DupLocal(), CBuildDatabase::x_EditAndAddBioseq(), and CMakeProfileDBApp::x_MakeVol().

◆ SetMaskData()

void CWriteDB::SetMaskData ( const CMaskedRangesVector ranges,
const vector< TGi > &  gis 
)

Set filtering data for a sequence.

This method specifies filtered regions for this sequence. A sequence may have filtering data from one or more algorithms. For each algorithm_id value specified in ranges, a description should be added to the database using RegisterMaskAlgorithm(). This must be done before the first call to SetMaskData() that uses the algorithm id for a non-empty offset range list.

Parameters
rangesFiltered ranges for this sequence and algorithm.
gisGIs associated with this sequence.

Definition at line 169 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetMaskData().

Referenced by CBuildDatabase::AddSequences(), and CBuildDatabase::x_AddMasksForSeqId().

◆ SetMaskedLetters()

void CWriteDB::SetMaskedLetters ( const string masked)

Set letters that should not be used in sequences.

This method specifies letters that should not be used in the resulting database. The masked letters are expected to be specified in an IUPAC (alphabetic) encoding, and will be replaced by 'X' (for protein) when the sequences are packed. This method should be called before any sequences are added. This method only works with protein (the motivating case cannot happen with nucleotide).

Parameters
maskedLetters to disinclude. [in]

Definition at line 136 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetMaskedLetters().

Referenced by CBuildDatabase::SetMaskLetters().

◆ SetMaxFileSize()

void CWriteDB::SetMaxFileSize ( Uint8  sz)

Set maximum size for output files.

The provided size is applied as a limit on the size of output files. If adding a sequence would cause any output file to exceed this size, the volume is closed and a new volume is started (unless the current volume is empty, in which case the size limit is ignored and a one-sequence volume is created). The default value is 2^30-1. There is also a hard limit required by the database format.

Parameters
szMaximum size in bytes of any volume component file. [in]

Definition at line 118 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetMaxFileSize().

Referenced by CBuildDatabase::CBuildDatabase(), CBuildDatabase::SetMaxFileSize(), and CMakeProfileDBApp::x_InitOutputDb().

◆ SetMaxVolumeLetters()

void CWriteDB::SetMaxVolumeLetters ( Uint8  letters)

Set maximum letters for output volumes.

The provided size is applied as a limit on the size of output volumes. If adding a sequence would cause a volume to exceed this many protein or nucleotide letters (*not* bytes), the volume is closed and a new volume is started (unless the volume is currently empty). There is no default, but there is a hard limit required by the format definition. Ambiguity encoding is not counted toward this limit.

Parameters
lettersMaximum letters to pack in one volume. [in]

Definition at line 123 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetMaxVolumeLetters().

Referenced by BOOST_AUTO_TEST_CASE().

◆ SetPig()

void CWriteDB::SetPig ( int  pig)

Set the PIG to be used for the sequence.

For proteins, this sets the PIG of the protein sequence.

Parameters
pigPIG identifier as an integer. [in]

Definition at line 99 of file writedb.cpp.

References m_Impl, and CWriteDB_Impl::SetPig().

Referenced by BOOST_AUTO_TEST_CASE(), and CBuildDatabase::x_AddPig().

Member Data Documentation

◆ m_Impl

CWriteDB_Impl* CWriteDB::m_Impl
protected

The documentation for this class was generated from the following files:
Modified on Tue Jul 16 13:21:32 2024 by modify_doxy.py rev. 669887