NCBI C++ ToolKit
writedb_files.cpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 /* $Id: writedb_files.cpp 96690 2022-04-28 11:08:55Z fongah2 $
2  * ===========================================================================
3  *
4  * PUBLIC DOMAIN NOTICE
5  * National Center for Biotechnology Information
6  *
7  * This software/database is a "United States Government Work" under the
8  * terms of the United States Copyright Act. It was written as part of
9  * the author's official duties as a United States Government employee and
10  * thus cannot be copyrighted. This software/database is freely available
11  * to the public for use. The National Library of Medicine and the U.S.
12  * Government have not placed any restriction on its use or reproduction.
13  *
14  * Although all reasonable efforts have been taken to ensure the accuracy
15  * and reliability of the software and data, the NLM and the U.S.
16  * Government do not and cannot warrant the performance or results that
17  * may be obtained by using this software or data. The NLM and the U.S.
18  * Government disclaim all warranties, express or implied, including
19  * warranties of performance, merchantability or fitness for any particular
20  * purpose.
21  *
22  * Please cite the author in any work or product based on this material.
23  *
24  * ===========================================================================
25  *
26  * Author: Kevin Bealer
27  *
28  */
29 
30 /// @file writedb_files.cpp
31 /// Implementation for the CWriteDB_Files class.
32 /// class for WriteDB.
33 #include <ncbi_pch.hpp>
36 #include <serial/objistr.hpp>
37 #include <serial/objostr.hpp>
38 #include <serial/serial.hpp>
39 #include <iostream>
40 #include <sstream>
41 #include <cmath>
42 
44 
45 /// Use standard C++ definitions.
47 
48 // Blast Database Format Notes (version 4).
49 // (See below for version 5.)
50 //
51 // Integers are 4 bytes stored in big endian format, except for the
52 // volume length. The volume length is 8 bytes, but is stored in a
53 // little endian byte order (reason unknown).
54 
55 // The 'standard' packing for strings in Blast DBs is as follows:
56 // 0..4: length
57 // 4..4+length: string data
58 //
59 // The title string follows this rule, but the create date has an
60 // additional detail; if it does not end on an offset that is a
61 // multiple of 8 bytes, extra 'NUL' characters are added to bring it
62 // to a multiple of 8 bytes. The NUL characters are added after the
63 // string bytes, and the stored length of the string is increased to
64 // include them. After extracting the string, 0-7 NUL bytes will need
65 // to be stripped from the end of the string (if any are found).
66 //
67 // (If this were not done, the offsets in the file would be unaligned;
68 // on some architectures this could cause a performance penalty or
69 // other problems. On little endian architectures such as Intel, this
70 // penalty is always paid.)
71 
72 // INDEX FILE FORMAT, for "Blast DB Version 4"
73 //
74 // 0..4: format version (Blast DB version, current is "4").
75 // 4..8: seqtype (1 for protein or 0 for nucleotide).
76 // 8..N1: title (string).
77 // N1..N2: create date (string).
78 // N2..N2+4: number of OIDs (#OIDS).
79 // N2+4..N2+12: number of letters in volume. (note: 8 bytes)
80 // N2+12..N2+16: maxlength (size of longest sequence in DB)
81 //
82 // N2+16..(end): Array data
83 //
84 // Array data is 2 or 3 arrays of (#OIDS + 1) four byte integers.
85 // For protein, 2 arrays are used; for nucleotide, 3 are used.
86 //
87 // The first array is header offsets, the second array is sequence
88 // offsets, and the third (optional) array is offsets of ambiguity
89 // data. Each array has a final element which is the length of the
90 // file; this makes it possible to compute the last sequence's length
91 // without adding a special case.
92 //
93 // As shown, the total size of index header =
94 // 4*4 bytes // 4 int fields (4 bytes each)
95 // + 8 bytes // 8 byte field
96 // + 2*4 + strings // 4 bytes length for each plus string data.
97 // = (32 + strings), rounded up to nearest multiple of 8
98 //
99 // "strings" here refers to the unterminated length of both strings.
100 
101 // Blast Database Format Notes (version 5).
102 // (See above for version 4.)
103 //
104 // Integers are 4 bytes stored in big endian format, except for the
105 // volume length. The volume length is 8 bytes, but is stored in a
106 // little endian byte order (reason unknown).
107 
108 // The 'standard' packing for strings in Blast DBs is as follows:
109 // 0..4: length
110 // 4..4+length: string data
111 //
112 // The title string and LMDB string follow this rule, but the create
113 // date has an additional detail; if it does not end on an offset that
114 // is a multiple of 8 bytes, extra 'NUL' characters are added to bring
115 // it to a multiple of 8 bytes. The NUL characters are added after the
116 // string bytes, and the stored length of the string is increased to
117 // include them. After extracting the string, 0-7 NUL bytes will need
118 // to be stripped from the end of the string (if any are found).
119 //
120 // (If this were not done, the offsets in the file would be unaligned;
121 // on some architectures this could cause a performance penalty or
122 // other problems. On little endian architectures such as Intel, this
123 // penalty is always paid.)
124 
125 // --------------------------------------------
126 
127 // INDEX FILE FORMAT, for "Blast DB Version 5"
128 //
129 // 0..4: format version (Blast DB version, current is "5").
130 // 4..8: seqtype (1 for protein or 0 for nucleotide).
131 // 8..12: this volume number (0 and up).
132 // 12..N1: title (string).
133 // N1..N2: name of LMDB database file (string)
134 // N2..N3: create date (string).
135 // N3..N3+4: number of OIDs (#OIDS).
136 // N3+4..N3+12: number of letters in volume. (note: 8 bytes)
137 // N3+12..N3+16: maxlength (size of longest sequence in DB)
138 //
139 // N3+16..(end): Array data
140 //
141 // Array data is 2 or 3 arrays of (#OIDS + 1) four byte integers.
142 // For protein, 2 arrays are used; for nucleotide, 3 are used.
143 //
144 // The first array is header offsets, the second array is sequence
145 // offsets, and the third (optional) array is offsets of ambiguity
146 // data. Each array has a final element which is the length of the
147 // file; this makes it possible to compute the last sequence's length
148 // without adding a special case.
149 //
150 // As shown, the total size of index header =
151 // 5*4 bytes // 5 int fields (4 bytes each)
152 // + 8 bytes // 8 byte field
153 // + 3*4 + strings // 4 bytes length for each plus string data.
154 // = (40 + strings), rounded up to nearest multiple of 8
155 //
156 // "strings" here refers to the unterminated length of both strings.
157 
159  const string & extension,
160  int index,
161  Uint8 max_file_size,
162  bool always_create)
163  : m_Created (false),
164  m_BaseName (basename),
165  m_Extension (extension),
166  m_Index (index),
167  m_Offset (0),
168  m_MaxFileSize(max_file_size)
169 {
170  // Define number of usable bits in m_Offset,
171  // deducting one for the sign bit.
172  // Define maximum allowed max_file_size.
173 #ifdef _DEBUG
174  static const int MAX_OFFSET_BITS = (sizeof m_Offset * 8);
175  static const Uint8 MAX_FILE_SIZE = ((Uint8) 1 << MAX_OFFSET_BITS);
176 #endif
177 
178  if (m_MaxFileSize == 0) {
180  } else {
181 #ifdef _DEBUG
182  _ASSERT(max_file_size <= MAX_FILE_SIZE);
183 #endif
184  }
185 
186  m_Nul.resize(1);
187  m_Nul[0] = (char) 0;
188 
189  m_UseIndex = (index >= 0);
190  x_MakeFileName();
191 
192  if (always_create) {
193  Create();
194  }
195 }
196 
198 {
199  _ASSERT(! m_Created);
200  m_Created = true;
201  m_RealFile.open(m_Fname.c_str(), ios::out | ios::binary);
202 }
203 
204 unsigned int CWriteDB_File::Write(const CTempString & data)
205 {
206  // Define maximum allowed max_file_size.
207 #ifdef _DEBUG
208  // Define number of usable bits in m_Offset,
209  // deducting one for the sign bit.
210  static const int MAX_OFFSET_BITS = (sizeof m_Offset * 8);
211  static const Uint8 MAX_OFFSET = ((Uint8) 1 << MAX_OFFSET_BITS);
212 #endif
213 
215 #ifdef _DEBUG
216  _ASSERT(((Uint8) m_Offset + data.length()) <= MAX_OFFSET);
217 #endif
218  m_RealFile.write(data.data(), data.length());
219 
220  m_Offset += data.length();
221  return m_Offset;
222 }
223 
224 unsigned int CWriteDB_File::Write(const char * data, int length)
225 {
227  m_RealFile.write(data, length);
228 
229  m_Offset += length;
230  return m_Offset;
231 }
232 
233 
234 
235 string CWriteDB_File::MakeShortName(const string & base, int index)
236 {
237  ostringstream fns;
238 
239  fns << base;
240  fns << ".";
241  fns << (index / 10);
242  fns << (index % 10);
243 
244  return fns.str();
245 }
246 
248 {
249  if (m_UseIndex) {
251  } else {
253  }
254 
255  m_Fname += ".";
256  m_Fname += m_Extension;
257 }
258 
260 {
261  x_Flush();
262  if (m_Created) {
263  m_RealFile.close();
264  }
265 }
266 
268 {
269  _ASSERT(m_UseIndex == true);
270 
271  string nm1 = m_Fname;
272  m_UseIndex = false;
273  x_MakeFileName();
274 
275  CDirEntry fn1(nm1);
277 }
278 
279 void CWriteDB_File::RenameFileIndex(unsigned int num_digits)
280 {
281  _ASSERT(num_digits > 2);
282  unsigned int orig_num_digits = log10(m_Index) +1;
283  if(orig_num_digits == num_digits) {
284  return;
285  }
286 
287  string orig_fname = m_Fname;
288  ostringstream fns;
289 
290  fns << m_BaseName;
291  fns << ".";
292  for (unsigned int i=2; i< num_digits; i++){
293  fns << "0";
294  }
295  fns << (m_Index / 10);
296  fns << (m_Index % 10);
297  fns << ".";
298  fns << m_Extension;
299 
300  m_Fname = fns.str();
301 
302  CDirEntry fn(orig_fname);
304 }
305 
307  bool protein,
308  const string & title,
309  const string & date,
310  int index,
311  Uint8 max_file_size,
312  EBlastDbVersion dbver)
314  protein ? "pin" : "nin",
315  index,
316  max_file_size,
317  true),
318  m_Protein (protein),
319  m_Title (title),
320  m_Date (date),
321  m_OIDs (0),
322  m_DataSize (0),
323  m_Letters (0),
324  m_MaxLength (0),
325  m_Version (dbver)
326 {
327  // Compute index overhead, rounding up.
328 
329  m_Overhead = x_Overhead(title, date);
330  if (dbver == eBDB_Version5) {
331  m_Overhead = x_Overhead(title, x_MakeLmdbName(), date);
332  } else {
333  m_Overhead = x_Overhead(title, date);
334  }
337 
338  // The '1' added to the sequence offset array refers to the fact
339  // that sequence files contain an initial NUL byte. This seems to
340  // be for the benefit of the protein database scanning code, but
341  // it is also done for nucleotide databases.
342 
343  m_Hdr.push_back(0);
344  m_Seq.push_back(1);
345 }
346 
348  const string & lmdbName,
349  const string & D)
350 {
351  return 5 * sizeof(int) + sizeof(long)
352  + 3 * sizeof(int) + T.size() + lmdbName.size() + D.size();
353 }
354 
356  const string & D)
357 {
358  return 4 * sizeof(int) + sizeof(long)
359  + 2 * sizeof(int) + T.size() + D.size();
360 }
361 
363 {
365 
366  bool use_lmdb = (m_Version == eBDB_Version5);
367 
368  int format_version = (int) m_Version;
369  int seq_type = (m_Protein ? 1 : 0);
370 
371  // Pad the date string (see comments at top.)
372 
373  string pad_date = m_Date;
374  int count = 0;
375  const string lmdb_name = use_lmdb ? x_MakeLmdbName() : "";
376  int overhead = use_lmdb
377  ? x_Overhead(m_Title, lmdb_name, pad_date)
378  : x_Overhead(m_Title, pad_date);
379  while (overhead & 0x7) {
380  pad_date.append(m_Nul);
381  if (count != -1) {
382  _ASSERT(count++ < 8);
383  }
384  overhead = use_lmdb
385  ? x_Overhead(m_Title, lmdb_name, pad_date)
386  : x_Overhead(m_Title, pad_date);
387  }
388 
389  // Write header
390 
391  ostream & F = m_RealFile;
392 
393  s_WriteInt4 (F, format_version);
394  s_WriteInt4 (F, seq_type);
395  if (!lmdb_name.empty()) {
396  s_WriteInt4 (F, m_Index);
398  s_WriteString(F, lmdb_name);
399  } else {
401  }
402  s_WriteString(F, pad_date);
403  s_WriteInt4 (F, m_OIDs);
406 
407  for(unsigned i = 0; i < m_Hdr.size(); i++) {
408  s_WriteInt4(F, m_Hdr[i]);
409  }
410 
411  for(unsigned i = 0; i < m_Seq.size(); i++) {
412  s_WriteInt4(F, m_Seq[i]);
413  }
414 
415  // Should loop m_OID times, or not at all.
416  for(unsigned i = 0; i < m_Amb.size(); i++) {
417  s_WriteInt4(F, m_Amb[i]);
418  }
419 
420  // This extra index is added here because formatdb adds it. SeqDB
421  // depends on its existence, but I don't think anyone reads (or
422  // needs) the data. The last offset in the ambiguity column
423  // represents the position of the set of ambiguities corresponding
424  // to the last offset in the sequence column. But the last
425  // sequence offset is not really a sequence start, it is the
426  // 'extra' offset used by sequence length computations.
427 
428  if (m_Amb.size()) {
429  s_WriteInt4(F, m_Seq.back());
430  }
431 
432  vector<unsigned int> tmp1, tmp2, tmp3;
433  m_Hdr.swap(tmp1);
434  m_Seq.swap(tmp2);
435  m_Amb.swap(tmp3);
436 }
437 
438 /// Form name of lmdb database file.
440 {
441  string suffix = (m_Protein ? ".pdb" : ".ndb");
442  size_t last_slash = m_BaseName.find_last_of(CFile::GetPathSeparator());
443  if (last_slash == m_BaseName.npos) {
444  return m_BaseName + suffix;
445  } else {
446  return m_BaseName.substr(last_slash + 1) + suffix;
447  }
448 }
449 
451  bool protein,
452  int index,
453  Uint8 max_file_size)
455  protein ? "phr" : "nhr",
456  index,
457  max_file_size,
458  true),
459  m_DataSize(0)
460 {
461 }
462 
464  bool protein,
465  int index,
466  Uint8 max_file_size,
467  Uint8 max_letters)
469  protein ? "psq" : "nsq",
470  index,
471  max_file_size,
472  true),
473  m_Letters (0),
474 #ifdef _DEBUG
475  m_BaseLimit(max_letters),
476  m_Protein (protein)
477 #else
478  m_BaseLimit(max_letters)
479 #endif
480 {
481  // Only protein sequences need the inter-sequence NUL bytes.
482  // The first null written here is for nucleotide sequences.
483  // It doesn't seem necessary, but formatdb provides it, so I
484  // will too.
485 
486  WriteWithNull(string());
487 }
488 
490 
#define true
Definition: bool.h:35
#define false
Definition: bool.h:36
CDirEntry –.
Definition: ncbifile.hpp:262
CTempString implements a light-weight string on top of a storage buffer whose lifetime management is ...
Definition: tempstr.hpp:65
CWriteDB_IndexFile class.
int m_Index
Volume index.
Uint8 m_MaxFileSize
Maximum file size in bytes.
TFile m_RealFile
Actual stream implementing the output file.
void x_MakeFileName()
Build the filename for this file.
bool m_Created
True if the file has already been opened.
string m_Fname
Current filename for output file.
CWriteDB_File(const string &basename, const string &extension, int index, Uint8 max_file_size, bool always_create)
Constructor.
virtual void RenameFileIndex(unsigned int num_digits)
void Create()
Create and open the file.
Uint8 x_DefaultByteLimit()
The default value for max_file_size.
void Close()
Close the file, flushing any remaining data to disk.
static string MakeShortName(const string &base, int index)
Construct the short name for a volume.
string m_BaseName
Database base name for all files.
virtual void x_Flush()=0
This should flush any unwritten data to disk.
unsigned int Write(const CTempString &data)
Write contents of a string to the file.
string m_Nul
For convenience, a string containing one NUL character.
unsigned int m_Offset
Stream position.
virtual void RenameSingle()
Rename this file, disincluding the volume index.
unsigned int WriteWithNull(const CTempString &data)
Write contents of a string to the file, appending a NUL.
bool m_UseIndex
True if filenames should use volume index.
string m_Extension
File extension for this file.
CWriteDB_HeaderFile(const string &dbname, bool protein, int index, Uint8 max_file_size)
Constructor.
Uint8 m_DataSize
Required space for data once written to disk.
CWriteDB_IndexFile(const string &dbname, bool protein, const string &title, const string &date, int index, Uint8 max_file_size, EBlastDbVersion dbver=eBDB_Version4)
Constructor.
int m_Overhead
Amount of file used by metadata.
int m_OIDs
OIDs added to database so far.
EBlastDbVersion m_Version
BLASTDB version (4 or 5).
bool m_Protein
True if this is a protein database.
virtual void x_Flush()
Flush index data to disk.
string m_Title
Title string for all database volumes.
vector< unsigned int > m_Amb
Offset in sequence file of each OID's ambiguity data.
int x_Overhead(const string &T, const string &lmdbName, const string &D)
Compute index file overhead.
vector< unsigned int > m_Hdr
Start offset in header file of each OID's headers.
const string x_MakeLmdbName()
Form name of LMDB database file.
Uint8 m_Letters
Letters of sequence data accumulated so far.
int m_MaxLength
Length of longest sequence.
string m_Date
Database creation time stamp.
vector< unsigned int > m_Seq
Offset in sequence file of each OID's sequence data.
CWriteDB_SequenceFile(const string &dbname, bool protein, int index, Uint8 max_file_size, Uint8 max_letters)
Constructor.
#define T(s)
Definition: common.h:230
std::ofstream out("events_result.xml")
main entry point for tests
bool Rename(const string &new_path, TRenameFlags flags=fRF_Default)
Rename entry.
Definition: ncbifile.cpp:2456
static char GetPathSeparator(void)
Get path separator symbol specific for the current platform.
Definition: ncbifile.cpp:433
@ fRF_Overwrite
Remove destination if it exists.
Definition: ncbifile.hpp:611
uint64_t Uint8
8-byte (64-bit) unsigned integer
Definition: ncbitype.h:105
#define END_NCBI_SCOPE
End previously defined NCBI scope.
Definition: ncbistl.hpp:103
#define BEGIN_NCBI_SCOPE
Define ncbi namespace.
Definition: ncbistl.hpp:100
const char * data(void) const
Return a pointer to the array represented.
Definition: tempstr.hpp:313
size_type length(void) const
Return the length of the represented array.
Definition: tempstr.hpp:320
char * dbname(DBPROCESS *dbproc)
Get name of current database.
Definition: dblib.c:6929
unsigned int
A callback function used to compare two keys in a database.
Definition: types.hpp:1210
int i
#define F(x)
Make a parametrized function appear to have only one variable.
Definition: ncbi_math.c:342
T log10(T x_)
static const char * suffix[]
Definition: pcregrep.c:408
#define basename(path)
Definition: replacements.h:116
EBlastDbVersion
BLAST database version.
Definition: seqdbcommon.hpp:51
@ eBDB_Version5
Definition: seqdbcommon.hpp:53
#define _DEBUG
#define _ASSERT
Data conversion tools for CWriteDB and associated code.
void s_WriteInt4(ostream &str, int x)
Write a four byte integer to a stream in big endian format.
void s_WriteInt8LE(ostream &str, Uint8 x)
Write an eight byte integer to a stream in little-endian format.
void s_WriteString(ostream &str, const string &s)
Write a length-prefixed string to a stream.
USING_SCOPE(std)
Use standard C++ definitions.
Code for database files construction.
int s_RoundUp(int value, int blocksize)
Round up to the next multiple of some number.
Modified on Sat Dec 09 04:48:06 2023 by modify_doxy.py rev. 669887