NCBI C++ ToolKit
blastkmerindex.cpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 /* $Id: blastkmerindex.cpp 100101 2023-06-15 14:10:29Z merezhuk $
2  * ===========================================================================
3  *
4  * PUBLIC DOMAIN NOTICE
5  * National Center for Biotechnology Information
6  *
7  * This software/database is a "United States Government Work" under the
8  * terms of the United States Copyright Act. It was written as part of
9  * the author's official duties as a United States Government employee and
10  * thus cannot be copyrighted. This software/database is freely available
11  * to the public for use. The National Library of Medicine and the U.S.
12  * Government have not placed any restriction on its use or reproduction.
13  *
14  * Although all reasonable efforts have been taken to ensure the accuracy
15  * and reliability of the software and data, the NLM and the U.S.
16  * Government do not and cannot warrant the performance or results that
17  * may be obtained by using this software or data. The NLM and the U.S.
18  * Government disclaim all warranties, express or implied, including
19  * warranties of performance, merchantability or fitness for any particular
20  * purpose.
21  *
22  * Please cite the author in any work or product based on this material.
23  *
24  * ===========================================================================
25  *
26  * Author: Tom Madden
27  *
28  * File Description:
29  * Build index for BLAST-kmer searches
30  *
31  * These comments describes the indices used for the KMER (QuickBLASTP) program.
32  * These indices are created by the code in this file.
33  * This document applies to the version 3 index. Others are not supported.
34  *
35  * There are two file extensions:
36  * 1.) pki: contain the values for quick matching of the query with database sequence. After the initial
37  * match, the Jaccard distance is calculated using the data in the pkd file.
38  * 2.) pkd: contains the data arrays used for Jaccard calculation.
39  *
40  * Description of pki files:
41  * ---------------------------------
42  * 1.) version - 4 bytes
43  * 2.) number of sequences - 4 bytes
44  * 3.) number of hash functions - 4 bytes
45  * 4.) number of samples (not used except in version 2) - 4 bytes
46  * 5.) KMER size - 4 bytes
47  * 6.) Rows per band. Used for LSH and typically 2 - 4 bytes
48  * 7.) compression. Used for data (pkd file) to say how many bytes per hash value - 4 bytes
49  * 8.) Alphabet (size or number?) - 4 bytes
50  * 8.) Start of the LSH array - 4 bytes.
51  * 9.) Size of the LSH array. Typically 16777217 bytes (2**24 + 1) - 4 bytes
52  * 10.) Offset for the end of the LSH matches (not array!) Also, this is the total size of the
53  * file in bytes. - 8 bytes
54  * 11.) Target chunk size. Typically 150. - 4 bytes
55  * 12.) Reserved for future use - 4 bytes
56  * 13.) Number of bad kmers These are overrepresented kmers - 4 bytes
57  * 14.) Hash values for bad kmers. This area will be filled with zeros if there
58  * are no badmers. - (2*4*number of hash functions-4) bytes
59  * 15.) LSH array with each entry (corresponding to hash value for two kmers)
60  * containing the offset to the list of data arrays in the pkd files. - size of LSH array (given
61  * in item 9 above) * 8 bytes.
62  * 16.) An entry at the end of the LSH array with the final offset in bytes. Should be
63  * same as entry 10 and is the size of the pki file. - 8 bytes.
64  * 17.) A list of entries for the data arrays in the pkd files - 4 bytes times the number of entries
65  *
66  * Description of pkd files:
67  * ---------------------------------
68  * This file contains the data arrays used in the calculation of the Jaccard distance.
69  * Each array has (number of hash functions) * width. Width can be 1, 2, or 4 bytes.
70  * Typically 2 bytes is used. Note that this compresses the hash values from 4 to 2 bytes,
71  * so collisions may occur. The number of collisions is small. At the end of
72  * each array 4 bytes are reserved for the OID in the BLAST database. Note that multiple
73  * arrays may refer to the same OID. Also, duplicate arrays with the same OID are eliminated.
74  * Note that the hash values (before compression) are ordered from largest to smallest.
75  *
76  * Locality Sensitive Hashing (LSH):
77  * ---------------------------------
78  * The idea here is to have some criteria to quickly identify data arrays that are similar to
79  * the ones in the query and only calculate the Jaccard distance on those arrays. We do
80  * this by taking two hash values from a data array and producing another hash value from it.
81  * In version 3 index, we have two different ways of producing such LSH hash values. First, we
82  * take two neighboring hash values (from elements 0 and 1, 1 and 2, 2 and 3, etc.)
83  * and combine them to make one LSH hash value. Typically, we use 32 hash values in an array,
84  * so this gives us 31 LSH hash values. Second, we take every other value and hash them to
85  * one LSH hash value (from elements 0 and 2, 1 and 3, etc.) For 32 hash values in an
86  * array, this gives us 30 LSH hash values.
87  *
88  */
89 
90 #include <ncbi_pch.hpp>
94 #include <corelib/ncbifile.hpp>
95 
96 
97 #ifdef _OPENMP
98 #include <omp.h>
99 #endif
100 
101 #include "pearson.hpp"
102 
104 BEGIN_SCOPE(blast)
105 
106 
107 
109  int kmerSize,
110  int numHashFct,
111  int samples,
112  int compress,
113  int alphabet,
114  int version,
115  int chunkSize)
116  : m_NumHashFct(numHashFct),
117  m_KmerSize(kmerSize),
118  m_SeqDB(seqdb),
119  m_DoSeg(0),
120  m_Samples(samples),
121  m_Compress(compress),
122  m_Alphabet(alphabet),
123  m_Version(version),
124  m_ChunkSize(chunkSize)
125 {
126  if (m_Compress == 0)
127  m_Compress=4;
128 
129  // Used for version one LSH
130  m_RowsPerBand = 2;
131  m_NumBands = m_NumHashFct/m_RowsPerBand;
132 
133  if (m_Version < 2)
134  m_Samples=0; // Occupies the seg position in version 1
135 }
136 
137 // universal hashing mod 1048583.
139 {
141  return ( (a*x + b) % p );
142 }
143 
144 
145 /// FNV Hash. See http://www.isthe.com/chongo/tech/comp/fnv/index.html
147 {
148  const Uint4 fnv_prime = 16777619u;
149  const Uint4 fnv_offset_basis = 2166136261u;
150  Int4 i;
151  Uint4 hash;
152 
153  Uint1 key[4];
154  key[0] = num & 0xff;
155  key[1] = (num >> 8) & 0xff;
156  key[2] = (num >> 16) & 0xff;
157  key[3] = (num >> 24) & 0xff;
158 
159 
160  hash = fnv_offset_basis;
161  for (i = 0;i < 4;i++) {
162  hash *= fnv_prime;
163  hash ^= key[i];
164  }
165 
166  return hash;
167 }
168 
169 // compute kmer set and minhash signature for a query
171  CSeqDB & db,
172  vector < vector < vector < uint32_t > > > & seq_hash,
173  uint32_t *dead,
174  int num_hashes,
175  const uint32_t *a,
176  const uint32_t *b,
177  bool do_seg,
178  int kmerNum,
179  int oidOffset,
180  int alphabetChoice,
181  int version,
182  int chunkSize)
183 {
184  int fullOID=q_oid+oidOffset; // REAL OID in full db, not just volume.
185  int seq_length = db.GetSeqLength(fullOID);
186  vector<TSeqRange> range_v;
187  int chunk_num = BlastKmerBreakUpSequence(seq_length, range_v, chunkSize);
188  seq_hash[q_oid].resize(chunk_num); // chunk_num is max number that will be saved.
189 
190  int chunk_iter=0;
191  int chunk_counter=0;
192  string query;
193  bool first_time=true;
195  for(vector<TSeqRange>::iterator iter=range_v.begin(); iter != range_v.end(); ++iter, chunk_iter++)
196  {
197 
198  set<uint32_t> seq_kmer = BlastKmerGetKmerSet(query, do_seg, *iter, kmerNum, alphabetChoice);
199 
200  if (seq_kmer.empty())
201  {
202  continue;
203  }
204 
205 
206  vector<uint32_t> idx_tmp(num_hashes);
207  vector<uint32_t> hash_tmp(num_hashes);
208 
209  for(int h=0;h<num_hashes;h++)
210  {
211  hash_tmp[h]=0xffffffff;
212  idx_tmp[h]=0xffffffff;
213  }
214 
215  // for each kmer in the sequence,
216  for(set<uint32_t>::iterator i=seq_kmer.begin(); i != seq_kmer.end(); ++i)
217  {
218  // compute hashes and track their minima
219  for(int h=0;h<num_hashes;h++)
220  {
221  uint32_t hashval = uhash(*i, a[h], b[h]);
222 
223  if (hashval < hash_tmp[h])
224  {
225  hash_tmp[h] = hashval;
226  idx_tmp[h] = *i;
227  }
228  }
229  } // end each kmer
230 
231  if (first_time == false)
232  {
233  int distance = BlastKmerGetDistance(idx_tmp, seq_hash[q_oid][chunk_counter]);
234  if (distance == 0)
235  continue;
236  chunk_counter++; // Last valid chunk iteration
237  }
238  else
239  first_time=false;
240 
241 
242  int h;
243  // save the kmers with the minimum hash values
244  seq_hash[q_oid][chunk_counter].resize(num_hashes+1);
245  for(h=0;h<num_hashes;h++)
246  seq_hash[q_oid][chunk_counter][h] = idx_tmp[h];
247 
248  // Last spot is Query OID
249  if (version > 1)
250  seq_hash[q_oid][chunk_counter][num_hashes] = q_oid;
251  else
252  seq_hash[q_oid][chunk_counter][num_hashes] = fullOID;
253  }
254 
255  /// ERASE the end of the vector if not all chunks filled..
256  if (chunk_num > chunk_counter+1)
257  seq_hash[q_oid].erase(seq_hash[q_oid].begin()+chunk_counter+1, seq_hash[q_oid].end());
258  if (first_time == true) // NOthing found at all.
259  seq_hash[q_oid].erase(seq_hash[q_oid].begin(), seq_hash[q_oid].end());
260 
261 }
262 
263 // compute kmer set and minhash signature for a query
265  CSeqDB & db,
266  vector < vector < vector < uint32_t > > > & seq_hash,
267  uint32_t *dead,
268  int num_hashes,
269  int kmerNum,
270  int oidOffset,
271  int alphabetChoice,
272  int version,
273  vector<int> badMers,
274  int chunkSize)
275 {
276  int fullOID=q_oid+oidOffset; // REAL OID in full db, not just volume.
277  int seq_length = db.GetSeqLength(fullOID);
278  vector<TSeqRange> range_v;
279  int chunk_num = BlastKmerBreakUpSequence(seq_length, range_v, chunkSize);
280  seq_hash[q_oid].resize(chunk_num); // chunk_num is max number that will be saved.
281 
282  int chunk_iter=0;
283  int chunk_counter=0;
284  string query;
285  bool first_time=true;
287  for(vector<TSeqRange>::iterator iter=range_v.begin(); iter != range_v.end(); ++iter, chunk_iter++)
288  {
289 
290  set<uint32_t> seq_kmer = BlastKmerGetKmerSet2(query, *iter, kmerNum, alphabetChoice, badMers);
291 
292  if (seq_kmer.empty())
293  {
294  continue;
295  }
296 
297 
298  vector<uint32_t> hash_values;
299  vector<uint32_t> idx_tmp(num_hashes);
300 
301  // for each kmer in the sequence,
302  for(set<uint32_t>::iterator i=seq_kmer.begin(); i != seq_kmer.end(); ++i)
303  {
304  uint32_t hashval = FNV_hash(*i);
305  hash_values.push_back(hashval);
306 
307  } // end each kmer
308  if (hash_values.size() < static_cast<size_t>(num_hashes))
309  {
310  int rem = 1 + num_hashes - static_cast<int>(hash_values.size());
311  uint32_t hashval = 0xffffffff; // Fill in empties
312  for (int i=0; i<rem; i++)
313  hash_values.push_back(hashval);
314  }
315  std::sort(hash_values.begin(), hash_values.end());
316 
317  for(int i=0; i<num_hashes; i++)
318  idx_tmp[i] = hash_values[i];
319 
320  if (first_time == false)
321  {
322  int distance = BlastKmerGetDistance(idx_tmp, seq_hash[q_oid][chunk_counter]);
323  if (distance == 0)
324  continue;
325  chunk_counter++; // Last valid chunk iteration
326  }
327  else
328  first_time=false;
329 
330 
331  int h;
332  // save the kmers with the minimum hash values
333  seq_hash[q_oid][chunk_counter].resize(num_hashes+1);
334  for(h=0;h<num_hashes;h++)
335  seq_hash[q_oid][chunk_counter][h] = idx_tmp[h];
336 
337  // Last spot is Query OID
338  if (version > 1)
339  seq_hash[q_oid][chunk_counter][num_hashes] = q_oid;
340  else
341  seq_hash[q_oid][chunk_counter][num_hashes] = fullOID;
342  }
343 
344  /// ERASE the end of the vector if not all chunks filled..
345  if (chunk_num > chunk_counter+1)
346  seq_hash[q_oid].erase(seq_hash[q_oid].begin()+chunk_counter+1, seq_hash[q_oid].end());
347  if (first_time == true) // NOthing found at all.
348  seq_hash[q_oid].erase(seq_hash[q_oid].begin(), seq_hash[q_oid].end());
349 
350 }
351 
352 // LSH via Stanford PDF
353 static void s_Get_LSH_index_hashes(vector < vector < vector <uint32_t> > >& seq_hash, vector< vector<uint32_t> >& lsh,
354  int q_oid, int numBands, int numRows, int& total_chunks, uint8_t* uniqueHash)
355 {
356 
357  int num_chunks = static_cast<int>(seq_hash[q_oid].size());
358  for (int n=0; n<num_chunks; n++)
359  {
360  for(int b=0;b<numBands;b++)
361  {
362  unsigned char key[9];
363  for(int r=0;r<numRows;r++)
364  {
365  key[r*4] = (seq_hash[q_oid][n][b*numRows+r]) & 0xff;
366  key[1+r*4] = ((seq_hash[q_oid][n][b*numRows+r]) >> 8) & 0xff;
367  key[2+r*4] = ((seq_hash[q_oid][n][b*numRows+r]) >> 16) & 0xff;
368  key[3+r*4] = ((seq_hash[q_oid][n][b*numRows+r]) >> 24) & 0xff;
369  }
370  key[8] = (unsigned char) b;
371  uint32_t foo = do_pearson_hash(key, 9);
372  uniqueHash[foo] = 1;
373  if (foo != 0)
374  lsh[foo].push_back(total_chunks);
375  }
376  total_chunks++;
377  }
378 }
379 
380 // LSH via Buhler
381 static void s_Get_LSH_index_hashes2(vector < vector < vector <uint32_t> > >& seq_hash, vector< vector<uint32_t> >& lsh,
382  int q_oid, int num_k, int num_l, int array_size, int& total_chunks, uint8_t* uniqueHash,
383  vector < vector <int> >& kvector)
384 {
385 
386  int max=4*num_k+1;
387  vector<unsigned char> key(max);
388  int num_chunks=static_cast<int>(seq_hash[q_oid].size());
389  int temp_index=0;
390  int temp_hash=0;
391  for (int n=0; n<num_chunks; n++)
392  {
393  for (int r=0; r<num_l; r++)
394  {
395  for (int i=0; i<num_k; i++)
396  {
397  temp_index = kvector[r][i];
398  temp_hash = seq_hash[q_oid][n][temp_index];
399  key[i*4] = (temp_hash) & 0xff;
400  key[1+i*4] = ((temp_hash) >> 8) & 0xff;
401  key[2+i*4] = ((temp_hash) >> 16) & 0xff;
402  key[3+i*4] = ((temp_hash) >> 24) & 0xff;
403  }
404  key[max-1] = r;
405  uint32_t foo = do_pearson_hash(key.data(), max);
406  uniqueHash[foo] = 1;
407  if (foo != 0)
408  lsh[foo].push_back(total_chunks);
409  }
410  total_chunks++;
411  }
412 }
413 
414 // One hash function LSH ala Stanford, but add extra rows
415 static void s_Get_LSH_index_hashes5(vector < vector < vector <uint32_t> > >& seq_hash, vector< vector<uint32_t> >& lsh,
416  int q_oid, int numHashes, int numRows, int& total_chunks, uint8_t* uniqueHash)
417 {
418 
419  int num_chunks = static_cast<int>(seq_hash[q_oid].size());
420  int numHashMax = numHashes - numRows + 1;
421  uint32_t temp_hash = 0;
422  for (int n=0; n<num_chunks; n++)
423  {
424  for(int b=0;b<numHashMax;b++)
425  {
426  unsigned char key[12]; // 12 assumes numRows is 3 or less.
427  for(int r=0;r<numRows;r++)
428  {
429  key[r*4] = (seq_hash[q_oid][n][b+r]) & 0xff;
430  key[1+r*4] = ((seq_hash[q_oid][n][b+r]) >> 8) & 0xff;
431  key[2+r*4] = ((seq_hash[q_oid][n][b+r]) >> 16) & 0xff;
432  key[3+r*4] = ((seq_hash[q_oid][n][b+r]) >> 24) & 0xff;
433  }
434  uint32_t foo = do_pearson_hash(key, 4*numRows);
435  uniqueHash[foo] = 1;
436  if (foo != 0)
437  lsh[foo].push_back(total_chunks);
438  }
439  for(int b=0;b<numHashMax-1; b++)
440  {
441  unsigned char key[8]; // 8 assumes 2 or less.
442  for (int r=0; r<numRows; r++)
443  {
444  temp_hash = seq_hash[q_oid][n][b+2*r];
445  key[r*4] = (temp_hash) & 0xff;
446  key[1+r*4] = (temp_hash >> 8) & 0xff;
447  key[2+r*4] = (temp_hash >> 16) & 0xff;
448  key[3+r*4] = (temp_hash >> 24) & 0xff;
449  }
450  uint32_t foo = do_pearson_hash(key, 4*numRows);
451  uniqueHash[foo] = 1;
452  if (foo != 0)
453  lsh[foo].push_back(total_chunks);
454  }
455 
456  total_chunks++;
457  }
458 }
459 
460 void
462 {
463  m_SeqDB->SetNumberOfThreads(numThreads);
464 
465  vector<string> paths;
466  int oid_offset=0;
468  vector<TSeqRange> range_vec;
469  vector<string> volname_vec;
470  int last=0;
471  for (vector<string>::iterator iter=paths.begin(); iter != paths.end(); ++iter)
472  {
473  string base;
474  string ext;
475  CDirEntry::SplitPath((*iter), NULL, &base, &ext);
476  string volName = base + ext;
477  volname_vec.push_back(volName);
478  CRef<CSeqDB> seqdb_vol(new CSeqDB(*iter, CSeqDB::eProtein));
479  oid_offset += seqdb_vol->GetNumSeqs();
481  range.SetFrom(last);
482  range.SetTo(oid_offset);
483  range_vec.push_back(range);
484  last=oid_offset;
485  }
486 
487  int numVols = static_cast<int>(paths.size());
488 #pragma omp parallel for num_threads(numThreads)
489  for(int index=0; index<numVols; index++)
490  {
491  x_BuildIndex(volname_vec[index], range_vec[index].GetFrom(), range_vec[index].GetTo());
492  }
493 }
494 
495 
496 vector<int>
498 {
499  char* loadBadMers = getenv("LOADBADMERS");
500  if (loadBadMers)
501  {
502  CNcbiIfstream in("badMers.in");
503  int badKmer;
504  vector<int> badMers;
505  while (in)
506  {
507  in >> badKmer;
508  if (! in)
509  break;
510  badMers.push_back(badKmer);
511 cerr << badKmer << '\n';
512  }
513  return badMers;
514  }
515  char* noBadMers = getenv("NOBADMERS");
516  if (noBadMers)
517  return vector<int>();
518 
519  if (alphabet == 1)
520  {
521  const int kLength=10;
522  int array[] = {139810, 69905, 70161, 70177, 74257,
523  69921, 69906, 74001, 135441, 69922};
524  vector<int> badMers(array, array+kLength);
525  return badMers;
526  }
527 
528  return vector<int>();
529 }
530 
531 void
532 CBlastKmerBuildIndex::x_BuildIndex(string& name, int start, int stop)
533 {
534  int StartLSH=0;
535  if (m_Version == 2)
536  {
537  int vectorRandNums=0;
538  vectorRandNums = m_Samples*m_RowsPerBand;
539  vectorRandNums += 8 - (m_Samples*m_RowsPerBand)%8; // multiple of 8.
540  StartLSH=56+(2*4*m_NumHashFct)+vectorRandNums;
541  }
542  else if (m_Version == 3)
543  StartLSH=56+(2*4*m_NumHashFct);
544  else
545  StartLSH=48+(2*4*m_NumHashFct);
546  const int kSizeLSH=KMER_LSH_ARRAY_SIZE; // 2**24 + 1 (sentinel at end)
547 
548  string indexFile = name + ".pki";
549  string dataFile = name + ".pkd";
550  // Output files
551  CNcbiOfstream index_file(indexFile.c_str(), IOS_BASE::out | IOS_BASE::binary);
552  CNcbiOfstream data_file(dataFile.c_str(), IOS_BASE::out | IOS_BASE::binary);
553 
554  if (!index_file)
555  NCBI_THROW(CFileException, eNotExists, "Cannot open " + indexFile);
556  if (!data_file)
557  NCBI_THROW(CFileException, eNotExists, "Cannot open " + dataFile);
558 
559  int num_seqs = 0;
560  if (stop != 0)
561  num_seqs = stop - start;
562  else
563  num_seqs = m_SeqDB->GetNumSeqs();
564 
565 
566  index_file.write((char *) &(m_Version), 4); // Add BLAST DB name?
567  index_file.write((char *) &(num_seqs), 4);
568  index_file.write((char *) &(m_NumHashFct), 4);
569  index_file.write((char *) &(m_Samples), 4);
570  index_file.write((char *) &(m_KmerSize), 4);
571  index_file.write((char *) &(m_RowsPerBand), 4);
572  index_file.write((char *) &(m_Compress), 4);
573  index_file.write((char *) &(m_Alphabet), 4);
574  index_file.write((char *) &(StartLSH), 4);
575  index_file.write((char *) &(kSizeLSH), 4);
576 
577  // hash coefficients
578  vector<uint32_t> a(m_NumHashFct);
579  vector<uint32_t> b(m_NumHashFct);
580  GetRandomNumbers(a.data(), b.data(), m_NumHashFct);
581 
582  // sequences that contain no valid kmers
583  uint32_t* dead = new uint32_t[3*num_seqs];
584  _ASSERT(dead);
585  for(int q_oid=0;q_oid<3*num_seqs;q_oid++)
586  dead[q_oid]=0;
587 
588  vector<int> badMers = s_BlastKmerLoadBadMers(m_Alphabet);
589 
590  // mapping from a sequence (by oid) to
591  // chunks of a sequence to the minhashes
592  vector < vector < vector < uint32_t > > > seq_hash(num_seqs);
593 
594  // compute minhash signatures for each sequence
595  for(int q_oid=0;q_oid<num_seqs;q_oid++)
596  {
597  if (m_Version < 3)
598  s_MinhashSequences(q_oid,
599  *m_SeqDB,
600  seq_hash,
601  dead,
602  m_NumHashFct,
603  a.data(),
604  b.data(),
605  m_DoSeg,
606  m_KmerSize,
607  start,
608  m_Alphabet,
609  m_Version,
610  m_ChunkSize);
611  else
612  s_MinhashSequences2(q_oid,
613  *m_SeqDB,
614  seq_hash,
615  dead,
616  m_NumHashFct,
617  m_KmerSize,
618  start,
619  m_Alphabet,
620  m_Version,
621  badMers,
622  m_ChunkSize);
623  }
624 
625  const uint32_t kUniqueHash = 0x1000000;
626  uint8_t* uniqueHash = new uint8_t[kUniqueHash];
627  for (uint32_t index=0; index<kUniqueHash; index++)
628  uniqueHash[index] = 0;
629 
630  vector< vector<uint32_t> > lsh(kSizeLSH);
631  int total_chunks=0;
632 // cerr << "TEST " << m_RowsPerBand << " " << m_Samples << '\n';
633  vector < vector <int> > kvector;
634  if (m_Version == 2)
635  {
637  }
638  for(int q_oid=0;q_oid<num_seqs;q_oid++)
639  {
640  if (dead[q_oid]) // Nothing to do with this.
641  continue;
642 
643  if (m_Version == 2)
645  total_chunks, uniqueHash, kvector);
646  else if (m_Version == 3)
647  s_Get_LSH_index_hashes5(seq_hash, lsh, q_oid, m_NumHashFct, m_RowsPerBand, total_chunks, uniqueHash);
648  else
649  s_Get_LSH_index_hashes(seq_hash, lsh, q_oid, m_NumBands, m_RowsPerBand, total_chunks, uniqueHash);
650  }
651 
652  int hashCount=0;
653  for(uint32_t index=0; index<kUniqueHash; index++)
654  {
655  if (uniqueHash[index] > 0)
656  hashCount++;
657  }
658  delete [] uniqueHash;
659 
660 
661  uint64_t LSHMatchSize=0;
662  for(int index=0; index<kSizeLSH-1; index++)
663  {
664  LSHMatchSize += lsh[index].size();
665  }
666 
667  // End of the LSH matches (NOT the LSH array)
668  const uint64_t kLSHMatchEnd = 4*LSHMatchSize + StartLSH + 8*kSizeLSH;
669  index_file.write((char *) &(kLSHMatchEnd), 8);
670 
671  if (m_Version > 2)
672  index_file.write((char *) &(m_ChunkSize), 4);
673 
674  if (m_Version > 1)
675  {
676  const int kFutureUse=0;
677  index_file.write((char *) &(kFutureUse), 4);
678  if (m_Version < 3)
679  index_file.write((char *) &(kFutureUse), 4);
680  }
681 
682  if (m_Version < 3)
683  {
684  // Random numbers used for the hash functions.
685  for(int i=0;i<m_NumHashFct;i++)
686  index_file.write((char *) &(a[i]), 4);
687  for(int i=0;i<m_NumHashFct;i++)
688  index_file.write((char *) &(b[i]), 4);
689  }
690  else
691  {
692  int i=1; // i must be lower than 2*m_NumHashFct
693  int num=static_cast<int>(badMers.size());
694  if (num > (2*m_NumHashFct-1))
695  num = 2*m_NumHashFct-1;
696  index_file.write((char *) &(num), 4);
697  for(vector<int>::iterator iter=badMers.begin(); iter != badMers.end() && i<2*m_NumHashFct; ++iter, ++i)
698  index_file.write((char *) &(*iter), 4);
699  if (i<2*m_NumHashFct)
700  {
701  const int kZero=0;
702  for (; i<2*m_NumHashFct; ++i)
703  index_file.write((char *) &(kZero), 4);
704  }
705 
706  }
707 
708 
709  if (m_Version == 2)
710  {
711  for (int i=0; i<m_Samples; i++)
712  {
713  vector<int> temp = kvector[i];
714  for (int j=0; j<m_RowsPerBand; j++) // values <= number of hash functions.
715  index_file.write((char *) &(temp[j]), 1);
716  }
717  // Need to have multiple of 8 bytes.
718  int extra = 8 - (m_Samples*m_RowsPerBand)%8;
719  Uint1 temp=0;
720  for (int i=0; i<extra; i++)
721  index_file.write((char *) &(temp), 1);
722  }
723 
724 
725  uint64_t lsh_offset = StartLSH + 8*kSizeLSH;
726  const uint64_t kNoValue = 0; // StartLSH is greater than zero, so this is OK.
727  for(int index=0; index<kSizeLSH-1; index++)
728  {
729  if (lsh[index].size() == 0)
730  index_file.write((char *) &(kNoValue), 8);
731  else
732  index_file.write((char *) &(lsh_offset), 8);
733  lsh_offset += 4*(lsh[index].size());
734  }
735  index_file.write((char *) &(lsh_offset), 8); // Last one, sentinel
736 
737  int lsh_hits=0;
738  for(int index=0; index<kSizeLSH-1; index++)
739  {
740  for(vector<uint32_t>::iterator i=lsh[index].begin(); i != lsh[index].end(); ++i)
741  {
742  index_file.write((char *) &(*i), 4);
743  lsh_hits++;
744  }
745  }
746 
747  index_file.flush();
748  index_file.close();
749 
750  x_WriteDataFile(seq_hash, num_seqs, data_file);
751  data_file.flush();
752  data_file.close();
753 
754 
755  delete [] dead;
756 
757  return;
758 }
759 
760 
761 void
762 CBlastKmerBuildIndex::x_WriteDataFile(vector < vector < vector < uint32_t > > > & seq_hash, int num_seqs, CNcbiOfstream& data_file)
763 {
764  int width = m_Compress;
765  if (width == 0)
766  width = 4;
767  for(int q_oid=0;q_oid<num_seqs;q_oid++)
768  {
769  int num_chunks = static_cast<int>(seq_hash[q_oid].size());
770  for (int n=0; n<num_chunks; n++)
771  {
772  vector<uint32_t> tmp_hash;
773  for(int b=0;b<m_NumHashFct;b++)
774  {
775  if (width == 2)
776  {
777  uint16_t hash_val = pearson_hash_int2short(seq_hash[q_oid][n][b]);
778  tmp_hash.push_back(hash_val);
779  // data_file.write((char *) &(hash_val), width);
780  }
781  else if(width == 1)
782  {
783  unsigned char hash_val = pearson_hash_int2byte(seq_hash[q_oid][n][b]);
784  tmp_hash.push_back(hash_val);
785  // data_file.write((char *) &(hash_val), width);
786  }
787  else
788  {
789  // data_file.write((char *) &(seq_hash[q_oid][n][b]), width);
790  tmp_hash.push_back(seq_hash[q_oid][n][b]);
791  }
792  }
793  if (m_Version == 3)
794  std::sort(tmp_hash.begin(), tmp_hash.end());
795  for(int b=0;b<m_NumHashFct;b++)
796  {
797  data_file.write((char *) &(tmp_hash[b]), width);
798  }
799  data_file.write((char *) &(seq_hash[q_oid][n][m_NumHashFct]), 4); // OID in the last slot
800  }
801  }
802 }
803 
804 END_SCOPE(blast)
806 
static void s_Get_LSH_index_hashes5(vector< vector< vector< uint32_t > > > &seq_hash, vector< vector< uint32_t > > &lsh, int q_oid, int numHashes, int numRows, int &total_chunks, uint8_t *uniqueHash)
uint32_t uhash(uint64_t x, uint64_t a, uint64_t b)
static Uint4 FNV_hash(uint32_t num)
FNV Hash. See http://www.isthe.com/chongo/tech/comp/fnv/index.html.
void s_MinhashSequences(uint32_t q_oid, CSeqDB &db, vector< vector< vector< uint32_t > > > &seq_hash, uint32_t *dead, int num_hashes, const uint32_t *a, const uint32_t *b, bool do_seg, int kmerNum, int oidOffset, int alphabetChoice, int version, int chunkSize)
static void s_Get_LSH_index_hashes2(vector< vector< vector< uint32_t > > > &seq_hash, vector< vector< uint32_t > > &lsh, int q_oid, int num_k, int num_l, int array_size, int &total_chunks, uint8_t *uniqueHash, vector< vector< int > > &kvector)
static void s_Get_LSH_index_hashes(vector< vector< vector< uint32_t > > > &seq_hash, vector< vector< uint32_t > > &lsh, int q_oid, int numBands, int numRows, int &total_chunks, uint8_t *uniqueHash)
void s_MinhashSequences2(uint32_t q_oid, CSeqDB &db, vector< vector< vector< uint32_t > > > &seq_hash, uint32_t *dead, int num_hashes, int kmerNum, int oidOffset, int alphabetChoice, int version, vector< int > badMers, int chunkSize)
vector< int > s_BlastKmerLoadBadMers(int alphabet)
int BlastKmerGetDistance(const vector< uint32_t > &minhash1, const vector< uint32_t > &minhash2)
Calculates the number of differences between two minhash arrays.
set< uint32_t > BlastKmerGetKmerSet2(const string &query_sequence, TSeqRange &range, int kmerNum, int alphabetChoice, vector< int > badMers)
Get KMERs for a given sequence using a compressed alphabet.
void GetRandomNumbers(uint32_t *a, uint32_t *b, int numHashes)
Get the random numbers for the hash function.
set< uint32_t > BlastKmerGetKmerSet(const string &query_sequence, bool do_seg, TSeqRange &range, int kmerNum, int alphabetChoice)
Get KMERs for a given sequence using a compressed alphabet.
int BlastKmerBreakUpSequence(int length, vector< TSeqRange > &range_v, int chunkSize)
Breaks a sequences up into chunks if the sequences is above a certain length.
#define PKMER_PRIME
void GetKValues(vector< vector< int > > &kvector, int k_value, int l_value, int array_size)
Function to get the k sites to compare for Buhler LSH.
CRef< CSeqDB > m_SeqDB
Residues in kmer.
int m_ChunkSize
version of index file
int m_KmerSize
Number of rows per band.
int m_Compress
Number of samples (Buhler only)
bool m_DoSeg
BLAST database.
int m_Version
0 for 15 letters, 1 for 10 letters.
int m_Samples
Should Seg be run on sequences.
void x_BuildIndex(string &name, int start=0, int number=0)
BUild index for an individual BLAST volume.
void Build(int numThreads=1)
Build the index.
int m_RowsPerBand
Number of LSH bands.
int m_NumBands
Number of hash functions.
int m_Alphabet
Compress the arrays for Jaccard matches.
void x_WriteDataFile(vector< vector< vector< uint32_t > > > &seq_hash, int num_seqs, CNcbiOfstream &data_file)
Writes out the data file.
CFileException –.
Definition: ncbifile.hpp:136
CRef –.
Definition: ncbiobj.hpp:618
CSeqDB.
Definition: seqdb.hpp:161
static void FindVolumePaths(const string &dbname, ESeqType seqtype, vector< string > &paths, vector< string > *alias_paths=NULL, bool recursive=true, bool expand_links=true)
Find volume paths.
Definition: seqdb.cpp:1040
void GetSequenceAsString(int oid, CSeqUtil::ECoding coding, string &output, TSeqRange range=TSeqRange()) const
Get a sequence in a given encoding.
Definition: seqdb.cpp:1141
const string & GetDBNameList() const
Get list of database names.
Definition: seqdb.cpp:760
int GetSeqLength(int oid) const
Returns the sequence length in base pairs or residues.
Definition: seqdb.cpp:400
@ eProtein
Definition: seqdb.hpp:174
int GetNumSeqs() const
Returns the number of sequences available.
Definition: seqdb.cpp:670
void SetNumberOfThreads(int num_threads, bool force_mt=false)
Setting the number of threads.
Definition: seqdb.cpp:1321
@ e_Ncbistdaa
Definition: sequtil.hpp:58
Definition: set.hpp:45
const_iterator begin() const
Definition: set.hpp:135
parent_type::iterator iterator
Definition: set.hpp:80
bool empty() const
Definition: set.hpp:133
const_iterator end() const
Definition: set.hpp:136
std::ofstream out("events_result.xml")
main entry point for tests
static DLIST_TYPE *DLIST_NAME() last(DLIST_LIST_TYPE *list)
Definition: dlist.tmpl.h:51
Uint8 uint64_t
unsigned char uint8_t
Uint2 uint16_t
Uint4 uint32_t
#define NULL
Definition: ncbistd.hpp:225
#define NCBI_THROW(exception_class, err_code, message)
Generic macro to throw an exception, given the exception class, error code and message string.
Definition: ncbiexpt.hpp:704
static void SplitPath(const string &path, string *dir=0, string *base=0, string *ext=0)
Split a path string into its basic components.
Definition: ncbifile.cpp:358
uint8_t Uint1
1-byte (8-bit) unsigned integer
Definition: ncbitype.h:99
int32_t Int4
4-byte (32-bit) signed integer
Definition: ncbitype.h:102
uint32_t Uint4
4-byte (32-bit) unsigned integer
Definition: ncbitype.h:103
#define END_NCBI_SCOPE
End previously defined NCBI scope.
Definition: ncbistl.hpp:103
#define END_SCOPE(ns)
End the previously defined scope.
Definition: ncbistl.hpp:75
#define BEGIN_NCBI_SCOPE
Define ncbi namespace.
Definition: ncbistl.hpp:100
#define BEGIN_SCOPE(ns)
Define a new scope.
Definition: ncbistl.hpp:72
IO_PREFIX::ofstream CNcbiOfstream
Portable alias for ofstream.
Definition: ncbistre.hpp:500
IO_PREFIX::ifstream CNcbiIfstream
Portable alias for ifstream.
Definition: ncbistre.hpp:439
int i
yy_size_t n
#define KMER_LSH_ARRAY_SIZE
Definition: mhfile.hpp:72
range(_Ty, _Ty) -> range< _Ty >
constexpr auto sort(_Init &&init)
const string version
version string
Definition: variables.hpp:66
const struct ncbi::grid::netcache::search::fields::SIZE size
const struct ncbi::grid::netcache::search::fields::KEY key
unsigned int a
Definition: ncbi_localip.c:102
Defines classes: CDirEntry, CFile, CDir, CSymLink, CMemoryFile, CFileUtil, CFileLock,...
T max(T x_, T y_)
std::istream & in(std::istream &in_, double &x_)
double r(size_t dimension_, const Int4 *score_, const double *prob_, double theta_)
uint32_t do_pearson_hash(unsigned char *key, int length)
Definition: pearson.cpp:51
uint16_t pearson_hash_int2short(uint32_t input, int seed1, int seed2)
Pearson hash an integer into two bytes.
Definition: pearson.cpp:81
unsigned char pearson_hash_int2byte(uint32_t input, int seed1)
Pearson hash an integer into one byte.
Definition: pearson.cpp:64
static string query
Definition: _hash_fun.h:40
#define _ASSERT
#define compress
Definition: zconf_cf.h:39
Modified on Fri Sep 20 14:58:03 2024 by modify_doxy.py rev. 669887