NCBI C++ ToolKit
bss_info.cpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 /* $Id: bss_info.cpp 90014 2020-05-04 17:30:22Z ivanov $
2 * ===========================================================================
3 *
4 * PUBLIC DOMAIN NOTICE
5 * National Center for Biotechnology Information
6 *
7 * This software/database is a "United States Government Work" under the
8 * terms of the United States Copyright Act. It was written as part of
9 * the author's official duties as a United States Government employee and
10 * thus cannot be copyrighted. This software/database is freely available
11 * to the public for use. The National Library of Medicine and the U.S.
12 * Government have not placed any restriction on its use or reproduction.
13 *
14 * Although all reasonable efforts have been taken to ensure the accuracy
15 * and reliability of the software and data, the NLM and the U.S.
16 * Government do not and cannot warrant the performance or results that
17 * may be obtained by using this software or data. The NLM and the U.S.
18 * Government disclaim all warranties, express or implied, including
19 * warranties of performance, merchantability or fitness for any particular
20 * purpose.
21 *
22 * Please cite the author in any work or product based on this material.
23 *
24 * ===========================================================================
25 *
26 * Author: David McElhany, with thanks to Eugene Vasilchenko and Andrei
27 * Gourianov for patiently answering my many questions.
28 *
29 * File Description:
30 * A demo program to address the problem statement:
31 * - Please extract the seq-id, tax-id, and defline for each of some very
32 * large number of sequences packed in a Bioseq-set.
33 *
34 * NOTE: This question was given in CXX-1382 in the context of demonstrating
35 * serial hooks, and the program is therefore designed primarily to demo
36 * serial hooks, not necessarily navigating object hierarchies.
37 *
38 * Some Assumptions:
39 *
40 * 1. The input will be a Seq-entry file containing the Bioseq-set.
41 * 2. The Seq-id to be reported will simply be the Bioseq.id.
42 * 3. The Object Manager shouldn't be used to retrieve the Tax-id or Defline
43 * (CXX-1382 specifically requested that hooks would be used to answer the
44 * question).
45 * 4. The Tax-id to be reported will be that of the Bioseq itself or, if it
46 * doesn't specify one, that of the most nested enclosing Bioseq or
47 * Bioseq-set that defines a Tax-id. Every Bioseq (or one of its enclosing
48 * Bioseq's or Bioseq-set's) must define a Tax-id.
49 * 5. The Defline to be reported will be found in the Bioseq.descr.title or,
50 * if that isn't present, in the descr.title of the most nested enclosing
51 * Bioseq or Bioseq-set that defines a descr.title. Every Bioseq (or one
52 * of its enclosing Bioseq's or Bioseq-set's) must define a Defline.
53 *
54 * Implementation Approach:
55 *
56 * 1. Set up a class member skip hook on Bioseq.id. This hook will record the
57 * current Bioseq.id, and when the information for the current Bioseq is
58 * reported, this is the Seq-id that will be used.
59 * 2. Set up stack path skip hooks for "*.descr.source.org.db" and
60 * "*.descr.org.db". These hooks will set the current Tax-id. Stack path
61 * hooks are used to ensure that Tax-id's are parsed only within the proper
62 * structural context.
63 * 3. Set up an object skip hook for Bioseq-set. This hook will udpate a stack
64 * containing the applicable Tax-id.
65 * 4. Set up an object skip hook for Bioseq. This hook will udpate a stack
66 * containing the applicable Tax-id, and will also determine the Defline and
67 * report the desired information.
68 * 5. Skip through the file, triggering the hooks.
69 *
70 * ===========================================================================
71 */
72 
73 #include <ncbi_pch.hpp>
74 #include <stack>
75 
76 #include <corelib/ncbiapp.hpp>
77 #include <corelib/ncbiargs.hpp>
78 #include <corelib/ncbiexpt.hpp>
79 #include <corelib/ncbifile.hpp>
80 #include <corelib/ncbistr.hpp>
81 #include <corelib/ncbistre.hpp>
82 
84 #include <objects/seq/Bioseq.hpp>
88 
89 #include <serial/objcopy.hpp>
90 #include <serial/objectio.hpp>
91 #include <serial/objhook.hpp>
92 #include <serial/objistr.hpp>
93 #include <serial/serial.hpp>
94 
97 
98 
99 ///////////////////////////////////////////////////////////////////////////
100 // Local Types
101 
105 };
106 
107 enum ETaxPath {
110 };
111 
112 
113 // Keep info about a Bioseq. Note that this info could be inherited from a Bioseq-set.descr.
114 
115 struct SBioseqInfo
116 {
118  : seqid_str(""), taxid_org(-1), taxid_source_org(-1), defline("")
119  {}
120 
121  string seqid_str;
124  string defline;
125 };
126 
127 
128 ///////////////////////////////////////////////////////////////////////////
129 // Module Static Functions and Data
130 
131 static ESerialDataFormat s_GetFormat(const string& name);
132 static void s_Report(void);
133 
134 typedef stack<SBioseqInfo> TBioseqInfoStack;
135 
137 {
138  // safe way to use global static objects
139  static CSafeStatic<TBioseqInfoStack> bi_stack;
140  return bi_stack.Get();
141 };
142 
143 
144 ///////////////////////////////////////////////////////////////////////////
145 // Hook Classes
146 
147 // This class finds Bioseq's and Bioseq-set's when skipping, gathers info
148 // about the context, and reports the info (for Bioseq's).
150 {
151 public:
153  : m_Context(context)
154  {}
155 
156  virtual void SkipObject(CObjectIStream& stream,
157  const CObjectTypeInfo& info)
158  {
159  // Push info to be used for this Bioseq. Note: this is initially
160  // invalid, but the info will be overwritten (in other hooks) before
161  // it gets used.
162  if (GetBioseqInfoStack().empty()) {
163  // Push a new empty object for the first time.
164  SBioseqInfo bs_info;
165  GetBioseqInfoStack().push(bs_info);
166  } else {
167  // Push a copy of the last object for this Bioseq.
168  // This facilitates inheriting info through nesting.
169  GetBioseqInfoStack().push(GetBioseqInfoStack().top());
170  }
171 
172  // Skip the Bioseq, triggering other hooks which in turn retrieve
173  // relevant data.
174  DefaultSkip(stream, info);
175 
176  // Report the required information (if this is a Bioseq).
178  s_Report();
179  }
180 
181  // We're done with this Bioseq info.
182  GetBioseqInfoStack().pop();
183  }
184 
185 private:
187 };
188 
189 
190 // This class finds Bioseq.id's when skipping. The hook reads and records
191 // the last Bioseq.id encountered.
192 
194 {
195 public:
196  virtual void SkipClassMember(CObjectIStream& stream,
197  const CObjectTypeInfoMI& info)
198  {
199  // The relevant ASN.1 is:
200  // Bioseq ::= SEQUENCE {
201  // id SET OF Seq-id
202  //
203  // This hook is on the 'id' class member of Bioseq, which means it's
204  // the whole 'SET OF Seq-id' that's hooked, not individual Seq-id's.
205  //
206  // Therefore, we will iterate through the set and: (1) read into a
207  // local Seq-id object, and (2) append that Seq-id's info to the
208  // Seq-id string that represents the Bioseq's id.
209 
210  string seqid_str("");
211  CIStreamContainerIterator isc(stream, info.GetMemberType());
212  for ( ; isc; ++isc ) {
213  // Read the Seq-id locally.
214  CSeq_id seqid;
215  isc >> seqid; // Read from the container iterator, not the stream.
216 
217  // Append this Seq-id to the Seq-id string.
218  if (seqid_str != "") {
219  seqid_str += "|";
220  }
221  seqid_str += seqid.AsFastaString();
222  }
223 
224  // Update the current Bioseq info with the new Seq-id string.
225  GetBioseqInfoStack().top().seqid_str = seqid_str;
226  }
227 };
228 
229 
230 // This class finds Tax-id's when skipping, and sets the current Tax-id.
231 
233 {
234 public:
236  : m_TaxPath(tax_path)
237  {}
238 
239  virtual void SkipClassMember(CObjectIStream& stream,
240  const CObjectTypeInfoMI& info)
241  {
242  // The relevant ASN.1 is:
243  // Seqdesc ::= CHOICE {
244  // org Org-ref ,
245  // source BioSource ,
246  // BioSource ::= SEQUENCE {
247  // org Org-ref ,
248  // Org-ref ::= SEQUENCE {
249  // db SET OF Dbtag OPTIONAL ,
250  // Dbtag ::= SEQUENCE {
251  // db VisibleString ,
252  // tag Object-id }
253  // Object-id ::= CHOICE {
254  // id INTEGER ,
255  // str VisibleString }
256  //
257  // This hook is on the 'db' class member of Org-ref (as set via stack
258  // path hooks), which means it's the whole 'SET OF Dbtag' that's
259  // hooked, not individual Dbtag's.
260  //
261  // Therefore, we will iterate through the set and: (1) read into a
262  // local Dbtag object, and (2) parse that Dbtag to see if it's a
263  // Tax-id and if so, update the current Bioseq info.
264 
265  // Set up the container iterator based on the hooked object info.
266  CIStreamContainerIterator isc(stream, info.GetMemberType());
267  for ( ; isc; ++isc ) {
268  // Read the Dbtag locally.
269  CDbtag dbtag;
270  isc >> dbtag; // Read from the container iterator, not the stream.
271 
272  // Get access to the Dbtag.db class member.
273  CObjectInfo obj = ObjectInfo<CDbtag>(dbtag);
274  CObjectInfo db_member = obj.SetClassMember(obj.FindMemberIndex("db"));
275 
276  // Get the value of the Dbtag.db class member.
277  string db_str = db_member.GetPrimitiveValueString();
278 
279  // Only continue for taxonomy db entries.
280  if (db_str == "taxon") {
281  // Get access to the Dbtag.tag class member.
282  CObjectInfo tag_cont = obj.SetClassMember(obj.FindMemberIndex("tag"));
283 
284  // Get access to the Dbtag.tag class member (the Object-id).
285  CObjectInfo tag_pt = tag_cont.GetPointedObject();
286 
287  // Get access to the selected Object-id choice variant.
288  CObjectInfoCV tag_var = tag_pt.GetCurrentChoiceVariant();
289  CObjectInfo tag_choice = tag_var.GetVariant();
290 
291  // Get the value of the selected Object-id choice variant.
292  int taxid;
293  if (tag_var.GetVariantInfo()->GetId().GetName() == "id") {
294  taxid = tag_choice.GetPrimitiveValueInt();
295  } else {
297  }
298  // Only keep the last Tax-id found (for each path).
299  if (m_TaxPath == eTaxPathOrg) {
300  GetBioseqInfoStack().top().taxid_org = taxid;
301  } else {
302  GetBioseqInfoStack().top().taxid_source_org = taxid;
303  }
304  }
305  }
306  }
307 
308 private:
310 };
311 
312 
313 // This class finds Defline's when skipping, and sets the current Defline.
314 
316 {
317 public:
318  virtual void SkipObject(CObjectIStream& stream,
319  const CObjectTypeInfo& info)
320  {
321  // Get a reference to the current Bioseq info.
322  SBioseqInfo& bs_info(GetBioseqInfoStack().top());
323 
324  // Read the Defline into the current Bioseq info.
325  stream.Read(&bs_info.defline,
328  }
329 };
330 
331 
332 ///////////////////////////////////////////////////////////////////////////
333 // Static Function Definitions
334 
335 
336 // This function translates format names to enum values.
337 ESerialDataFormat s_GetFormat(const string& name)
338 {
339  if (name == "asn") {
340  return eSerial_AsnText;
341  } else if (name == "asnb") {
342  return eSerial_AsnBinary;
343  } else if (name == "xml") {
344  return eSerial_Xml;
345  } else if (name == "json") {
346  return eSerial_Json;
347  } else {
348  // Should be caught by argument processing, but in case of a
349  // programming error...
350  NCBI_THROW(CException, eUnknown, "Bad serial format name " + name);
351  }
352 }
353 
354 
355 // This function will print the required information for the current Bioseq.
356 static void s_Report(void)
357 {
358  // Get the Seq-id string.
359  string seqid_str = GetBioseqInfoStack().top().seqid_str;
360 
361  // Get the Tax-id, preferring Tax-id's found with stack path
362  // *.descr.source.org.db over those with path *.descr.org.db --
363  // see ncbi::objects::CBioseq_Info::GetTaxId() and
364  // ncbi::objects::sequence::GetTaxId().
365  int taxid = GetBioseqInfoStack().top().taxid_source_org;
366  // If not found in source.org, taxid will be < 1 so try just org.
367  if (taxid < 1) {
368  taxid = GetBioseqInfoStack().top().taxid_org;
369  }
370 
371  // Get the Defline.
372  string defline = GetBioseqInfoStack().top().defline;
373 
374  // Report the required information.
375  cout << ">"
376  << seqid_str << " "
377  << defline << " ["
378  << taxid << "]" << endl;
379 }
380 
381 
382 ///////////////////////////////////////////////////////////////////////////
383 // Main Application Functionality
384 
386 {
387  virtual void Init(void);
388  virtual int Run(void);
389 };
390 
391 
393 {
394  // Create command-line argument descriptions class
395  unique_ptr<CArgDescriptions> arg_desc(new CArgDescriptions);
396 
397  // Specify USAGE context
398  arg_desc->SetUsageContext
399  (GetArguments().GetProgramBasename(),
400  "Bioseq-set info extractor");
401 
402  // Describe the expected command-line arguments
403 
404  arg_desc->AddDefaultKey
405  ("i", "InputFile",
406  "name of input file",
408  arg_desc->AddDefaultKey("ifmt", "InputFormat", "format of input file",
410  arg_desc->SetConstraint
411  ("ifmt", &(*new CArgAllow_Strings, "asn", "asnb", "xml", "json"));
412 
413  // Setup arg.descriptions for this application
414  SetupArgDescriptions(arg_desc.release());
415 }
416 
417 
419 {
420  // Get arguments
421  const CArgs& args = GetArgs();
422 
423  // Set up the input stream.
424  unique_ptr<CObjectIStream> in(CObjectIStream::Open
425  (s_GetFormat(args["ifmt"].AsString()),
426  args["i"].AsInputFile()));
427 
428  // Set up an object skip hook for Bioseq. This hook will udpate a stack
429  // containing the required info and report the required info.
432 
433  // Set up an object skip hook for Bioseq-set. This hook will udpate a stack
434  // containing the required info.
437 
438  // Set up a class member skip hook on Bioseq.id. This hook will record the
439  // current Bioseq.id, and when the information for the current Bioseq is
440  // reported, this is the Seq-id that will be used.
442  .FindMember("id")
444 
445  // Set up stack path skip hooks to capture Tax-id's. Stack path
446  // hooks are used to ensure that Tax-id's are parsed only within the
447  // proper structural context.
448  in->SetPathSkipMemberHook("*.descr.source.org.db",
450  in->SetPathSkipMemberHook("*.descr.org.db",
451  new CHookTax_id(eTaxPathOrg));
452 
453  // Set up a stack path skip hook to capture the Defline. A stack path
454  // hook is used to ensure that Defline's are parsed only within the
455  // proper structural context.
456  in->SetPathSkipObjectHook("*.descr.title", new CHookDefline());
457 
458  // Skip through the Seq-entry in the input file. This will trigger the
459  // hooks, which will extract and report the desired Bioseq information.
460  in->Skip(CType<CSeq_entry>());
461 
462  return 0;
463 }
464 
465 
466 int main(int argc, const char* argv[])
467 {
468  // Execute main application function
469  return CBssInfoApp().AppMain(argc, argv);
470 }
ETaxPath
Definition: bss_info.cpp:107
@ eTaxPathSourceOrg
Definition: bss_info.cpp:109
@ eTaxPathOrg
Definition: bss_info.cpp:108
static TBioseqInfoStack & GetBioseqInfoStack()
Definition: bss_info.cpp:136
stack< SBioseqInfo > TBioseqInfoStack
Definition: bss_info.cpp:134
USING_SCOPE(ncbi)
int main(int argc, const char *argv[])
Definition: bss_info.cpp:466
static ESerialDataFormat s_GetFormat(const string &name)
Definition: bss_info.cpp:337
static void s_Report(void)
Definition: bss_info.cpp:356
EBioseqContext
Definition: bss_info.cpp:102
@ eBioseqContextBioseq
Definition: bss_info.cpp:103
@ eBioseqContextBioseqSet
Definition: bss_info.cpp:104
CArgAllow_Strings –.
Definition: ncbiargs.hpp:1641
CArgDescriptions –.
Definition: ncbiargs.hpp:541
CArgs –.
Definition: ncbiargs.hpp:379
virtual void Init(void)
Initialize the application.
Definition: bss_info.cpp:392
virtual int Run(void)
Run the application.
Definition: bss_info.cpp:418
Definition: Dbtag.hpp:53
CHookBioseqContext(EBioseqContext context)
Definition: bss_info.cpp:152
virtual void SkipObject(CObjectIStream &stream, const CObjectTypeInfo &info)
Definition: bss_info.cpp:156
EBioseqContext m_Context
Definition: bss_info.cpp:186
virtual void SkipClassMember(CObjectIStream &stream, const CObjectTypeInfoMI &info)
Definition: bss_info.cpp:196
virtual void SkipObject(CObjectIStream &stream, const CObjectTypeInfo &info)
Definition: bss_info.cpp:318
ETaxPath m_TaxPath
Definition: bss_info.cpp:309
virtual void SkipClassMember(CObjectIStream &stream, const CObjectTypeInfoMI &info)
Definition: bss_info.cpp:239
CHookTax_id(ETaxPath tax_path)
Definition: bss_info.cpp:235
Reading (iterating through) elements of containers (SET OF, SEQUENCE OF).
Definition: objectio.hpp:164
CObjectIStream –.
Definition: objistr.hpp:93
CObjectInfoCV –.
Definition: objectiter.hpp:588
CObjectInfo –.
Definition: objectinfo.hpp:597
CObjectTypeInfoMI –.
Definition: objectiter.hpp:246
CObjectTypeInfo –.
Definition: objectinfo.hpp:94
CSafeStatic<>::
T & Get(void)
Create the variable if not created yet, return the reference.
Skip hook for data member of a containing object (eg, SEQUENCE)
Definition: objhook.hpp:223
Skip hook for a standalone object.
Definition: objhook.hpp:205
The NCBI C++ standard methods for dealing with std::string.
virtual const CArgs & GetArgs(void) const
Get parsed command line arguments.
Definition: ncbiapp.cpp:285
int AppMain(int argc, const char *const *argv, const char *const *envp=0, EAppDiagStream diag=eDS_Default, const char *conf=NcbiEmptyCStr, const string &name=NcbiEmptyString)
Main function (entry point) for the NCBI application.
Definition: ncbiapp.cpp:799
virtual void SetupArgDescriptions(CArgDescriptions *arg_desc)
Setup the command line argument descriptions.
Definition: ncbiapp.cpp:1175
const CNcbiArguments & GetArguments(void) const
Get the application's cached unprocessed command-line arguments.
@ fPreOpen
Open file right away; for eInputFile, eOutputFile, eIOFile.
Definition: ncbiargs.hpp:618
@ eInputFile
Name of file (must exist and be readable)
Definition: ncbiargs.hpp:595
@ eString
An arbitrary string.
Definition: ncbiargs.hpp:589
#define NCBI_THROW(exception_class, err_code, message)
Generic macro to throw an exception, given the exception class, error code and message string.
Definition: ncbiexpt.hpp:704
const CMemberId & GetId(void) const
const string & GetName(void) const
@ eUnknown
Definition: app_popup.hpp:72
ESerialDataFormat
Data file format.
Definition: serialdef.hpp:71
@ eSerial_AsnText
ASN.1 text.
Definition: serialdef.hpp:73
@ eSerial_Xml
XML.
Definition: serialdef.hpp:75
@ eSerial_Json
JSON.
Definition: serialdef.hpp:76
@ eSerial_AsnBinary
ASN.1 binary.
Definition: serialdef.hpp:74
const string AsFastaString(void) const
Definition: Seq_id.cpp:2265
TMemberIndex FindMemberIndex(const string &name) const
Find class member index by its name.
Definition: objectinfo.cpp:124
void Read(const CObjectInfo &object)
Read object of know type.
Definition: objistr.cpp:952
void SetLocalSkipHook(CObjectIStream &stream, CSkipObjectHook *hook) const
Set local (for the specified stream) skip hook.
Definition: objectinfo.cpp:420
CObjectInfo GetPointedObject(void) const
Get data and type information of object to which this type refers.
Definition: objectinfo.cpp:102
void SetLocalSkipHook(CObjectIStream &stream, CSkipClassMemberHook *hook) const
Definition: objectiter.cpp:150
CMemberIterator FindMember(const string &memberName) const
Find class member by its name.
CChoiceVariant GetCurrentChoiceVariant(void) const
Get data and type information of selected choice variant.
static CObjectIStream * Open(ESerialDataFormat format, CNcbiIstream &inStream, bool deleteInStream)
Create serial object reader and attach it to an input stream.
Definition: objistr.cpp:195
int GetPrimitiveValueInt(void) const
Get data as int.
Definition: objectinfo.cpp:154
CObjectInfo GetVariant(void) const
Get variant data.
void DefaultSkip(CObjectIStream &in, const CObjectTypeInfo &type)
Default skip.
Definition: objhook.cpp:111
void GetPrimitiveValueString(string &value) const
Get string data.
Definition: objectinfo.cpp:199
const CVariantInfo * GetVariantInfo(void) const
CObjectInfo SetClassMember(TMemberIndex index) const
Create member if necessary and return member object.
Definition: objectinfo.cpp:345
static int StringToInt(const CTempString str, TStringToNumFlags flags=0, int base=10)
Convert string to int.
Definition: ncbistr.cpp:630
@ fConvErr_NoThrow
Do not throw an exception on error.
Definition: ncbistr.hpp:285
static MDB_envinfo info
Definition: mdb_load.c:37
constexpr bool empty(list< Ts... >) noexcept
Magic spell ;-) needed for some weird compilers... very empiric.
Defines the CNcbiApplication and CAppException classes for creating NCBI applications.
Defines command line argument related classes.
Defines NCBI C++ exception handling.
Defines classes: CDirEntry, CFile, CDir, CSymLink, CMemoryFile, CFileUtil, CFileLock,...
NCBI C++ stream class wrappers for triggering between "new" and "old" C++ stream libraries.
std::istream & in(std::istream &in_, double &x_)
int taxid_source_org
Definition: bss_info.cpp:123
string seqid_str
Definition: bss_info.cpp:121
string defline
Definition: bss_info.cpp:124
Modified on Sat Dec 02 09:24:02 2023 by modify_doxy.py rev. 669887