NCBI C++ ToolKit
align_filter.hpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 #ifndef GPIPE_COMMON___ALIGN_FILTER__HPP
2 #define GPIPE_COMMON___ALIGN_FILTER__HPP
3 
4 /* $Id: align_filter.hpp 101311 2023-11-29 13:58:36Z dicuccio $
5  * ===========================================================================
6  *
7  * PUBLIC DOMAIN NOTICE
8  * National Center for Biotechnology Information
9  *
10  * This software/database is a "United States Government Work" under the
11  * terms of the United States Copyright Act. It was written as part of
12  * the author's official duties as a United States Government employee and
13  * thus cannot be copyrighted. This software/database is freely available
14  * to the public for use. The National Library of Medicine and the U.S.
15  * Government have not placed any restriction on its use or reproduction.
16  *
17  * Although all reasonable efforts have been taken to ensure the accuracy
18  * and reliability of the software and data, the NLM and the U.S.
19  * Government do not and cannot warrant the performance or results that
20  * may be obtained by using this software or data. The NLM and the U.S.
21  * Government disclaim all warranties, express or implied, including
22  * warranties of performance, merchantability or fitness for any particular
23  * purpose.
24  *
25  * Please cite the author in any work or product based on this material.
26  *
27  * ===========================================================================
28  *
29  * Authors: Mike DiCuccio
30  *
31  * File Description:
32  *
33  */
34 
35 #include <corelib/ncbiobj.hpp>
37 
39 
41 
42 #include <set>
43 
46  class CSeq_align;
47  class CSeq_align_set;
48  class CSeq_annot;
49  class CScope;
51 
52 
53 ///
54 /// CAlignFilter exposes a query language for inspecting properties and scores
55 /// placed on Seq-align objects. The query language supports a wide variety of
56 /// parameters and language structures.
57 ///
58 /// Basic syntax
59 /// ------------
60 ///
61 /// - Queries can contain any mix of balanced parentheses
62 /// - Queries can use standard boolean operators (AND / OR / NOT; operators
63 /// are not case sensitive)
64 /// - Queries consist of tokens expressing a conditional phrase, of the forms:
65 /// -# a = b
66 /// -# a != b
67 /// -# a < b
68 /// -# a > b
69 /// - Tokens may be a numeric or a text string. Text strings are evaluated in
70 /// a dictionary against a list of known computable values. If a text
71 /// string is not found in the computed dictionary, the text string is
72 /// looked up as a score in Seq-align.score.
73 /// - CAlignFilter supports a set of functions as well. Functions express
74 /// additional parameters or mathematical operations. The current list of
75 /// functions is:
76 /// -# MUL(a, b) = a * b; a and b are tokens as defined above
77 /// -# ADD(a, b) = a + b; a and b are tokens as defined above
78 /// -# IS_SEG_TYPE(a) = 1 if the Seq-align is of segment type a (where a is
79 /// one of 'disc', 'denseg', 'std', 'spliced', 'packed', 'dendiag')
80 /// -# COALESCE(a,b,...) = first of (a, b, ...) that evaluates to a
81 /// supported value. In order to avoid problems when querying against a
82 /// missing value, COALESCE() allows the specification of alternate score
83 /// names or of alternate values. Thus, COALESCE(score, 0) will return 0
84 /// if 'score' is not present.
85 ///
86 /// Current Accepted Tokens
87 /// -----------------------
88 ///
89 /// - Any named score. CSeq_align enforces through the use of enums specific
90 /// score names; some standard score names are described in CSeq_align and
91 /// include:
92 /// -# align_length
93 /// -# bit_score
94 /// -# comp_adjustment_method
95 /// -# e_value
96 /// -# longest_gap
97 /// -# num_ident
98 /// -# num_mismatch
99 /// -# num_negatives
100 /// -# num_positives
101 /// -# pct_coverage
102 /// -# pct_identity_gap
103 /// -# pct_identity_gapopen_only
104 /// -# pct_identity_ungap
105 /// -# score
106 /// -# sum_e
107 /// NOTE: There is no requirement that an alignment contain any of the above
108 /// scores.
109 ///
110 /// - Any number. All numbers are interpreted as doubles.
111 ///
112 /// - One of a fixed set of computable characteristics locally defined in
113 /// CAlignFilter. These include:
114 /// -# 3prime_unaligned - Length of 3' unaligned sequence
115 /// -# 5prime_unaligned - Length of 5' unaligned sequence (same as
116 /// query_start)
117 /// -# align_length - Length of aligned query span
118 /// -# align_length_ratio - Length of aligned subject span / length of
119 /// aligned query span
120 /// -# align_length_ungap - Sum of lengths of aligned query segments
121 /// -# cds_internal_stops - For Spliced-segs, returns the count of the
122 /// number of internal stops present in the mapped CDS (mapped =
123 /// CGeneModel::CreateGeneModel() mapped)
124 /// -# internal_unaligned - Length of unaligned sequence between 5'-most and
125 /// 3'-most ends
126 /// -# min_exon_len - Length of shortest exon
127 /// -# product_length - Same as query_length
128 /// -# query_end - End pos (0-based) of query sequence
129 /// -# query_length - Length of query sequence
130 /// -# query_start - Start pos (0-based) of query sequence
131 /// -# subject_end - Ending pos (0-based) of subject span
132 /// -# subject_length - Length of subject length
133 /// -# subject_start - Starting pos (0-based) of subject span
134 ///
135 /// - A specific sequence identifier. The special tokens 'query' and 'subject'
136 /// can be used to specify individual sequences using any of the sequence's
137 /// seq-id synonyms
138 ///
139 ///
140 /// Example queries:
141 /// ----------------
142 ///
143 /// - pct_coverage > 99.5
144 /// - finds alignments with the score pct_coverage > 99.5
145 ///
146 /// - (pct_identity_gap > 99.9 AND pct_coverage > 98) OR (pct_identity_gap > 99.0 AND pct_coverage > 99.5)
147 /// - Evaluates two simultaneous logical conditions, returning the
148 /// inclusive OR set
149 ///
150 /// - query = NM_012345.1
151 /// - returns all alignments for the query sequence
152 ///
153 /// - MUL(align_length, 0.8) > num_positives
154 /// - evaluates for all alignments for which num_positives covers 80% of
155 /// the aligned length
156 ///
157 
159 {
160 public:
161  CAlignFilter();
162  CAlignFilter(const string& filter_string);
163 
164  /// Set the query to be used
165  void SetFilter(const string& filter_string);
166 
167  /// CAlignFilter uses a scope internally. You can set a scope yourself;
168  /// alternatively, the scope used internally will be a default scope
169  void SetScope(objects::CScope& scope);
170  objects::CScope& SetScope();
171 
172  /// Remove duplicate alignments when filtering
173  /// NOTE: this may be expensive for a large number of alignments, as it
174  /// forces the algorithm to maintain a list of hash keys for each alignment
175  CAlignFilter& SetRemoveDuplicates(bool b = true);
176 
177  /// Add a sequence to a blacklist.
178  /// Blacklisted sequences are excluded always; if an alignment contains a
179  /// query or subject that matches a blacklisted alignment, then that
180  /// alignment will be excluded.
181  ///
182  /// NOTE: this is only triggered if the alignments are pairwise!
183  ///
184  void AddBlacklistQueryId(const objects::CSeq_id_Handle& idh);
185  void AddBlacklistSubjectId(const objects::CSeq_id_Handle& idh);
186 
187  /// Add a sequence to the white list.
188  /// If an alignment matches a whitelisted ID as appropriate, it will always
189  /// be returned.
190  ///
191  /// NOTE: this is only triggered if the alignments are pairwise!
192  ///
193  void AddWhitelistQueryId(const objects::CSeq_id_Handle& idh);
194  void AddWhitelistSubjectId(const objects::CSeq_id_Handle& idh);
195 
196  /// Add a sequence to the exclude-not-in list
197  /// If an alignment does not match one of the IDs, it is excluded.
198  ///
199  /// NOTE: this is only triggered if the alignments are pairwise!
200  ///
201  void AddExcludeNotInQueryId(const objects::CSeq_id_Handle& idh);
202  void AddExcludeNotInSubjectId(const objects::CSeq_id_Handle& idh);
203 
204  /// Add a specific query/subject range restriction.
205  /// The restriction acts on the subject range for a given query
206  void AddQSRangeRestriction(const objects::CSeq_id_Handle& qid,
207  const objects::CSeq_id_Handle& sid,
208  TSeqRange subj_range);
209 
210  /// Match a single alignment
211  bool Match(const objects::CSeq_align& align);
212 
213  /// Filter a set of alignments, iteratively applying Match() to each
214  /// alignment and emitting all matched alignments in the output set.
215  void Filter(const list< CRef<objects::CSeq_align> >& aligns_in,
216  list< CRef<objects::CSeq_align> >& aligns_out);
217 
218  /// Filter a set of alignments, iteratively applying Match() to each
219  /// alignment and emitting all matched alignments in the output set.
220  void Filter(const objects::CSeq_align_set& aligns_in,
221  objects::CSeq_align_set& aligns_out);
222 
223  /// Filter a set of alignments, iteratively applying Match() to each
224  /// alignment and emitting all matched alignments in the output seq-annot.
225  void Filter(const objects::CSeq_annot& aligns_in,
226  objects::CSeq_annot& aligns_out);
227 
228  /// Print out the dictionary of score generators
229  void PrintDictionary(CNcbiOstream&);
230 
231  /// Do a dry run of the filter, printing out the parse tree and
232  /// looking up all strings
233  void DryRun(CNcbiOstream&);
234 
235 private:
236  bool x_Match(const CQueryParseTree::TNode& node,
237  const objects::CSeq_align& align);
238 
239  bool x_IsUnique(const objects::CSeq_align& align);
240 
241  double x_GetAlignmentScore(const string& score_name,
242  const objects::CSeq_align& align,
243  bool throw_if_not_found = false);
244 
245  bool x_Query_Op(const CQueryParseTree::TNode& key_node,
247  bool is_not,
248  const CQueryParseTree::TNode& val_node,
249  const objects::CSeq_align& align);
250 
251  double x_FuncCall(const CQueryParseTree::TNode& func_node,
252  const objects::CSeq_align& align);
253  double x_TermValue(const CQueryParseTree::TNode& term_node,
254  const objects::CSeq_align& align,
255  bool throw_if_not_found = false);
256 
257  bool x_Query_Range(const CQueryParseTree::TNode& key_node,
258  bool is_not,
259  const CQueryParseTree::TNode& val1_node,
260  const CQueryParseTree::TNode& val2_node,
261  const objects::CSeq_align& align);
262 
263  objects::CScoreLookup::IScore::EComplexity
264  x_Complexity(const CQueryParseTree::TNode& node);
265 
266  void x_ParseTree_Flatten(CQueryParseTree& tree,
267  CQueryParseTree::TNode& node);
268 
269 
270 private:
272  string m_Query;
273  unique_ptr<CQueryParseTree> m_ParseTree;
274 
275  /// Flag indicating whether this is a dry run of the filter. If so we are not
276  /// matching an alignment, but instead walking the parse tree and printing
277  /// information about each score name
280 
282 
289 
290  /// Range restriction infrastructure
291  /// For some query/subject pairs, we will restrict the alignments to lie
292  /// within a specified range
296 
299 
303 
304  const TRegionMap &x_GetRegionMap(const string &regions_file);
305 
306  objects::CScoreLookup m_ScoreLookup;
307 };
308 
309 
310 
312 
313 
314 #endif // GPIPE_COMMON___ALIGN_FILTER__HPP
CAlignFilter exposes a query language for inspecting properties and scores placed on Seq-align object...
set< objects::CSeq_id_Handle > m_QueryWhitelist
objects::CScoreLookup m_ScoreLookup
set< objects::CSeq_id_Handle > m_QueryExcludeNotIn
objects::CScope & SetScope()
set< objects::CSeq_id_Handle > m_SubjectWhitelist
CNcbiOstream * m_DryRunOutput
map< string, TRegionMap > TRegionMapCache
set< string > TUniqueAligns
set< objects::CSeq_id_Handle > m_SubjectBlacklist
bool m_IsDryRun
Flag indicating whether this is a dry run of the filter.
set< objects::CSeq_id_Handle > m_QueryBlacklist
map< objects::CSeq_id_Handle, TSubjCompartments > TQuerySubjCompartments
TQuerySubjCompartments m_QSComparts
TRegionMapCache m_RegionMapCache
CRef< objects::CScope > m_Scope
void Filter(const objects::CSeq_align_set &aligns_in, objects::CSeq_align_set &aligns_out)
Filter a set of alignments, iteratively applying Match() to each alignment and emitting all matched a...
set< objects::CSeq_id_Handle > m_SubjectExcludeNotIn
void SetScope(objects::CScope &scope)
CAlignFilter uses a scope internally.
TUniqueAligns m_UniqueAligns
bool m_RemoveDuplicates
map< objects::CSeq_id_Handle, CRangeCollection< TSeqPos > > TSubjCompartments
Range restriction infrastructure For some query/subject pairs, we will restrict the alignments to lie...
map< objects::CSeq_id_Handle, CRangeCollection< TSeqPos > > TRegionMap
void Filter(const list< CRef< objects::CSeq_align > > &aligns_in, list< CRef< objects::CSeq_align > > &aligns_out)
Filter a set of alignments, iteratively applying Match() to each alignment and emitting all matched a...
void Filter(const objects::CSeq_annot &aligns_in, objects::CSeq_annot &aligns_out)
Filter a set of alignments, iteratively applying Match() to each alignment and emitting all matched a...
unique_ptr< CQueryParseTree > m_ParseTree
CObject –.
Definition: ncbiobj.hpp:180
Query tree and associated utility methods.
CScope –.
Definition: scope.hpp:92
definition of a Culling tree
Definition: ncbi_tree.hpp:100
Definition: map.hpp:338
EType
Query node type.
Definition: query_parse.hpp:84
#define END_NCBI_SCOPE
End previously defined NCBI scope.
Definition: ncbistl.hpp:103
#define END_SCOPE(ns)
End the previously defined scope.
Definition: ncbistl.hpp:75
#define BEGIN_NCBI_SCOPE
Define ncbi namespace.
Definition: ncbistl.hpp:100
#define BEGIN_SCOPE(ns)
Define a new scope.
Definition: ncbistl.hpp:72
IO_PREFIX::ostream CNcbiOstream
Portable alias for ostream.
Definition: ncbistre.hpp:149
#define NCBI_XALGOALIGN_EXPORT
Definition: ncbi_export.h:985
Portable reference counted smart and weak pointers using CWeakRef, CRef, CObject and CObjectEx.
Query string parsing components.
Definition: type.c:6
Modified on Sun Apr 21 03:37:31 2024 by modify_doxy.py rev. 669887