NCBI C++ ToolKit
seq_loc_mapper_base.cpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 /* $Id: seq_loc_mapper_base.cpp 99064 2023-02-08 19:14:27Z ucko $
2 * ===========================================================================
3 *
4 * PUBLIC DOMAIN NOTICE
5 * National Center for Biotechnology Information
6 *
7 * This software/database is a "United States Government Work" under the
8 * terms of the United States Copyright Act. It was written as part of
9 * the author's official duties as a United States Government employee and
10 * thus cannot be copyrighted. This software/database is freely available
11 * to the public for use. The National Library of Medicine and the U.S.
12 * Government have not placed any restriction on its use or reproduction.
13 *
14 * Although all reasonable efforts have been taken to ensure the accuracy
15 * and reliability of the software and data, the NLM and the U.S.
16 * Government do not and cannot warrant the performance or results that
17 * may be obtained by using this software or data. The NLM and the U.S.
18 * Government disclaim all warranties, express or implied, including
19 * warranties of performance, merchantability or fitness for any particular
20 * purpose.
21 *
22 * Please cite the author in any work or product based on this material.
23 *
24 * ===========================================================================
25 *
26 * Author: Aleksey Grichenko
27 *
28 * File Description:
29 * Seq-loc mapper base
30 *
31 */
32 
33 #include <ncbi_pch.hpp>
43 #include <algorithm>
44 
45 
46 #define NCBI_USE_ERRCODE_X Objects_SeqLocMap
47 
48 
51 
52 
53 const char* CAnnotMapperException::GetErrCodeString(void) const
54 {
55  switch ( GetErrCode() ) {
56  case eBadLocation: return "eBadLocation";
57  case eUnknownLength: return "eUnknownLength";
58  case eBadAlignment: return "eBadAlignment";
59  case eBadFeature: return "eBadFeature";
60  case eCanNotMap: return "eCanNotMap";
61  case eOtherError: return "eOtherError";
62  default: return CException::GetErrCodeString();
63  }
64 }
65 
66 /*
67 /////////////////////////////////////////////////////////////////////
68 
69 CSeq_loc_Mapper_Base basic approaches.
70 
71 1. Initialization
72 
73 The mapper parses input data (two seq-locs, seq-alignment) and stores
74 mappings in a collection of CMappingRange objects. Each mapping range
75 contains source (id, start, stop, strand) and destination (id, start,
76 strand).
77 
78 All coordinates are converted to genomic with one exception: if
79 source and destination locations have the same length and the mapper
80 can not obtain real sequence types, it assumes that both sequences
81 are nucleotides even if they are proteins. See x_AdjustSeqTypesToProt()
82 for more info on this special case.
83 
84 The mapper uses several methods to check sequence types: by comparing
85 source and destination lengths, by calling GetSeqType() which is
86 overriden in CSeq_loc_Mapper to provide the correct information, using
87 some information from alignments (e.g. spiced-segs contain explicit
88 sequence types). If all these methods fail, the mapper may still
89 successfully do its job. E.g. if mapping is between two whole seq-locs,
90 it may be done with the assumption that both sequences have the same
91 type.
92 
93 The order of mapping ranges is not preserved, they are sorted by
94 source seq-id and start position.
95 
96 When parsing input locations the mapper also tries to create equivalent
97 mappings for all synonyms of the source sequence id. The base class
98 does not provide synonyms, buy CSeq_loc_Mapper does override
99 CollectSynonyms() method to implement this.
100 
101 In some situations (like mapping between a bioseq and its segments),
102 the mapper also creates dummy mappings from destination to itself,
103 so that during the mapping any ranges already on the destination
104 sequence are not truncated. See x_PreserveDestinationLocs().
105 
106 
107 2. Mapping
108 
109 Mapping of seq-locs is done range-by-range, the original seq-loc
110 is not parsed completely before mapping. Each original interval is
111 mapped through all matching mapping ranges, some parts may be mapped
112 more than once.
113 
114 The mapped ranges are first stored in a container of SMappedRange
115 structures. This is done to simplify merging ranges. If no merge
116 flag is set or the new range can not be merged with the collected
117 set, all ranges from the container are moved (pushed) to the
118 destination seq-loc and the new range starts the new collection.
119 This is done by x_PushMappedRange method (adding a new range) and
120 x_PushRangesToDstMix (pushing the collected mapped ranges to the
121 destination seq-loc).
122 
123 The pushing also occurs in the following situations:
124 - When a source range is discarded (not just clipped) - see
125  x_SetLastTruncated.
126 - When a non-mapping range is copied to the destination mix (in fact,
127  in this case pushing is usually done by the truncation described
128  above).
129 - When a new complex seq-loc is started (e.g. a new mix or equiv)
130  to preserve the structure of the source location.
131 
132 Since merging is done only among the temporary collection, any
133 of the above conditions breaks merging. Examples:
134 - The original seq-loc is a mix, containing two other mixes A and B,
135  which contain overlapping ranges. These ranges will not be merged,
136  since they originate from different complex locations.
137 - If the original seq-loc contains three ranges A, B and C, which are
138  mapped so that A' and C' overlap or abut, but B is discarded, the
139  A' and C' will not be merged. Depending on the flags, B may be
140  also included in the mapped location between A' and C' (see
141  KeepNonmappingRanges).
142 
143 TODO: Is the above behavior expected or should it be changed so that
144 merging can be done at least in some of the described cases?
145 
146 After mapping the destination seq-loc may be a simple interval or
147 a mix of sub-locations. This mix can be optimized when the mapping
148 is finished: null locations are removed (if no GapPreserve is set),
149 as well as empty mixes etc. Mixes with a single element are replaced
150 with this element. Mixes which contain only intervals are converted
151 to packed-ints.
152 
153 
154 /////////////////////////////////////////////////////////////////////
155 */
156 
157 
158 /////////////////////////////////////////////////////////////////////
159 //
160 // CDefault_Mapper_Sequence_Info
161 //
162 // Default sequence type/length/synonyms provider - returns unknown type
163 // and length for any sequence, adds no synonyms except the original id.
164 
165 
167 {
168 public:
172  { return kInvalidSeqPos; }
173  virtual void CollectSynonyms(const CSeq_id_Handle& id,
174  TSynonyms& synonyms)
175  { synonyms.insert(id); }
176 };
177 
178 
179 /////////////////////////////////////////////////////////////////////
180 //
181 // CMappingRange
182 //
183 // Helper class for mapping points, ranges, strands and fuzzes
184 //
185 
186 
188  TSeqPos src_from,
189  TSeqPos src_length,
190  ENa_strand src_strand,
191  CSeq_id_Handle dst_id,
192  TSeqPos dst_from,
193  ENa_strand dst_strand,
194  bool ext_to,
195  int frame,
196  TSeqPos src_bioseq_len,
197  TSeqPos dst_len)
198  : m_Src_id_Handle(src_id),
199  m_Src_from(src_from),
200  m_Src_to(src_from + src_length - 1),
201  m_Src_strand(src_strand),
202  m_Dst_id_Handle(dst_id),
203  m_Dst_from(dst_from),
204  m_Dst_strand(dst_strand),
205  m_Reverse(!SameOrientation(src_strand, dst_strand)),
206  m_ExtTo(ext_to),
207  m_Frame(frame),
208  m_Src_bioseq_len(src_bioseq_len),
209  m_Dst_len(dst_len),
210  m_Group(0)
211 {
212  return;
213 }
214 
215 
217  TSeqPos to,
218  bool is_set_strand,
219  ENa_strand strand) const
220 {
221  // The callers set is_set_strand to true only if the mapper's
222  // fCheckStrand is enabled. Only in this case CanMap() checks
223  // if the location's strand is the same as the mapping's one.
224  if ( is_set_strand && (IsReverse(strand) != IsReverse(m_Src_strand)) ) {
225  return false;
226  }
227  return from <= m_Src_to && to >= m_Src_from;
228 }
229 
230 
232 {
233  _ASSERT(pos >= m_Src_from && pos <= m_Src_to);
234  if (!m_Reverse) {
235  return m_Dst_from + pos - m_Src_from;
236  }
237  else {
238  return m_Dst_from + m_Src_to - pos;
239  }
240 }
241 
242 
244  TSeqPos to,
245  const TRangeFuzz* fuzz) const
246 {
247  // Special case of mapping from a protein to a nucleotide through
248  // a partial cd-region. Extend the mapped interval to the end of
249  // destination range if all of the following conditions are true:
250  // - source is a protein (m_ExtTo)
251  // - destination is a nucleotide (m_ExtTo)
252  // - destination interval has partial "to" (m_ExtTo)
253  // - interval to be mapped has partial "to"
254  // - destination range is 1 or 2 bases beyond the end of the source range
255  const int frame_shift = ( (m_Frame > 1) ? (m_Frame - 1) : 0 );
256 
257  // If we're partial on the left and we're not at the beginning only because of
258  // frame shift, we shift back to the beginning when mapping.
259  // example accession: AJ237662.1
260  const bool partial_from = fuzz && fuzz->first && fuzz->first->IsLim() &&
261  ( fuzz->first->GetLim() == CInt_fuzz::eLim_lt || fuzz->first->GetLim() == CInt_fuzz::eLim_gt );
262  const bool partial_to = fuzz && fuzz->second && fuzz->second->IsLim() &&
263  ( fuzz->second->GetLim() == CInt_fuzz::eLim_lt || fuzz->second->GetLim() == CInt_fuzz::eLim_gt );
264 
265  from = max(from, m_Src_from);
266  to = min(to, m_Src_to);
267 
268  if (!m_Reverse) {
269  TRange ret(Map_Pos(from), Map_Pos(to));
270  // extend to beginning if necessary
271  // example accession that triggers this "if": AJ237662.1
272  if( (frame_shift > 0) && partial_from && (from == 0) && (m_Src_from == 0) ) {
273  if( m_Dst_from >= static_cast<TSeqPos>(frame_shift) ) {
274  ret.SetFrom( m_Dst_from - frame_shift );
275  } else {
276  ret.SetFrom( m_Dst_from );
277  }
278  }
279  // extend to the end, if necessary
280  if( m_Dst_len != kInvalidSeqPos ) {
281  const TSeqPos src_to_dst_end = m_Dst_from + (m_Src_to - m_Src_from);
282  const TSeqPos new_dst_end = m_Dst_from + m_Dst_len - 1;
283  if ( m_ExtTo && partial_to && to+1 == m_Src_bioseq_len ) {
284  if( ((int)new_dst_end - (int)src_to_dst_end) >= 0 && (new_dst_end - src_to_dst_end) <= 2 ) {
285  ret.SetTo( new_dst_end );
286  }
287  }
288  }
289  return ret;
290  }
291  else {
292  TRange ret(Map_Pos(to), Map_Pos(from));
293 
294  // extend to beginning if necessary (Note: reverse strand implies "beginning" is a higher number )
295  if( m_Dst_len != kInvalidSeqPos ) {
296  const TSeqPos new_dst_end = m_Dst_from + m_Dst_len - 1;
297  if ( (frame_shift > 0) && partial_from && (from == 0) && (m_Src_from == 0) ) {
298  ret.SetTo( new_dst_end + frame_shift );
299  }
300  }
301  // extend to the end, if necessary (Note: reverse strand implies "end" is a lower number )
302  // ( e.g. NZ_AAOJ01000043 )
303  if( m_ExtTo && partial_to && (to+1 == m_Src_bioseq_len) ) {
304  ret.SetFrom( m_Dst_from );
305  }
306 
307  return ret;
308  }
309 }
310 
311 
312 bool CMappingRange::Map_Strand(bool is_set_strand,
313  ENa_strand src,
314  ENa_strand* dst) const
315 {
316  _ASSERT(dst);
317  if ( m_Reverse ) {
318  // Always convert to reverse strand, even if the source
319  // strand is unknown.
320  *dst = Reverse(src);
321  return true;
322  }
323  if (is_set_strand) {
324  // Use original strand if set
325  *dst = src;
326  return true;
327  }
329  // Destination strand may be set for nucleotides
330  // even if the source one is not set.
331  *dst = m_Dst_strand;
332  return true;
333  }
334  return false; // Leave the mapped strand unset.
335 }
336 
337 
339 
341 {
342  // Recalculate fuzz of type lim to the reverse strand.
343  switch ( lim ) {
344  case CInt_fuzz::eLim_gt:
345  return CInt_fuzz::eLim_lt;
346  case CInt_fuzz::eLim_lt:
347  return CInt_fuzz::eLim_gt;
348  case CInt_fuzz::eLim_tr:
349  return CInt_fuzz::eLim_tl;
350  case CInt_fuzz::eLim_tl:
351  return CInt_fuzz::eLim_tr;
352  default:
353  return lim;
354  }
355 }
356 
357 
359 {
360  if ( !fuzz ) return;
361  switch ( fuzz->Which() ) {
362  case CInt_fuzz::e_Lim:
363  {
364  // gt/lt are swapped when mapping to reverse strand.
365  if ( m_Reverse ) {
366  CRef<CInt_fuzz> oldFuzz = fuzz;
367  fuzz.Reset( new CInt_fuzz ); // careful: other TRangeFuzz's may map to the same TFuzz
368  fuzz->Assign( *oldFuzz );
369  fuzz->SetLim(x_ReverseFuzzLim(fuzz->GetLim()));
370  }
371  break;
372  }
373  case CInt_fuzz::e_Alt:
374  {
375  // Map each point to the destination sequence.
376  // Discard non-mappable points (???).
377  TFuzz mapped(new CInt_fuzz);
378  CInt_fuzz::TAlt& alt = mapped->SetAlt();
379  ITERATE(CInt_fuzz::TAlt, it, fuzz->GetAlt()) {
380  if ( CanMap(*it, *it, false, eNa_strand_unknown) ) {
381  alt.push_back(Map_Pos(*it));
382  }
383  }
384  if ( !alt.empty() ) {
385  fuzz = mapped;
386  }
387  else {
388  fuzz.Reset();
389  }
390  break;
391  }
392  case CInt_fuzz::e_Range:
393  {
394  // Map each range, truncate the ends if necessary.
395  // Discard unmappable ranges (???).
396  TRange rg(fuzz->GetRange().GetMin(), fuzz->GetRange().GetMax());
397  if ( CanMap(rg.GetFrom(), rg.GetTo(), false, eNa_strand_unknown) ) {
398  rg = Map_Range(rg.GetFrom(), rg.GetTo());
399  if ( !rg.Empty() ) {
400  CRef<CInt_fuzz> oldFuzz = fuzz;
401  fuzz.Reset( new CInt_fuzz ); // careful: other TRangeFuzz's may map to the same TFuzz
402  fuzz->Assign( *oldFuzz );
403  fuzz->SetRange().SetMin(rg.GetFrom());
404  fuzz->SetRange().SetMax(rg.GetTo());
405  }
406  }
407  else {
408  rg = TRange::GetEmpty();
409  }
410  if ( rg.Empty() ) {
411  fuzz.Reset();
412  }
413  break;
414  }
415  default:
416  // Other types are not converted
417  break;
418  }
419 }
420 
421 
423 {
424  // Maps fuzz if possible.
425  TRangeFuzz res = m_Reverse ? TRangeFuzz(fuzz.second, fuzz.first) : fuzz;
426  x_Map_Fuzz(res.first);
427  x_Map_Fuzz(res.second);
428  return res;
429 }
430 
431 
432 /////////////////////////////////////////////////////////////////////
433 //
434 // CMappingRanges
435 //
436 // Collection of mapping ranges
437 
438 
440  : m_ReverseSrc(false),
441  m_ReverseDst(false)
442 {
443 }
444 
445 
447 {
448  m_IdMap[cvt->m_Src_id_Handle].insert(TRangeMap::value_type(
449  TRange(cvt->m_Src_from, cvt->m_Src_to), cvt));
450 }
451 
452 
455  TSeqPos src_from,
456  TSeqPos src_length,
457  ENa_strand src_strand,
458  CSeq_id_Handle dst_id,
459  TSeqPos dst_from,
460  ENa_strand dst_strand,
461  bool ext_to,
462  int frame,
463  TSeqPos /*dst_total_len*/,
464  TSeqPos src_bioseq_len,
465  TSeqPos dst_len)
466 {
468  src_id, src_from, src_length, src_strand,
469  dst_id, dst_from, dst_strand,
470  ext_to, frame, src_bioseq_len, dst_len ));
471  AddConversion(cvt);
472  return cvt;
473 }
474 
475 
478  TSeqPos from,
479  TSeqPos to) const
480 {
481  // Get mappings iterator for the given id and range.
482  TIdMap::const_iterator ranges = m_IdMap.find(id);
483  if (ranges == m_IdMap.end()) {
484  return TRangeIterator();
485  }
486  return ranges->second.begin(TRange(from, to));
487 }
488 
489 
490 /////////////////////////////////////////////////////////////////////
491 //
492 // CSeq_loc_Mapper_Message
493 //
494 
495 
497  EDiagSev sev,
498  int err_code,
499  int sub_code)
500  : CMessage_Basic(msg, sev, err_code, sub_code),
501  m_ObjType(eNot_set),
502  m_Obj(null)
503 {
504 }
505 
506 
508 {
509 }
510 
511 
513 {
514  return new CSeq_loc_Mapper_Message(*this);
515 }
516 
517 
519 {
521  switch ( Which() ) {
523  cout << "NULL";
524  break;
526  cout << MSerial_AsnText << *GetLoc();
527  break;
529  cout << MSerial_AsnText << *GetFeat();
530  break;
532  cout << MSerial_AsnText << *GetAlign();
533  break;
535  cout << MSerial_AsnText << *GetGraph();
536  break;
537  }
538 }
539 
540 
542 {
544  CRef<CSeq_loc> ref(new CSeq_loc());
545  ref->Assign(loc);
546  m_Obj = ref;
547 }
548 
549 
551 {
552  return m_ObjType == eSeq_loc ?
553  dynamic_cast<const CSeq_loc*>(m_Obj.GetPointerOrNull()) : 0;
554 }
555 
556 
558 {
560  CRef<CSeq_feat> ref(new CSeq_feat());
561  ref->Assign(feat);
562  m_Obj = ref;
563 }
564 
565 
567 {
568  return m_ObjType == eSeq_feat ?
569  dynamic_cast<const CSeq_feat*>(m_Obj.GetPointerOrNull()) : 0;
570 }
571 
572 
574 {
576  CRef<CSeq_align> ref(new CSeq_align());
577  ref->Assign(align);
578  m_Obj = ref;
579 }
580 
581 
583 {
584  return m_ObjType == eSeq_align ?
585  dynamic_cast<const CSeq_align*>(m_Obj.GetPointerOrNull()) : 0;
586 }
587 
588 
590 {
592  CRef<CSeq_graph> ref(new CSeq_graph());
593  ref->Assign(graph);
594  m_Obj = ref;
595 }
596 
597 
599 {
600  return m_ObjType == eSeq_graph ?
601  dynamic_cast<const CSeq_graph*>(m_Obj.GetPointerOrNull()) : 0;
602 }
603 
604 
606 {
608  m_Obj.Reset();
609 }
610 
611 
612 /////////////////////////////////////////////////////////////////////////////
613 ///
614 /// CSeq_loc_Mapper_Options --
615 ///
616 
617 
619 {
620  if ( !m_SeqInfo ) {
622  }
623  return *m_SeqInfo;
624 }
625 
626 
627 /////////////////////////////////////////////////////////////////////
628 //
629 // CSeq_loc_Mapper_Base
630 //
631 
632 
633 /////////////////////////////////////////////////////////////////////
634 //
635 // Initialization of the mapper
636 //
637 
638 
639 // Helpers for converting strand to/from index.
640 // The index is used to access elements of a vector, grouping
641 // mapping ranges by strand.
642 inline
644 {
645  _ASSERT(idx != 0);
646  return ENa_strand(idx - 1);
647 }
648 
649 #define STRAND_TO_INDEX(is_set, strand) \
650  ((is_set) ? size_t((strand) + 1) : 0)
651 
652 #define INDEX_TO_STRAND(idx) \
653  s_IndexToStrand(idx)
654 
655 
657  : m_MergeFlag(eMergeNone),
658  m_GapFlag(eGapPreserve),
659  m_MiscFlags(fTrimSplicedSegs),
660  m_Partial(false),
661  m_LastTruncated(false),
662  m_Mappings(new CMappingRanges),
663  m_CurrentGroup(0),
664  m_FuzzOption(0),
665  m_MapOptions(options)
666 {
667 }
668 
669 
671  CSeq_loc_Mapper_Options options)
672  : m_MergeFlag(eMergeNone),
673  m_GapFlag(eGapPreserve),
674  m_MiscFlags(fTrimSplicedSegs),
675  m_Partial(false),
676  m_LastTruncated(false),
677  m_Mappings(mapping_ranges),
678  m_CurrentGroup(0),
679  m_FuzzOption(0),
680  m_MapOptions(options)
681 {
682 }
683 
684 
686  EFeatMapDirection dir,
687  CSeq_loc_Mapper_Options options)
688  : m_MergeFlag(eMergeNone),
689  m_GapFlag(eGapPreserve),
690  m_MiscFlags(fTrimSplicedSegs),
691  m_Partial(false),
692  m_LastTruncated(false),
693  m_Mappings(new CMappingRanges),
694  m_CurrentGroup(0),
695  m_FuzzOption(0),
696  m_MapOptions(options)
697 {
698  x_InitializeFeat(map_feat, dir);
699 }
700 
701 
703  const CSeq_loc& target,
704  CSeq_loc_Mapper_Options options)
705  : m_MergeFlag(eMergeNone),
706  m_GapFlag(eGapPreserve),
707  m_MiscFlags(fTrimSplicedSegs),
708  m_Partial(false),
709  m_LastTruncated(false),
710  m_Mappings(new CMappingRanges),
711  m_CurrentGroup(0),
712  m_FuzzOption(0),
713  m_MapOptions(options)
714 {
715  x_InitializeLocs(source, target);
716 }
717 
718 
720  const CSeq_id& to_id,
721  CSeq_loc_Mapper_Options options)
722  : m_MergeFlag(eMergeNone),
723  m_GapFlag(eGapPreserve),
724  m_MiscFlags(fTrimSplicedSegs),
725  m_Partial(false),
726  m_LastTruncated(false),
727  m_Mappings(new CMappingRanges),
728  m_CurrentGroup(0),
729  m_FuzzOption(0),
730  m_MapOptions(options)
731 {
732  x_InitializeAlign(map_align, to_id);
733 }
734 
735 
737  const CSeq_id& to_id,
738  const CSeq_align& map_align,
739  CSeq_loc_Mapper_Options options)
740  : m_MergeFlag(eMergeNone),
741  m_GapFlag(eGapPreserve),
742  m_MiscFlags(fTrimSplicedSegs),
743  m_Partial(false),
744  m_LastTruncated(false),
745  m_Mappings(new CMappingRanges),
746  m_CurrentGroup(0),
747  m_FuzzOption(0),
748  m_MapOptions(options)
749 {
750  x_InitializeAlign(map_align, to_id, &from_id);
751 }
752 
753 
755  const CSeq_id& to_id,
756  TMapOptions opts,
757  IMapper_Sequence_Info* seq_info)
758  : m_MergeFlag(eMergeNone),
759  m_GapFlag(eGapPreserve),
760  m_MiscFlags(fTrimSplicedSegs),
761  m_Partial(false),
762  m_LastTruncated(false),
763  m_Mappings(new CMappingRanges),
764  m_CurrentGroup(0),
765  m_FuzzOption(0),
766  m_MapOptions(CSeq_loc_Mapper_Options(seq_info, opts))
767 {
768  x_InitializeAlign(map_align, to_id);
769 }
770 
771 
773  size_t to_row,
774  CSeq_loc_Mapper_Options options)
775  : m_MergeFlag(eMergeNone),
776  m_GapFlag(eGapPreserve),
777  m_MiscFlags(fTrimSplicedSegs),
778  m_Partial(false),
779  m_LastTruncated(false),
780  m_Mappings(new CMappingRanges),
781  m_CurrentGroup(0),
782  m_FuzzOption(0),
783  m_MapOptions(options)
784 {
785  x_InitializeAlign(map_align, to_row);
786 }
787 
788 
790  size_t to_row,
791  const CSeq_align& map_align,
792  CSeq_loc_Mapper_Options options)
793  : m_MergeFlag(eMergeNone),
794  m_GapFlag(eGapPreserve),
795  m_MiscFlags(fTrimSplicedSegs),
796  m_Partial(false),
797  m_LastTruncated(false),
798  m_Mappings(new CMappingRanges),
799  m_CurrentGroup(0),
800  m_FuzzOption(0),
801  m_MapOptions(options)
802 {
803  x_InitializeAlign(map_align, to_row, from_row);
804 }
805 
806 
808  size_t to_row,
809  TMapOptions opts,
810  IMapper_Sequence_Info* seq_info)
811  : m_MergeFlag(eMergeNone),
812  m_GapFlag(eGapPreserve),
813  m_MiscFlags(fTrimSplicedSegs),
814  m_Partial(false),
815  m_LastTruncated(false),
816  m_Mappings(new CMappingRanges),
817  m_CurrentGroup(0),
818  m_FuzzOption(0),
819  m_MapOptions(seq_info, opts)
820 {
821  x_InitializeAlign(map_align, to_row);
822 }
823 
824 
826 {
827  return;
828 }
829 
831 {
832  m_FuzzOption = newOption;
833 }
834 
836  EFeatMapDirection dir)
837 {
838  // Make sure product is set
839  _ASSERT(map_feat.IsSetProduct());
840 
841  // Sometimes sequence types can be detected based on the feature type.
842  ESeqType loc_type = eSeq_unknown;
843  ESeqType prod_type = eSeq_unknown;
844  switch ( map_feat.GetData().Which() ) {
846  loc_type = eSeq_nuc; // Can gene features have product?
847  break;
849  loc_type = eSeq_nuc;
850  prod_type = eSeq_prot;
851  break;
853  loc_type = eSeq_prot; // Can protein features have product?
854  break;
855  case CSeqFeatData::e_Rna:
856  loc_type = eSeq_nuc;
857  prod_type = eSeq_nuc;
858  break;
859  /*
860  case e_Org:
861  case e_Pub:
862  case e_Seq:
863  case e_Imp:
864  case e_Region:
865  case e_Comment:
866  case e_Bond:
867  case e_Site:
868  case e_Rsite:
869  case e_User:
870  case e_Txinit:
871  case e_Num:
872  case e_Psec_str:
873  case e_Non_std_residue:
874  case e_Het:
875  case e_Biosrc:
876  case e_Clone:
877  */
878  default:
879  break;
880  }
881 
882  if (loc_type != eSeq_unknown) {
883  for (CSeq_loc_CI it(map_feat.GetLocation()); it; ++it) {
884  CSeq_id_Handle idh = it.GetSeq_id_Handle();
885  if (idh) {
886  SetSeqTypeById(idh, loc_type);
887  }
888  }
889  }
890  if (prod_type != eSeq_unknown) {
891  for (CSeq_loc_CI it(map_feat.GetProduct()); it; ++it) {
892  CSeq_id_Handle idh = it.GetSeq_id_Handle();
893  if (idh) {
894  SetSeqTypeById(idh, prod_type);
895  }
896  }
897  }
898 
899  int frame = 0;
900  if (map_feat.GetData().IsCdregion()) {
901  // For cd-regions use frame information.
902  frame = map_feat.GetData().GetCdregion().GetFrame();
903  }
904  if (dir == eLocationToProduct) {
905  x_InitializeLocs(map_feat.GetLocation(), map_feat.GetProduct(), frame, 0);
906  }
907  else {
908  x_InitializeLocs(map_feat.GetProduct(), map_feat.GetLocation(), 0, frame);
909  }
910 }
911 
912 
914  const CSeq_loc& target,
915  int src_frame,
916  int dst_frame)
917 {
918  if (source.IsEmpty() || target.IsEmpty()) {
919  // Ignore mapping from or to an empty location.
920  return;
921  }
922 
923  // There are several passes - we need to find out sequence types
924  // and lengths before creating the mappings.
925 
926  // First pass - collect sequence types (if possible) and
927  // calculate total length of each location.
928  TSeqPos src_total_len = 0; // total length of the source location
929  TSeqPos dst_total_len = 0; // total length of the destination
930  ESeqType src_type = eSeq_unknown; // source sequence type
931  ESeqType dst_type = eSeq_unknown; // destination sequence type
932  bool known_src_types = x_CheckSeqTypes(source, src_type, src_total_len);
933  bool known_dst_types = x_CheckSeqTypes(target, dst_type, dst_total_len);
934 
935  // Non-zero frame indicates genomic sequence in a nuc-to-prot alignment.
936  if (src_frame) {
937  if (src_type == eSeq_unknown) {
938  src_type = eSeq_nuc;
939  }
940  else if (src_type != eSeq_nuc) {
941  NCBI_THROW(CAnnotMapperException, eBadLocation,
942  "Frame can not be specified for a protein source location.");
943  }
944  if (!dst_frame) {
945  if (dst_type == eSeq_unknown) {
946  dst_type = eSeq_prot;
947  }
948  // If frame is not set it's probably a prot, but we are not enforcing this.
949  }
950  }
951  if (dst_frame) {
952  if (dst_type == eSeq_unknown) {
953  dst_type = eSeq_nuc;
954  }
955  else if (dst_type != eSeq_nuc) {
956  NCBI_THROW(CAnnotMapperException, eBadLocation,
957  "Frame can not be specified for a protein target location.");
958  }
959  if (!src_frame) {
960  if (src_type == eSeq_unknown) {
961  src_type = eSeq_prot;
962  }
963  // If frame is not set it's probably a nuc, but we are not enforcing this.
964  }
965  }
966 
967  // Check if all sequence types are known and there are no conflicts.
968  bool known_types = known_src_types && known_dst_types;
969  if ( !known_types ) {
970  // some types are still unknown, try other methods
971  // First, if at least one sequence type is known, try to use it
972  // for the whole location.
973  // x_ForceSeqTypes will throw if there are different sequence types in
974  // the same location.
975  if (src_type == eSeq_unknown) {
976  src_type = x_ForceSeqTypes(source);
977  }
978  if (dst_type == eSeq_unknown) {
979  dst_type = x_ForceSeqTypes(target);
980  }
981  // If both source and destination types could be forced, don't
982  // check sequence lengths.
983  if (src_type == eSeq_unknown || dst_type == eSeq_unknown) {
984  // There are only unknown types in the source, destination
985  // of both. Try to compare lengths of the locations.
986  if (src_total_len == kInvalidSeqPos ||
987  dst_total_len == kInvalidSeqPos) {
988  // Location lengths are unknown (e.g. whole seq-locs).
989  // No way to create correct mappings.
990  NCBI_THROW(CAnnotMapperException, eBadLocation,
991  "Undefined location length -- "
992  "unable to detect sequence type");
993  }
994  if (src_total_len == dst_total_len) {
995  // If the lengths are the same, source and destination
996  // have the same sequence type. If at least one of them
997  // is known, use it for both.
998  if (src_type != eSeq_unknown) {
999  dst_type = src_type;
1000  }
1001  else if (dst_type != eSeq_unknown) {
1002  src_type = dst_type;
1003  }
1004  // By default we assume that both sequences are nucleotides.
1005  // Even if it's a mapping between two proteins, this assumption
1006  // should work fine in most cases.
1007  // The only exception is when we try to map an alignment and
1008  // while parsing it we detect that the mapping was between
1009  // prots. In this case CSeq_align_Mapper_Base will call
1010  // x_AdjustSeqTypesToProt() to change the types and adjust
1011  // ranges according to the new sequence width.
1012  }
1013  // While checking if it's a mapping between nuc and prot,
1014  // truncate incomplete or stop codons.
1015  // NOTE: It's safe to ignore frames here. If frames are set, sequence types
1016  // should have been already assigned.
1017  else if (src_total_len/3 == dst_total_len || src_total_len == (dst_total_len + 1)*3) {
1018  if (src_type == eSeq_unknown) {
1019  src_type = eSeq_nuc;
1020  }
1021  if (dst_type == eSeq_unknown) {
1022  dst_type = eSeq_prot;
1023  }
1024  // Make sure there's no conflict between the known and
1025  // the calculated sequence types.
1026  if (src_type != eSeq_nuc || dst_type != eSeq_prot) {
1027  NCBI_THROW(CAnnotMapperException, eBadLocation,
1028  "Sequence types (nuc to prot) are inconsistent with "
1029  "location lengths");
1030  }
1031  }
1032  else if (dst_total_len/3 == src_total_len || dst_total_len == (src_total_len + 1)*3) {
1033  if (src_type == eSeq_unknown) {
1034  src_type = eSeq_prot;
1035  }
1036  if (dst_type == eSeq_unknown) {
1037  dst_type = eSeq_nuc;
1038  }
1039  // Make sure there's no conflict between the known and
1040  // the calculated sequence types.
1041  if (src_type != eSeq_prot || dst_type != eSeq_nuc) {
1042  NCBI_THROW(CAnnotMapperException, eBadLocation,
1043  "Sequence types (prot to nuc) are inconsistent with "
1044  "location lengths");
1045  }
1046  }
1047  else {
1048  // If location lengths are not 1:1 or 1:3, there's no way
1049  // to get the right sequence types.
1050  NCBI_THROW(CAnnotMapperException, eBadLocation,
1051  "Wrong location length -- "
1052  "unable to detect sequence type");
1053  }
1054  }
1055  }
1056  // If both source and destination total length are known, check if
1057  // they match each other.
1058  // NOTE: The actual lengths may change later if trimming at sequence
1059  // length is enabled.
1060  bool multiseq_src = !source.GetId();
1061  bool multiseq_dst = !target.GetId();
1062  if (src_total_len != kInvalidSeqPos && dst_total_len != kInvalidSeqPos) {
1063  if ( src_frame ) src_total_len -= src_frame - 1;
1064  if ( dst_frame ) dst_total_len -= dst_frame - 1;
1065  if (src_type == eSeq_nuc && dst_type == eSeq_prot) {
1066  // Report length mismatch except a single stop codon on a single bioseq.
1067  if (src_total_len == (dst_total_len + 1)*3 && !multiseq_dst) {
1068  // Extend destination to include stop codon
1069  dst_total_len++;
1070  }
1071  // Report and drop overhanging bases if any
1072  else if (src_total_len/3 == dst_total_len && src_total_len % 3 != 0) {
1073  ERR_POST_X(28, Info <<
1074  "Source and destination lengths do not match, "
1075  "dropping " << src_total_len % 3 <<
1076  " overhanging bases on source location");
1077  }
1078  // Allow partial codon mismatch, report more than one codon.
1079  if (dst_total_len*3 >= src_total_len + 3 ||
1080  dst_total_len*3 + 3 <= src_total_len) {
1081  ERR_POST_X(31, Warning <<
1082  "Source and destination lengths do not match.");
1083  }
1084  }
1085  else if (dst_type == eSeq_nuc && src_type == eSeq_prot) {
1086  // Report length mismatch except a single stop codon on a single bioseq.
1087  if (dst_total_len == (src_total_len + 1)*3 && !multiseq_src) {
1088  // Extend stop codon.
1089  src_total_len++;
1090  }
1091  else if (dst_total_len/3 == src_total_len && dst_total_len % 3 != 0) {
1092  ERR_POST_X(28, Info <<
1093  "Source and destination lengths do not match, "
1094  "dropping " << dst_total_len % 3 <<
1095  " overhanging bases on destination location");
1096  }
1097  // Allow partial codon mismatch
1098  if (src_total_len*3 >= dst_total_len + 3 ||
1099  src_total_len*3 + 3 <= dst_total_len) {
1100  ERR_POST_X(31, Warning <<
1101  "Source and destination lengths do not match.");
1102  }
1103  }
1104  // Same sequence types
1105  else if (src_total_len != dst_total_len) {
1106  ERR_POST_X(31, Warning <<
1107  "Source and destination lengths do not match.");
1108  }
1109  }
1110 
1111  // At this point all sequence types should be known or forced.
1112  // Set the widths.
1113  int src_width = (src_type == eSeq_prot) ? 3 : 1;
1114  int dst_width = (dst_type == eSeq_prot) ? 3 : 1;
1115 
1118  CSeq_loc_CI dst_it(target, CSeq_loc_CI::eEmpty_Skip,
1120 
1121  // Get starts and lengths with care, check for empty and whole ranges.
1122  TRange rg = src_it.GetRange();
1123  // Start with an empty range
1124  TSeqPos src_start = kInvalidSeqPos;
1125  TSeqPos src_len = 0;
1126  // For whole ranges don't fetch their actual length since it is allowed
1127  // to be different from its genomic couterpart.
1128  if ( rg.IsWhole() ) {
1129  src_start = 0;
1130  // Use actual length if trimming is enabled or if the location
1131  // references multiple sequences.
1132  if ( multiseq_src || m_MapOptions.GetTrimMappedLocation() ) {
1133  src_len = GetSequenceLength(src_it.GetSeq_id());
1134  if (src_type == eSeq_prot) {
1135  src_len *= 3;
1136  }
1137  }
1138  else {
1139  src_len = kInvalidSeqPos;
1140  }
1141  }
1142  else if ( !rg.Empty() ) {
1143  src_start = src_it.GetRange().GetFrom()*src_width;
1144  src_len = x_GetRangeLength(src_it)*src_width;
1145  }
1146 
1147  rg = dst_it.GetRange();
1148  TSeqPos dst_start = kInvalidSeqPos;
1149  TSeqPos dst_len = 0;
1150  if ( rg.IsWhole() ) {
1151  dst_start = 0;
1152  // Use actual length if trimming is enabled or if the location
1153  // references multiple sequences.
1154  if ( multiseq_dst || m_MapOptions.GetTrimMappedLocation() ) {
1155  dst_len = GetSequenceLength(dst_it.GetSeq_id());
1156  if (dst_type == eSeq_prot) {
1157  dst_len *= 3;
1158  }
1159  }
1160  else {
1161  dst_len = kInvalidSeqPos;
1162  }
1163  }
1164  else if ( !rg.Empty() ) {
1165  dst_start = dst_it.GetRange().GetFrom()*dst_width;
1166  dst_len = x_GetRangeLength(dst_it)*dst_width;
1167  }
1168 
1169  if (src_frame && dst_type == eSeq_prot && src_start != kInvalidSeqPos &&
1170  static_cast<TSeqPos>(src_frame) <= src_len ) {
1171  if( !source.IsReverseStrand() ) {
1172  src_start += src_frame - 1;
1173  }
1174  src_len -= src_frame - 1;
1175  }
1176  if (dst_frame && src_type == eSeq_prot && dst_start != kInvalidSeqPos &&
1177  static_cast<TSeqPos>(dst_frame) <= dst_len ) {
1178  if( !target.IsReverseStrand() ) {
1179  dst_start += dst_frame - 1;
1180  }
1181  dst_len -= dst_frame - 1;
1182  }
1183  // Iterate source and destination ranges.
1184  TSeqPos src_bioseq_len = (source.GetId() ? GetSequenceLength( *source.GetId())
1185  : src_total_len);
1186  if (src_bioseq_len != kInvalidSeqPos) {
1187  src_bioseq_len = src_width*src_bioseq_len;
1188  }
1189  TSeqPos last_src_start = 0, last_src_len = 0;
1190  TSeqPos last_dst_start = 0, last_dst_len = 0;
1191  bool last_src_reverse = false, last_dst_reverse = false;
1192  CSeq_id_Handle last_src_id, last_dst_id;
1193  // Must be non-zero.
1194  // Zero group can be used for alignment rows which failed to map.
1195  m_CurrentGroup++;
1196  while (src_it && dst_it) {
1197  // If sequence types were detected using lengths, set them now.
1198  if (src_type != eSeq_unknown) {
1199  SetSeqTypeById(src_it.GetSeq_id_Handle(), src_type);
1200  }
1201  if (dst_type != eSeq_unknown) {
1202  SetSeqTypeById(dst_it.GetSeq_id_Handle(), dst_type);
1203  }
1204  // Add new mapping range. This will adjust starts and lengths.
1205  if (last_src_id &&
1206  src_it.GetSeq_id_Handle() == last_src_id &&
1207  IsReverse(src_it.GetStrand()) == last_src_reverse) {
1208  if ( !last_src_reverse ) {
1209  if (last_src_start + last_src_len != src_start) {
1210  m_CurrentGroup++;
1211  }
1212  }
1213  else {
1214  if (src_start + src_len != last_src_start) {
1215  m_CurrentGroup++;
1216  }
1217  }
1218  }
1219  if (last_dst_id &&
1220  dst_it.GetSeq_id_Handle() == last_dst_id &&
1221  IsReverse(dst_it.GetStrand()) == last_dst_reverse) {
1222  if ( !last_dst_reverse ) {
1223  if (last_dst_start + last_dst_len != dst_start) {
1224  m_CurrentGroup++;
1225  }
1226  }
1227  else {
1228  if (dst_start + dst_len != last_dst_start) {
1229  m_CurrentGroup++;
1230  }
1231  }
1232  }
1233  last_src_start = src_start;
1234  last_src_len = src_len;
1235  last_dst_start = dst_start;
1236  last_dst_len = dst_len;
1238  src_it.GetSeq_id(), src_start, src_len, src_it.GetStrand(),
1239  dst_it.GetSeq_id(), dst_start, dst_len, dst_it.GetStrand(),
1240  dst_it.GetFuzzFrom(), dst_it.GetFuzzTo(),
1241  src_frame ? src_frame : dst_frame,
1242  src_bioseq_len);
1243  // Start new group on a gap in src or dst.
1244  // If the whole source or destination range was used, increment the
1245  // iterator.
1246  // This part may not work correctly if whole locations are
1247  // involved and lengths of the sequences can not be retrieved.
1248  // E.g. if the source contains 2 ranges and destination is a mix of
1249  // two whole locations (one per source range), dst_it will never be
1250  // incremented and both source ranges will be mapped to the same
1251  // sequence.
1252  last_src_id = src_it.GetSeq_id_Handle();
1253  last_src_reverse = IsReverse(src_it.GetStrand());
1254  if (src_len == 0 && ++src_it) {
1255  TRange r = src_it.GetRange();
1256  if ( r.Empty() ) {
1257  src_start = kInvalidSeqPos;
1258  src_len = 0;
1259  }
1260  else if ( r.IsWhole() ) {
1261  src_start = 0;
1262  if ( multiseq_src || m_MapOptions.GetTrimMappedLocation() ) {
1263  src_len = GetSequenceLength(src_it.GetSeq_id());
1264  if (src_type == eSeq_prot) {
1265  src_len *= 3;
1266  }
1267  }
1268  else {
1269  src_len = kInvalidSeqPos;
1270  }
1271  }
1272  else {
1273  src_start = src_it.GetRange().GetFrom()*src_width;
1274  src_len = x_GetRangeLength(src_it)*src_width;
1275  }
1276  if (last_src_id != src_it.GetSeq_id_Handle() ||
1277  last_src_reverse != IsReverse(src_it.GetStrand())) {
1278  m_CurrentGroup++;
1279  }
1280  }
1281  last_dst_id = dst_it.GetSeq_id_Handle();
1282  last_dst_reverse = IsReverse(dst_it.GetStrand());
1283  if (dst_len == 0 && ++dst_it) {
1284  TRange r = dst_it.GetRange();
1285  if ( r.Empty() ) {
1286  dst_start = kInvalidSeqPos;
1287  dst_len = 0;
1288  }
1289  else if ( r.IsWhole() ) {
1290  dst_start = 0;
1291  if ( multiseq_dst || m_MapOptions.GetTrimMappedLocation() ) {
1292  dst_len = GetSequenceLength(dst_it.GetSeq_id());
1293  if (dst_type == eSeq_prot) {
1294  dst_len *= 3;
1295  }
1296  }
1297  else {
1298  dst_len = kInvalidSeqPos;
1299  }
1300  }
1301  else {
1302  dst_start = dst_it.GetRange().GetFrom()*dst_width;
1303  dst_len = x_GetRangeLength(dst_it)*dst_width;
1304  }
1305  if (last_dst_id != dst_it.GetSeq_id_Handle() ||
1306  last_dst_reverse != IsReverse(dst_it.GetStrand())) {
1307  m_CurrentGroup++;
1308  }
1309  }
1310  }
1311  // Remember the direction of source and destination. This information
1312  // will be used when ordering ranges in the mapped location.
1313  m_Mappings->SetReverseSrc(source.IsReverseStrand());
1315 }
1316 
1317 
1319  const TSynonyms& synonyms) const
1320 {
1322  ITERATE(TSynonyms, it, synonyms) {
1323  if (idh == *it) return true;
1324  }
1325  return false;
1326 }
1327 
1328 
1330  const CSeq_id& to_id,
1331  const CSeq_id* from_id)
1332 {
1334  unique_ptr<IMapper_Sequence_Info::TSynonyms> from_syn;
1335  CSeq_id_Handle to_idh = CSeq_id_Handle::GetHandle(to_id);
1336  CollectSynonyms(to_idh, to_syn);
1337  if ( from_id ) {
1338  CSeq_id_Handle from_idh = CSeq_id_Handle::GetHandle(*from_id);
1339  from_syn.reset(new IMapper_Sequence_Info::TSynonyms);
1340  CollectSynonyms(from_idh, *from_syn);
1341  }
1342  x_InitializeAlign(map_align, to_syn, from_syn.get());
1343 }
1344 
1345 const size_t kInvalidRow = size_t(-1);
1346 
1348  const TSynonyms& to_ids,
1349  const TSynonyms* from_ids)
1350 {
1351  // When finding the destination row, the first row with required seq-id
1352  // is used. Do not check if there are multiple rows with the same id.
1353  switch ( map_align.GetSegs().Which() ) {
1355  {
1356  const TDendiag& diags = map_align.GetSegs().GetDendiag();
1357  ITERATE(TDendiag, diag_it, diags) {
1358  size_t to_row = kInvalidRow;
1359  size_t from_row = kInvalidRow;
1360  for (size_t i = 0; i < (*diag_it)->GetIds().size(); ++i) {
1361  if ( x_IsSynonym(*(*diag_it)->GetIds()[i], to_ids) ) {
1362  to_row = i;
1363  if (!from_ids || from_row != kInvalidRow) break;
1364  }
1365  if (from_ids && x_IsSynonym(*(*diag_it)->GetIds()[i], *from_ids)) {
1366  from_row = i;
1367  if (to_row != kInvalidRow) break;
1368  }
1369  }
1370  if (to_row == kInvalidRow) {
1371  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1372  "Target ID not found in the alignment");
1373  }
1374  if (from_ids && from_row == kInvalidRow) {
1375  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1376  "Source ID not found in the alignment");
1377  }
1378  // Each diag forms a separate group. See SetMergeBySeg().
1379  m_CurrentGroup++;
1380  x_InitAlign(**diag_it, to_row, from_row);
1381  }
1382  break;
1383  }
1385  {
1386  const CDense_seg& dseg = map_align.GetSegs().GetDenseg();
1387  size_t to_row = kInvalidRow;
1388  size_t from_row = kInvalidRow;
1389  for (size_t i = 0; i < dseg.GetIds().size(); ++i) {
1390  if ( x_IsSynonym(*dseg.GetIds()[i], to_ids) ) {
1391  to_row = i;
1392  if (!from_ids || from_row != kInvalidRow) break;
1393  }
1394  if (from_ids && x_IsSynonym(*dseg.GetIds()[i], *from_ids)) {
1395  from_row = i;
1396  if (to_row != kInvalidRow) break;
1397  }
1398  }
1399  if (to_row == kInvalidRow) {
1400  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1401  "Target ID not found in the alignment");
1402  }
1403  if (from_ids && from_row == kInvalidRow) {
1404  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1405  "Source ID not found in the alignment");
1406  }
1407  x_InitAlign(dseg, to_row, from_row);
1408  break;
1409  }
1411  {
1412  const TStd& std_segs = map_align.GetSegs().GetStd();
1413  ITERATE(TStd, std_seg, std_segs) {
1414  size_t to_row = kInvalidRow;
1415  if ((*std_seg)->IsSetIds() && !(*std_seg)->GetIds().empty()) {
1416  for (size_t i = 0; i < (*std_seg)->GetIds().size(); ++i) {
1417  if ( x_IsSynonym(*(*std_seg)->GetIds()[i], to_ids) ) {
1418  to_row = i;
1419  break;
1420  }
1421  }
1422  }
1423  if (to_row == kInvalidRow) {
1424  // The id is not found or 'ids' is missing in the std-seg.
1425  // Try to parse seq-locs.
1426  for (size_t i = 0; i < (*std_seg)->GetLoc().size(); ++i) {
1427  const CSeq_id* row_id = (*std_seg)->GetLoc()[i]->GetId();
1428  if (row_id && x_IsSynonym(*row_id, to_ids)) {
1429  to_row = i;
1430  break;
1431  }
1432  }
1433  }
1434  if (to_row == kInvalidRow) {
1435  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1436  "Target ID not found in the alignment");
1437  }
1438  // Each std-seg forms a separate group. See SetMergeBySeg().
1439  m_CurrentGroup++;
1440  x_InitAlign(**std_seg, to_row);
1441  }
1442  break;
1443  }
1445  {
1446  const CPacked_seg& pseg = map_align.GetSegs().GetPacked();
1447  size_t to_row = kInvalidRow;
1448  size_t from_row = kInvalidRow;
1449  for (size_t i = 0; i < pseg.GetIds().size(); ++i) {
1450  if ( x_IsSynonym(*pseg.GetIds()[i], to_ids) ) {
1451  to_row = i;
1452  if (!from_ids || from_row != kInvalidRow) break;
1453  }
1454  if (from_ids && x_IsSynonym(*pseg.GetIds()[i], *from_ids)) {
1455  from_row = i;
1456  if (to_row != kInvalidRow) break;
1457  }
1458  }
1459  if (to_row == kInvalidRow) {
1460  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1461  "Target ID not found in the alignment");
1462  }
1463  if (from_ids && from_row == kInvalidRow) {
1464  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1465  "Source ID not found in the alignment");
1466  }
1467  x_InitAlign(pseg, to_row, from_row);
1468  break;
1469  }
1471  {
1472  const CSeq_align_set& aln_set = map_align.GetSegs().GetDisc();
1473  ITERATE(CSeq_align_set::Tdata, aln, aln_set.Get()) {
1474  // Each sub-alignment forms a separate group.
1475  // See SetMergeBySeg().
1476  m_CurrentGroup++;
1477  x_InitializeAlign(**aln, to_ids);
1478  }
1479  break;
1480  }
1482  {
1483  x_InitSpliced(map_align.GetSegs().GetSpliced(), to_ids);
1484  break;
1485  }
1487  {
1488  const CSparse_seg& sparse = map_align.GetSegs().GetSparse();
1489  size_t row = 0;
1490  ITERATE(CSparse_seg::TRows, it, sparse.GetRows()) {
1491  // Prefer to map from the second subrow to the first one
1492  // if their ids are the same.
1493  if ( x_IsSynonym((*it)->GetFirst_id(), to_ids) ) {
1495  }
1496  else if ( x_IsSynonym((*it)->GetSecond_id(), to_ids) ) {
1498  }
1499  x_InitSparse(sparse, row);
1500  }
1501  break;
1502  }
1503  default:
1504  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1505  "Unsupported alignment type");
1506  }
1507 }
1508 
1509 
1511  size_t to_row,
1512  size_t from_row)
1513 {
1514  switch ( map_align.GetSegs().Which() ) {
1516  {
1517  const TDendiag& diags = map_align.GetSegs().GetDendiag();
1518  ITERATE(TDendiag, diag_it, diags) {
1519  // Each diag forms a separate group. See SetMergeBySeg().
1520  m_CurrentGroup++;
1521  x_InitAlign(**diag_it, to_row, from_row);
1522  }
1523  break;
1524  }
1526  {
1527  const CDense_seg& dseg = map_align.GetSegs().GetDenseg();
1528  x_InitAlign(dseg, to_row, from_row);
1529  break;
1530  }
1532  {
1533  const TStd& std_segs = map_align.GetSegs().GetStd();
1534  ITERATE(TStd, std_seg, std_segs) {
1535  // Each std-seg forms a separate group. See SetMergeBySeg().
1536  m_CurrentGroup++;
1537  x_InitAlign(**std_seg, to_row);
1538  }
1539  break;
1540  }
1542  {
1543  const CPacked_seg& pseg = map_align.GetSegs().GetPacked();
1544  x_InitAlign(pseg, to_row, from_row);
1545  break;
1546  }
1548  {
1549  // Use the same row in each sub-alignment.
1550  const CSeq_align_set& aln_set = map_align.GetSegs().GetDisc();
1551  ITERATE(CSeq_align_set::Tdata, aln, aln_set.Get()) {
1552  // Each sub-alignment forms a separate group. See SetMergeBySeg().
1553  m_CurrentGroup++;
1554  x_InitializeAlign(**aln, to_row, from_row);
1555  }
1556  break;
1557  }
1559  {
1560  // Spliced alignment row indexing is different, use enum
1561  // to avoid confusion.
1562  if (to_row == 0 || to_row == 1) {
1563  x_InitSpliced(map_align.GetSegs().GetSpliced(),
1564  ESplicedRow(to_row));
1565  }
1566  else {
1567  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1568  "Invalid row number in spliced-seg alignment");
1569  }
1570  break;
1571  }
1573  {
1574  x_InitSparse(map_align.GetSegs().GetSparse(), to_row);
1575  break;
1576  }
1577  default:
1578  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1579  "Unsupported alignment type");
1580  }
1581 }
1582 
1583 
1585  size_t to_row,
1586  size_t from_row)
1587 {
1588  // Check the alignment for consistency. Adjust invalid values, show
1589  // warnings if this happens.
1590  size_t dim = diag.GetDim();
1591  _ASSERT(to_row < dim);
1592  if (dim != diag.GetIds().size()) {
1593  ERR_POST_X(1, Warning << "Invalid 'ids' size in dendiag");
1594  dim = min(dim, diag.GetIds().size());
1595  }
1596  if (dim != diag.GetStarts().size()) {
1597  ERR_POST_X(2, Warning << "Invalid 'starts' size in dendiag");
1598  dim = min(dim, diag.GetStarts().size());
1599  }
1600  bool have_strands = diag.IsSetStrands();
1601  if (have_strands && dim != diag.GetStrands().size()) {
1602  ERR_POST_X(3, Warning << "Invalid 'strands' size in dendiag");
1603  dim = min(dim, diag.GetStrands().size());
1604  }
1605 
1606  ENa_strand dst_strand = have_strands ?
1607  diag.GetStrands()[to_row] : eNa_strand_unknown;
1608  const CSeq_id& dst_id = *diag.GetIds()[to_row];
1609  ESeqType dst_type = GetSeqTypeById(dst_id);
1610  int dst_width = (dst_type == eSeq_prot) ? 3 : 1;
1611 
1612  // In alignments with multiple sequence types segment length
1613  // should be multiplied by 3, while starts multiplier depends
1614  // on the sequence type.
1615  int len_width = 1;
1616  for (size_t row = 0; row < dim; ++row) {
1617  if (GetSeqTypeById(*diag.GetIds()[row]) == eSeq_prot) {
1618  len_width = 3;
1619  break;
1620  }
1621  }
1622  for (size_t row = 0; row < dim; ++row) {
1623  if (row == to_row) {
1624  continue;
1625  }
1626  if (from_row != kInvalidRow && from_row != row) {
1627  continue;
1628  }
1629  const CSeq_id& src_id = *diag.GetIds()[row];
1630  ESeqType src_type = GetSeqTypeById(src_id);
1631  int src_width = (src_type == eSeq_prot) ? 3 : 1;
1632  TSeqPos src_len = diag.GetLen()*len_width;
1633  TSeqPos dst_len = src_len;
1634  TSeqPos src_start = diag.GetStarts()[row]*src_width;
1635  TSeqPos dst_start = diag.GetStarts()[to_row]*dst_width;
1636  ENa_strand src_strand = have_strands ?
1637  diag.GetStrands()[row] : eNa_strand_unknown;
1638  // Add mapping
1639  x_NextMappingRange(src_id, src_start, src_len, src_strand,
1640  dst_id, dst_start, dst_len, dst_strand, 0, 0);
1641  // Since the lengths are always the same, both source and
1642  // destination ranges must be used in one iteration.
1643  _ASSERT(!src_len && !dst_len);
1644  }
1645 }
1646 
1647 
1649  size_t to_row,
1650  size_t from_row)
1651 {
1652  // Check the alignment for consistency. Adjust invalid values, show
1653  // warnings if this happens.
1654  size_t dim = denseg.GetDim();
1655  _ASSERT(to_row < dim);
1656 
1657  size_t numseg = denseg.GetNumseg();
1658  // claimed dimension may not be accurate :-/
1659  if (numseg != denseg.GetLens().size()) {
1660  ERR_POST_X(4, Warning << "Invalid 'lens' size in denseg");
1661  numseg = min(numseg, denseg.GetLens().size());
1662  }
1663  if (dim != denseg.GetIds().size()) {
1664  ERR_POST_X(5, Warning << "Invalid 'ids' size in denseg");
1665  dim = min(dim, denseg.GetIds().size());
1666  }
1667  if (dim*numseg != denseg.GetStarts().size()) {
1668  ERR_POST_X(6, Warning << "Invalid 'starts' size in denseg");
1669  dim = min(dim*numseg, denseg.GetStarts().size()) / numseg;
1670  }
1671  bool have_strands = denseg.IsSetStrands();
1672  if (have_strands && dim*numseg != denseg.GetStrands().size()) {
1673  ERR_POST_X(7, Warning << "Invalid 'strands' size in denseg");
1674  dim = min(dim*numseg, denseg.GetStrands().size()) / numseg;
1675  }
1676 
1677  // In alignments with multiple sequence types segment length
1678  // should be multiplied by 3, while starts multiplier depends
1679  // on the sequence type.
1680  int len_width = 1;
1681  for (size_t row = 0; row < dim; ++row) {
1682  if (GetSeqTypeById(*denseg.GetIds()[row]) == eSeq_prot) {
1683  len_width = 3;
1684  break;
1685  }
1686  }
1687 
1688  const CSeq_id& dst_id = *denseg.GetIds()[to_row];
1689  ESeqType dst_type = GetSeqTypeById(dst_id);
1690  int dst_width = (dst_type == eSeq_prot) ? 3 : 1;
1691  for (size_t row = 0; row < dim; ++row) {
1692  if (row == to_row) {
1693  continue;
1694  }
1695  if (from_row != kInvalidRow && from_row != row) {
1696  continue;
1697  }
1698  const CSeq_id& src_id = *denseg.GetIds()[row];
1699 
1700  ESeqType src_type = GetSeqTypeById(src_id);
1701  int src_width = (src_type == eSeq_prot) ? 3 : 1;
1702 
1703  // Depending on the flags we may need to use whole range
1704  // for each dense-seg ignoring its segments.
1706  // Get total range for source and destination rows.
1707  // Both ranges must be not empty.
1708  TSeqRange r_src
1709  = denseg.GetSeqRange(static_cast<CDense_seg::TDim>(row));
1710  TSeqRange r_dst
1711  = denseg.GetSeqRange(static_cast<CDense_seg::TDim>(to_row));
1712 
1713  _ASSERT(r_src.GetLength() != 0 && r_dst.GetLength() != 0);
1714  ENa_strand dst_strand = have_strands ?
1715  denseg.GetStrands()[to_row] : eNa_strand_unknown;
1716  ENa_strand src_strand = have_strands ?
1717  denseg.GetStrands()[row] : eNa_strand_unknown;
1718 
1719  // Dense-seg can not contain whole ranges, no need to check the ranges.
1720  TSeqPos src_len = r_src.GetLength()*len_width;
1721  TSeqPos dst_len = r_dst.GetLength()*len_width;
1722  TSeqPos src_start = r_src.GetFrom()*src_width;
1723  TSeqPos dst_start = r_dst.GetFrom()*dst_width;
1724 
1725  if (src_len != dst_len) {
1726  ERR_POST_X(23, Error <<
1727  "Genomic vs product length mismatch in dense-seg");
1728  }
1730  src_id, src_start, src_len, src_strand,
1731  dst_id, dst_start, dst_len, dst_strand,
1732  0, 0);
1733  // Since the lengths are always the same, both source and
1734  // destination ranges must be used in one iteration.
1735  if (src_len != 0 || dst_len != 0) {
1736  NCBI_THROW(CAnnotMapperException, eBadAlignment,
1737  "Different lengths of source and destination rows "
1738  "in dense-seg.");
1739  }
1740  } else {
1741  // Normal mode - use all segments instead of the total range.
1742  for (size_t seg = 0; seg < numseg; ++seg) {
1743  int i_src_start = denseg.GetStarts()[seg*dim + row];
1744  int i_dst_start = denseg.GetStarts()[seg*dim + to_row];
1745  if (i_src_start < 0 || i_dst_start < 0) {
1746  // Ignore gaps
1747  continue;
1748  }
1749 
1750  ENa_strand dst_strand = have_strands ?
1751  denseg.GetStrands()[seg*dim + to_row] : eNa_strand_unknown;
1752  ENa_strand src_strand = have_strands ?
1753  denseg.GetStrands()[seg*dim + row] : eNa_strand_unknown;
1754 
1755  TSeqPos src_len = denseg.GetLens()[seg]*len_width;
1756  TSeqPos dst_len = src_len;
1757  TSeqPos src_start = (TSeqPos)(i_src_start)*src_width;
1758  TSeqPos dst_start = (TSeqPos)(i_dst_start)*dst_width;
1759  x_NextMappingRange(src_id, src_start, src_len, src_strand,
1760  dst_id, dst_start, dst_len, dst_strand, 0, 0);
1761  // Since the lengths are always the same, both source and
1762  // destination ranges must be used in one iteration.
1763  _ASSERT(!src_len && !dst_len);
1764  }
1765  }
1766  }
1767 }
1768 
1769 
1770 void CSeq_loc_Mapper_Base::x_InitAlign(const CStd_seg& sseg, size_t to_row)
1771 {
1772  // Check the alignment for consistency. Adjust invalid values, show
1773  // warnings if this happens.
1774  size_t dim = sseg.GetDim();
1775  if (dim != sseg.GetLoc().size()) {
1776  ERR_POST_X(8, Warning << "Invalid 'loc' size in std-seg");
1777  dim = min(dim, sseg.GetLoc().size());
1778  }
1779  if (sseg.IsSetIds()
1780  && dim != sseg.GetIds().size()) {
1781  ERR_POST_X(9, Warning << "Invalid 'ids' size in std-seg");
1782  dim = min(dim, sseg.GetIds().size());
1783  }
1784 
1785  const CSeq_loc& dst_loc = *sseg.GetLoc()[to_row];
1786  for (size_t row = 0; row < dim; ++row ) {
1787  if (row == to_row) {
1788  continue;
1789  }
1790  const CSeq_loc& src_loc = *sseg.GetLoc()[row];
1791  if ( src_loc.IsEmpty() ) {
1792  // skipped row in this segment
1793  continue;
1794  }
1795  // The mapping is just between two locations
1796  x_InitializeLocs(src_loc, dst_loc);
1797  }
1798 }
1799 
1800 
1802  size_t to_row,
1803  size_t from_row)
1804 {
1805  // Check the alignment for consistency. Adjust invalid values, show
1806  // warnings if this happens.
1807  size_t dim = pseg.GetDim();
1808  size_t numseg = pseg.GetNumseg();
1809  // claimed dimension may not be accurate :-/
1810  if (numseg != pseg.GetLens().size()) {
1811  ERR_POST_X(10, Warning << "Invalid 'lens' size in packed-seg");
1812  numseg = min(numseg, pseg.GetLens().size());
1813  }
1814  if (dim != pseg.GetIds().size()) {
1815  ERR_POST_X(11, Warning << "Invalid 'ids' size in packed-seg");
1816  dim = min(dim, pseg.GetIds().size());
1817  }
1818  if (dim*numseg != pseg.GetStarts().size()) {
1819  ERR_POST_X(12, Warning << "Invalid 'starts' size in packed-seg");
1820  dim = min(dim*numseg, pseg.GetStarts().size()) / numseg;
1821  }
1822  bool have_strands = pseg.IsSetStrands();
1823  if (have_strands && dim*numseg != pseg.GetStrands().size()) {
1824  ERR_POST_X(13, Warning << "Invalid 'strands' size in packed-seg");
1825  dim = min(dim*numseg, pseg.GetStrands().size()) / numseg;
1826  }
1827 
1828  // In alignments with multiple sequence types segment length
1829  // should be multiplied by 3, while starts multiplier depends
1830  // on the sequence type.
1831  int len_width = 1;
1832  for (size_t row = 0; row < dim; ++row) {
1833  if (GetSeqTypeById(*pseg.GetIds()[row]) == eSeq_prot) {
1834  len_width = 3;
1835  break;
1836  }
1837  }
1838 
1839  const CSeq_id& dst_id = *pseg.GetIds()[to_row];
1840  ESeqType dst_type = GetSeqTypeById(dst_id);
1841  int dst_width = (dst_type == eSeq_prot) ? 3 : 1;
1842 
1843  for (size_t row = 0; row < dim; ++row) {
1844  if (row == to_row) {
1845  continue;
1846  }
1847  if (from_row != kInvalidRow && from_row != row) {
1848  continue;
1849  }
1850  const CSeq_id& src_id = *pseg.GetIds()[row];
1851  ESeqType src_type = GetSeqTypeById(src_id);
1852  int src_width = (src_type == eSeq_prot) ? 3 : 1;
1853  for (size_t seg = 0; seg < numseg; ++seg) {
1854  if (!pseg.GetPresent()[seg*dim + row] ||
1855  !pseg.GetPresent()[seg*dim + to_row]) {
1856  // Ignore gaps
1857  continue;
1858  }
1859 
1860  ENa_strand dst_strand = have_strands ?
1861  pseg.GetStrands()[seg*dim + to_row] : eNa_strand_unknown;
1862  ENa_strand src_strand = have_strands ?
1863  pseg.GetStrands()[seg*dim + row] : eNa_strand_unknown;
1864 
1865  TSeqPos src_len = pseg.GetLens()[seg]*len_width;
1866  TSeqPos dst_len = src_len;
1867  TSeqPos src_start = pseg.GetStarts()[seg*dim + row]*src_width;
1868  TSeqPos dst_start = pseg.GetStarts()[seg*dim + to_row]*dst_width;
1870  src_id, src_start, src_len, src_strand,
1871  dst_id, dst_start, dst_len, dst_strand,
1872  0, 0);
1873  // Since the lengths are always the same, both source and
1874  // destination ranges must be used in one iteration.
1875  _ASSERT(!src_len && !dst_len);
1876  }
1877  }
1878 }
1879 
1880 
1882  const TSynonyms& to_ids)
1883 {
1884  // Assume the same seq-id can not be used in both genomic and product rows,
1885  // try find the correct row.
1886  if (spliced.IsSetGenomic_id() && x_IsSynonym(spliced.GetGenomic_id(), to_ids)) {
1887  x_InitSpliced(spliced, eSplicedRow_Gen);
1888  return;
1889  }
1890  if (spliced.IsSetProduct_id() && x_IsSynonym(spliced.GetProduct_id(), to_ids)) {
1891  x_InitSpliced(spliced, eSplicedRow_Prod);
1892  return;
1893  }
1894  // Global ids are not set or not equal to to_id, try to use per-exon ids.
1895  // Not sure if it's possible that per-exon ids are different from the
1896  // global ones, but if this happens let's just ignore the globals.
1897  // Another catch: the mapping destination will be the whole row rather
1898  // than only those exons, which contain the requested id.
1899  ITERATE(CSpliced_seg::TExons, it, spliced.GetExons()) {
1900  const CSpliced_exon& ex = **it;
1901  if (ex.IsSetGenomic_id() && x_IsSynonym(ex.GetGenomic_id(), to_ids)) {
1902  x_InitSpliced(spliced, eSplicedRow_Gen);
1903  return;
1904  }
1905  if (ex.IsSetProduct_id() && x_IsSynonym(ex.GetProduct_id(), to_ids)) {
1906  x_InitSpliced(spliced, eSplicedRow_Prod);
1907  return;
1908  }
1909  }
1910 }
1911 
1912 
1914 {
1915  // Helper function - return exon part length regardless of its type.
1916  switch ( part.Which() ) {
1918  return part.GetMatch();
1920  return part.GetMismatch();
1922  return part.GetDiag();
1924  return part.GetProduct_ins();
1926  return part.GetGenomic_ins();
1927  default:
1928  ERR_POST_X(22, Warning << "Unsupported CSpliced_exon_chunk type: " <<
1929  part.SelectionName(part.Which()) << ", ignoring the chunk.");
1930  }
1931  return 0;
1932 }
1933 
1934 
1936 x_AddExonPartsMapping(TSeqPos& mapping_len,
1937  ESplicedRow to_row,
1938  const CSeq_id& gen_id,
1939  TSeqPos& gen_start,
1940  TSeqPos& gen_len,
1941  ENa_strand gen_strand,
1942  const CSeq_id& prod_id,
1943  TSeqPos& prod_start,
1944  TSeqPos& prod_len,
1945  ENa_strand prod_strand)
1946 {
1947  if (mapping_len == 0) return;
1948  bool rev_gen = IsReverse(gen_strand);
1949  bool rev_prod = IsReverse(prod_strand);
1950  TSeqPos pgen_len = mapping_len;
1951  TSeqPos pprod_len = mapping_len;
1952  // Calculate starts depending on the strand.
1953  TSeqPos pgen_start = rev_gen ?
1954  gen_start + gen_len - mapping_len : gen_start;
1955  TSeqPos pprod_start = rev_prod ?
1956  prod_start + prod_len - mapping_len : prod_start;
1957  // Create the mapping.
1958  if (to_row == eSplicedRow_Prod) {
1960  gen_id, pgen_start, pgen_len, gen_strand,
1961  prod_id, pprod_start, pprod_len, prod_strand,
1962  0, 0);
1963  }
1964  else {
1966  prod_id, pprod_start, pprod_len, prod_strand,
1967  gen_id, pgen_start, pgen_len, gen_strand,
1968  0, 0);
1969  }
1970  // Since the lengths are always the same, both source and
1971  // destination ranges must be used in one iteration.
1972  _ASSERT(pgen_len == 0 && pprod_len == 0);
1973  if ( !rev_gen ) {
1974  gen_start += mapping_len;
1975  }
1976  gen_len -= mapping_len;
1977  if ( !rev_prod ) {
1978  prod_start += mapping_len;
1979  }
1980  prod_len -= mapping_len;
1981  mapping_len = 0;
1982 }
1983 
1984 
1987  ESplicedRow to_row,
1988  const CSeq_id& gen_id,
1989  TSeqPos& gen_start,
1990  TSeqPos& gen_len,
1991  ENa_strand gen_strand,
1992  const CSeq_id& prod_id,
1993  TSeqPos& prod_start,
1994  TSeqPos& prod_len,
1995  ENa_strand prod_strand)
1996 {
1997  // Parse a single exon, create mapping for each part.
1998  bool rev_gen = IsReverse(gen_strand);
1999  bool rev_prod = IsReverse(prod_strand);
2000  // Merge parts participating in the mapping (match, mismatch, diag).
2001  // Calculate total length of the merged parts.
2002  TSeqPos mapping_len = 0;
2003  ITERATE(CSpliced_exon::TParts, it, parts) {
2004  const CSpliced_exon_chunk& part = **it;
2005  TSeqPos plen = sx_GetExonPartLength(part);
2006  // Only match, mismatch and diag are used for mapping.
2007  // Ignore insertions the same way as gaps in other alignment types.
2008  if ( part.IsMatch() || part.IsMismatch() || part.IsDiag() ) {
2009  mapping_len += plen;
2010  continue;
2011  }
2012  // Convert any collected ranges to a new mapping. Adjust starts and
2013  // lengths.
2014  x_AddExonPartsMapping(mapping_len, to_row,
2015  gen_id, gen_start, gen_len, gen_strand,
2016  prod_id, prod_start, prod_len, prod_strand);
2017  // Adjust starts and lengths to skip non-participating parts.
2018  if (!rev_gen && !part.IsProduct_ins()) {
2019  gen_start += plen;
2020  }
2021  if (!rev_prod && !part.IsGenomic_ins()) {
2022  prod_start += plen;
2023  }
2024  if ( !part.IsProduct_ins() ) {
2025  gen_len -= plen;
2026  }
2027  if ( !part.IsGenomic_ins() ) {
2028  prod_len -= plen;
2029  }
2030  }
2031  // Convert any remaining ranges to a new mapping. If mapping_len is zero,
2032  // nothing will be done.
2033  x_AddExonPartsMapping(mapping_len, to_row,
2034  gen_id, gen_start, gen_len, gen_strand,
2035  prod_id, prod_start, prod_len, prod_strand);
2036 }
2037 
2038 
2040  ESplicedRow to_row)
2041 {
2042  // Use global strands and seq-ids for all exons where no explicit
2043  // values are set.
2044  bool have_gen_strand = spliced.IsSetGenomic_strand();
2045  ENa_strand gen_strand = have_gen_strand ?
2047  bool have_prod_strand = spliced.IsSetProduct_strand();
2048  ENa_strand prod_strand = have_prod_strand ?
2050 
2051  const CSeq_id* gen_id = spliced.IsSetGenomic_id() ?
2052  &spliced.GetGenomic_id() : 0;
2053  const CSeq_id* prod_id = spliced.IsSetProduct_id() ?
2054  &spliced.GetProduct_id() : 0;
2055 
2056  bool prod_is_prot = false;
2057  // Spliced-seg already contains the information about sequence types.
2058  switch ( spliced.GetProduct_type() ) {
2060  prod_is_prot = true;
2061  break;
2063  // Leave both widths = 1
2064  break;
2065  default:
2066  ERR_POST_X(14, Error << "Unknown product type in spliced-seg");
2067  return;
2068  }
2069 
2070  ITERATE(CSpliced_seg::TExons, it, spliced.GetExons()) {
2071  // Use new group for each exon.
2072  m_CurrentGroup++;
2073  const CSpliced_exon& ex = **it;
2074  const CSeq_id* ex_gen_id = ex.IsSetGenomic_id() ?
2075  &ex.GetGenomic_id() : gen_id;
2076  const CSeq_id* ex_prod_id = ex.IsSetProduct_id() ?
2077  &ex.GetProduct_id() : prod_id;
2078  if (!ex_gen_id || !ex_prod_id) {
2079  // No id is set globally or locally. Ignore the exon.
2080  ERR_POST_X(15, Error << "Missing id in spliced-exon");
2081  continue;
2082  }
2083  ENa_strand ex_gen_strand = ex.IsSetGenomic_strand() ?
2084  ex.GetGenomic_strand() : gen_strand;
2085  ENa_strand ex_prod_strand = ex.IsSetProduct_strand() ?
2086  ex.GetProduct_strand() : prod_strand;
2087  TSeqPos gen_from = ex.GetGenomic_start();
2088  TSeqPos gen_to = ex.GetGenomic_end();
2089  TSeqPos prod_from, prod_to;
2090  // Make sure coordinate types match product type.
2091  if (prod_is_prot != ex.GetProduct_start().IsProtpos()) {
2092  ERR_POST_X(24, Error <<
2093  "Wrong product-start type in spliced-exon, "
2094  "does not match product-type");
2095  }
2096  if (prod_is_prot != ex.GetProduct_end().IsProtpos()) {
2097  ERR_POST_X(25, Error <<
2098  "Wrong product-end type in spliced-exon, "
2099  "does not match product-type");
2100  }
2101  prod_from = ex.GetProduct_start().AsSeqPos();
2102  prod_to = ex.GetProduct_end().AsSeqPos();
2103 
2104  TSeqPos gen_len = gen_to - gen_from + 1;
2105  TSeqPos prod_len = prod_to - prod_from + 1;
2106  // Cache sequence type for the id.
2107  SetSeqTypeById(*ex_prod_id, prod_is_prot ? eSeq_prot : eSeq_nuc);
2108  SetSeqTypeById(*ex_gen_id, eSeq_nuc);
2109  if ( ex.IsSetParts() ) {
2110  // Iterate exon parts.
2111  x_IterateExonParts(ex.GetParts(), to_row,
2112  *ex_gen_id, gen_from, gen_len, ex_gen_strand,
2113  *ex_prod_id, prod_from, prod_len, ex_prod_strand);
2114  }
2115  else {
2116  // Use the whole exon if there are no parts.
2117  if ( to_row == eSplicedRow_Prod ) {
2119  *ex_gen_id, gen_from, gen_len, ex_gen_strand,
2120  *ex_prod_id, prod_from, prod_len, ex_prod_strand,
2121  0, 0);
2122  }
2123  else {
2125  *ex_prod_id, prod_from, prod_len, ex_prod_strand,
2126  *ex_gen_id, gen_from, gen_len, ex_gen_strand,
2127  0, 0);
2128  }
2129  }
2130  // Make sure the whole exon was used.
2131  if (gen_len || prod_len) {
2132  ERR_POST_X(17, Error <<
2133  "Genomic vs product length mismatch in spliced-exon");
2134  }
2135  }
2136 }
2137 
2138 
2140  size_t to_row)
2141 {
2142  // Sparse-seg needs special row indexing.
2143  bool to_second = m_MapOptions.GetAlign_Sparse_ToSecond();
2144 
2145  // Check the alignment for consistency. Adjust invalid values, show
2146  // warnings if this happens.
2147  _ASSERT(to_row < sparse.GetRows().size());
2148  const CSparse_align& row = *sparse.GetRows()[to_row];
2149 
2150  size_t numseg = row.GetNumseg();
2151  // claimed dimension may not be accurate :-/
2152  if (numseg != row.GetFirst_starts().size()) {
2153  ERR_POST_X(18, Warning <<
2154  "Invalid 'first-starts' size in sparse-align");
2155  numseg = min(numseg, row.GetFirst_starts().size());
2156  }
2157  if (numseg != row.GetSecond_starts().size()) {
2158  ERR_POST_X(19, Warning <<
2159  "Invalid 'second-starts' size in sparse-align");
2160  numseg = min(numseg, row.GetSecond_starts().size());
2161  }
2162  if (numseg != row.GetLens().size()) {
2163  ERR_POST_X(20, Warning << "Invalid 'lens' size in sparse-align");
2164  numseg = min(numseg, row.GetLens().size());
2165  }
2166  bool have_strands = row.IsSetSecond_strands();
2167  if (have_strands && numseg != row.GetSecond_strands().size()) {
2168  ERR_POST_X(21, Warning <<
2169  "Invalid 'second-strands' size in sparse-align");
2170  numseg = min(numseg, row.GetSecond_strands().size());
2171  }
2172 
2173  const CSeq_id& first_id = row.GetFirst_id();
2174  const CSeq_id& second_id = row.GetSecond_id();
2175 
2176  ESeqType first_type = GetSeqTypeById(first_id);
2177  ESeqType second_type = GetSeqTypeById(second_id);
2178  int first_width = (first_type == eSeq_prot) ? 3 : 1;
2179  int second_width = (second_type == eSeq_prot) ? 3 : 1;
2180  // In alignments with multiple sequence types segment length
2181  // should be multiplied by 3, while starts multiplier depends
2182  // on the sequence type.
2183  int len_width = (first_type == eSeq_prot || second_type == eSeq_prot) ?
2184  3 : 1;
2185  const CSparse_align::TFirst_starts& first_starts = row.GetFirst_starts();
2186  const CSparse_align::TSecond_starts& second_starts = row.GetSecond_starts();
2187  const CSparse_align::TLens& lens = row.GetLens();
2188  const CSparse_align::TSecond_strands& strands = row.GetSecond_strands();
2189 
2190  // Iterate segments, create mapping for each segment.
2191  for (size_t i = 0; i < numseg; i++) {
2192  TSeqPos first_start = first_starts[i]*first_width;
2193  TSeqPos second_start = second_starts[i]*second_width;
2194  TSeqPos first_len = lens[i]*len_width;
2195  TSeqPos second_len = first_len;
2196  ENa_strand strand = have_strands ? strands[i] : eNa_strand_unknown;
2197  if ( to_second ) {
2199  first_id, first_start, first_len, eNa_strand_unknown,
2200  second_id, second_start, second_len, strand,
2201  0, 0);
2202  }
2203  else {
2205  second_id, second_start, second_len, strand,
2206  first_id, first_start, first_len, eNa_strand_unknown,
2207  0, 0);
2208  }
2209  // Make sure the whole segment was used.
2210  _ASSERT(!first_len && !second_len);
2211  }
2212 }
2213 
2214 
2215 /////////////////////////////////////////////////////////////////////
2216 //
2217 // Initialization helpers
2218 //
2219 
2220 
2221 const CSeq_id_Handle&
2223 {
2224  TSynonymMap::const_iterator primary_it = m_SynonymMap.find(synonym);
2225  return primary_it != m_SynonymMap.end() ? primary_it->second : synonym;
2226 }
2227 
2228 
2231 {
2232  // NOTE: Maping synonyms to the main id should be done by the caller.
2233  // Check cached types.
2235  if (found != m_SeqTypes.end()) {
2236  return found->second;
2237  }
2238  // New sequence - check the type and cache it.
2240  if (seqtype != eSeq_unknown) {
2241  // Cache sequence type for all synonyms if any
2242  SetSeqTypeById(idh, seqtype);
2243  }
2244  return seqtype;
2245 }
2246 
2247 
2249  TSynonyms& synonyms) const
2250 {
2251  m_MapOptions.GetSeqInfo().CollectSynonyms(id, synonyms);
2252  if ( synonyms.empty() ) {
2253  synonyms.insert(id);
2254  }
2255 }
2256 
2257 
2258 const CSeq_id_Handle&
2260 {
2262  if (primary_it != m_SynonymMap.end()) {
2263  return primary_it->second;
2264  }
2265  TSynonyms synonyms;
2266  m_MapOptions.GetSeqInfo().CollectSynonyms(id, synonyms);
2267  ITERATE(TSynonyms, syn, synonyms) {
2268  // If an id is already mapped, do not touch it.
2269  if (m_SynonymMap.find(*syn) != m_SynonymMap.end()) continue;
2270  m_SynonymMap[*syn] = id;
2271  // Add matching (e.g. versionless) synonyms.
2272  CConstRef<CSeq_id> syn_id = syn->GetSeqId();
2273  CSeq_id::TSeqIdHandles matches;
2274  syn_id->GetMatchingIds(matches);
2275  ITERATE(CSeq_id::TSeqIdHandles, mit, matches) {
2276  m_SynonymMap[*mit] = id;
2277  }
2278  }
2279  return id;
2280 }
2281 
2282 
2284 {
2287  if (it != m_LengthMap.end()) {
2288  return it->second;
2289  }
2291  m_LengthMap[idh] = len;
2292  return len;
2293 }
2294 
2295 
2297  ESeqType seqtype) const
2298 {
2299  // Do not store unknown types
2300  if (seqtype == eSeq_unknown) return;
2301  CSeq_id_Handle primary_id = CollectSynonyms(idh);
2302  TSeqTypeById::const_iterator it = m_SeqTypes.find(primary_id);
2303  if (it != m_SeqTypes.end()) {
2304  // If the type is already known and different from the new one,
2305  // throw the exception.
2306  if (it->second != seqtype) {
2307  NCBI_THROW(CAnnotMapperException, eOtherError,
2308  "Attempt to modify a known sequence type.");
2309  }
2310  return;
2311  }
2312  m_SeqTypes[primary_id] = seqtype;
2313 }
2314 
2315 
2317  ESeqType& seqtype,
2318  TSeqPos& len)
2319 {
2320  // Iterate the seq-loc, try to get sequence types used in it.
2321  len = 0;
2322  seqtype = eSeq_unknown;
2323  bool found_type = false;
2324  bool ret = true; // return true if types are known for all parts.
2325  for (CSeq_loc_CI it(loc); it; ++it) {
2326  CSeq_id_Handle idh = it.GetSeq_id_Handle();
2327  if ( !idh ) continue; // NULL?
2328  ESeqType it_type = GetSeqTypeById(idh);
2329  // Reset ret to false if there are unknown types.
2330  ret = ret && it_type != eSeq_unknown;
2331  if (!found_type && it_type != eSeq_unknown) {
2332  seqtype = it_type;
2333  found_type = true;
2334  }
2335  else if (seqtype != it_type) {
2336  seqtype = eSeq_unknown; // Report multiple types as 'unknown'
2337  }
2338  // Adjust total length or reset it.
2339  // kInvalidSeqPos indicates at least some ranges of unknown length
2340  // have been already parsed. Once this happen, do not check other
2341  // lengths.
2342  if (len != kInvalidSeqPos) {
2343  if ( it.GetRange().IsWhole() ) {
2344  TSeqPos seq_len = GetSequenceLength(it.GetSeq_id());
2345  if (seq_len == kInvalidSeqPos) {
2346  // Unknown length - stop checking other lengths.
2347  len = kInvalidSeqPos;
2348  }
2349  else {
2350  len += seq_len;
2351  }
2352  }
2353  else {
2354  len += it.GetRange().GetLength();
2355  }
2356  }
2357  }
2358  return ret;
2359 }
2360 
2361 
2364 {
2365  // Try to find at least one known sequence type and use it for
2366  // all unknown parts.
2367  ESeqType ret = eSeq_unknown;
2368  set<CSeq_id_Handle> handles; // Collect all seq-ids used in the location
2369  for (CSeq_loc_CI it(loc); it; ++it) {
2370  CSeq_id_Handle idh = it.GetSeq_id_Handle();
2371  if ( !idh ) continue; // NULL?
2372  idh = CollectSynonyms(idh);
2374  if (st != m_SeqTypes.end() && st->second != eSeq_unknown) {
2375  // New sequence type could be detected.
2376  if (ret == eSeq_unknown) {
2377  ret = st->second; // Remember the type if not set yet.
2378  }
2379  else if (ret != st->second) {
2380  // There are different types in the location and some are
2381  // unknown - impossible to use this for mapping.
2382  NCBI_THROW(CAnnotMapperException, eBadLocation,
2383  "Unable to detect sequence types in the locations.");
2384  }
2385  }
2386  handles.insert(idh); // Store the new id
2387  }
2388  if (ret != eSeq_unknown) {
2389  // At least some types could be detected and there were no conflicts.
2390  // Use the found type for all other ranges.
2391  ITERATE(set<CSeq_id_Handle>, it, handles) {
2392  m_SeqTypes[*it] = ret;
2393  }
2394  }
2395  return ret; // Return the found type or unknown.
2396 }
2397 
2398 
2400 {
2401  // The function is used when seq-align mapper suddenly detects that
2402  // sequence types were incorrectly set to nuc during the initialization.
2403  // This is possible only when the mapping is from a protein to a protein,
2404  // the scope (or other source of type information) is not available, and
2405  // the seq-loc mapper decides the mapping is from nuc to nuc. The seq-align
2406  // mapper may deduce the real sequence types from the alignment to be
2407  // mapped. In this case it will ask the seq-loc mapper to adjust types.
2408  // We need to check a lot of conditions not to spoil the mapping data.
2409  bool have_id = false; // Is the id known to the mapper?
2410  bool have_known = false; // Are there known sequence types?
2411  CSeq_id_Handle primary_id = x_GetPrimaryId(idh);
2412  // Make sure all ids have unknown types (could not be detected during
2413  // the initialization).
2415  if (id_it->first == primary_id) {
2416  have_id = true;
2417  }
2418  if (GetSeqTypeById(id_it->first) != eSeq_unknown) {
2419  have_known = true;
2420  }
2421  }
2422  // The requested id is not used in the mappings - ignore the request.
2423  if ( !have_id ) return;
2424  if ( have_known ) {
2425  // Some sequence types are already known, we can not adjust anything.
2426  NCBI_THROW(CAnnotMapperException, eOtherError,
2427  "Can not adjust sequence types to protein.");
2428  }
2429  // Now we have to copy all the mappings adjusting there sequence types
2430  // and coordinates.
2431  CRef<CMappingRanges> old_mappings = m_Mappings;
2433  ITERATE(CMappingRanges::TIdMap, id_it, old_mappings->GetIdMap()) {
2434  SetSeqTypeById(id_it->first, eSeq_prot);
2435  // Adjust all starts and lengths
2436  ITERATE(CMappingRanges::TRangeMap, rg_it, id_it->second) {
2437  const CMappingRange& mrg = *rg_it->second;
2438  TSeqPos src_from = mrg.m_Src_from;
2439  if (src_from != kInvalidSeqPos) src_from *= 3;
2440  TSeqPos dst_from = mrg.m_Dst_from;
2441  if (dst_from != kInvalidSeqPos) dst_from *= 3;
2442  TSeqPos len = mrg.m_Src_to - mrg.m_Src_from + 1;
2443  if (len != kInvalidSeqPos) len *= 3;
2445  mrg.m_Src_id_Handle, src_from, len, mrg.m_Src_strand,
2446  mrg.m_Dst_id_Handle, dst_from, mrg.m_Dst_strand,
2447  mrg.m_ExtTo);
2448  new_rg->SetGroup(mrg.GetGroup());
2449  }
2450  }
2451  // Also update m_DstRanges. They must also use genomic coordinates.
2453  NON_CONST_ITERATE(TDstIdMap, id_it, *str_it) {
2454  NON_CONST_ITERATE(TDstRanges, rg_it, id_it->second) {
2455  TSeqPos from = kInvalidSeqPos;
2456  TSeqPos to = 0;
2457  if ( rg_it->IsWhole() ) {
2458  from = 0;
2459  to = kInvalidSeqPos;
2460  }
2461  else if ( !rg_it->Empty() ) {
2462  from = rg_it->GetFrom()*3;
2463  to = rg_it->GetToOpen()*3;
2464  }
2465  rg_it->SetOpen(from, to);
2466  }
2467  }
2468  }
2469 }
2470 
2471 
2473 {
2474  if (it.IsWhole() && IsReverse(it.GetStrand())) {
2475  // This should not happen since whole locations do not have strands.
2476  // But just for the safety we need real interval length for minus
2477  // strand not "whole", to calculate mapping coordinates.
2478  // This can also fail. There are some additional checks in the
2479  // calling function (see x_InitializeLocs).
2480  return GetSequenceLength(it.GetSeq_id());
2481  }
2482  else {
2483  return it.GetRange().GetLength();
2484  }
2485 }
2486 
2487 
2489  TSeqPos& src_start,
2490  TSeqPos& src_len,
2491  ENa_strand src_strand,
2492  const CSeq_id& dst_id,
2493  TSeqPos& dst_start,
2494  TSeqPos& dst_len,
2495  ENa_strand dst_strand,
2496  const CInt_fuzz* fuzz_from,
2497  const CInt_fuzz* fuzz_to,
2498  int frame,
2499  TSeqPos src_bioseq_len )
2500 {
2501  TSeqPos cvt_src_start = src_start;
2502  TSeqPos cvt_dst_start = dst_start;
2503  TSeqPos cvt_length;
2504 
2505  const TSeqPos original_dst_len = dst_len;
2506 
2507  if (src_len == dst_len) {
2508  if (src_len == kInvalidSeqPos) {
2509  // Mapping whole to whole - try to get actual lengths.
2510  src_len = GetSequenceLength(src_id);
2511  if (src_len != kInvalidSeqPos) {
2512  src_len -= src_start;
2513  }
2514  dst_len = GetSequenceLength(dst_id);
2515  if (dst_len != kInvalidSeqPos) {
2516  dst_len -= dst_start;
2517  }
2518  // GetSequenceLength() could fail to get the length.
2519  // We can still try to initialize the mapper but with care.
2520  // If a location is whole, its start must be 0 and strand unknown.
2521  _ASSERT(src_len != kInvalidSeqPos ||
2522  (src_start == 0 && src_strand == eNa_strand_unknown));
2523  _ASSERT(dst_len != kInvalidSeqPos ||
2524  (dst_start == 0 && dst_strand == eNa_strand_unknown));
2525  }
2526  cvt_length = src_len;
2527  src_len = 0;
2528  dst_len = 0;
2529  }
2530  else if (src_len > dst_len) {
2531  // It is possible that the source location is whole. In this
2532  // case its strand must be not set.
2533  _ASSERT(src_len != kInvalidSeqPos || src_strand == eNa_strand_unknown);
2534  // Destination range is shorter - use it as a single interval,
2535  // adjust source range according to its strand.
2536  if (IsReverse(src_strand)) {
2537  cvt_src_start += src_len - dst_len;
2538  }
2539  else {
2540  src_start += dst_len;
2541  }
2542  cvt_length = dst_len;
2543  // Do not adjust length of a whole location.
2544  if (src_len != kInvalidSeqPos) {
2545  src_len -= cvt_length;
2546  }
2547  dst_len = 0; // Destination has been used completely.
2548  }
2549  else { // if (src_len < dst_len)
2550  // It is possible that the destination location is whole. In this
2551  // case its strand must be not set.
2552  _ASSERT(dst_len != kInvalidSeqPos || dst_strand == eNa_strand_unknown);
2553  // Source range is shorter - use it as a single interval,
2554  // adjust destination range according to its strand.
2555  if ( IsReverse(dst_strand) ) {
2556  cvt_dst_start += dst_len - src_len;
2557  }
2558  else {
2559  dst_start += src_len;
2560  }
2561  cvt_length = src_len;
2562  // Do not adjust length of a whole location.
2563  if (dst_len != kInvalidSeqPos) {
2564  dst_len -= cvt_length;
2565  }
2566  src_len = 0; // Source has been used completely.
2567  }
2568  // Special case: prepare to extend mapped "to" if:
2569  // - mapping is from prot to nuc
2570  // - destination "to" is partial.
2571  // See also CMappingRange::m_ExtTo
2572  bool ext_to = false;
2573  ESeqType src_type = GetSeqTypeById(src_id);
2574  ESeqType dst_type = GetSeqTypeById(dst_id);
2575  if (src_type == eSeq_prot && dst_type == eSeq_nuc) {
2576  if ( IsReverse(dst_strand) && fuzz_from ) {
2577  ext_to = fuzz_from &&
2578  fuzz_from->IsLim() &&
2579  fuzz_from->GetLim() == CInt_fuzz::eLim_lt;
2580  }
2581  else if ( !IsReverse(dst_strand) && fuzz_to ) {
2582  ext_to = fuzz_to &&
2583  fuzz_to->IsLim() &&
2584  fuzz_to->GetLim() == CInt_fuzz::eLim_gt;
2585  }
2586  }
2587  // Ready to add the conversion.
2588  x_AddConversion(src_id, cvt_src_start, src_strand,
2589  dst_id, cvt_dst_start, dst_strand, cvt_length, ext_to, frame,
2590  src_bioseq_len, original_dst_len);
2591 }
2592 
2593 
2595  TSeqPos src_start,
2596  ENa_strand src_strand,
2597  const CSeq_id& dst_id,
2598  TSeqPos dst_start,
2599  ENa_strand dst_strand,
2600  TSeqPos length,
2601  bool ext_right,
2602  int frame,
2603  TSeqPos src_bioseq_len,
2604  TSeqPos dst_len)
2605 {
2606  // Make sure the destination ranges for the strand do exist.
2607  if (m_DstRanges.size() <= size_t(dst_strand)) {
2608  m_DstRanges.resize(size_t(dst_strand) + 1);
2609  }
2610  CSeq_id_Handle src_idh = CSeq_id_Handle::GetHandle(src_id);
2611  CSeq_id_Handle dst_idh = CSeq_id_Handle::GetHandle(dst_id);
2612  CSeq_id_Handle main_id = CollectSynonyms(src_idh);
2613 
2615  TSeqPos src_seq_len = GetSequenceLength(src_id);
2616  if (src_seq_len != kInvalidSeqPos && src_seq_len > 0) {
2617  ESeqType src_type = GetSeqType(src_idh);
2618  if (src_type == eSeq_prot) {
2619  src_seq_len *= 3;
2620  }
2621  if (length > src_seq_len - src_start) {
2622  TSeqPos trim = length - src_seq_len + src_start;
2623  if ( !SameOrientation(src_strand, dst_strand) ) {
2624  dst_start += trim;
2625  }
2626  length -= trim;
2627  }
2628  }
2629  TSeqPos dst_seq_len = GetSequenceLength(dst_id);
2630  if (dst_seq_len != kInvalidSeqPos && dst_seq_len > 0) {
2631  ESeqType dst_type = GetSeqType(dst_idh);
2632  if (dst_type == eSeq_prot) {
2633  dst_seq_len *= 3;
2634  }
2635  if (length > dst_seq_len - dst_start) {
2636  TSeqPos trim = length - dst_seq_len + dst_start;
2637  if ( !SameOrientation(src_strand, dst_strand) ) {
2638  src_start += trim;
2639  }
2640  length -= trim;
2641  if (dst_len != kInvalidSeqPos) {
2642  dst_len = dst_len > trim ? dst_len - trim : 0;
2643  }
2644  }
2645  }
2646  }
2647  if (length == 0) return;
2649  main_id, src_start, length, src_strand,
2650  dst_idh, dst_start, dst_strand,
2651  ext_right, frame, kInvalidSeqPos, src_bioseq_len, dst_len);
2652  if ( m_CurrentGroup ) {
2653  rg->SetGroup(m_CurrentGroup);
2654  }
2655  // Add destination range.
2656  m_DstRanges[size_t(dst_strand)][dst_idh]
2657  .push_back(TRange(dst_start, dst_start + length - 1));
2658 }
2659 
2660 
2662 {
2663  // Iterate destination ranges and create dummy mappings from
2664  // destination to destination so than ranges already on the
2665  // target sequence are not lost. This function is used only
2666  // when mapping between a sequence and its parts (through
2667  // a bioseq handle or a seq-map).
2668  for (size_t str_idx = 0; str_idx < m_DstRanges.size(); str_idx++) {
2669  NON_CONST_ITERATE(TDstIdMap, id_it, m_DstRanges[str_idx]) {
2670  CSeq_id_Handle main_id = CollectSynonyms(id_it->first);
2671  // Sort the ranges so that they can be merged.
2672  id_it->second.sort();
2673  TSeqPos dst_start = kInvalidSeqPos;
2674  TSeqPos dst_stop = kInvalidSeqPos;
2675  ESeqType dst_type = GetSeqTypeById(id_it->first);
2676  int dst_width = (dst_type == eSeq_prot) ? 3 : 1;
2677  ITERATE(TDstRanges, rg_it, id_it->second) {
2678  // Collect and merge ranges
2679  TSeqPos rg_start = kInvalidSeqPos;
2680  TSeqPos rg_stop = 0;
2681  if ( rg_it->IsWhole() ) {
2682  rg_start = 0;
2683  rg_stop = kInvalidSeqPos;
2684  }
2685  else if ( !rg_it->Empty() ) {
2686  rg_start = rg_it->GetFrom()*dst_width;
2687  rg_stop = rg_it->GetTo()*dst_width;
2688  }
2689  // The following will also be true if the first destination
2690  // range is empty. Ignore it anyway.
2691  if (dst_start == kInvalidSeqPos) {
2692  dst_start = rg_start;
2693  dst_stop = rg_stop;
2694  continue;
2695  }
2696  if (dst_stop != kInvalidSeqPos && rg_start <= dst_stop + 1) {
2697  // overlapping or abutting ranges, continue collecting
2698  dst_stop = max(dst_stop, rg_stop);
2699  continue;
2700  }
2701  // Separate ranges, add conversion and restart collecting
2703  main_id, dst_start,
2704  dst_stop == kInvalidSeqPos
2705  ? kInvalidSeqPos : dst_stop - dst_start + 1,
2706  ENa_strand(str_idx),
2707  id_it->first, dst_start, ENa_strand(str_idx));
2708  // Do we have the whole sequence already?
2709  if (dst_stop == kInvalidSeqPos) {
2710  // Prevent the range to be added one more time.
2711  dst_start = dst_stop;
2712  break;
2713  }
2714  // Proceed to the next range.
2715  dst_start = rg_start;
2716  dst_stop = rg_stop;
2717  }
2718  // Add any remaining range.
2719  if (dst_start < dst_stop) {
2721  main_id, dst_start,
2722  dst_stop == kInvalidSeqPos
2723  ? kInvalidSeqPos : dst_stop - dst_start + 1,
2724  ENa_strand(str_idx),
2725  id_it->first, dst_start, ENa_strand(str_idx));
2726  }
2727  }
2728  }
2729  m_DstRanges.clear();
2730 }
2731 
2732 
2733 /////////////////////////////////////////////////////////////////////
2734 //
2735 // Mapping methods
2736 //
2737 
2739 {
2740  if( loc ) {
2741  CRef<CSeq_loc> new_loc( new CSeq_loc );
2742  bool is_first = true;
2743  const ESeqLocExtremes extreme = eExtreme_Biological;
2744 
2745  CSeq_loc_CI loc_iter( *loc, CSeq_loc_CI::eEmpty_Allow );
2746  for( ; loc_iter; ++loc_iter ) {
2747  CConstRef<CSeq_loc> loc_piece( loc_iter.GetRangeAsSeq_loc() );
2748 
2749  // remove nonsense (to C) fuzz like "range fuzz" from result
2750  loc_piece = x_FixNonsenseFuzz(loc_piece);
2751 
2752  if( loc_piece && ( loc_piece->IsPartialStart(extreme) || loc_piece->IsPartialStop(extreme) ) ) {
2753  const bool is_last = ( ++CSeq_loc_CI(loc_iter) == loc->end() );
2754 
2755  CRef<CSeq_loc> new_loc_piece( new CSeq_loc );
2756  new_loc_piece->Assign( *loc_piece );
2757 
2758  if( ! is_first ) {
2759  new_loc_piece->SetPartialStart( false, extreme ) ;
2760  }
2761  if( ! is_last ) {
2762  new_loc_piece->SetPartialStop( false, extreme );
2763  }
2764 
2765  new_loc->Add( *new_loc_piece );
2766  } else {
2767  new_loc->Add( *loc_piece );
2768  }
2769 
2770  is_first = false;
2771  }
2772 
2773  loc = new_loc;
2774  }
2775 }
2776 
2779  CConstRef<CSeq_loc> loc_piece ) const
2780 {
2781  switch( loc_piece->Which() ) {
2782  case CSeq_loc::e_Int:
2783  {
2784  const CSeq_interval &seq_int = loc_piece->GetInt();
2785 
2786  const bool from_fuzz_is_bad =
2787  ( seq_int.IsSetFuzz_from() &&
2788  ( seq_int.GetFuzz_from().IsRange() ||
2789  (seq_int.GetFuzz_from().IsLim() &&
2790  seq_int.GetFuzz_from().GetLim() == CInt_fuzz::eLim_gt ) ) );
2791  const bool to_fuzz_is_bad =
2792  ( seq_int.IsSetFuzz_to() &&
2793  ( seq_int.GetFuzz_to().IsRange() ||
2794  (seq_int.GetFuzz_to().IsLim() &&
2795  seq_int.GetFuzz_to().GetLim() == CInt_fuzz::eLim_lt ) ) );
2796 
2797  if( from_fuzz_is_bad || to_fuzz_is_bad ) {
2798  CRef<CSeq_loc> new_loc( new CSeq_loc );
2799  new_loc->Assign( *loc_piece );
2800 
2801  if( from_fuzz_is_bad ) {
2802  new_loc->SetInt().ResetFuzz_from();
2803  }
2804 
2805  if( to_fuzz_is_bad ) {
2806  new_loc->SetInt().ResetFuzz_to();
2807  }
2808 
2809  return new_loc;
2810  }
2811  }
2812  break;
2813  case CSeq_loc::e_Pnt:
2814  {
2815  const CSeq_point &pnt = loc_piece->GetPnt();
2816 
2817  const bool is_fuzz_range =
2818  ( pnt.IsSetFuzz() && pnt.GetFuzz().IsRange() );
2819  if( is_fuzz_range ) {
2820  CRef<CSeq_loc> new_loc( new CSeq_loc );
2821  new_loc->Assign( *loc_piece );
2822 
2823  new_loc->SetPnt().ResetFuzz();
2824 
2825  return new_loc;
2826  }
2827  }
2828  break;
2829  default:
2830  break;
2831  }
2832 
2833  // the vast majority of the time we should end up here
2834  return loc_piece;
2835 }
2836 
2837 // Check location type, optimize if possible (empty mix to NULL,
2838 // mix with a single element to this element etc.).
2840 {
2841  if ( !loc ) {
2842  loc.Reset(new CSeq_loc);
2843  loc->SetNull();
2844  return;
2845  }
2846  switch (loc->Which()) {
2847  case CSeq_loc::e_not_set:
2848  case CSeq_loc::e_Feat:
2849  case CSeq_loc::e_Null:
2850  case CSeq_loc::e_Empty:
2851  case CSeq_loc::e_Whole:
2852  case CSeq_loc::e_Int:
2853  case CSeq_loc::e_Pnt:
2854  case CSeq_loc::e_Equiv:
2855  case CSeq_loc::e_Bond:
2858  return;
2859  case CSeq_loc::e_Mix:
2860  {
2861  // remove final NULLs (optionally except one), if any
2862  {{
2863  CSeq_loc_mix::Tdata &mix_locs = loc->SetMix().Set();
2864  bool have_null = false;
2865  while (mix_locs.size() > 1 &&
2866  mix_locs.back()->IsNull())
2867  {
2868  have_null = true;
2869  mix_locs.pop_back();
2870  }
2871  // NULLs may indicate removed ranges, in this case preserve one NULL.
2872  if (GetNonMappingAsNull() && have_null &&
2873  mix_locs.size() > 0 && !mix_locs.back()->IsNull()) {
2874  CRef<CSeq_loc> null_loc(new CSeq_loc);
2875  null_loc->SetNull();
2876  mix_locs.push_back(null_loc);
2877  }
2878  }}
2879 
2880  switch ( loc->GetMix().Get().size() ) {
2881  case 0:
2882  // Empty mix - convert to Null.
2883  loc->SetNull();
2884  break;
2885  case 1:
2886  {
2887  // Mix with a single element - propagate it to the
2888  // top level.
2889  CRef<CSeq_loc> single = *loc->SetMix().Set().begin();
2890  loc = single;
2891  break;
2892  }
2893  default:
2894  {
2895  // Try to convert to packed-int
2896  CRef<CSeq_loc> packed;
2898  loc->SetMix().Set()) {
2899  // If there is something other than int, stop the
2900  // optimization and leave the mix as-is.
2901  if ( !(*it)->IsInt() ) {
2902  packed.Reset();
2903  break;
2904  }
2905  if ( !packed ) {
2906  packed.Reset(new CSeq_loc);
2907  }
2908  packed->SetPacked_int().Set().
2909  push_back(Ref(&(*it)->SetInt()));
2910  }
2911  if ( packed ) {
2912  loc = packed;
2913  }
2914  break;
2915  }
2916  }
2917  break;
2918  }
2919  default:
2920  NCBI_THROW(CAnnotMapperException, eBadLocation,
2921  "Unsupported location type");
2922  }
2923 }
2924 
2925 
2926 // Map a single range. Use mappings[cvt_idx] for mapping.
2927 // last_src_to indicates were the previous mapping has ended (this may
2928 // be left or right end depending on the source strand).
2929 // For the first mapping last_src_to must be set to kInvalidSeqPos.
2931  bool is_set_strand,
2932  ENa_strand src_strand,
2933  const TRangeFuzz& src_fuzz,
2934  TSortedMappings& mappings,
2935  size_t cvt_idx,
2936  TSeqPos* last_src_to)
2937 {
2938  const CMappingRange& cvt = *mappings[cvt_idx];
2939  if ( !cvt.CanMap(src_rg.GetFrom(), src_rg.GetTo(),
2940  is_set_strand && x_IsSetMiscFlag(fCheckStrand), src_strand) ) {
2941  // Can not map the range through this mapping.
2942  return false;
2943  }
2944  // The source range should be already using genomic coords.
2945  TSeqPos left = src_rg.GetFrom();
2946  TSeqPos right = src_rg.GetTo();
2947  bool partial_left = false;
2948  bool partial_right = false;
2949  // Used source sub-range is required to adjust graph data.
2950  // The values are relative to the source range.
2951  TRange used_rg = (src_rg.IsWhole() || src_rg.Empty()) ? src_rg :
2952  TRange(0, src_rg.GetLength() - 1);
2953 
2954  bool reverse = IsReverse(src_strand);
2955 
2956  // Have to save trimmed parts for error reporting (see below).
2957  TRange trimmed_left, trimmed_right;
2958  // Check if the source range is truncated by the mapping.
2959  if (left < cvt.m_Src_from) {
2960  trimmed_left.SetOpen(left, cvt.m_Src_from);
2961  used_rg.SetFrom(cvt.m_Src_from - left);
2962  left = cvt.m_Src_from;
2963  if ( !reverse ) {
2964  // Partial if there's a gap between left and last_src_to.
2965  partial_left = (*last_src_to == kInvalidSeqPos) ||
2966  (left != *last_src_to + 1);
2967  }
2968  else {
2969  // Partial if there's gap between left and next cvt. right end.
2970  partial_left = (cvt_idx == mappings.size() - 1) ||
2971  (mappings[cvt_idx + 1]->m_Src_to + 1 != left);
2972  }
2973  }
2974  if (right > cvt.m_Src_to) {
2975  trimmed_right.Set(cvt.m_Src_to + 1, right);
2976  used_rg.SetLength(cvt.m_Src_to - left + 1);
2977  right = cvt.m_Src_to;
2978  if ( !reverse ) {
2979  // Partial if there's gap between right and next cvt. left end.
2980  partial_right = (cvt_idx == mappings.size() - 1) ||
2981  (mappings[cvt_idx + 1]->m_Src_from != right + 1);
2982  }
2983  else {
2984  // Partial if there's gap between right and last_src_to.
2985  partial_right = (*last_src_to == kInvalidSeqPos) ||
2986  (right + 1 != *last_src_to);
2987  }
2988  }
2989  if ((partial_left || partial_right) && x_IsSetMiscFlag(fErrorOnPartial)) {
2990  string err_msg = "Unmapped sequence: " + cvt.m_Src_id_Handle.AsString();
2991  if ( partial_left ) {
2992  err_msg += " " + NStr::NumericToString(trimmed_left.GetFrom()) + ".." +
2993  NStr::NumericToString(trimmed_left.GetTo());
2994  }
2995  if ( partial_right ) {
2996  if ( partial_left ) err_msg += ",";
2997  err_msg += " " + NStr::NumericToString(trimmed_right.GetFrom()) + ".." +
2998  NStr::NumericToString(trimmed_right.GetTo());
2999  }
3000  err_msg += " not mapped to " + cvt.m_Dst_id_Handle.AsString();
3001  NCBI_THROW(CAnnotMapperException, eCanNotMap, err_msg);
3002  }
3003  if (right < left) {
3004  // Empty range - ignore it.
3005  return false;
3006  }
3007  // Adjust last mapped range end.
3008  *last_src_to = reverse ? left : right;
3009 
3010  TRangeFuzz fuzz;
3011 
3012  if( (m_FuzzOption & fFuzzOption_CStyle) == 0 ) {
3013  //// Indicate partial ranges using fuzz.
3014  if ( partial_left ) {
3015  // Set fuzz-from if a range was skipped on the left.
3016  fuzz.first.Reset(new CInt_fuzz);
3017  fuzz.first->SetLim(CInt_fuzz::eLim_lt);
3018  }
3019  else {
3020  if ( (!reverse && cvt_idx == 0) ||
3021  (reverse && cvt_idx == mappings.size() - 1) ) {
3022  // Preserve fuzz-from on the left end if any.
3023  fuzz.first = src_fuzz.first;
3024  }
3025  }
3026  if ( partial_right ) {
3027  // Set fuzz-to if a range will be skipped on the right.
3028  fuzz.second.Reset(new CInt_fuzz);
3029  fuzz.second->SetLim(CInt_fuzz::eLim_gt);
3030  }
3031  else {
3032  if ( (reverse && cvt_idx == 0) ||
3033  (!reverse && cvt_idx == mappings.size() - 1) ) {
3034  // Preserve fuzz-to on the right end if any.
3035  fuzz.second = src_fuzz.second;
3036  }
3037  }
3038  } else {
3039  fuzz = src_fuzz;
3040  }
3041  // If the previous range could not be mapped and was removed,
3042  // indicate it using fuzz.
3043  if ( !GetNonMappingAsNull() && m_LastTruncated ) {
3044  // TODO: Reconsider this "if" after we switch permanently to C++
3045  if ( ((m_FuzzOption & fFuzzOption_CStyle) == 0) && !fuzz.first ) {
3046  if( (m_FuzzOption & fFuzzOption_RemoveLimTlOrTr) != 0 ) {
3047  // we set lt or gt, as appropriate for strand
3048  if (reverse && !fuzz.second) {
3049  fuzz.second.Reset(new CInt_fuzz);
3050  fuzz.second->SetLim(CInt_fuzz::eLim_gt);
3051  }
3052  else if (!reverse && !fuzz.first) {
3053  fuzz.first.Reset(new CInt_fuzz);
3054  fuzz.first->SetLim(CInt_fuzz::eLim_lt);
3055  }
3056  } else {
3057  // Set fuzz for the original location.
3058  // This may be reversed later while mapping.
3059  if ( !reverse ) {
3060  fuzz.first.Reset(new CInt_fuzz);
3061  fuzz.first->SetLim(CInt_fuzz::eLim_tl);
3062  }
3063  else {
3064  fuzz.second.Reset(new CInt_fuzz);
3065  fuzz.second->SetLim(CInt_fuzz::eLim_tr);
3066  }
3067  }
3068  }
3069  // Reset the flag - current range is mapped at least partially.
3070  m_LastTruncated = false;
3071  }
3072 
3073  // Map fuzz to the destination. This will also adjust fuzz lim value
3074  // (just set by truncation) when strand is reversed by the mapping.
3075  TRangeFuzz mapped_fuzz = cvt.Map_Fuzz(fuzz);
3076 
3077  // Map the range and the strand. Fuzz is required to extend mapped
3078  // range in case of cd-region - see CMappingRange::m_ExtTo.
3079  TRange rg = cvt.Map_Range(left, right, &src_fuzz);
3080  ENa_strand dst_strand;
3081  bool is_set_dst_strand = cvt.Map_Strand(is_set_strand,
3082  src_strand, &dst_strand);
3083  // Store the new mapped range and its source.
3085  STRAND_TO_INDEX(is_set_dst_strand, dst_strand),
3086  rg, mapped_fuzz, cvt.m_Reverse, cvt.m_Group);
3088  STRAND_TO_INDEX(is_set_strand, src_strand),
3089  STRAND_TO_INDEX(is_set_dst_strand, dst_strand),
3090  TRange(left, right), cvt.m_Reverse);
3091  // If mapping a graph, store the information required to adjust its data.
3092  if ( m_GraphRanges && !used_rg.Empty() ) {
3093  m_GraphRanges->AddRange(used_rg);
3094  if ( !src_rg.IsWhole() ) {
3095  m_GraphRanges->IncOffset(src_rg.GetLength());
3096  }
3097  }
3098  return true;
3099 }
3100 
3101 
3103 {
3104  // The flag indicates if the last range could not be mapped
3105  // or preserved and was dropped.
3107  return;
3108  }
3109  m_LastTruncated = true;
3110  if ( GetNonMappingAsNull() ) {
3111  // Replace the original range with NULL.
3112  x_PushNullLoc();
3113  return;
3114  }
3115  // Update the mapped location before checking its properties.
3117  // If the mapped location does not have any fuzz set, set it to
3118  // indicate the truncated part.
3122  }
3123  else {
3124  // HACK: Using SetPartialStop() instead of SetTruncatedStop() to set fuzz to lim-gt
3125  // rather than lim-tr.
3127  }
3128  }
3129 }
3130 
3131 
3132 // Map a single interval. Return true if the range could be mapped
3133 // at least partially.
3135  TRange src_rg,
3136  bool is_set_strand,
3137  ENa_strand src_strand,
3138  TRangeFuzz orig_fuzz)
3139 {
3140  bool res = false;
3142  ESeqType src_type = GetSeqTypeById(src_idh);
3143  if (src_type == eSeq_prot && !(src_rg.IsWhole() || src_rg.Empty()) ) {
3144  src_rg = TRange(src_rg.GetFrom()*3, src_rg.GetTo()*3 + 2);
3145  }
3146  else if (m_GraphRanges && src_type == eSeq_unknown) {
3147  // Unknown sequence type, don't know how much of the graph
3148  // data to skip.
3149  ERR_POST_X(26, Warning <<
3150  "Unknown sequence type in the source location, "
3151  "mapped graph data may be incorrect.");
3152  }
3153 
3154  // Collect mappings which can be used to map the range.
3155  TSortedMappings mappings;
3157  src_idh, src_rg.GetFrom(), src_rg.GetTo());
3158  for ( ; rg_it; ++rg_it) {
3159  mappings.push_back(rg_it->second);
3160  }
3161  // Sort the mappings depending on the original location strand.
3162  if ( IsReverse(src_strand) ) {
3163  sort(mappings.begin(), mappings.end(), CMappingRangeRef_LessRev());
3164  }
3165  else {
3166  sort(mappings.begin(), mappings.end(), CMappingRangeRef_Less());
3167  }
3168 
3169  // special adjustment (e.g. GU561555)
3170  // This should very *rarely* be needed
3171  if( ! m_Mappings.Empty() ) {
3172  // get first mapping
3173  TRangeIterator r_it = m_Mappings->BeginMappingRanges(src_idh, 0, 1);
3174  if( r_it && r_it->second ) {
3175  const CMappingRange &mapping = *r_it->second;
3176  // try to detect if we hit the case where we couldn't do a frame-shift
3177  if( ! mapping.m_Reverse && mapping.m_Frame > 1 && mapping.m_Dst_from == 0 &&
3178  mapping.m_Dst_len <= static_cast<TSeqPos>(mapping.m_Frame - 1) )
3179  {
3180  const int shift = ( mappings[0]->m_Frame - 1 );
3181  if( src_rg.GetFrom() != 0 ) {
3182  src_rg.SetFrom( src_rg.GetFrom() + shift );
3183  }
3184  src_rg.SetTo( src_rg.GetTo() + shift);
3185  }
3186  }
3187  }
3188 
3189  // The last mapped position (in biological order). Required to check
3190  // if some part of the source location did not match any mapping range
3191  // and was dropped.
3192  TSeqPos last_src_to = kInvalidSeqPos;
3193  // Save offset from the graph start to restore it later.
3194  TSeqPos graph_offset = m_GraphRanges ? m_GraphRanges->GetOffset() : 0;
3195  // Map through each mapping. If some part of the original range matches
3196  // several mappings, it will be mapped several times.
3197  for (size_t idx = 0; idx < mappings.size(); ++idx) {
3198  if ( x_MapNextRange(src_rg,
3199  is_set_strand, src_strand,
3200  orig_fuzz,
3201  mappings, idx,
3202  &last_src_to) ) {
3203  res = true;
3204  }
3205  // Mapping can adjust graph offset, but while mapping the same
3206  // source range we need to preserve it.
3207  if ( m_GraphRanges ) {
3208  m_GraphRanges->SetOffset(graph_offset);
3209  }
3210  }
3211  // If nothing could be mapped, set 'truncated' flag.
3212  if ( !res ) {
3214  }
3215  // Now it's ok to adjust graph offset.
3216  if ( m_GraphRanges ) {
3217  if ( !src_rg.IsWhole() ) {
3218  m_GraphRanges->IncOffset(src_rg.GetLength());
3219  }
3220  else {
3221  ERR_POST_X(27, Warning <<
3222  "Unknown sequence length in the source whole location, "
3223  "mapped graph data may be incorrect.");
3224  }
3225  }
3226  return res;
3227 }
3228 
3229 
3231 {
3233  // Copy fuzz from the original interval.
3234  if ( si.IsSetFuzz_from() ) {
3235  fuzz.first.Reset(new CInt_fuzz);
3236  fuzz.first->Assign(si.GetFuzz_from());
3237  }
3238  if ( si.IsSetFuzz_to() ) {
3239  fuzz.second.Reset(new CInt_fuzz);
3240  fuzz.second->Assign(si.GetFuzz_to());
3241  }
3242  // Map the same way as a standalone seq-interval.
3243  bool res = x_MapInterval(si.GetId(),
3244  TRange(si.GetFrom(), si.GetTo()),
3245  si.IsSetStrand(),
3246  si.IsSetStrand() ? si.GetStrand() : eNa_strand_unknown,
3247  fuzz);
3248  if ( !res ) {
3249  // If the interval could not be mapped, we may need to keep
3250  // the original one.
3252  // Propagate collected mapped ranges to the destination seq-loc.
3254  // Add a copy of the original interval.
3255  TRange rg(si.GetFrom(), si.GetTo());
3257  STRAND_TO_INDEX(si.IsSetStrand(), si.GetStrand()),
3258  rg, fuzz, false, 0);
3259  }
3260  else {
3261  // If we don't need to keep the non-mapping ranges, just mark
3262  // the result as partial.
3263  m_Partial = true;
3264  }
3265  }
3266 }
3267 
3268 
3270  TSeqPos p)
3271 {
3273  // Copy fuzz from the original point.
3274  if ( pp.IsSetFuzz() ) {
3275  fuzz.first.Reset(new CInt_fuzz);
3276  fuzz.first->Assign(pp.GetFuzz());
3277  }
3278  // Map the same way as a standalone seq-interval.
3279  bool res = x_MapInterval(
3280  pp.GetId(),
3281  TRange(p, p), pp.IsSetStrand(),
3282  pp.IsSetStrand() ?
3284  fuzz);
3285  if ( !res ) {
3286  // If the point could not be mapped, we may need to keep
3287  // the original one.
3289  // Propagate collected mapped ranges to the destination seq-loc.
3291  // Add a copy of the original point.
3292  TRange rg(p, p);
3296  pp.GetStrand()),
3297  rg, fuzz, false, 0);
3298  }
3299  else {
3300  // If we don't need to keep the non-mapping ranges, just mark
3301  // the result as partial.
3302  m_Partial = true;
3303  }
3304  }
3305 }
3306 
3307 
3309 {
3310  // Parse and map a seq-loc.
3311  switch ( src_loc.Which() ) {
3312  case CSeq_loc::e_Null:
3313  // Check if gaps are allowed in the result.
3314  if (m_GapFlag == eGapRemove) {
3315  return; // No - just ignore it.
3316  }
3317  // Yes - proceed to seq-loc duplication
3318  case CSeq_loc::e_not_set:
3319  case CSeq_loc::e_Feat:
3320  {
3321  // These types can not be mapped, just copy them to the
3322  // resulting seq-loc.
3323  // First, push any ranges already mapped to the result.
3325  // Add a copy of the original location.
3326  CRef<CSeq_loc> loc(new CSeq_loc);
3327  loc->Assign(src_loc);
3328  x_PushLocToDstMix(loc);
3329  break;
3330  }
3331  case CSeq_loc::e_Empty:
3332  {
3333  // With empty seq-locs we can only change its seq-id.
3334  bool res = false;
3335  // Check if the id can be mapped at all.
3339  TRange::GetWhole().GetTo());
3340  for ( ; mit; ++mit) {
3341  const CMappingRange& cvt = *mit->second;
3342  if ( cvt.GoodSrcId(src_loc.GetEmpty()) ) {
3343  // Found matching source id, map it to the destination.
3346  cvt.GetDstIdHandle(),
3348  TRange::GetEmpty(), fuzz, false, 0);
3349  res = true;
3350  break;
3351  }
3352  }
3353  if ( !res ) {
3354  // If we don't have any mappings for this seq-id we may
3355  // still need to keep the original.
3358  CRef<CSeq_loc> loc(new CSeq_loc);
3359  loc->Assign(src_loc);
3360  x_PushLocToDstMix(loc);
3361  }
3362  else if ( GetNonMappingAsNull() ) {
3363  x_PushNullLoc();
3364  }
3365  else {
3366  m_Partial = true;
3367  }
3368  }
3369  break;
3370  }
3371  case CSeq_loc::e_Whole:
3372  {
3373  // Whole locations are mapped the same way as intervals, but we need
3374  // to know the bioseq length.
3375  const CSeq_id& src_id = src_loc.GetWhole();
3376  TSeqPos src_to = GetSequenceLength(src_id);
3377  TRange src_rg = TRange::GetWhole();
3378  // Sequence length returned above may be zero - treat it as unknown.
3379  if (src_to > 0 && src_to != kInvalidSeqPos) {
3380  src_rg.SetOpen(0, src_to);
3381  }
3382  // The length may still be unknown, but we'll try to map it anyway.
3383  // If there are no minus strands involved, it should be possible.
3384  bool res = x_MapInterval(src_id, src_rg,
3385  false, eNa_strand_unknown,
3387  if ( !res ) {
3388  // If nothing could be mapped, we may still need to keep
3389  // the original.
3392  CRef<CSeq_loc> loc(new CSeq_loc);
3393  loc->Assign(src_loc);
3394  x_PushLocToDstMix(loc);
3395  }
3396  else {
3397  m_Partial = true;
3398  }
3399  }
3400  break;
3401  }
3402  case CSeq_loc::e_Int:
3403  {
3404  // Map a single interval.
3405  const CSeq_interval& src_int = src_loc.GetInt();
3406  // Copy fuzz so that it's preserved if there are no truncations.
3408  if ( src_int.IsSetFuzz_from() ) {
3409  fuzz.first.Reset(new CInt_fuzz);
3410  fuzz.first->Assign(src_int.GetFuzz_from());
3411  }
3412  if ( src_int.IsSetFuzz_to() ) {
3413  fuzz.second.Reset(new CInt_fuzz);
3414  fuzz.second->Assign(src_int.GetFuzz_to());
3415  }
3416  // Map the interval.
3417  bool res = x_MapInterval(src_int.GetId(),
3418  TRange(src_int.GetFrom(), src_int.GetTo()),
3419  src_int.IsSetStrand(),
3420  src_int.IsSetStrand() ? src_int.GetStrand() : eNa_strand_unknown,
3421  fuzz);
3422  if ( !res ) {
3423  // If nothing could be mapped, we may still need to keep
3424  // the original.
3427  CRef<CSeq_loc> loc(new CSeq_loc);
3428  loc->Assign(src_loc);
3429  // This is the only difference from mapping a packed-int
3430  // element - we keep the whole original seq-loc rather than
3431  // a single interval.
3432  x_PushLocToDstMix(loc);
3433  }
3434  else {
3435  m_Partial = true;
3436  }
3437  }
3438  break;
3439  }
3440  case CSeq_loc::e_Pnt:
3441  {
3442  // Point is mapped as an interval of length 1.
3443  const CSeq_point& pnt = src_loc.GetPnt();
3445  if ( pnt.IsSetFuzz() ) {
3446  // With C-style, we sometimes set the fuzz to the "to-fuzz" depending
3447  // on what the fuzz actually is.
3448  if( (m_FuzzOption & fFuzzOption_CStyle) != 0 &&
3449  (pnt.GetFuzz().IsLim() &&
3450  pnt.GetFuzz().GetLim() == CInt_fuzz::eLim_gt) )
3451  {
3452  fuzz.second.Reset(new CInt_fuzz);
3453  fuzz.second->Assign(pnt.GetFuzz());
3454  } else {
3455  fuzz.first.Reset(new CInt_fuzz);
3456  fuzz.first->Assign(pnt.GetFuzz());
3457  }
3458  }
3459  bool res = x_MapInterval(pnt.GetId(),
3460  TRange(pnt.GetPoint(), pnt.GetPoint()),
3461  pnt.IsSetStrand(),
3462  pnt.IsSetStrand() ? pnt.GetStrand() : eNa_strand_unknown,
3463  fuzz);
3464  if ( !res ) {
3465  // If nothing could be mapped, we may still need to keep
3466  // the original.
3469  CRef<CSeq_loc> loc(new CSeq_loc);
3470  loc->Assign(src_loc);
3471  x_PushLocToDstMix(loc);
3472  }
3473  else {
3474  m_Partial = true;
3475  }
3476  }
3477  break;
3478  }
3480  {
3481  // Packed intervals are mapped one-by-one with
3482  const CPacked_seqint::Tdata& src_ints = src_loc.GetPacked_int().Get();
3483  ITERATE ( CPacked_seqint::Tdata, i, src_ints ) {
3485  }
3486  break;
3487  }
3489  {
3490  // Mapping of packed points is rather straightforward.
3491  const CPacked_seqpnt& src_pack_pnts = src_loc.GetPacked_pnt();
3492  const CPacked_seqpnt::TPoints& src_pnts = src_pack_pnts.GetPoints();
3493  ITERATE ( CPacked_seqpnt::TPoints, i, src_pnts ) {
3494  x_Map_PackedPnt_Element(src_pack_pnts, *i);
3495  }
3496  break;
3497  }
3498  case CSeq_loc::e_Mix:
3499  {
3500  // First, move any ranges already mapped to the resulting seq-loc.
3502  // Save the resulting seq-loc for later use and reset it.
3504  m_Dst_loc.Reset();
3505  // Map each child seq-loc. The results are collected in m_Dst_loc
3506  // as a new mix.
3507  const CSeq_loc_mix::Tdata& src_mix = src_loc.GetMix().Get();
3508  ITERATE ( CSeq_loc_mix::Tdata, i, src_mix ) {
3509  x_MapSeq_loc(**i);
3510  }
3511  // Update the mapped location if necessary.
3513  // Restore the previous (e.g. parent mix) mapped location if any.
3514  CRef<CSeq_loc> mix = m_Dst_loc;
3515  m_Dst_loc = prev;
3516  // Optimize the mix just mapped and push it to the parent one.
3517  x_OptimizeSeq_loc(mix);
3518  x_PushLocToDstMix(mix);
3519  break;
3520  }
3521  case CSeq_loc::e_Equiv:
3522  {
3523  // Equiv is mapped basically the same way as a mix:
3524  // map each sub-location, optimize the result and push it to the
3525  // destination equiv.
3528  m_Dst_loc.Reset();
3529  const CSeq_loc_equiv::Tdata& src_equiv = src_loc.GetEquiv().Get();
3530  CRef<CSeq_loc> equiv(new CSeq_loc);
3531  equiv->SetEquiv();
3532  ITERATE ( CSeq_loc_equiv::Tdata, i, src_equiv ) {
3533  x_MapSeq_loc(**i);
3536  equiv->SetEquiv().Set().push_back(m_Dst_loc);
3537  m_Dst_loc.Reset();
3538  }
3539  m_Dst_loc = prev;
3540  x_PushLocToDstMix(equiv);
3541  break;
3542  }
3543  case CSeq_loc::e_Bond:
3544  {
3545  // Bond is mapped like a mix having two sub-locations (A and B).
3548  m_Dst_loc.Reset();
3549  const CSeq_bond& src_bond = src_loc.GetBond();
3550  CRef<CSeq_loc> dst_loc(new CSeq_loc);
3551  CRef<CSeq_loc> pntA;
3552  CRef<CSeq_loc> pntB;
3554  if ( src_bond.GetA().IsSetFuzz() ) {
3555  fuzzA.first.Reset(new CInt_fuzz);
3556  fuzzA.first->Assign(src_bond.GetA().GetFuzz());
3557  }
3558  bool resA = x_MapInterval(src_bond.GetA().GetId(),
3559  TRange(src_bond.GetA().GetPoint(), src_bond.GetA().GetPoint()),
3560  src_bond.GetA().IsSetStrand(),
3561  src_bond.GetA().IsSetStrand() ?
3562  src_bond.GetA().GetStrand() : eNa_strand_unknown,
3563  fuzzA);
3564  // If A or B could not be mapped, always preserve the original one
3565  // regardless of the KeepNonmapping flag - we can not just
3566  // drop a part of a bond. See more below.
3567  if ( resA ) {
3568  pntA = x_GetMappedSeq_loc();
3569  _ASSERT(pntA);
3570  }
3571  else {
3572  pntA.Reset(new CSeq_loc);
3573  pntA->SetPnt().Assign(src_bond.GetA());
3574  }
3575  // Reset truncation flag - we are starting new location.
3576  m_LastTruncated = false;
3577  bool resB = false;
3578  if ( src_bond.IsSetB() ) {
3580  if ( src_bond.GetB().IsSetFuzz() ) {
3581  fuzzB.first.Reset(new CInt_fuzz);
3582  fuzzB.first->Assign(src_bond.GetB().GetFuzz());
3583  }
3584  resB = x_MapInterval(src_bond.GetB().GetId(),
3585  TRange(src_bond.GetB().GetPoint(), src_bond.GetB().GetPoint()),
3586  src_bond.GetB().IsSetStrand(),
3587  src_bond.GetB().IsSetStrand() ?
3588  src_bond.GetB().GetStrand() : eNa_strand_unknown,
3589  fuzzB);
3590  }
3591  if ( resB ) {
3592  pntB = x_GetMappedSeq_loc();
3593  _ASSERT(pntB);
3594  }
3595  else {
3596  pntB.Reset(new CSeq_loc);
3597  pntB->SetPnt().Assign(src_bond.GetB());
3598  }
3599  m_Dst_loc = prev;
3600  // Now we check the non-mapping flag. Only if both A and B
3601  // failed to map and the flag is not set, we can discard the bond.
3602  if ( resA || resB || x_IsSetMiscFlag(fKeepNonmapping) ) {
3603  if (pntA->IsPnt() && pntB->IsPnt()) {
3604  // Mapped locations are points - pack into bond
3605  CSeq_bond& dst_bond = dst_loc->SetBond();
3606  dst_bond.SetA(pntA->SetPnt());
3607  if ( src_bond.IsSetB() ) {
3608  dst_bond.SetB(pntB->SetPnt());
3609  }
3610  }
3611  else {
3612  // The original points were mapped to something different
3613  // (e.g. there were multiple mappings for each point).
3614  // Convert the whole bond to mix, add gaps between A and B.
3615  CSeq_loc_mix& dst_mix = dst_loc->SetMix();
3616  if ( pntA ) {
3617  dst_mix.Set().push_back(pntA);
3618  }
3619  if ( pntB ) {
3620  // Add null only if B is set.
3621  CRef<CSeq_loc> null_loc(new CSeq_loc);
3622  null_loc->SetNull();
3623  dst_mix.Set().push_back(null_loc);
3624  dst_mix.Set().push_back(pntB);
3625  }
3626  }
3627  x_PushLocToDstMix(dst_loc);
3628  }
3629  m_Partial = m_Partial || (!resA) || (!resB);
3630  break;
3631  }
3632  default:
3633  NCBI_THROW(CAnnotMapperException, eBadLocation,
3634  "Unsupported location type");
3635  }
3636 }
3637 
3640 {
3641  // Here we create an alignment mapper to map aligns.
3642  // CSeq_loc_Mapper overrides this to return CSeq_align_Mapper.
3643  return new CSeq_align_Mapper_Base(src_align, *this);
3644 }
3645 
3646 
3648 {
3649  // Reset the mapper before mapping each location
3650  m_Dst_loc.Reset();
3651  m_Partial = false;
3652  m_LastTruncated = false;
3653  x_MapSeq_loc(src_loc);
3654  // Push any remaining mapped ranges to the mapped location.
3656  // C-style generates less fuzz, so we would then have to remove some
3657  if( (m_FuzzOption & fFuzzOption_CStyle) != 0 ) {
3659  }
3660  // Optimize mapped location.
3662  // If source locations should be included, optimize them too and
3663  // convert the result to equiv.
3664  if ( m_SrcLocs ) {
3666  CRef<CSeq_loc> ret(new CSeq_loc);
3667  ret->SetEquiv().Set().push_back(m_Dst_loc);
3668  ret->SetEquiv().Set().push_back(m_SrcLocs);
3669  return ret;
3670  }
3671  return m_Dst_loc;
3672 }
3673 
3674 
3676 {
3677 public:
3679  CTotalRangeSynonymMapper(const TSynonymMap& syn_map) : m_Map(syn_map) {}
3680  virtual ~CTotalRangeSynonymMapper(void) {}
3681 
3683  {
3684  auto main_id = m_Map.find(CSeq_id_Handle::GetHandle(id));
3685  return main_id != m_Map.end() ? main_id->second : CSeq_id_Handle();
3686  }
3687 private:
3689 };
3690 
3691 
3693 {
3695  CRef<CSeq_loc> total_range = src_loc.Merge(CSeq_loc::fMerge_SingleRange, &syn_mapper);
3696  return Map(*total_range);
3697 }
3698 
3699 
3702  size_t* row)
3703 {
3704  // Mapping of alignments if performed by seq-align mapper.
3705  m_Dst_loc.Reset();
3706  m_Partial = false;
3707  m_LastTruncated = false;
3708  CRef<CSeq_align_Mapper_Base> aln_mapper(InitAlignMapper(src_align));
3709  if ( row ) {
3710  aln_mapper->Convert(*row);
3711  }
3712  else {
3713  aln_mapper->Convert();
3714  }
3715  return aln_mapper->GetDstAlign();
3716 }
3717 
3718 
3719 /////////////////////////////////////////////////////////////////////
3720 //
3721 // Produce result of the mapping
3722 //
3723 
3724 
3726 x_RangeToSeq_loc(const CSeq_id_Handle& idh,
3727  TSeqPos from,
3728  TSeqPos to,
3729  size_t strand_idx,
3730  TRangeFuzz rg_fuzz)
3731 {
3732  ESeqType seq_type = GetSeqTypeById(idh);
3733  if (seq_type == eSeq_prot) {
3734  // Convert coordinates. For seq-locs discard frame information.
3735  from = from/3;
3736  to = to/3;
3737  }
3738 
3739  CRef<CSeq_loc> loc(new CSeq_loc);
3740  // If any fuzz is set, create interval, not point.
3741  // Points with fuzz can create problems later since they don't
3742  // specify fuzz direction. See GP-2895.
3743  if (from == to && (!rg_fuzz.first && !rg_fuzz.second) &&
3744  (m_FuzzOption & fFuzzOption_CStyle) == 0 )
3745  {
3746  // point
3747  loc->SetPnt().SetId().Assign(*idh.GetSeqId());
3748  loc->SetPnt().SetPoint(from);
3749  if (strand_idx > 0) {
3750 
3751  loc->SetPnt().SetStrand(INDEX_TO_STRAND(strand_idx));
3752  }
3753  if ( rg_fuzz.first ) {
3754  loc->SetPnt().SetFuzz(*rg_fuzz.first);
3755  }
3756  else if ( rg_fuzz.second ) {
3757  loc->SetPnt().SetFuzz(*rg_fuzz.second);
3758  }
3759  }
3760  // Note: at this moment for whole locations 'to' is equal to GetWholeTo()
3761  // not GetWholeToOpen().
3762  else if (from == 0 && to == TRange::GetWholeTo()) {
3763  loc->SetWhole().Assign(*idh.GetSeqId());
3764  // Ignore strand for whole locations
3765  }
3766  else {
3767  // interval
3768  loc->SetInt().SetId().Assign(*idh.GetSeqId());
3769  loc->SetInt().SetFrom(from);
3770  loc->SetInt().SetTo(to);
3771  if (strand_idx > 0) {
3772  loc->SetInt().SetStrand(INDEX_TO_STRAND(strand_idx));
3773  }
3774  if ( rg_fuzz.first ) {
3775  loc->SetInt().SetFuzz_from(*rg_fuzz.first);
3776  }
3777  if ( rg_fuzz.second ) {
3778  loc->SetInt().SetFuzz_to(*rg_fuzz.second);
3779  }
3780  }
3781  return loc;
3782 }
3783 
3784 
3787  size_t strand_idx) const
3788 {
3789  // Get mapped ranges for the given id and strand.
3790  // Make sure the vector contains entry for the strand index.
3791  TRangesByStrand& str_vec = m_MappedLocs[id];
3792  if (str_vec.size() <= strand_idx) {
3793  str_vec.resize(strand_idx + 1);
3794  }
3795  return str_vec[strand_idx];
3796 }
3797 
3798 
3799 // Add new mapped range.
3800 // The range is added as the first or the last element depending on its strand.
3801 // 'push_reverse' indicates if this rule must be reversed. This flag is set
3802 // when the mapping itself reverses the strand.
3804  size_t strand_idx,
3805  const TRange& range,
3806  const TRangeFuzz& fuzz,
3807  bool push_reverse,
3808  int group)
3809 {
3810  // It is impossible to collect source locations and do merging
3811  // at the same time.
3813  NCBI_THROW(CAnnotMapperException, eOtherError,
3814  "Merging ranges is incompatible with "
3815  "including source locations.");
3816  }
3817  bool reverse = (strand_idx > 0) &&
3818  IsReverse(INDEX_TO_STRAND(strand_idx));
3819  switch ( m_MergeFlag ) {
3820  case eMergeContained:
3821  case eMergeAll:
3822  case eMergeBySeg:
3823  {
3824  // Merging will be done later, while constructing the mapped
3825  // seq-loc. Now just add new range in the right order.
3826  if ( push_reverse ) {
3827  x_GetMappedRanges(id, strand_idx)
3828  .push_front(SMappedRange(range, fuzz, group));
3829  }
3830  else {
3831  x_GetMappedRanges(id, strand_idx)
3832  .push_back(SMappedRange(range, fuzz, group));
3833  }
3834  break;
3835  }
3836  case eMergeNone:
3837  {
3838  // No merging. Propagate any collected ranges to the
3839  // mapped location to keep grouping, add the new one.
3841  if ( push_reverse ) {
3842  x_GetMappedRanges(id, strand_idx)
3843  .push_front(SMappedRange(range, fuzz, group));
3844  }
3845  else {
3846  x_GetMappedRanges(id, strand_idx)
3847  .push_back(SMappedRange(range, fuzz, group));
3848  }
3849  break;
3850  }
3851  case eMergeAbutting:
3852  default:
3853  {
3854  // Some special processing is required.
3856  // Start new sub-location for:
3857  // - New ID (can not merge ranges on different sequences)
3858  bool no_merge = (it == m_MappedLocs.end()) || (it->first != id);
3859  // - New strand (can not merge ranges on different strands)
3860  no_merge = no_merge ||
3861  (it->second.size() <= strand_idx) || it->second.empty();
3862  // - Ranges are not abutting or belong to different groups
3863  if ( !no_merge ) {
3864  // Compare the new range to the previous one, which can be
3865  // the first or the last depending on the strand.
3866  if ( reverse ) {
3867  const SMappedRange& mrg = it->second[strand_idx].front();
3868  // Check coordinates or group number.
3869  no_merge = no_merge ||
3870  (mrg.range.GetFrom() != range.GetToOpen());
3871  if (m_MergeFlag == eMergeBySeg) {
3872  no_merge = no_merge || (mrg.group != group);
3873  }
3874  }
3875  else {
3876  const SMappedRange& mrg = it->second[strand_idx].back();
3877  // Check coordinates or group number.
3878  no_merge = no_merge ||
3879  (mrg.range.GetToOpen() != range.GetFrom());
3880  if (m_MergeFlag == eMergeBySeg) {
3881  no_merge = no_merge || (mrg.group != group);
3882  }
3883  }
3884  }
3885  if ( no_merge ) {
3886  // Can not merge the new range with the previous one.
3888  if ( push_reverse ) {
3889  x_GetMappedRanges(id, strand_idx)
3890  .push_front(SMappedRange(range, fuzz, group));
3891  }
3892  else {
3893  x_GetMappedRanges(id, strand_idx)
3894  .push_back(SMappedRange(range, fuzz, group));
3895  }
3896  }
3897  else {
3898  // The ranges can be merged. Take the strand into account.
3899  if ( reverse ) {
3900  SMappedRange& last_rg = it->second[strand_idx].front();
3901  last_rg.range.SetFrom(range.GetFrom());
3902  last_rg.fuzz.first = fuzz.first;
3903  }
3904  else {
3905  SMappedRange& last_rg = it->second[strand_idx].back();
3906  last_rg.range.SetTo(range.GetTo());
3907  last_rg.fuzz.second = fuzz.second;
3908  }
3909  }
3910  }
3911  }
3912 }
3913 
3914 
3915 // Store the range from the original location which could be mapped.
3916 // See also x_PushMappedRange.
3918  size_t src_strand,
3919  size_t dst_strand,
3920  const TRange& range,
3921  bool push_reverse)
3922 {
3923  if ( !x_IsSetMiscFlag(fIncludeSrcLocs) ) return; // No need to store source ranges.
3924  if ( !m_SrcLocs ) {
3925  m_SrcLocs.Reset(new CSeq_loc);
3926  }
3927  CRef<CSeq_loc> loc(new CSeq_loc);
3928  CRef<CSeq_id> id(new CSeq_id);
3929  id->Assign(*idh.GetSeqId());
3930  if ( range.Empty() ) {
3931  loc->SetEmpty(*id);
3932  }
3933  else if ( range.IsWhole() ) {
3934  loc->SetWhole(*id);
3935  }
3936  else {
3937  // The range uses genomic coords, recalculate if necessary.
3938  ESeqType seq_type = GetSeqTypeById(idh);
3939  int seq_width = (seq_type == eSeq_prot) ? 3 : 1;
3940  loc->SetInt().SetId(*id);
3941  loc->SetInt().SetFrom(range.GetFrom()/seq_width);
3942  loc->SetInt().SetTo(range.GetTo()/seq_width);
3943  if (src_strand > 0) {
3944  loc->SetStrand(INDEX_TO_STRAND(src_strand));
3945  }
3946  }
3947  // Store the location.
3948  if ( !SameOrientation(
3949  src_strand ? INDEX_TO_STRAND(src_strand) : eNa_strand_unknown,
3950  dst_strand ? INDEX_TO_STRAND(dst_strand) : eNa_strand_unknown) ) {
3951  push_reverse = !push_reverse;
3952  }
3953  if ( push_reverse ) {
3954  m_SrcLocs->SetMix().Set().push_front(loc);
3955  }
3956  else {
3957  m_SrcLocs->SetMix().Set().push_back(loc);
3958  }
3959 }
3960 
3961 
3963 {
3964  // Are there any locations ready?
3965  if (m_MappedLocs.size() == 0) {
3966  return;
3967  }
3968  // Push everything already mapped to the destination mix.
3969  // m_MappedLocs are reset and ready to accept the next part.
3971  if ( !m_Dst_loc ) {
3972  // If this is the first mapped location, just use it without
3973  // wrapping in a mix.
3974  m_Dst_loc = loc;
3975  return;
3976  }
3977  if ( !loc->IsNull() ) {
3978  // If the location is not null, add it to the existing mix.
3979  x_PushLocToDstMix(loc);
3980  }
3981 }
3982 
3983 
3985 {
3986  _ASSERT(loc);
3987  // If the mix does not exist yet, create it.
3988  if ( !m_Dst_loc || !m_Dst_loc->IsMix() ) {
3990  m_Dst_loc.Reset(new CSeq_loc);
3991  m_Dst_loc->SetMix();
3992  if ( tmp ) {
3993  m_Dst_loc->SetMix().Set().push_back(tmp);
3994  }
3995  }
3996  CSeq_loc_mix::Tdata& mix = m_Dst_loc->SetMix().Set();
3997  if ( loc->IsNull() ) {
3998  if ( m_GapFlag == eGapRemove ) {
3999  return; // No need to store gaps
4000  }
4001  if ( mix.size() > 0 && (*mix.rbegin())->IsNull() ) {
4002  // do not create duplicate NULLs
4003  return;
4004  }
4005  }
4006  mix.push_back(loc);
4007 }
4008 
4009 
4011 {
4012  CRef<CSeq_loc> null_loc(new CSeq_loc);
4013  null_loc->SetNull();
4015  x_PushLocToDstMix(null_loc);
4016 }
4017 
4018 
4020 {
4022  // Sorting discards the original order, no need to check
4023  // mappings, just use the mapped strand.
4024  return str != 0 && IsReverse(INDEX_TO_STRAND(str));
4025  }
4026  // For other merging modes the strand is not important (it's checked
4027  // somewhere else), we just need to know if the order of ranges
4028  // is reversed by mapping or not.
4030 }
4031 
4032 
4034 {
4035  // Create a new mix to store all mapped ranges in it.
4036  CRef<CSeq_loc> dst_loc(new CSeq_loc);
4037  CSeq_loc_mix::Tdata& dst_mix = dst_loc->SetMix().Set();
4038  // Iterate all mapped seq-ids.
4040  // Uninitialized id means gap (this should not happen in fact).
4041  if ( !id_it->first ) {
4042  if (m_GapFlag == eGapPreserve) {
4043  CRef<CSeq_loc> null_loc(new CSeq_loc);
4044  null_loc->SetNull();
4045  dst_mix.push_back(null_loc);
4046  }
4047  continue;
4048  }
4049  // Iterate all strands for the current id.
4050  for (int str = 0; str < (int)id_it->second.size(); ++str) {
4051  if (id_it->second[str].size() == 0) {
4052  continue;
4053  }
4054  TSeqPos from = kInvalidSeqPos;
4055  TSeqPos to = kInvalidSeqPos;
4057  int group = -1;
4058  // Some merge flags require the ranges to be sorted.
4059  if (m_MergeFlag == eMergeContained ||
4060  m_MergeFlag == eMergeAll ||
4061  m_MergeFlag == eMergeBySeg) {
4062  id_it->second[str].sort();
4063  }
4064  // Iterate mapped ranges.
4065  NON_CONST_ITERATE(TMappedRanges, rg_it, id_it->second[str]) {
4066  if ( rg_it->range.Empty() ) {
4067  // Empty seq-loc
4068  CRef<CSeq_loc> loc(new CSeq_loc);
4069  loc->SetEmpty().Assign(*id_it->first.GetSeqId());
4070  if ( x_ReverseRangeOrder(0) ) {
4071  dst_mix.push_front(loc);
4072  }
4073  else {
4074  dst_mix.push_back(loc);
4075  }
4076  continue;
4077  }
4078  // Is this the first mapped range?
4079  if (to == kInvalidSeqPos) {
4080  // Initialize from, to and fuzz.
4081  from = rg_it->range.GetFrom();
4082  to = rg_it->range.GetTo();
4083  fuzz = rg_it->fuzz;
4084  group = rg_it->group;
4085  continue;
4086  }
4087  if (m_MergeFlag != eMergeBySeg || rg_it->group == group) {
4088  // Merge abutting ranges. The ranges are sorted by 'from',
4089  // so we need to check only one end.
4090  if (m_MergeFlag == eMergeAbutting) {
4091  if (rg_it->range.GetFrom() == to + 1) {
4092  to = rg_it->range.GetTo();
4093  fuzz.second = rg_it->fuzz.second;
4094  continue;
4095  }
4096  }
4097  // Merge contained ranges
4098  if (m_MergeFlag == eMergeContained) {
4099  // Ignore interval completely covered by another one.
4100  // Check only 'to', since the ranges are sorted by 'from'.
4101  if (rg_it->range.GetTo() <= to) {
4102  continue;
4103  }
4104  // If the old range is contaied in the new one, adjust
4105  // its 'to'.
4106  if (rg_it->range.GetFrom() == from) {
4107  to = rg_it->range.GetTo();
4108  fuzz.second = rg_it->fuzz.second;
4109  continue;
4110  }
4111  }
4112  // Merge all overlapping ranges.
4114  if (rg_it->range.GetFrom() <= to + 1) {
4115  if (rg_it->range.GetTo() > to) {
4116  to = rg_it->range.GetTo();
4117  fuzz.second = rg_it->fuzz.second;
4118  }
4119  continue;
4120  }
4121  }
4122  }
4123 
4124  // No merging happened - store the previous interval
4125  // or point.
4126  if ( x_ReverseRangeOrder(str) ) {
4127  dst_mix.push_front(x_RangeToSeq_loc(id_it->first, from, to,
4128  str, fuzz));
4129  }
4130  else {
4131  dst_mix.push_back(x_RangeToSeq_loc(id_it->first, from, to,
4132  str, fuzz));
4133  }
4134  // Initialize the new range, but do not store it yet - it
4135  // may be merged with the next one.
4136  from = rg_it->range.GetFrom();
4137  to = rg_it->range.GetTo();
4138  fuzz = rg_it->fuzz;
4139  group = rg_it->group;
4140  }
4141  // If there were only empty ranges, do not try to add them as points.
4142  if (from == kInvalidSeqPos && to == kInvalidSeqPos) {
4143  continue;
4144  }
4145  // Last interval or point not yet stored.
4146  if ( x_ReverseRangeOrder(str) ) {
4147  dst_mix.push_front(x_RangeToSeq_loc(id_it->first, from, to,
4148  str, fuzz));
4149  }
4150  else {
4151  dst_mix.push_back(x_RangeToSeq_loc(id_it->first, from, to,
4152  str, fuzz));
4153  }
4154  }
4155  }
4156  m_MappedLocs.clear();
4157  x_OptimizeSeq_loc(dst_loc);
4158  return dst_loc;
4159 }
4160 
4161 
4162 // Copy a range from the original graph data to the mapped one.
4163 template<class TData> void CopyGraphData(const TData& src,
4164  TData& dst,
4165  TSeqPos from,
4166  TSeqPos to)
4167 {
4168  _ASSERT(from < src.size() && to <= src.size());
4169  dst.insert(dst.end(), src.begin() + from, src.begin() + to);
4170 }
4171 
4172 
4174 {
4175  CRef<CSeq_graph> ret;
4176  // Start collecting used ranges to adjust graph data.
4178  CRef<CSeq_loc> mapped_loc = Map(src_graph.GetLoc());
4179  if ( !mapped_loc ) {
4180  // Nothing was mapped, return NULL.
4181  return ret;
4182  }
4183  ret.Reset(new CSeq_graph);
4184  ret->Assign(src_graph);
4185  ret->SetLoc(*mapped_loc);
4186 
4187  // Check mapped sequence type, adjust coordinates.
4188  ESeqType src_type = eSeq_unknown;
4189  bool src_type_set = false;
4190  // Iterate the original location, look for the sequence type.
4191  for (CSeq_loc_CI it = src_graph.GetLoc(); it; ++it) {
4192  ESeqType it_type = GetSeqTypeById(it.GetSeq_id_Handle());
4193  if (it_type == eSeq_unknown) {
4194  continue;
4195  }
4196  if ( !src_type_set ) {
4197  src_type = it_type;
4198  src_type_set = true;
4199  }
4200  else if (src_type != it_type) {
4201  NCBI_THROW(CAnnotMapperException, eBadLocation,
4202  "Source graph location contains different sequence "
4203  "types -- can not map graph data.");
4204  }
4205  }
4206  ESeqType dst_type = eSeq_unknown;
4207  bool dst_type_set = false;
4208  // Iterate the mapped location, look for the sequence type.
4209  for (CSeq_loc_CI it = *mapped_loc; it; ++it) {
4210  ESeqType it_type = GetSeqTypeById(it.GetSeq_id_Handle());
4211  if (it_type == eSeq_unknown) {
4212  continue;
4213  }
4214  if ( !dst_type_set ) {
4215  dst_type = it_type;
4216  dst_type_set = true;
4217  }
4218  else if (dst_type != it_type) {
4219  NCBI_THROW(CAnnotMapperException, eBadLocation,
4220  "Mapped graph location contains different sequence "
4221  "types -- can not map graph data.");
4222  }
4223  }
4224 
4225  CSeq_graph::TGraph& dst_data = ret->SetGraph();
4226  dst_data.Reset();
4227  const CSeq_graph::TGraph& src_data = src_graph.GetGraph();
4228 
4229  // Recalculate compression factor.
4230  TSeqPos comp = (src_graph.IsSetComp() && src_graph.GetComp()) ?
4231  src_graph.GetComp() : 1;
4232  // In some cases the original data indexing must be divided by 3
4233  // to get mapped data indexes.
4234  TSeqPos comp_div = comp;
4235  // By now, only one sequence type can be present.
4236  // If the original and mapped sequence types are different
4237  // and one of them is prot, adjust compression.
4238  if (src_type != dst_type &&
4239  (src_type == eSeq_prot || dst_type == eSeq_prot)) {
4240  // Source is prot, need to multiply comp by 3
4241  if (src_type == eSeq_prot) {
4242  comp *= 3;
4243  comp_div = comp;
4244  }
4245  // Mapped is prot, need to divide comp by 3 if possible
4246  else if (comp % 3 == 0) {
4247  comp /= 3;
4248  }
4249  else {
4250  // Can not divide by 3, impossible to adjust data.
4251  NCBI_THROW(CAnnotMapperException, eOtherError,
4252  "Can not map seq-graph data between "
4253  "different sequence types.");
4254  }
4255  }
4256  ret->SetComp(comp);
4257  TSeqPos numval = 0;
4258 
4259  typedef CGraphRanges::TGraphRanges TGraphRanges;
4260  const TGraphRanges& ranges = m_GraphRanges->GetRanges();
4261  // Copy only the used ranges from the original data to the mapped one.
4262  switch ( src_data.Which() ) {
4264  dst_data.SetByte().SetMin(src_data.GetByte().GetMin());
4265  dst_data.SetByte().SetMax(src_data.GetByte().GetMax());
4266  dst_data.SetByte().SetAxis(src_data.GetByte().GetAxis());
4267  dst_data.SetByte().SetValues();
4268  // Copy each used range.
4269  ITERATE(TGraphRanges, it, ranges) {
4270  TSeqPos from = it->GetFrom()/comp_div;
4271  TSeqPos to = it->GetTo()/comp_div + 1;
4272  CopyGraphData(src_data.GetByte().GetValues(),
4273  dst_data.SetByte().SetValues(),
4274  from, to);
4275  numval += to - from;
4276  }
4277  break;
4279  dst_data.SetInt().SetMin(src_data.GetInt().GetMin());
4280  dst_data.SetInt().SetMax(src_data.GetInt().GetMax());
4281  dst_data.SetInt().SetAxis(src_data.GetInt().GetAxis());
4282  dst_data.SetInt().SetValues();
4283  ITERATE(TGraphRanges, it, ranges) {
4284  TSeqPos from = it->GetFrom()/comp_div;
4285  TSeqPos to = it->GetTo()/comp_div + 1;
4286  CopyGraphData(src_data.GetInt().GetValues(),
4287  dst_data.SetInt().SetValues(),
4288  from, to);
4289  numval += to - from;
4290  }
4291  break;
4293  dst_data.SetReal().SetMin(src_data.GetReal().GetMin());
4294  dst_data.SetReal().SetMax(src_data.GetReal().GetMax());
4295  dst_data.SetReal().SetAxis(src_data.GetReal().GetAxis());
4296  dst_data.SetReal().SetValues();
4297  ITERATE(TGraphRanges, it, ranges) {
4298  TSeqPos from = it->GetFrom()/comp_div;
4299  TSeqPos to = it->GetTo()/comp_div + 1;
4300  CopyGraphData(src_data.GetReal().GetValues(),
4301  dst_data.SetReal().SetValues(),
4302  from, to);
4303  numval += to - from;
4304  }
4305  break;
4306  default:
4307  break;
4308  }
4309  ret->SetNumval(numval);
4310 
4311  m_GraphRanges.Reset();
4312  return ret;
4313 }
4314 
4315 
4318 {
4319  EMapResult ret = eMapped_None;
4320  size_t mapped_count = 0;
4321  size_t non_mapped_count = 0;
4322  switch (annot.GetData().Which()) {
4324  {
4325  CSeq_annot::C_Data::TFtable& ftable = annot.SetData().SetFtable();
4326  string error;
4327  bool mapped = false;
4329  error.clear();
4330  mapped = false;
4331  try {
4332  // For error reporting we may need the original feature.
4333  CSeq_feat& feat = **it;
4334  CRef<CSeq_loc> loc;
4335  if (flags & fAnnotMap_Location) {
4336  loc = Map(feat.GetLocation());
4337  if ( loc && !loc->IsNull() ) {
4338  feat.SetLocation(*loc);
4339  mapped = true;
4340  }
4341  }
4342  if ((flags & fAnnotMap_Product) && feat.IsSetProduct() ) {
4343  loc = Map(feat.GetProduct());
4344  if ( loc && !loc->IsNull() ) {
4345  feat.SetProduct(*loc);
4346  mapped = true;
4347  }
4348  }
4349  }
4350  catch (CAnnotMapperException& e) {
4351  error = e.GetMsg();
4352  mapped = false;
4353  }
4354  if ( mapped ) {
4355  mapped_count++;
4356  }
4357  else {
4360  error.empty() ? "Failed to map seq-feat" : error,
4361  eDiag_Error);
4362  msg.SetFeat(**it);
4364  }
4365  non_mapped_count++;
4367  ftable.erase(it);
4368  }
4370  NCBI_THROW(CAnnotMapperException, eCanNotMap,
4371  error.empty() ? string("Failed to map seq-feat.") : error);
4372  }
4373  }
4374  }
4375  break;
4376  }
4378  {
4379  CSeq_annot::C_Data::TAlign& aligns = annot.SetData().SetAlign();
4380  string error;
4382  error.clear();
4383  CRef<CSeq_align> align;
4384  try {
4385  align = Map(**it);
4386  }
4387  catch (CAnnotMapperException& e) {
4388  error = e.GetMsg();
4389  }
4390  if ( align ) {
4391  *it = align;
4392  mapped_count++;
4393  }
4394  else {
4397  error.empty() ? "Failed to map seq-align" : error,
4398  eDiag_Error);
4399  msg.SetAlign(**it);
4401  }
4402  non_mapped_count++;
4404  aligns.erase(it);
4405  }
4407  NCBI_THROW(CAnnotMapperException, eCanNotMap,
4408  error.empty() ? string("Failed to map seq-align") : error);
4409  }
4410  }
4411  }
4412  break;
4413  }
4415  {
4416  CSeq_annot::C_Data::TGraph& graphs = annot.SetData().SetGraph();
4417  string error;
4419  error.clear();
4420  CRef<CSeq_graph> graph;
4421  try {
4422  graph = Map(**it);
4423  }
4424  catch (CAnnotMapperException& e) {
4425  error = e.GetMsg();
4426  }
4427  if ( graph ) {
4428  *it = graph;
4429  mapped_count++;
4430  }
4431  else {
4434  error.empty() ? "Failed to map seq-graph" : error,
4435  eDiag_Error);
4436  msg.SetGraph(**it);
4438  }
4439  non_mapped_count++;
4441  graphs.erase(it);
4442  }
4444  NCBI_THROW(CAnnotMapperException, eCanNotMap,
4445  error.empty() ? string("Failed to map seq-graph") : error);
4446  }
4447  }
4448  }
4449  break;
4450  }
4451  default:
4452  {
4454  NCBI_THROW(CAnnotMapperException, eCanNotMap,
4455  "Can not map seq-annot - unsupported type.");
4456  }
4457  ERR_POST_X(30, Warning << "Unsupported CSeq_annot type: " <<
4458  annot.GetData().Which());
4459  }
4460  }
4461  if ( mapped_count ) {
4462  ret = non_mapped_count ? eMapped_Some : eMapped_All;
4463  }
4464  return ret;
4465 }
4466 
4467 
4469 {
4470  if ( value ) {
4471  m_MiscFlags |= flag;
4472  }
4473  else {
4474  m_MiscFlags &= ~flag;
4475  }
4476 }
4477 
4478 
4479 NCBI_PARAM_DECL(bool, Mapper, NonMapping_As_Null);
4480 NCBI_PARAM_DEF_EX(bool, Mapper, NonMapping_As_Null, false, eParam_NoThread,
4481  MAPPER_NONMAPPING_AS_NULL);
4482 typedef NCBI_PARAM_TYPE(Mapper, NonMapping_As_Null) TNonMappingAsNullParam;
4483 
4484 
4486 {
4487  return TNonMappingAsNullParam::GetDefault();
4488 }
4489 
4490 
User-defined methods of the data storage class.
bool IsReverse(ENa_strand s)
Definition: Na_strand.hpp:75
ENa_strand Reverse(ENa_strand s)
Definition: Na_strand.hpp:90
ESeqLocExtremes
Used to determine the meaning of a location's Start/Stop positions.
Definition: Na_strand.hpp:61
@ eExtreme_Biological
5' and 3'
Definition: Na_strand.hpp:62
bool SameOrientation(ENa_strand a, ENa_strand b)
Definition: Na_strand.hpp:83
Seq-loc and seq-align mapper exceptions.
virtual void CollectSynonyms(const CSeq_id_Handle &id, TSynonyms &synonyms)
Collect all synonyms for the id including the id itself.
virtual TSeqPos GetSequenceLength(const CSeq_id_Handle &)
Get sequence length or kInvalidSeqPos.
virtual TSeqType GetSequenceType(const CSeq_id_Handle &)
Get information about sequence type (nuc or prot).
CRange< TSeqPos > GetSeqRange(TDim row) const
Definition: Dense_seg.hpp:234
Helper class for mapping graphs.
CMappingRange - describes a single interval to interval mapping.
Storage for multiple mapping ranges.
Default IMessage implementation: text and severity only.
CPacked_seg –.
Definition: Packed_seg.hpp:66
TSeqPos AsSeqPos() const
Definition: Product_pos.cpp:56
Class used to map seq-alignments.
namespace ncbi::objects::
Definition: Seq_feat.hpp:58
Seq-loc iterator class – iterates all intervals from a seq-loc in the correct order.
Definition: Seq_loc.hpp:453
CSeq_loc_Mapper_Message.
CSeq_loc_Mapper_Options –.
CSpliced_exon_chunk –.
virtual CSeq_id_Handle GetBestSynonym(const CSeq_id &id)
CTotalRangeSynonymMapper(const TSynonymMap &syn_map)
CSeq_loc_Mapper_Base::TSynonymMap TSynonymMap
IMapper_Sequence_Info.
Interface for mapping IDs to the best synonym.
Definition: Seq_loc.hpp:408
size_type size() const
Definition: map.hpp:148
const_iterator begin() const
Definition: map.hpp:151
const_iterator end() const
Definition: map.hpp:152
iterator_bool insert(const value_type &val)
Definition: map.hpp:165
void clear()
Definition: map.hpp:169
const_iterator find(const key_type &key) const
Definition: map.hpp:153
Definition: set.hpp:45
iterator_bool insert(const value_type &val)
Definition: set.hpp:149
bool empty() const
Definition: set.hpp:133
static uch flags
static const char si[8][64]
Definition: des.c:146
std::ofstream out("events_result.xml")
main entry point for tests
#define false
Definition: bool.h:36
static DLIST_TYPE *DLIST_NAME() prev(DLIST_LIST_TYPE *list, DLIST_TYPE *item)
Definition: dlist.tmpl.h:61
static const char * str(char *buf, int n)
Definition: stats.c:84
static char tmp[3200]
Definition: utf8.c:42
unsigned int TSeqPos
Type for sequence locations and lengths.
Definition: ncbimisc.hpp:875
#define ITERATE(Type, Var, Cont)
ITERATE macro to sequence through container elements.
Definition: ncbimisc.hpp:815
#define ERASE_ITERATE(Type, Var, Cont)
Non-constant version with ability to erase current element, if container permits.
Definition: ncbimisc.hpp:843
#define NON_CONST_ITERATE(Type, Var, Cont)
Non constant version of ITERATE macro.
Definition: ncbimisc.hpp:822
const TSeqPos kInvalidSeqPos
Define special value for invalid sequence position.
Definition: ncbimisc.hpp:878
@ null
Definition: ncbimisc.hpp:646
#define ERR_POST_X(err_subcode, message)
Error posting with default error code and given error subcode.
Definition: ncbidiag.hpp:550
EDiagSev
Severity level for the posted diagnostics.
Definition: ncbidiag.hpp:650
@ eDiag_Error
Error message.
Definition: ncbidiag.hpp:653
void Error(CExceptionArgs_Base &args)
Definition: ncbiexpt.hpp:1197
#define NCBI_THROW(exception_class, err_code, message)
Generic macro to throw an exception, given the exception class, error code and message string.
Definition: ncbiexpt.hpp:704
const string & GetMsg(void) const
Get message string.
Definition: ncbiexpt.cpp:461
void Warning(CExceptionArgs_Base &args)
Definition: ncbiexpt.hpp:1191
virtual const char * GetErrCodeString(void) const
Get error code interpreted as text.
Definition: ncbiexpt.cpp:444
void Info(CExceptionArgs_Base &args)
Definition: ncbiexpt.hpp:1185
virtual void Assign(const CSerialObject &source, ESerialRecursionMode how=eRecursive)
Set object to copy of another one.
#define MSerial_AsnText
I/O stream manipulators –.
Definition: serialbase.hpp:696
static bool HaveListeners(void)
Check if there are any listeners installed in the current thread.
static EPostResult Post(const IMessage &message)
Post the message to listener(s), if any.
virtual void Write(CNcbiOstream &out) const
Print the message and any additional information to the stream.
CConstRef< CSeq_id > GetSeqId(void) const
static CSeq_id_Handle GetHandle(const CSeq_id &id)
Normal way of getting a handle, works for any seq-id.
string AsString(void) const
void GetMatchingIds(TSeqIdHandles &matches) const
Collect partially matching seq-ids: no-version, no-name etc.
Definition: Seq_id.cpp:3335
void SetPacked_int(TPacked_int &v)
Definition: Seq_loc.hpp:984
void SetMix(TMix &v)
Definition: Seq_loc.hpp:987
void SetWhole(TWhole &v)
Definition: Seq_loc.hpp:982
bool IsPartialStart(ESeqLocExtremes ext) const
check start or stop of location for e_Lim fuzz
Definition: Seq_loc.cpp:3222
bool IsReverseStrand(void) const
Return true if all ranges have reverse strand.
Definition: Seq_loc.hpp:995
virtual void Assign(const CSerialObject &source, ESerialRecursionMode how=eRecursive)
Override Assign() to incorporate cache invalidation.
Definition: Seq_loc.cpp:337
CConstRef< CSeq_loc > GetRangeAsSeq_loc(void) const
Get seq-loc for the current iterator position.
Definition: Seq_loc.cpp:2585
CRef< CSeq_loc > Merge(TOpFlags flags, ISynonymMapper *syn_mapper) const
All functions create and return a new seq-loc object.
Definition: Seq_loc.cpp:5037
void SetPnt(TPnt &v)
Definition: Seq_loc.hpp:985
const_iterator end(void) const
Definition: Seq_loc.cpp:1034
CSeq_id_Handle GetSeq_id_Handle(void) const
Definition: Seq_loc.hpp:1035
void SetEmpty(TEmpty &v)
Definition: Seq_loc.hpp:981
void SetInt(TInt &v)
Definition: Seq_loc.hpp:983
bool IsWhole(void) const
True if the current location is a whole sequence.
Definition: Seq_loc.hpp:1077
const CInt_fuzz * GetFuzzFrom(void) const
Definition: Seq_loc.hpp:1063
const CInt_fuzz * GetFuzzTo(void) const
Definition: Seq_loc.hpp:1070
void Add(const CSeq_loc &other)
Simple adding of seq-locs.
Definition: Seq_loc.cpp:3875
const CSeq_id * GetId(void) const
Get the id of the location return NULL if has multiple ids or no id at all.
Definition: Seq_loc.hpp:941
TRange GetRange(void) const
Get the range.
Definition: Seq_loc.hpp:1042
void SetPartialStart(bool val, ESeqLocExtremes ext)
set / remove e_Lim fuzz on start or stop (lt/gt - indicating partial interval)
Definition: Seq_loc.cpp:3280
void SetEquiv(TEquiv &v)
Definition: Seq_loc.hpp:988
ENa_strand GetStrand(void) const
Definition: Seq_loc.hpp:1056
void SetBond(TBond &v)
Definition: Seq_loc.hpp:989
const CSeq_id & GetSeq_id(void) const
Get seq_id of the current location.
Definition: Seq_loc.hpp:1028
void SetStrand(ENa_strand strand)
Set the strand for all of the location's ranges.
Definition: Seq_loc.cpp:5196
void SetPartialStop(bool val, ESeqLocExtremes ext)
Definition: Seq_loc.cpp:3313
void SetNull(void)
Override all setters to incorporate cache invalidation.
Definition: Seq_loc.hpp:960
bool IsPartialStop(ESeqLocExtremes ext) const
Definition: Seq_loc.cpp:3251
void SetTruncatedStop(bool val, ESeqLocExtremes ext)
Definition: Seq_loc.cpp:3431
@ eOrder_Biological
Iterate sub-locations in positional order.
Definition: Seq_loc.hpp:462
@ eEmpty_Allow
ignore empty locations
Definition: Seq_loc.hpp:458
@ fMerge_SingleRange
Definition: Seq_loc.hpp:332
void x_InitializeAlign(const CSeq_align &map_align, const CSeq_id &to_id, const CSeq_id *from_id=nullptr)
CRef< IMapper_Sequence_Info > m_SeqInfo
TSeqPos x_GetRangeLength(const CSeq_loc_CI &it)
TRange Map_Range(TSeqPos from, TSeqPos to, const TRangeFuzz *fuzz=0) const
Map an interval, set fuzz when the mapping truncates the original range.
pair< TFuzz, TFuzz > TRangeFuzz
bool x_CheckSeqTypes(const CSeq_loc &loc, ESeqType &seqtype, TSeqPos &len)
CRef< CSeq_loc > x_RangeToSeq_loc(const CSeq_id_Handle &idh, TSeqPos from, TSeqPos to, size_t strand_idx, TRangeFuzz rg_fuzz)
void x_SetMiscFlag(EMiscFlags flag, bool value)
CRef< CSeq_loc > Map(const CSeq_loc &src_loc)
Map seq-loc.
TSeqPos GetSequenceLength(const CSeq_id &id)
virtual void CollectSynonyms(const CSeq_id_Handle &id, TSynonyms &synonyms)=0
Collect all synonyms for the id including the id itself.
CSeq_loc_Mapper_Base(CMappingRanges *mapping_ranges, CSeq_loc_Mapper_Options options=CSeq_loc_Mapper_Options())
Mapping through a pre-filled CMappipngRanges.
ESeqType GetSeqTypeById(const CSeq_id_Handle &idh) const
Methods for getting sequence types, use cached types (m_SeqTypes) if possible.
TRangeMap::const_iterator TRangeIterator
void x_AddExonPartsMapping(TSeqPos &mapping_len, ESplicedRow to_row, const CSeq_id &gen_id, TSeqPos &gen_start, TSeqPos &gen_len, ENa_strand gen_strand, const CSeq_id &prod_id, TSeqPos &prod_start, TSeqPos &prod_len, ENa_strand prod_strand)
CRef< CGraphRanges > m_GraphRanges
void x_PushMappedRange(const CSeq_id_Handle &id, size_t strand_idx, const TRange &range, const TRangeFuzz &fuzz, bool push_reverse, int group)
int GetGroup(void) const
CMappingRange(CSeq_id_Handle src_id, TSeqPos src_from, TSeqPos src_length, ENa_strand src_strand, CSeq_id_Handle dst_id, TSeqPos dst_from, ENa_strand dst_strand, bool ext_to=false, int frame=0, TSeqPos src_bioseq_len=kInvalidSeqPos, TSeqPos dst_len=kInvalidSeqPos)
EMapResult
Result of seq-annot mapping.
CSeq_id_Handle m_Dst_id_Handle
void x_Map_Fuzz(TFuzz &fuzz) const
void AddRange(const TRange &rg)
const CSeq_id_Handle & GetDstIdHandle(void) const
CSeq_align::C_Segs::TDendiag TDendiag
list< SMappedRange > TMappedRanges
const TIdMap & GetIdMap() const
CSeq_loc_Mapper_Message(const string &msg, EDiagSev sev, int err_code=0, int sub_code=0)
ESplicedRow
Spliced-seg row indexing constants.
virtual TSeqPos GetSequenceLength(const CSeq_id_Handle &idh)=0
Get sequence length or kInvalidSeqPos.
const CSeq_graph * GetGraph(void) const
Get seq-graph object or null.
bool GetTrimMappedLocation(void) const
Mapped location trimming at sequence end.
void x_InitializeFeat(const CSeq_feat &map_feat, EFeatMapDirection dir)
void SetSeqTypeById(const CSeq_id_Handle &idh, ESeqType seqtype) const
Methods for setting sequence types.
IMapper_Sequence_Info & GetSeqInfo(void) const
CSeq_loc_Mapper_Options –.
EFeatMapDirection
Mapping direction used when initializing the mapper with a feature.
void x_StripExtraneousFuzz(CRef< CSeq_loc > &loc) const
vector< TMappedRanges > TRangesByStrand
CSeq_loc_Mapper_Options m_MapOptions
const CSeq_id_Handle & CollectSynonyms(const CSeq_id_Handle &id) const
void ResetObject(void)
Set the stored object to null.
CRef< CSeq_loc > MapTotalRange(const CSeq_loc &seq_loc)
Take the total range from the location and run it through the mapper.
const CSeq_loc * GetLoc(void) const
Get seq-loc object or null.
bool CanMap(TSeqPos from, TSeqPos to, bool is_set_strand, ENa_strand strand) const
Check if the interval can be mapped through this mapping range.
void x_InitSparse(const CSparse_seg &sparse, size_t to_row)
void SetReverseSrc(bool value=true)
CSeq_loc_Mapper_Options & SetAlign_Sparse_ToFirst(bool value=true)
CSeq_align::C_Segs::TStd TStd
bool GoodSrcId(const CSeq_id &id) const
Check if the id is on the source sequence.
void AddConversion(CRef< CMappingRange > cvt)
Add new mapping range to the proper place.
void x_Map_PackedPnt_Element(const CPacked_seqpnt &pp, TSeqPos p)
void x_NextMappingRange(const CSeq_id &src_id, TSeqPos &src_start, TSeqPos &src_len, ENa_strand src_strand, const CSeq_id &dst_id, TSeqPos &dst_start, TSeqPos &dst_len, ENa_strand dst_strand, const CInt_fuzz *fuzz_from=0, const CInt_fuzz *fuzz_to=0, int frame=0, TSeqPos src_bioseq_len=kInvalidSeqPos)
const CSeq_align * GetAlign(void) const
Get seq-align object or null.
bool x_ReverseRangeOrder(int str) const
TRangeFuzz Map_Fuzz(const TRangeFuzz &fuzz) const
Map fuzz if one is set in the original location.
ESeqType x_ForceSeqTypes(const CSeq_loc &loc) const
void SetFeat(const CSeq_feat &feat)
Set seq-feat object (copy into the message).
void SetReverseDst(bool value=true)
bool Map_Strand(bool is_set_strand, ENa_strand src, ENa_strand *dst) const
Map the strand, return true if the destination strand should be set (even if it's eNa_strand_unknown ...
void x_PushSourceRange(const CSeq_id_Handle &idh, size_t src_strand, size_t dst_strand, const TRange &range, bool push_reverse)
void x_IterateExonParts(const CSpliced_exon::TParts &parts, ESplicedRow to_row, const CSeq_id &gen_id, TSeqPos &gen_start, TSeqPos &gen_len, ENa_strand gen_strand, const CSeq_id &prod_id, TSeqPos &prod_start, TSeqPos &prod_len, ENa_strand prod_strand)
void SetAlign(const CSeq_align &align)
Set seq-align object (copy into the message).
EObjectType Which(void) const
Check type of the object stored in the message.
void x_InitSpliced(const CSpliced_seg &spliced, const TSynonyms &to_ids)
virtual void Write(CNcbiOstream &out) const
Print the message and any additional information to the stream.
bool GetAlign_Dense_seg_TotalRange(void) const
Dense-seg mapping option.
TSeqPos Map_Pos(TSeqPos pos) const
Map a single point.
void x_InitAlign(const CDense_diag &diag, size_t to_row, size_t from_row)
CMappingRange::TRange TRange
CRef< CSeq_loc > x_GetMappedSeq_loc(void)
TSeqPos GetOffset(void) const
void x_PushLocToDstMix(CRef< CSeq_loc > loc)
bool GetReverseSrc(void) const
void IncOffset(TSeqPos inc)
const CSeq_feat * GetFeat(void) const
Get seq-feat object or null.
void x_InitializeLocs(const CSeq_loc &source, const CSeq_loc &target, int src_frame=0, int dst_frame=0)
bool x_IsSetMiscFlag(EMiscFlags flag) const
static TSeqPos sx_GetExonPartLength(const CSpliced_exon_chunk &part)
virtual CSeq_loc_Mapper_Message * Clone(void) const
Create a copy of the message.
void x_Map_PackedInt_Element(const CSeq_interval &si)
bool GetReverseDst(void) const
void SetLoc(const CSeq_loc &loc)
Set seq-loc object (copy into the message).
void x_MapSeq_loc(const CSeq_loc &src_loc)
CMappingRange::TRange TRange
vector< TRange > TGraphRanges
CRef< CSeq_align > x_MapSeq_align(const CSeq_align &src_align, size_t *row)
virtual TSeqType GetSequenceType(const CSeq_id_Handle &idh)=0
Get information about sequence type (nuc or prot).
bool x_IsSynonym(const CSeq_id &id, const TSynonyms &synonyms) const
CRef< CMappingRanges > m_Mappings
void x_OptimizeSeq_loc(CRef< CSeq_loc > &loc) const
const CSeq_id_Handle & x_GetPrimaryId(const CSeq_id_Handle &synonym) const
TMappedRanges & x_GetMappedRanges(const CSeq_id_Handle &id, size_t strand_idx) const
const TGraphRanges & GetRanges(void) const
CInt_fuzz::ELim x_ReverseFuzzLim(CInt_fuzz::ELim lim) const
ESeqType GetSeqType(const CSeq_id_Handle &idh) const
TRangeIterator BeginMappingRanges(CSeq_id_Handle id, TSeqPos from, TSeqPos to) const
Get mapping ranges iterator for the given seq-id and range.
CMappingRanges::TSortedMappings TSortedMappings
CSeq_id_Handle m_Src_id_Handle
CConstRef< CSeq_loc > x_FixNonsenseFuzz(CConstRef< CSeq_loc > loc_piece) const
bool x_MapNextRange(const TRange &src_rg, bool is_set_strand, ENa_strand src_strand, const TRangeFuzz &src_fuzz, TSortedMappings &mappings, size_t cvt_idx, TSeqPos *last_src_to)
void SetGraph(const CSeq_graph &graph)
Set seq-graph object (copy into the message).
bool GetAlign_Sparse_ToSecond(void) const
pair< TFuzz, TFuzz > TRangeFuzz
void SetFuzzOption(TFuzzOption newOption)
void x_AddConversion(const CSeq_id &src_id, TSeqPos src_start, ENa_strand src_strand, const CSeq_id &dst_id, TSeqPos dst_start, ENa_strand dst_strand, TSeqPos length, bool ext_right, int frame, TSeqPos src_bioseq_len, TSeqPos dst_length)
static bool GetNonMappingAsNull(void)
void x_AdjustSeqTypesToProt(const CSeq_id_Handle &idh)
virtual CSeq_align_Mapper_Base * InitAlignMapper(const CSeq_align &src_align)
vector< TDstIdMap > TDstStrandMap
bool x_MapInterval(const CSeq_id &src_id, TRange src_rg, bool is_set_strand, ENa_strand src_strand, TRangeFuzz orig_fuzz)
CSeq_loc_Mapper_Options & SetAlign_Sparse_ToSecond(bool value=true)
void SetOffset(TSeqPos offset)
@ eMapped_All
All annotations were mapped, none was removed.
@ eMapped_None
No annotation was mapped, the input seq-annot is unchanged.
@ eMapped_Some
Some (not all) annotations were mapped.
@ fAnnotMap_RemoveNonMapping
Remove annotations which can not be mapped with this mapper.
@ fAnnotMap_ThrowOnFailure
Throw exception if an annotation can not be mapped.
@ fAnnotMap_Location
Map seq-feat locations.
@ fAnnotMap_Product
Map seq-feat products.
@ eLocationToProduct
Map from the feature's location to product.
CRef< C > Ref(C *object)
Helper functions to get CRef<> and CConstRef<> objects.
Definition: ncbiobj.hpp:2015
void Reset(void)
Reset reference object.
Definition: ncbiobj.hpp:773
TObjectType * GetPointerOrNull(void) THROWS_NONE
Get pointer value.
Definition: ncbiobj.hpp:986
bool Empty(void) const THROWS_NONE
Check if CRef is empty – not pointing to any object, which means having a null value.
Definition: ncbiobj.hpp:719
@ eParam_NoThread
Do not use per-thread values.
Definition: ncbi_param.hpp:418
position_type GetLength(void) const
Definition: range.hpp:158
TParent::value_type value_type
Definition: rangemap.hpp:611
position_type GetTo(void) const
Definition: range.hpp:142
position_type GetToOpen(void) const
Definition: range.hpp:138
position_type GetFrom(void) const
Definition: range.hpp:134
static TThisType GetEmpty(void)
Definition: range.hpp:306
TThisType & SetLength(position_type length)
Definition: range.hpp:194
TThisType & Set(position_type from, position_type to)
Definition: range.hpp:188
bool Empty(void) const
Definition: range.hpp:148
TThisType & SetOpen(position_type from, position_type toOpen)
Definition: range.hpp:184
bool IsWhole(void) const
Definition: range.hpp:284
static TThisType GetWhole(void)
Definition: range.hpp:272
static position_type GetWholeTo(void)
Definition: range.hpp:264
#define END_NCBI_SCOPE
End previously defined NCBI scope.
Definition: ncbistl.hpp:103
#define END_SCOPE(ns)
End the previously defined scope.
Definition: ncbistl.hpp:75
#define BEGIN_NCBI_SCOPE
Define ncbi namespace.
Definition: ncbistl.hpp:100
#define BEGIN_SCOPE(ns)
Define a new scope.
Definition: ncbistl.hpp:72
IO_PREFIX::ostream CNcbiOstream
Portable alias for ostream.
Definition: ncbistre.hpp:149
static enable_if< is_arithmetic< TNumeric >::value||is_convertible< TNumeric, Int8 >::value, string >::type NumericToString(TNumeric value, TNumToStringFlags flags=0, int base=10)
Convert numeric value to string.
Definition: ncbistr.hpp:673
void SetFrom(TFrom value)
Assign a value to From data member.
Definition: Range_.hpp:231
TTo GetTo(void) const
Get the To member data.
Definition: Range_.hpp:269
TFrom GetFrom(void) const
Get the From member data.
Definition: Range_.hpp:222
void SetTo(TTo value)
Assign a value to To data member.
Definition: Range_.hpp:278
bool IsLim(void) const
Check if variant Lim is selected.
Definition: Int_fuzz_.hpp:636
TRange & SetRange(void)
Select the variant.</