NCBI C++ ToolKit
histogram_binning.hpp
Go to the documentation of this file.

Go to the SVN repository for this file.

1 #ifndef UTIL__HISTOGRAM_BINNING__HPP
2 #define UTIL__HISTOGRAM_BINNING__HPP
3 
4 /* $Id: histogram_binning.hpp 59039 2013-07-23 19:09:51Z kornbluh $
5 * ===========================================================================
6 *
7 * PUBLIC DOMAIN NOTICE
8 * National Center for Biotechnology Information
9 *
10 * This software/database is a "United States Government Work" under the
11 * terms of the United States Copyright Act. It was written as part of
12 * the author's official duties as a United States Government employee and
13 * thus cannot be copyrighted. This software/database is freely available
14 * to the public for use. The National Library of Medicine and the U.S.
15 * Government have not placed any restriction on its use or reproduction.
16 *
17 * Although all reasonable efforts have been taken to ensure the accuracy
18 * and reliability of the software and data, the NLM and the U.S.
19 * Government do not and cannot warrant the performance or results that
20 * may be obtained by using this software or data. The NLM and the U.S.
21 * Government disclaim all warranties, express or implied, including
22 * warranties of performance, merchantability or fitness for any particular
23 * purpose.
24 *
25 * Please cite the author in any work or product based on this material.
26 *
27 * ===========================================================================
28 *
29 * Authors: Michael Kornbluh, NCBI
30 *
31 */
32 
33 /// @file histogram_binning.hpp
34 /// Given a set of data points, automatically put them into bins
35 /// for histogram display. If requested, it can even figure out
36 /// a good number of bins to have.
37 
38 #include <corelib/ncbimisc.hpp>
39 
41 
42 /// Given a set of integer data, this will bin the data
43 /// for use in histograms.
44 /// Usage is generally as follows: AddNumber is called
45 /// repeatedly until all data points are loaded, then
46 /// CalcHistogram is called to get the resulting histogram.
48 public:
49 
50  /// The numeric type this bins. It is big so that it can cover
51  /// whatever type of integer the caller wants to bin.
52  typedef Int8 TValue;
53 
54  /// Constructor.
55  ///
56  /// @param num_bins
57  /// 0 means to automatically pick a reasonable number of bins.
58  /// Note that num_bins is a suggestion, and the answer
59  /// you get could havea a higher or lower number of bins
60  /// depending on misc factors.
61  CHistogramBinning(Uint8 num_bins = 0) : m_iNumBins(num_bins) { }
62 
63  /// This should not normally be needed, since number of bins
64  /// is usually picked in the constructor. Callers might
65  /// use it if they want to get histograms with differing
66  /// number of bins for the same data, for example.
67  ///
68  /// @param num_bins
69  /// This sets the target number of bins for the next histogram
70  /// calculated.
71  /// As in the constructor, 0 means to auto-pick number of bins.
72  void SetNumBins(Uint8 num_bins) {
73  m_iNumBins = num_bins;
74  }
75 
76  /// Give this histogram another number to bin. This is called
77  /// repeatedly, and then the caller will probably want to
78  /// call CalcHistogram to get the answer.
79  /// It is perfectly fine to call this function multiple times
80  /// with the same the_number; the total appearances will be summed
81  /// up over all the calls.
82  ///
83  /// @param the_number
84  /// This is the value of the data point. For example, if this
85  /// is a histogram of various persons' ages then the_number
86  /// would be 42 if the next data point is a 42-year-old person.
87  /// @param num_appearances
88  /// This is how many times the data point appears. For example,
89  /// setting this to "3" is the same as calling the function
90  /// 3 times with a num_appearances value of 1.
91  void AddNumber(TValue the_number, Uint8 num_appearances = 1) {
92  m_mapValueToTotalAppearances[the_number] += num_appearances;
93  }
94 
95  /// This clears all data given to the object. It behaves pretty
96  /// much like destroying the object and then recreating it anew.
97  void clear(void) {
98  m_mapValueToTotalAppearances.clear();
99  }
100 
101  /// Holds the information about a bin.
102  struct SBin {
103  SBin(
104  TValue first_number_arg,
105  TValue last_number_arg,
106  Uint8 total_appearances_arg );
107 
108  /// The start range of the bin (inclusive)
110  /// The end range of the bin (inclusive)
112  /// The total number of data points in this bin
113  /// for all values from first_number to last_number
115  };
116  /// A histogram is given as a vector of bins.
117  typedef vector<SBin> TListOfBins;
118 
119  /// Pick which binning algorithm to use when generating
120  /// the histogram.
121  enum EHistAlgo {
122  /// This algorithm tries to make each bin represent
123  /// values that are clustered together.
124  /// Run-time O(n lg n), where n is the number
125  /// of possible values, NOT the number of data points.
127  /// This algorithm tries to make each bin roughly even in size,
128  /// except the last bin which may be much smaller.
129  /// Run-time O(n lg n), where n is the number
130  /// of possible values, NOT the number of data points.
132  // Maybe future algo: eHistAlgo_SameRangeInEachBin
133 
134  /// The default algorithm
135  eHistAlgo_Default = eHistAlgo_IdentifyClusters
136  };
137 
138  /// Call this after data is loaded via AddNumber, etc. and it
139  /// will give you a list of SBin's, each of which identifies
140  /// a bin of the resulting histogram.
141  ///
142  /// NOTE: Caller must deallocate the returned histogram object!
143  ///
144  /// @param eHistAlgo
145  /// This lets the caller pick which histogram binning algorithm
146  /// to use. The default should suffice for most situations.
147  /// @return
148  /// This returns a newly-allocated list of the bins in the resulting
149  /// histogram. The caller is expected to free this.
150  TListOfBins* CalcHistogram(
151  EHistAlgo eHistAlgo = eHistAlgo_Default) const;
152 
153 private:
154  /// The number of bins to aim for the next time CalcHistogram
155  /// is called. It is 0 to automatically pick the number of bins.
157 
158  /// forbid copying
160  /// forbid assignment
161  CHistogramBinning & operator =(const CHistogramBinning &);
162 
163  /// Maps each value (e.g. age of person if you are histogramming
164  /// ages) to the number of times that value appears.
166  /// Maps a value to the number of times it appears, for the data
167  /// given so far.
169 
170  /// Implementation of eHistAlgo_IdentifyClusters
171  /// NOTE: Caller must deallocate the returned object!
172  TListOfBins * x_IdentifyClusters(void) const;
173  /// Implementation of eHistAlgo_TryForSameNumDataInEachBin
174  /// NOTE: Caller must deallocate the returned object!
175  TListOfBins * x_TryForEvenBins(void) const;
176 
177  /// shared by the various histogram algos
178  enum EInitStatus {
179  /// This indicates that x_InitializeHistogramAlgo
180  /// has done all the work required of it and
181  /// the caller can return the output value, which
182  /// usually happens in trivial cases.
184  /// This means the initialization completed, but
185  /// the caller will need to do more work to shrink the
186  /// number of bins.
187  eInitStatus_KeepGoing
188  };
189 
190  /// This holds the shared logic used by the various histogram
191  /// algorithms.
192  ///
193  /// @param out_listOfBins
194  /// This will be filled in with a dummy histogram consisting
195  /// of one bin for each data value.
196  /// @param out_num_bins
197  /// This will be filled in by the target number of bins.
198  /// This is usually m_iNumBins, but if m_iNumBins is zero
199  /// it will be the automatically picked number of bins.
200  /// @return
201  /// The return value indicates how initialization went. In particular,
202  /// there are cases where the work is already done and out_listOfBins
203  /// can be returned as-is to the user.
204  EInitStatus x_InitializeHistogramAlgo(
205  TListOfBins & out_listOfBins,
206  Uint8 & out_num_bins) const;
207 };
208 
210 
211 #endif /* UTIL__HISTOGRAM_BINNING__HPP */
212 
Given a set of integer data, this will bin the data for use in histograms.
void SetNumBins(Uint8 num_bins)
This should not normally be needed, since number of bins is usually picked in the constructor.
Uint8 m_iNumBins
The number of bins to aim for the next time CalcHistogram is called.
vector< SBin > TListOfBins
A histogram is given as a vector of bins.
CHistogramBinning(const CHistogramBinning &)
forbid copying
void AddNumber(TValue the_number, Uint8 num_appearances=1)
Give this histogram another number to bin.
Int8 TValue
The numeric type this bins.
TMapValueToTotalAppearances m_mapValueToTotalAppearances
Maps a value to the number of times it appears, for the data given so far.
EInitStatus
shared by the various histogram algos
@ eInitStatus_AllAlgoWorkDone
This indicates that x_InitializeHistogramAlgo has done all the work required of it and the caller can...
void clear(void)
This clears all data given to the object.
EHistAlgo
Pick which binning algorithm to use when generating the histogram.
@ eHistAlgo_TryForSameNumDataInEachBin
This algorithm tries to make each bin roughly even in size, except the last bin which may be much sma...
@ eHistAlgo_IdentifyClusters
This algorithm tries to make each bin represent values that are clustered together.
map< TValue, Uint8 > TMapValueToTotalAppearances
Maps each value (e.g.
CHistogramBinning(Uint8 num_bins=0)
Constructor.
int64_t Int8
8-byte (64-bit) signed integer
Definition: ncbitype.h:104
uint64_t Uint8
8-byte (64-bit) unsigned integer
Definition: ncbitype.h:105
#define END_NCBI_SCOPE
End previously defined NCBI scope.
Definition: ncbistl.hpp:103
#define BEGIN_NCBI_SCOPE
Define ncbi namespace.
Definition: ncbistl.hpp:100
Miscellaneous common-use basic types and functionality.
NCBI_XUTIL_EXPORT
Parameter to control printing diagnostic message about conversion of static array data from a differe...
Definition: static_set.hpp:72
Holds the information about a bin.
Uint8 total_appearances
The total number of data points in this bin for all values from first_number to last_number.
TValue first_number
The start range of the bin (inclusive)
TValue last_number
The end range of the bin (inclusive)
Modified on Fri Sep 20 14:57:58 2024 by modify_doxy.py rev. 669887