Defect report / some comments on [rand] and N2391

From:

st@quanttec.com (Stephan Tolksdorf)

Newsgroups:

comp.std.c++

Date:

Fri, 21 Sep 2007 17:33:43 GMT

Message-ID:

<fd0tl4$nfm$1@news01.versatel.de>

In the following I comment on several issues in the current [rand]
proposal and N2391.

Furthermore, it might be of interest to you that I implemented (from
scratch) a library that is essentially a superset of [rand], with the
exception of [rand.device] and [rand.dist.samp.genpdf]. In my experience
the [rand] proposal is well thought-out, efficient, flexible, easily
extensible and generally of a very high quality. My thanks go to all
contributors of the proposal.

Seeding
========

1) typo in N2391, section 9, pertaining to [rand.util.seedseq]

The proposed algorithm at the end of item c) (top of page 11) states
"(...) update begin[k + p] by xoring it with r4, update begin[k + q] by
xoring it with r3, (...)", although it should be
"(...) update begin[k + p] by xoring it with r3, update begin[k + q] by
xoring it with r4, (...)".

2) problem in [rand.eng.mers]/6

The mersenne_twister_engine is required to use a seeding method that is
given as an algorithm parameterized over the number of bits W. I doubt
whether the given generalization of an algorithm that was originally
developed only for unsigned 32-bit integers is appropriate for other bit
widths. For instance, W could be theoretically 16 and UIntType a 16-bit
integer, in which case the given multiplier would not fit into the
UIntType. Moreover, T. Nishimura and M. Matsumoto have chosen a
different multiplier for their 64 bit Mersenne Twister [1].
I see two possible resolutions:
a) Restrict the parameter W of the mersenne_twister_template to values
of 32 or 64 and use the multiplier from [1] for the 64-bit case (my
preference).
b) Interpret the state array for any W as a 32-bit array of appropriate
length (and a specified byte order) and always employ the 32-bit
algorithm for seeding.

3) problem in [rand.req.eng]/3

The 3rd table row in [rand.req.eng]/3 requires random number engines to
accept any arithmetic type as a seed, which is then casted to the
engine's result_type and subsequently used for seeding the state of the
engine. The requirement stated as =93Creates an engine with initial state
determined by static_cast<X::result_type>(s)=94 forces random number
engines to either use a seeding method that completely depends on the
result_type (see the discussion of seeding for the
mersenne_twister_engine in point 2 above) or at least to throw away
"bits of randomness" in the seed value if the result_type is smaller
than the seed type. This seems to be inappropriate for many modern
random number generators, in particular F2-linear or cryptographic ones,
which operate on an internal bit array that in principle is independent
of the type of numbers returned.

I propose to change the wording to a version similar to
=93Creates an engine with initial state determined by
static_cast<UintType>(s), where UintType is an implementation specific
unsigned integer type.=94
Additionally, the definition of s in [rand.req.eng]/1 c) could be
restricted to unsigned integer types.
Similarly, the type of the seed in [rand.req.adapt]/3 e) could be left
unspecified.

4) comment on [rand.req.adapt]/3 e)

If an engine adaptor is invoked with an argument of type seed_seq, then
all base engines are specified to be seeded with this seed_seq. As
seed_seq's randomization method is qualified as constant, this procedure
will effectively initialize all base engines with the same seed (though
the resulting state might still differ to a certain degree if the
engines are of different types). It is not clear whether this mode of
operation is in general appropriate, hence - as far as the stated
requirements are of general nature and not just specific to the engine
adaptors provided by the library =96 it might be better to leave the
behaviour unspecified, since the current definition of seed_seq does not
allow for a generally satisfying specification.

5) comment on N2391, section 10, pertaining to [rand.util.seedseq]

I do not agree with the assessment that the member seed_seq::size
becomes useless after the changes proposed in N2391. The number of 32
bit seeds that were supplied to seed_seq is a valuable indication of the
entropy/ bits of randomness stored in seed_seq, which could be exploited
in (non-standard) engines to improve the quality of seeding. Moreover,
it is quite useful to know the size of the array of seeds before one
calls get_seeds with an output iterator.

6) proposal for a customizable seed_seq [rand.util.seedseq]

The proper way to seed random number engines seems to be the most
frequently discussed issue of the [rand] proposal. While the new
seed_seq approach is already rather general and probably sufficient for
most situations, it is unlikely to be optimal in every case (one problem
was pointed out in point 5 above). In some situations it might, for
instance, be better to seed the state with a cryptographic generator.

In my opinion this is a pretty strong argument for extending the
standard with a simple facility to customize the seeding procedure. This
could, for example, be done with the following minimal changes:

a) Turn the interface specification of [rand.util.seedseq]/2 into a
=93SeedSeq=94 requirement, where the exact behaviour of the constructors =
and
the randomize method are left unspecified and where the const
qualification for randomize is removed. Classes implementing this
interface are additionally required to specialize the traits class in c).

b) Provide the class seed_seq as a default implementation of the SeedSeq
interface.

c) Supplement the seed_seq with a traits class

    template <typename T>
    struct is_seed_seq {
        static const bool value = false;
    }

    and the specialization

    template <>
    struct is_seed_seq<seed_seq> {
        static const bool value = true;
    }

    which users can supplement with further specializations.

d) Change [rand.req.eng]/1 d) to =93q is an lvalue of a type that fulfils
the SeedSeq requirements=94, and modify the constructors and seed methods
in [rand.eng] appropriately (the actual implementation could be done
using the SFINAE technique).

Distributions
=============

7) defect in [rand.dist.samp.genpdf]

[rand.dist.samp.genpdf] describes the interface for a distribution
template that is meant to simulate random numbers from any general
distribution given only the density and the support of the distribution.
I'm not aware of any general purpose algorithm that would be capable of
correctly and efficiently implementing the described functionality. From
what I know, this is essentially an unsolved research problem. Existing
algorithms either require more knowledge about the distribution and the
problem domain or work only under very limited circumstances. Even the
state of the art special purpose library UNU.RAN [2] does not solve the
problem in full generality, and in any case, testing and customer
support for such a library feature would be a nightmare.
For these reasons, I propose to delete section [rand.dist.samp.genpdf].

8) comment on [rand.req.dist]/9

The requirement "P shall have a declaration of the form "typedef X
distribution_type" effectively makes the use of inheritance for
implementing distributions very inconvenient, because the child of a
distribution class in general will not satisfy this requirement.
In my opinion the benefits of having a typedef in the parameter class
pointing back to the distribution class are not worth the hassle this
requirement causes. [In my code base I never made use of the nested
typedef but on several occasions could have profited from being able to
use simple inheritance for the implementation of a distribution class.]
I propose to drop this requirement.

9) unnecessary restriction in [rand.dist.norm.{chisq, f, t}]

chi_squared_distribution, fisher_f_distribution and
student_t_distribution have parameters for the "degrees of freedom" n
and m that are specified as integers. For the following two reasons this
is an unnecessary restriction: First, in many applications such as
Bayesian inference or Monte Carlo simulations it is more convenient to
treat the respective parameters as continuous variables. Second, the
standard non-naive algorithms (i.e. O(1) algorithms) for simulating from
these distributions work with floating-point parameters anyway (all
three distributions could be easily implemented using the Gamma
distribution, for instance).
For these reasons, I propose to change the type of the respective
parameters to double.

Similar arguments could in principle be made for the parameters t and k
of the discrete binomial_distribution and
negative_binomial_distribution, though in both cases continuous
parameters are less frequently used in practice and in case of the
binomial_distribution the implementation would be significantly
complicated by a non-discrete parameter (in most implementations one
would need an approximation of the log-gamma function instead of just
the log-factorial function).

10) unfortunate naming in [rand.dist.bern.bin] and [rand.dist.bern.negbin=
]

In my opinion the choice of name for the "t" parameter of the
binomial_distribution is very unfortunate. In virtually every internet
reference, book and software implementation this parameter is called "n"
instead, see for example Wikipedia [3], Mathworld [4], Evans et al. [5],
the R statistical computing language [6], Mathematica [7] and Matlab [8].

Similarly, the choice of "k" for the parameter of the negative binomial
distributions is rather unusual. The most common choice for the negative
binomial distribution seems to be "r" instead.

Choosing unusual names for the parameters causes confusion among users
and makes the interface unnecessarily inconvenient to use. For these
reasons, I propose to change the name of the respective parameters to
"n" and "r".

11) comment on [rand.dist.samp.discrete]

a) The specification for discrete_distribution requires the member
probabilities() to return a vector of _standardized_ probabilities,
which forces the implementation every time to divide each probability by
the sum of all probabilities, as the sum will in practice almost never
be exactly 1.0. This is unnecessarily inefficient as the implementation
would otherwise not need to compute the standardized probabilities at
all and could instead work with the non-standardized probabilities and
the sum. If there was no standardization the user would just get back
the probabilities that were previously supplied to the distribution
object, which to me seems to be the more obvious solution.

b)The behaviour of discrete_distribution is not specified in case the
number of given probabilities is larger than the maximum number
representable by the IntType.

I propose to change the specification such that the non-standardized
probabilities need to be returned and that an additional requirement is
included for the number of probabilities to be smaller than the maximum
of IntType.

12) comment on [rand.dist.samp.pconst]

a) The discussion in point 11 above regarding probabilities() similarly
applies to the method densities() of piecewise_constant_distribution.

b) The design of the constructor
"template <class InputIteratorB, class InputIteratorW>
piecewise_constant_distribution(
InputIteratorB firstB, InputIteratorB lastB,
InputIteratorW firstW);"
is unnecessarily unsafe, as there is no separate end-iterator given for
the weights. I can't see any performance or convenience reasons that
would justify the risks inherent in such a function interface, in
particular the risk that input error might go unnoticed.
I propose to add an "InputIteratorW lastW" argument to the interface.

OTHER
=====

13) editorial issue in [rand.adapt.disc]/3

Since the template parameter p and r are of type size_t, the member n in
the class exposition should have type size_t, too.

14) defect in [rand.util.canonical]/3

The complexity of generate_canonical is specified to be "exactly
k=max(1, ceil(b/log2 R)) invocations of g". This terms involves a
logarithm that is not rounded and hence can not (in general) be computed
at compile time. As this function template is performance critical,
I propose to replace ceil(b/log2 R) with ceil(b/floor(log2 R)).

15) overly restrictive specification in N2391, section 2,
     pertaining to [alg.random.shuffle]

The recommended revised specification for the third form of
random_shuffle reads

"The third form of the function takes an object g meeting the
requirements of uniform random number generator (26.4.1.2). This form of
the function makes max(0, n - 1) calls to d(g), where n is last - first,
and d is an object of type uniform_int_distribution<
iterator_traits<RandomAccessIterator>::difference_type>
([rand.dist.uni.int]) that was constructed with arguments (0,n)."

This can be interpreted as requiring (n - 1) calls to a distribution
object with the fixed parameters although the algorithm typically uses a
uniform_int_distribution with different parameters in each iteration.
In any case, such a specification would be overly specific for the
following two reasons:
First, it could be advantageous to make more calls to
uniform_int_distribution in order to avoid relatively costly
reparametrizations.
Second, an implementation might want to use a special purpose
distribution different from uniform_int_distribution.

Personally, I find the old complexity specification "Exactly (last -
first ) - 1 swaps" sufficient.

References
==========

[1]
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/mt19937-6=
4.c
[2] http://statmath.wu-wien.ac.at/unuran/
[3] http://en.wikipedia.org/wiki/Binomial_distribution,
[4] http://mathworld.wolfram.com/BinomialDistribution.html,
[5] Evans et al. (1993) Statistical Distributions, 2nd E., Wiley,p. 38
[6] http://cran.r-project.org/doc/manuals/fullrefman.pdf, p. 926
[7]http://reference.wolfram.com/mathematica/ref/BinomialDistribution.html
[8]http://www.mathworks.com/access/helpdesk/help/toolbox/stats/binornd.ht=
ml

Best regards,
   Stephan Tolksdorf

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]