Class KolmogorovSmirnovTest

    • Method Detail

      • with

        public KolmogorovSmirnovTest with​(PValueMethod v)
        Return an instance with the configured p-value method.

        For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.

        For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.

        Parameters:
        v - Value.
        Returns:
        an instance
        See Also:
        with(UniformRandomProvider)
      • with

        public KolmogorovSmirnovTest with​(Inequality v)
        Return an instance with the configured inequality.

        Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).

        Parameters:
        v - Value.
        Returns:
        an instance
      • with

        public KolmogorovSmirnovTest with​(org.apache.commons.rng.UniformRandomProvider v)
        Return an instance with the configured source of randomness.

        Applies to the two-sample test when the p-value method is set to ESTIMATE. The randomness is used for sampling of the combined distribution.

        There is no default source of randomness. If the randomness is not set then estimation is not possible and an IllegalStateException will be raised in the two-sample test.

        Parameters:
        v - Value.
        Returns:
        an instance
        See Also:
        with(PValueMethod)
      • withIterations

        public KolmogorovSmirnovTest withIterations​(int v)
        Return an instance with the configured number of iterations.

        Applies to the two-sample test when the p-value method is set to ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using the iterations as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.

        Parameters:
        v - Value.
        Returns:
        an instance
        Throws:
        IllegalArgumentException - if the number of iterations is not strictly positive
      • statistic

        public double statistic​(double[] x,
                                DoubleUnaryOperator cdf)
        Computes the one-sample Kolmogorov-Smirnov test statistic.
        • two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
        • greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
        • less: \(D_n^-=\sup_x (F(x)-F_n(x))\)

        where \(F\) is the distribution cumulative density function (cdf), \(n\) is the length of x and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x.

        The cumulative distribution function should map a real value x to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:

         UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1);
         UniformRandomProvider rng = RandomSource.KISS.create(123);
         double[] x = dist.sampler(rng).samples(100);
         double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
         
        Parameters:
        cdf - Reference cumulative distribution function.
        x - Sample being evaluated.
        Returns:
        Kolmogorov-Smirnov statistic
        Throws:
        IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
        See Also:
        test(double[], DoubleUnaryOperator)
      • statistic

        public double statistic​(double[] x,
                                double[] y)
        Computes the two-sample Kolmogorov-Smirnov test statistic.
        • two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
        • greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
        • less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)

        where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values in y.

        Parameters:
        x - First sample.
        y - Second sample.
        Returns:
        Kolmogorov-Smirnov statistic
        Throws:
        IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
        See Also:
        test(double[], double[])
      • test

        public KolmogorovSmirnovTest.OneResult test​(double[] x,
                                                    DoubleUnaryOperator cdf)
        Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).

        The test is defined by the AlternativeHypothesis:

        • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
        • Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
        • Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).

        The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:

        \[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]

        The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:

        \[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]

        Parameters:
        x - Sample being being evaluated.
        cdf - Reference cumulative distribution function.
        Returns:
        test result
        Throws:
        IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
        See Also:
        statistic(double[], DoubleUnaryOperator)
      • test

        public KolmogorovSmirnovTest.TwoResult test​(double[] x,
                                                    double[] y)
        Performs a two-sample Kolmogorov-Smirnov test on samples x and y. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) of the y values.

        The test is defined by the AlternativeHypothesis:

        • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
        • Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
        • Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).

        If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.

        \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an Inequality parameter. This is ignored for large samples.

        If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.

        If the p-value method is ESTIMATE then the p-value is estimated by repeat sampling of the joint distribution of x and y. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or an IllegalStateException will be raised. The p-value for the upper bound on D will not be estimated and is set to NaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the R Matching package (Sekhon (2011)).

        Parameters:
        x - First sample.
        y - Second sample.
        Returns:
        test result
        Throws:
        IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
        IllegalStateException - if the p-value method is ESTIMATE and there is no source of randomness.
        See Also:
        statistic(double[], double[])