java.lang.Object
- org.apache.commons.statistics.inference.KolmogorovSmirnovTest

```
public final class KolmogorovSmirnovTest
extends Object
```
Implements the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.
The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.
The two-sample test uses a D statistic based on the maximum deviation of the two empirical distributions of sample data points. The two-sample tests evaluate the null hypothesis that the two samples x and y come from the same underlying distribution.
References:
1. Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
2. Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
3. Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
4. Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
5. Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
Since:

1.1

See Also:

Kolmogorov-Smirnov (K-S) test (Wikipedia)

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`KolmogorovSmirnovTest.OneResult`	Result for the one-sample Kolmogorov-Smirnov test.
`static class`	`KolmogorovSmirnovTest.TwoResult`	Result for the two-sample Kolmogorov-Smirnov test.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`double`	`statistic(double[] x, double[] y)`	Computes the two-sample Kolmogorov-Smirnov test statistic.
`double`	`statistic(double[] x, DoubleUnaryOperator cdf)`	Computes the one-sample Kolmogorov-Smirnov test statistic.
`KolmogorovSmirnovTest.TwoResult`	`test(double[] x, double[] y)`	Performs a two-sample Kolmogorov-Smirnov test on samples `x` and `y`.
`KolmogorovSmirnovTest.OneResult`	`test(double[] x, DoubleUnaryOperator cdf)`	Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that `x` conforms to the distribution cumulative density function (`cdf`).
`KolmogorovSmirnovTest`	`with(org.apache.commons.rng.UniformRandomProvider v)`	Return an instance with the configured source of randomness.
`KolmogorovSmirnovTest`	`with(AlternativeHypothesis v)`	Return an instance with the configured alternative hypothesis.
`KolmogorovSmirnovTest`	`with(Inequality v)`	Return an instance with the configured inequality.
`KolmogorovSmirnovTest`	`with(PValueMethod v)`	Return an instance with the configured p-value method.
`static KolmogorovSmirnovTest`	`withDefaults()`	Return an instance using the default options.
`KolmogorovSmirnovTest`	`withIterations(int v)`	Return an instance with the configured number of iterations.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - withDefaults
```
public static KolmogorovSmirnovTest withDefaults()
```
    Return an instance using the default options.
    
    AlternativeHypothesis.TWO_SIDED
    PValueMethod.AUTO
    Inequality.NON_STRICT
    RNG = none
    Iterations = 1000
    Returns:
    
    default instance
  - with
```
public KolmogorovSmirnovTest with(AlternativeHypothesis v)
```
    Return an instance with the configured alternative hypothesis.
    
    Parameters:
    
    v - Value.
    
    Returns:
    
    an instance
  - with
```
public KolmogorovSmirnovTest with(PValueMethod v)
```
    Return an instance with the configured p-value method.
    For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.
    For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.
    
    Parameters:
    
    v - Value.
    
    Returns:
    
    an instance
    
    See Also:
    
    with(UniformRandomProvider)
  - with
```
public KolmogorovSmirnovTest with(Inequality v)
```
    Return an instance with the configured inequality.
    Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).
    
    Parameters:
    
    v - Value.
    
    Returns:
    
    an instance
  - with
```
public KolmogorovSmirnovTest with(org.apache.commons.rng.UniformRandomProvider v)
```
    Return an instance with the configured source of randomness.
    Applies to the two-sample test when the p-value method is set to ESTIMATE. The randomness is used for sampling of the combined distribution.
    There is no default source of randomness. If the randomness is not set then estimation is not possible and an IllegalStateException will be raised in the two-sample test.
    
    Parameters:
    
    v - Value.
    
    Returns:
    
    an instance
    
    See Also:
    
    with(PValueMethod)
  - withIterations
```
public KolmogorovSmirnovTest withIterations(int v)
```
    Return an instance with the configured number of iterations.
    Applies to the two-sample test when the p-value method is set to ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using the iterations as the denominator. The number of significant digits in the p-value is upper bounded by log₁₀(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.
    
    Parameters:
    
    v - Value.
    
    Returns:
    
    an instance
    
    Throws:
    
    IllegalArgumentException - if the number of iterations is not strictly positive
  - statistic
```
public double statistic(double[] x,
                        DoubleUnaryOperator cdf)
```
    Computes the one-sample Kolmogorov-Smirnov test statistic.
    
    two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
    greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
    less: \(D_n^-=\sup_x (F(x)-F_n(x))\)
    
    where \(F\) is the distribution cumulative density function (cdf), \(n\) is the length of x and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x.
    The cumulative distribution function should map a real value x to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:
    UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1); UniformRandomProvider rng = RandomSource.KISS.create(123); double[] x = dist.sampler(rng).samples(100); double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
    Parameters:
    
    cdf - Reference cumulative distribution function.
    
    x - Sample being evaluated.
    
    Returns:
    
    Kolmogorov-Smirnov statistic
    
    Throws:
    
    IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
    
    See Also:
    
    test(double[], DoubleUnaryOperator)
  - statistic
```
public double statistic(double[] x,
                        double[] y)
```
    Computes the two-sample Kolmogorov-Smirnov test statistic.
    
    two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
    greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
    less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)
    
    where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values in y.
    Parameters:
    
    x - First sample.
    
    y - Second sample.
    
    Returns:
    
    Kolmogorov-Smirnov statistic
    
    Throws:
    
    IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
    
    See Also:
    
    test(double[], double[])
  - test
```
public KolmogorovSmirnovTest.OneResult test(double[] x,
                                            DoubleUnaryOperator cdf)
```
    Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).
    The test is defined by the AlternativeHypothesis:
    
    Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
    Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
    Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).
    
    The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:
    \[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]
    The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:
    \[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]
    Parameters:
    
    x - Sample being being evaluated.
    
    cdf - Reference cumulative distribution function.
    
    Returns:
    
    test result
    
    Throws:
    
    IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
    
    See Also:
    
    statistic(double[], DoubleUnaryOperator)
  - test
```
public KolmogorovSmirnovTest.TwoResult test(double[] x,
                                            double[] y)
```
    Performs a two-sample Kolmogorov-Smirnov test on samples x and y. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) of the y values.
    The test is defined by the AlternativeHypothesis:
    
    Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
    Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
    Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).
    
    If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.
    \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an Inequality parameter. This is ignored for large samples.
    If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.
    If the p-value method is ESTIMATE then the p-value is estimated by repeat sampling of the joint distribution of x and y. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or an IllegalStateException will be raised. The p-value for the upper bound on D will not be estimated and is set to NaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the R Matching package (Sekhon (2011)).
    Parameters:
    
    x - First sample.
    
    y - Second sample.
    
    Returns:
    
    test result
    
    Throws:
    
    IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
    
    IllegalStateException - if the p-value method is ESTIMATE and there is no source of randomness.
    
    See Also:
    
    statistic(double[], double[])

Class KolmogorovSmirnovTest

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Detail

withDefaults

with

with

with

with

withIterations

statistic

statistic

test

test