Class KolmogorovSmirnovTest
- java.lang.Object
-
- org.apache.commons.statistics.inference.KolmogorovSmirnovTest
-
public final class KolmogorovSmirnovTest extends Object
Implements the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.
The two-sample test uses a D statistic based on the maximum deviation of the two empirical distributions of sample data points. The two-sample tests evaluate the null hypothesis that the two samples
x
andy
come from the same underlying distribution.References:
- Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
- Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
- Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
- Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
- Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
- Since:
- 1.1
- See Also:
- Kolmogorov-Smirnov (K-S) test (Wikipedia)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
KolmogorovSmirnovTest.OneResult
Result for the one-sample Kolmogorov-Smirnov test.static class
KolmogorovSmirnovTest.TwoResult
Result for the two-sample Kolmogorov-Smirnov test.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double
statistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic.double
statistic(double[] x, DoubleUnaryOperator cdf)
Computes the one-sample Kolmogorov-Smirnov test statistic.KolmogorovSmirnovTest.TwoResult
test(double[] x, double[] y)
Performs a two-sample Kolmogorov-Smirnov test on samplesx
andy
.KolmogorovSmirnovTest.OneResult
test(double[] x, DoubleUnaryOperator cdf)
Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
conforms to the distribution cumulative density function (cdf
).KolmogorovSmirnovTest
with(org.apache.commons.rng.UniformRandomProvider v)
Return an instance with the configured source of randomness.KolmogorovSmirnovTest
with(AlternativeHypothesis v)
Return an instance with the configured alternative hypothesis.KolmogorovSmirnovTest
with(Inequality v)
Return an instance with the configured inequality.KolmogorovSmirnovTest
with(PValueMethod v)
Return an instance with the configured p-value method.static KolmogorovSmirnovTest
withDefaults()
Return an instance using the default options.KolmogorovSmirnovTest
withIterations(int v)
Return an instance with the configured number of iterations.
-
-
-
Method Detail
-
withDefaults
public static KolmogorovSmirnovTest withDefaults()
Return an instance using the default options.- Returns:
- default instance
-
with
public KolmogorovSmirnovTest with(AlternativeHypothesis v)
Return an instance with the configured alternative hypothesis.- Parameters:
v
- Value.- Returns:
- an instance
-
with
public KolmogorovSmirnovTest with(PValueMethod v)
Return an instance with the configured p-value method.For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.
For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.
- Parameters:
v
- Value.- Returns:
- an instance
- See Also:
with(UniformRandomProvider)
-
with
public KolmogorovSmirnovTest with(Inequality v)
Return an instance with the configured inequality.Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).
- Parameters:
v
- Value.- Returns:
- an instance
-
with
public KolmogorovSmirnovTest with(org.apache.commons.rng.UniformRandomProvider v)
Return an instance with the configured source of randomness.Applies to the two-sample test when the p-value method is set to
ESTIMATE
. The randomness is used for sampling of the combined distribution.There is no default source of randomness. If the randomness is not set then estimation is not possible and an
IllegalStateException
will be raised in the two-sample test.- Parameters:
v
- Value.- Returns:
- an instance
- See Also:
with(PValueMethod)
-
withIterations
public KolmogorovSmirnovTest withIterations(int v)
Return an instance with the configured number of iterations.Applies to the two-sample test when the p-value method is set to
ESTIMATE
. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using theiterations
as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.- Parameters:
v
- Value.- Returns:
- an instance
- Throws:
IllegalArgumentException
- if the number of iterations is not strictly positive
-
statistic
public double statistic(double[] x, DoubleUnaryOperator cdf)
Computes the one-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
- greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
- less: \(D_n^-=\sup_x (F(x)-F_n(x))\)
where \(F\) is the distribution cumulative density function (
cdf
), \(n\) is the length ofx
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
.The cumulative distribution function should map a real value
x
to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1); UniformRandomProvider rng = RandomSource.KISS.create(123); double[] x = dist.sampler(rng).samples(100); double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
- Parameters:
cdf
- Reference cumulative distribution function.x
- Sample being evaluated.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- ifdata
does not have length at least 2; or contains NaN values.- See Also:
test(double[], DoubleUnaryOperator)
-
statistic
public double statistic(double[] x, double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
- greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
- less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)
where \(n\) is the length of
x
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values iny
.- Parameters:
x
- First sample.y
- Second sample.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- if eitherx
ory
does not have length at least 2; or contain NaN values.- See Also:
test(double[], double[])
-
test
public KolmogorovSmirnovTest.OneResult test(double[] x, DoubleUnaryOperator cdf)
Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
conforms to the distribution cumulative density function (cdf
).The test is defined by the
AlternativeHypothesis
:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).
The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:
\[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]
The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:
\[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]
- Parameters:
x
- Sample being being evaluated.cdf
- Reference cumulative distribution function.- Returns:
- test result
- Throws:
IllegalArgumentException
- ifdata
does not have length at least 2; or contains NaN values.- See Also:
statistic(double[], DoubleUnaryOperator)
-
test
public KolmogorovSmirnovTest.TwoResult test(double[] x, double[] y)
Performs a two-sample Kolmogorov-Smirnov test on samplesx
andy
. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution that puts mass \(1/m\) of they
values.The test is defined by the
AlternativeHypothesis
:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).
If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.
\(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an
Inequality
parameter. This is ignored for large samples.If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.
If the p-value method is
ESTIMATE
then the p-value is estimated by repeat sampling of the joint distribution ofx
andy
. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or anIllegalStateException
will be raised. The p-value for the upper bound on D will not be estimated and is set toNaN
. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the RMatching
package (Sekhon (2011)).- Parameters:
x
- First sample.y
- Second sample.- Returns:
- test result
- Throws:
IllegalArgumentException
- if eitherx
ory
does not have length at least 2; or contain NaN values.IllegalStateException
- if the p-value method isESTIMATE
and there is no source of randomness.- See Also:
statistic(double[], double[])
-
-