Kolmogorov-Smirnov using R - r

Long story short, I want to manually write the code for the Kolmogorov-Smirnov one-sample statistic instead of using ks.test() in R. From what I understand, the K-S test can be broken down into a ratio between a numerator and a denominator. I am interested in writing out the numerator, and from what I understand it is the maximal absolute difference between a sample of observations and the theoretical assumption. Let's use the below case as an example:
Data Expected
1 0.01052632 0.008864266
2 0.02105263 0.010969529
13 0.05263158 0.018282548
20 0.06315789 0.031689751
22 0.09473684 0.046315789
24 0.26315789 0.210526316
26 0.27368421 0.220387812
27 0.29473684 0.236232687
28 0.30526316 0.252520776
3 0.42105263 0.365650970
4 0.42105263 0.372299169
5 0.45263158 0.398781163
6 0.49473684 0.452853186
7 0.50526316 0.460277008
8 0.73684211 0.656842105
9 0.74736842 0.665484765
10 0.75789474 0.691523546
11 0.77894737 0.718005540
12 0.80000000 0.735955679
14 0.84210526 0.791135734
15 0.86315789 0.809972299
16 0.88421053 0.838559557
17 0.89473684 0.857950139
18 0.96842105 0.958337950
19 0.97894737 0.968642659
21 0.97894737 0.979058172
23 0.98947368 0.989473684
25 1.00000000 1.000000000
Here, I want to obtain the maximal absolute difference (Data - Expected).
Anyone have an idea? I can rephrase this question, if necessary. Thanks!

I was looking for an answer something along the lines of this code:
> A <- with(df, max(abs(Data-Expected)))
,where df is the data frame.
Here, I obtain the differences between each Data and Expected, convert the values into absolute values, and from the vector of absolute differences select the maximum value. Thus, the answer is:
> A
0.082

Related

Finding value of a series in R without for-loop

I am a newbie in R` and I found this problem:
Calculate the following sum using R:
1+(2/3)+(2/3)(4/5)+...+(2/3)(4/5)...(38/39)
I was enthusiastic to know how to solve this without using a for loop, and using only vector operations.
My thoughts and what I've tried till now:
Suppose I create two vectors such as
x<-2*(1:19)
y<-2*(1:19)+1
Then, x consists of all the numerators in the question and y has all the denominators. Now
z<-x/y
will create a vector of length 19 in which will be stored the values of 2/3, 4/5, ..., 38/39
I was thinking of using the prod function in R to find the required products. So, I created a vector such that
i<-1:19
In hopes of traversing z from the first element to the last, I did write:
prod(z[1:i])
But it failed miserably, giving me the result:
[1] 0.6666667
Warning message:
In 1:i : numerical expression has 19 elements: only the first used
What I wanted to do:
I expected to store the values of (2/3), (2/3)(4/5), ..., (2/3)(4/5)...(38/39) individually in another vector (say p) which will thus have 19 elements in it. I then intend to use the sum function to finally find out the sum of all those...
Where am I stuck:
As described in the R documentation, the prod function returns the product of all the values present in its arguments. So,
prod(z[1:1])
prod(z[1:2])
prod(z[1:3])
will return the values of (2/3), (2/3)(4/5), (2/3)(4/5)(6/7) respectively which it does:
> prod(z[1:1])
[1] 0.6666667
> prod(z[1:2])
[1] 0.5333333
> prod(z[1:3])
[1] 0.4571429
But it's not possible to go on like this and do it for all the 19 elements of the vector z. I am stuck here thinking as to what could be done. I wanted to iterate all the elements of z one-by-one for which I created another vector i as described above, but it didn't go as I had thought. Any help, suggestions, and hints will be really great as to how this can be done. I seem to have run out of ideas here.
More Information:
Here, I am providing with all the outputs in a systematic manner for others to understand my problem better:
> x
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
> y
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
> z
[1] 0.6666667 0.8000000 0.8571429 0.8888889 0.9090909 0.9230769 0.9333333
[8] 0.9411765 0.9473684 0.9523810 0.9565217 0.9600000 0.9629630 0.9655172
[15] 0.9677419 0.9696970 0.9714286 0.9729730 0.9743590
> i
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Short Note (controversial statement ahead): This post would really have benefited from the use of LaTeX, but unfortunately, due to extremely heavy dependencies, as is mentioned in several posts regarding inclusion of LaTeX on Stack Overflow (like this), that is not a thing till now.
You can use cumprod to get a cumulative product of a vector which is what you are after
p <- cumprod(z)
p
# [1] 0.6666667 0.5333333 0.4571429 0.4063492 0.3694084 0.3409923 0.3182595
# [8] 0.2995384 0.2837732 0.2702602 0.2585097 0.2481694 0.2389779 0.2307373
# [15] 0.2232941 0.2165276 0.2103411 0.2046562 0.1994087
A less-efficient but more generalized alternative to cumprod would be
p <- sapply(i, function(x) prod(z[1:x]))
Here the sapply takes the place of the loop and passes a different ending index for each product
Then you can do
1 + sum(p)

Coefplot for a chi square distribution

I was told to do a coefplot in R to visualise my data better.
Therefore i first did a chi square test. and after i put my data into a table it looked like this:
1 2 3 5 6
5_min_blank 11 21 18 19 8
Boldstyle 6 7 14 10 2
Boldstyle_pause 9 22 19 8 0
Breaststroke 7 16 10 5 4
Breaststroke_pause 9 13 10 8 3
Diving 14 20 10 10 4
1-6 are categories and "bold style" etc. are different sounds.
i than did a test:
fit.swim<-chisq.test(X2,simulate.p.value = TRUE, B = 10000)
and got this result:
Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
data: X2
X-squared = 87.794, df = NA, p-value = 0.09479
Now i would like to do a coefplot with my data but i only get this error:
coefplot(fit.swim)
Error: $ operator is invalid for atomic vectors
Any ideas how to draw a nice plot?
Thank you very much for the help!
All the best
Marie
I think that the reason you are getting that error is because coefplot requires a fitted model as input in the form of an lm, glm or rxLinMod obj.
In your case you have carried out a goodness of fit test that essentially compares the observed sample distribution with the expected probability distribution. There isn't a fitted model to plot the coefficients from.

How to use BoxCoxTrans function in R?

I want to use BoxCoxTrans function in R to resolve problem of skewness.
But, I have a problem that couldn't get result as data frame. This is my R code.
df<-read.csv("dataSetNA1.csv",header=TRUE)
dd1<-apply(df[2:61],2,BoxCoxTrans) #Except independent variable that located first column, All variables are numeric variable.
dd1
$LT1Y_MXOD_AMT
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 19594 0 1600000
Lambda could not be estimated; no transformation is applied
$MOBL_PRIN
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 100000 191229 320000 1100000
Lambda could not be estimated; no transformation is applied
str(dd1)
I don't know how to get result as data frame.
If I use as.data.frame function, this error message is posted.
dd2<-as.data.frame(dd1)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
클래스 ""BoxCoxTrans""를 data.frame으로 강제형변환 할 수 없습니다
please help me.
Here is one way to accomplish what you are after (I assume you are transforming the features):
library(caret)
data(cars)
#create a list with the BoxCox objects
g <- apply(cars, 2, BoxCoxTrans)
#use map2 from purr to apply the models to new data
z <- purrr::map2(g, cars, function(x, y) predict(x, y))
#here the transformation is performed on the same data on
#which I estimated the BoxCox lambda for
B_trans = as.data.frame(do.call(cbind, z)) #to convert to data frame
head(data.frame(B_trans, cars), 20)
#outpout
speed dist speed.1 dist.1
1 4 0.8284271 4 2
2 4 4.3245553 4 10
3 7 2.0000000 7 4
4 7 7.3808315 7 22
5 8 6.0000000 8 16
6 9 4.3245553 9 10
7 10 6.4852814 10 18
8 10 8.1980390 10 26
9 10 9.6619038 10 34
10 11 6.2462113 11 17
11 11 8.5830052 11 28
12 12 5.4833148 12 14
13 12 6.9442719 12 20
14 12 7.7979590 12 24
15 12 8.5830052 12 28
16 13 8.1980390 13 26
17 13 9.6619038 13 34
18 13 9.6619038 13 34
19 13 11.5646600 13 46
20 14 8.1980390 14 26
First two columns are transformed data and 2nd two are original data.
Another way is to incorporate the transformation of features during the training:
train(....preProcess = "BoxCox"...)
more on the matter: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/train
In order to perform a Box Cox transformation your data has to be positive. Hence, the values should be greater than 0.
The reason for this is, that the logarithm of 0 is -Inf.
If your data contains values of 0 you can just add 1 to each observation. This won't change your distribution/skewness.
A BoxCox transformation is a transformation on your response variable. You could use the Boxcox function of the MASS package to find out what transformation is needed. Boxcox returns a lambda value. U should raise your response, say y, to the power lambda and this results in a new response variable, y*.
Then just replace the y-column in your old data frame by y*.
Note that if the resulting lambda is 0, you should apply a logarithmic transformation ln(y).

How to write the Kolmogorov-Smirnov in R

Long story short, I want to manually write the code for the Kolmogorov-Smirnov one-sample statistic instead of using ks.test() in R. From what I understand, the K-S test is basically a ratio between a numerator and a denominator. I am interested in writing out the numerator, and from what I understand it is the maximal absolute difference between a sample of observations and the theoretical assumption. Let's use the below case as an example:
Data Expected
1 0.01052632 0.008864266
2 0.02105263 0.010969529
13 0.05263158 0.018282548
20 0.06315789 0.031689751
22 0.09473684 0.046315789
24 0.26315789 0.210526316
26 0.27368421 0.220387812
27 0.29473684 0.236232687
28 0.30526316 0.252520776
3 0.42105263 0.365650970
4 0.42105263 0.372299169
5 0.45263158 0.398781163
6 0.49473684 0.452853186
7 0.50526316 0.460277008
8 0.73684211 0.656842105
9 0.74736842 0.665484765
10 0.75789474 0.691523546
11 0.77894737 0.718005540
12 0.80000000 0.735955679
14 0.84210526 0.791135734
15 0.86315789 0.809972299
16 0.88421053 0.838559557
17 0.89473684 0.857950139
18 0.96842105 0.958337950
19 0.97894737 0.968642659
21 0.97894737 0.979058172
23 0.98947368 0.989473684
25 1.00000000 1.000000000
Here, I want to obtain the maximal absolute difference (Data - Expected).
Anyone have an idea? I can rephrase this question, if necessary. Thanks!
I utilized the below function to obtain the answer:
> A <- with(df, max(abs(Data-Expected)))
> A
0.082
Basically, this function calculates the differences between the two columns into a new vector, whose values are transformed into absolute values, and from the absolute values the largest one is obtained.
Credit to Josh O'Brien.

Using tapply on two columns instead of one

I would like to calculate the gini coefficient of several plots with R unsing the gini() function from the package reldist.
I have a data frame from which I need to use two columns as input to the gini function.
> head(merged[,c(1,17,29)])
idp c13 w
1 19 126 14.14
2 19 146 14.14
3 19 76 39.29
4 19 74 39.29
5 19 86 39.29
6 19 93 39.29
The gini function uses the first elements for calculation (c13 here) and the second elements are the weights (w here) corresponding to each element from c13.
So I need to use the column c13 and w like this:
gini(merged$c13,merged$w)
[1] 0.2959369
The thing is I want to do this for each plot (idp). I have 4 thousands different values of idp with dozens of values of the two other columns for each.
I thought I could do this using the function tapply(). But I can't put two colums in the function using tapply.
tapply(list(merged$c13,merged$w), merged$idp, gini)
As you know this does not work.
So what I would love to get as a result is a data frame like this:
idp Gini
1 19 0.12
2 21 0.45
3 35 0.65
4 65 0.23
Do you have any idea of how to do this?? Maybe the plyr package?
Thank you for your help!
You can use function ddply() from library plyr() to calculate coefficient for each level (changed in example data frame some idp values to 21).
library(plyr)
library(reldist)
ddply(merged,.(idp),summarize, Gini=gini(c13,w))
idp Gini
1 19 0.15307402
2 21 0.05006588

Resources