I'm trying to check that 2 samples follow the same unknown distribution using the ks.test function.
I have two datasets:
the dataset A tells me the percentage of time a value has been observed in a given environment.
the dataset B is basically a list of observed values in another environment.
My understanding is I need to pass two sample set of observed values, so I should (?) build a sample set from the dataset A where the values are present in a percentage as defined in dataset A.
Here is a code snippet to illustrate the idea. Please note the actual values in set_A and set_B are irrelevant, I'm just trying to have a structure that highlights the problem.
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
ks <- ks.test(set_B$obs, set_C$obs)
if (ks$p.value < 0.05) {
print("the 2 samples don't follow the same distribution whatever it is")
} else {
print("the 2 samples do follow the same distribution whatever it is")
}
And now my question: does that make sense?
For Kolmogorov–Smirnov test, if we know the probability of dataset A and dataset B and we form a dummy data using dataset A, we can get a fixed Kolmogorov–Smirnov static. However, if we don't know the sample size, we can't get a fixed p-value for Kolmogorov–Smirnov test because it depends on the Kolmogorov–Smirnov static, the number of samples and the level.
To verify this, we could run and check the value D and p-value,
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
(ks_1 <- ks.test(set_B$obs, set_C$obs))
(ks_2 <- ks.test(rep(set_B$obs, 2), set_C$obs))
Related
I'm trying to run a command in this tutorial on RMANOVA:
https://finnstats.com/index.php/2021/04/06/repeated-measures-of-anova-in-r/
However, when I try to run this command:
data <- read.xlsx("D:/RStudio/data.xlsx",sheetName="Sheet1")
It gives me the following error:
Error in loadWorkbook(file, password = password) : Cannot find
D:/RStudio/data.xlsx
It appears that this is not loading because it requires a local data file that I don't have. However, I don't see that there is this data file in the tutorial page. Am I correct in assuming the file is missing, or is this command supposed to build the spreadsheet?
Any help would be great. Thanks!
In short
You are correct in assuming the file is missing. This command will not create any spreadsheets but it will read a spreadsheet stored at "D:/RStudio/data.xlsx".
Possible solution
You could create a dataset on your own, e.g., like this:
# Treatment A: Samples for the different time steps T0, T1, and T2
T0A <- rnorm(12, mean = 7.853, sd = 3.082)
T1A <- rnorm(12, mean = 9.298, sd = 2.090)
T2A <- rnorm(12, mean = 5.586, sd = 0.396)
# Treatment B: Samples for the different time steps T0, T1, and T2
T0B <- rnorm(12, mean = 7.853, sd = 3.082)
T1B <- rnorm(12, mean = 9.298, sd = 2.090)
T2B <- rnorm(12, mean = 5.586, sd = 0.396)
# Combine the values in a data.frame
data <- data.frame(time = c(rep("T0", 12), rep("T1", 12), rep("T2", 12),
rep("T0", 12), rep("T1", 12), rep("T2", 12)),
score = c(T0A, T1A, T2A, T0B, T1B, T2B),
Treatment = c(rep("A", 36), rep("B", 36))
)
# make time and Treatment factors
data$time <- as.factor(data$time)
data$Treatment <- as.factor(data$Treatment)
# Here, we are at the Summary Statistics step of the tutorial already
library(dplyr)
library(rstatix)
summary<-data %>%
group_by(time) %>%
get_summary_stats(score, type = "mean_sd")
data.frame(summary)
Note that treatment A and B are exactly the same in this case. Depending on what you want to test, you can alter the mean and standard deviation of the different treatments.
Additional ideas
You could also introduce outliers to your self-created data set. In the following, I just chose the mean value from T0A and 4 times the standard deviation of T0A. Then we can set a value N for how many potential outliers we want. Subsequently, we create random values that can be up to 4 standard deviations higher or lower than the mean of T0A and use those values to replace random score values within the data.frame. In this case we set a maximum of N = 1 outlier. Of course, this script could be adapted to set certain ranges of potential values dependent on the time and Treatment factors (but that's beyond the scope of this answer for now).
mean_value <- 7.853
extreme <- 4*3.082
N <- 1
outlier_values <- runif(N, min = mean_value - extreme, max = mean_value + extreme)
outliers <- round(runif(N, min = 1, max = nrow(data)), digits = 0)
data$score[outliers] <- outlier_values
In my opinion, this data set is more useful than any example data set, because you can now change mean and standard deviation values and introduce outliers, etc, so you can experiment with the data and see how your statistical tests respond in various situations.
I'm trying to use repeat loop to generate 100 data set of Poisson Distribution with sample size n=100 and I would like to arrange the result in by row and column but it is just show me repeating to show me the last set of data while not all the data set. At the same time I would also trying to figure out the way to get the mean, variance and MSE of the 100 data set.
set.seed(124)
a <- 1
repeat{
b = rpois(100, lambda = 3)
Storage100 <- matrix(data=b,nrow=100,ncol=1)
a = a+1
print(b)
if (a>100){break
}
}
Storage100
I'm expecting that my 100 data set can be show like first set of data in first column, second set of data in second column.....
Use replicate with simplify as TRUE to get matrix of dimension 100 X 100 where each column represents the distribution.
set.seed(124)
m1 <- replicate(100, matrix(data=rpois(100, lambda = 3),ncol = 1), simplify = TRUE)
To get the mean for each column we can use colMeans (thanks to #jay.sf)
colMeans(m1)
I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688
For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.
I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.
I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.
Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis.
However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.
# Update:
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]
# Generate the output file
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100 #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50
min=0 #min income
max=1 #max income
#---------------------------------------------
batch =100 # divide the large number of residents into small batch of 100
x= matrix(
0, # the data elements
nrow=n, # number of rows
ncol=number_of_populations, # number of columns
byrow = TRUE) # fill matrix by rows
x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
# 1. Generate uniform distribution
#x[,j] <- runif(n,min, max)
# 2. Generate Normal distribution
sd = sd_constant+j/50
repeat
{
{
x_temp <- rnorm(n, meanx, sd)
is_inside = TRUE
for (i in 1:n){
if (x_temp[i]<min || x_temp[i] >max) {
is_inside = FALSE
break
}
}
}
if(is_inside==TRUE) {break}
} #end repeat
x[,j] <- x_temp
}
# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)
file_output = "random_income.csv"
sink(file_output)
write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory
#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5))
for (j in 1:number_of_populations)
{
#plot(X[,i],y,'xlab'=i)
hist(x[,j],main="Normal",'xlab'=j)
}
Here's a sensible simple way...
sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))
Test it out:
mysamp <- sampnorm01(1e5)
hist(mysamp)
Thanks to #PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.
rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
bounds <- pnorm(c(min, max), mean, sd)
u <- runif(n, bounds[1], bounds[2])
qnorm(u, mean, sd)
}
Test it out:
mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)
You can normalize the data:
x = rnorm(100)
# normalize
min.x = min(x)
max.x = max(x)
x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)
Here is my take on it.
The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.
#' Creates a random normal distribution within the specified bounds
#'
#' WARNING: This function does not preserve the standard deviation
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
yes = mean(c(lower, upper)),
no = mean)
data <- rnorm(n, mean = mean, sd = sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
}
return(data)
}
Example:
a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))