How do I run this command in the xlsx library? - r

I'm trying to run a command in this tutorial on RMANOVA:
https://finnstats.com/index.php/2021/04/06/repeated-measures-of-anova-in-r/
However, when I try to run this command:
data <- read.xlsx("D:/RStudio/data.xlsx",sheetName="Sheet1")
It gives me the following error:
Error in loadWorkbook(file, password = password) : Cannot find
D:/RStudio/data.xlsx
It appears that this is not loading because it requires a local data file that I don't have. However, I don't see that there is this data file in the tutorial page. Am I correct in assuming the file is missing, or is this command supposed to build the spreadsheet?
Any help would be great. Thanks!

In short
You are correct in assuming the file is missing. This command will not create any spreadsheets but it will read a spreadsheet stored at "D:/RStudio/data.xlsx".
Possible solution
You could create a dataset on your own, e.g., like this:
# Treatment A: Samples for the different time steps T0, T1, and T2
T0A <- rnorm(12, mean = 7.853, sd = 3.082)
T1A <- rnorm(12, mean = 9.298, sd = 2.090)
T2A <- rnorm(12, mean = 5.586, sd = 0.396)
# Treatment B: Samples for the different time steps T0, T1, and T2
T0B <- rnorm(12, mean = 7.853, sd = 3.082)
T1B <- rnorm(12, mean = 9.298, sd = 2.090)
T2B <- rnorm(12, mean = 5.586, sd = 0.396)
# Combine the values in a data.frame
data <- data.frame(time = c(rep("T0", 12), rep("T1", 12), rep("T2", 12),
rep("T0", 12), rep("T1", 12), rep("T2", 12)),
score = c(T0A, T1A, T2A, T0B, T1B, T2B),
Treatment = c(rep("A", 36), rep("B", 36))
)
# make time and Treatment factors
data$time <- as.factor(data$time)
data$Treatment <- as.factor(data$Treatment)
# Here, we are at the Summary Statistics step of the tutorial already
library(dplyr)
library(rstatix)
summary<-data %>%
group_by(time) %>%
get_summary_stats(score, type = "mean_sd")
data.frame(summary)
Note that treatment A and B are exactly the same in this case. Depending on what you want to test, you can alter the mean and standard deviation of the different treatments.
Additional ideas
You could also introduce outliers to your self-created data set. In the following, I just chose the mean value from T0A and 4 times the standard deviation of T0A. Then we can set a value N for how many potential outliers we want. Subsequently, we create random values that can be up to 4 standard deviations higher or lower than the mean of T0A and use those values to replace random score values within the data.frame. In this case we set a maximum of N = 1 outlier. Of course, this script could be adapted to set certain ranges of potential values dependent on the time and Treatment factors (but that's beyond the scope of this answer for now).
mean_value <- 7.853
extreme <- 4*3.082
N <- 1
outlier_values <- runif(N, min = mean_value - extreme, max = mean_value + extreme)
outliers <- round(runif(N, min = 1, max = nrow(data)), digits = 0)
data$score[outliers] <- outlier_values
In my opinion, this data set is more useful than any example data set, because you can now change mean and standard deviation values and introduce outliers, etc, so you can experiment with the data and see how your statistical tests respond in various situations.

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

ks.test without reference sample

I'm trying to check that 2 samples follow the same unknown distribution using the ks.test function.
I have two datasets:
the dataset A tells me the percentage of time a value has been observed in a given environment.
the dataset B is basically a list of observed values in another environment.
My understanding is I need to pass two sample set of observed values, so I should (?) build a sample set from the dataset A where the values are present in a percentage as defined in dataset A.
Here is a code snippet to illustrate the idea. Please note the actual values in set_A and set_B are irrelevant, I'm just trying to have a structure that highlights the problem.
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
ks <- ks.test(set_B$obs, set_C$obs)
if (ks$p.value < 0.05) {
print("the 2 samples don't follow the same distribution whatever it is")
} else {
print("the 2 samples do follow the same distribution whatever it is")
}
And now my question: does that make sense?
For Kolmogorov–Smirnov test, if we know the probability of dataset A and dataset B and we form a dummy data using dataset A, we can get a fixed Kolmogorov–Smirnov static. However, if we don't know the sample size, we can't get a fixed p-value for Kolmogorov–Smirnov test because it depends on the Kolmogorov–Smirnov static, the number of samples and the level.
To verify this, we could run and check the value D and p-value,
library(data.table)
# one sample set showing the percentage of time each value is observed in env A
value <- runif(10, 1, 99)
time_percent <- runif(10)
time_percent <- time_percent / sum(time_percent) * 100
set_A <- data.table(obs = round(value, 0), time_percent = round(time_percent, 0))
# a sample set of all observed values in env B
set_B = data.table(obs = runif(30, 1, 200))
# I want to check the set_B follows the same distribution as the set_A
# I generate a dummy sample where the number of times a value is present is the same percentage as the one defined in set_A
#set_C <- data.table(obs = set_A[, rep(obs, time_percent)])
set_C = data.table(obs = rep(set_A$obs, time = set_A$time_percent))
(ks_1 <- ks.test(set_B$obs, set_C$obs))
(ks_2 <- ks.test(rep(set_B$obs, 2), set_C$obs))

Paired sample t-test in R: a question of direction

I have a question about the sign of t in a paired-sample t-test using different data structures, but the same data. I know that the sign doesn't make a difference in terms of significance, but, it does generally tell the user if there have been decreases over time or increases over time. So, I need to make sure that the code I provide produces the same results OR, is explained correctly.
I have to explain the results (and code) as an example we're giving users of our software, which uses R (Rdotnet within a C# program) for statistics. I'm unclear as to the proper order of variables in both methods in R.
Method 1 uses two matrices
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
## Runs paired-sample T-Test just on two original matrices
t.test(preTest,postTest, paired = TRUE)
The results show significance and with the positive t, tells me that there has been a reduction in the mean difference from preTest to PostTest.
Paired t-test
data: preTest and postTest
t = 7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340533 11.185513
sample estimates:
mean of the differences
8.763023
However, most people are going to get their data not from two matrices, but, from a file with values for BEFORE and AFTER. I will have these data in a csv and import them during a demo. So, to mimic this, I need to create data frame in the structure that users of our software are used to seeing. 'pstt' should look like the dataframe I have after I import a csv.
Method 2: uses a flat-file structure
## Converts the matrices into a dataframe that looks like the way these
data are normally stored in a csv or Excel
ID <- c(1:100)
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
The results for this t-test has the t negative, which may indicate to the user that the variable under study has increased over time.
Paired t-test
data: weight by group
t = -7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.185513 -6.340533
sample estimates:
mean of the differences
-8.763023
Is there a way to define explicitly which group is the BEFORE and which is the AFTER? Or, do you have to have the AFTER group first in Method 2.
Thanks for any help/explanation.
Here is the full R program that I used:
## sets working dir
# setwd("C:\\Temp\\")
## runs file from command line
# source("paired_ttest.r",echo=TRUE)
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
ID <- c(1:100)
## Converts the matrices into a dataframe that looks like the way these
data are normally stored
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
print(pstt2)
## Runs paired-sample T-Test just on two original matrices
# t.test(preTest,postTest, paired = TRUE)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
Since group is a factor, the t.test will use the first level of that factor as the reference level. By default factor levels are sorted alphabetically to "AFTER" would come before "BEFORE" and "postTest" would be come before "preTest". You can explicitly set reference level of a factor with relevel().
t.test(weight ~ relevel(group, "preTest"), data = pstt2, paired = TRUE)

How can I find numérical intervals of k-means clusters?

I'm trying to discretize a numerical variable using Kmeans.
It worked pretty well but I'm wondering how I can find the intervals in my cluster.
I work with FactoMineR to do my kmeans.
I found 3 clusters according to the following graph :
My point now is to identify the intervals of my numerical variable within the clusters.
Is there any option or method in FactoMineR or other package to do it ?
I can do it manually but as I have to do it for a certain amount of variables, I'd like to found an easy way to identify them.
Since you have not provided data I have used the example from the kmeans documentation, which produces two groups for data with two columns x and y. You may split the original data by the cluster each row belongs to and then extract data from each group. I am not sure if my example data resembles your data, but in below code I have simply used the difference between min value of column x and max value of column y as the boundaries of a potential interval (depending on the use case this makes sense or not). Does that help you?
data <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(data) <- c("x", "y")
cl <- kmeans(data, 2)
data <- as.data.frame(cbind(data, cluster = cl$cluster))
lapply(split(data, data$cluster), function(x) {
min_x <- min(x$x)
max_y <- max(x$y)
diff <- max_y-min_x
c(min_x = min_x , max_y = max_y, diff = diff)
})
# $`1`
# min_x max_y diff
# -0.6906124 0.5123950 1.2030074
#
# $`2`
# min_x max_y diff
# 0.2052112 1.6941800 1.4889688

I have a table and I want to calculate statistics for many fields in R

I am a beginner with R software. I have a table with many fields (about 600). I need to compute Mean Max Min StandardDev for all the fields using a single script and saving the result into a separate file. Moreover I would like to calculate the student test of all the fields but one against the one left out.
Start of with this, and work your way towards your goal.
Study apply functions to understand.
#random data
dt <- data.frame(
x = rnorm(100,mean = 4, sd = 2)
,y = rnorm(100,mean = 7, sd = 3)
,z = rnorm(100,mean = 5, sd = 2))
And
Option 1:
#summarize each column/field
sm <- lapply(names(dt),function(x){c(x,summary(dt[,x]))})
do.call(rbind, sm)
Option 2:
#summarize each column/field
apply(dt,2,FUN = summary)
With t.test:
lapply(names(dt),function(a){lapply(names(dt),function(b){t.test(dt[,a],dt[,b])})})
all comparisons combination are made. Result are in a complex list, because of all the features of the t.test.

Resources