Calculate empiric pvalue according to a distribution in R

Calculate empiric pvalue according to a distribution in R - r

I have a list such as:
> generate_all[1,]
[1] 16 16 19 17 17 16 17 18 13 18 18 21 19 9 27 16 13 18 20 23 17 18 17 14 13 22 20 17 20 16 15 15 19 14 15 13 19 18 18 15 19
[42] 18 18 20 21 24 12 14 14 15 17 16 15 11 19 22 18 14 22 9 21 14 20 17 23 13 21 15 17 19 20 15 14 19 23 14 17 14 20 16 17 20
[83] 14 17 17 22 14 20 26 24 13 13 15 21 20 19 13 17 20 13 17 17 23 19 23 13 22 18 14 15 22 13 21 15 25 15 22 17 19 10 16 18 18
[124] 12 8 22 15 27 21 13 15 19 21 17 11 18 16 22 19 20 17 20 19 22 24 22 18 11 21 16 10 12 29 13 10 19 18 19 19 26 13 16 20 18
[165] 17 18 15 14 16 19 25 21 11 16 16 17 11 22 16 13 19 24 18 13 19 24 11 17 11 25 11 11 24 20 18 15 8 13 14 11 21 19 17 21 21
[206] 16 20 20 21 14 15 17 15 18 17 20 19 13 14 25 13 17 25 9 13 14 19 18 19 18 23 23 24 19 14 24 17 21 17 16 13 20 18 14 20 17
[247] 22 20 12 19 19 20 21 21 12 19 18 16 16 11 15 11 15 26 15 18 13 18 18 17 20 26 13 15 16 12 15 16 18 20 9 20 18 17 19 23 12
[288] 13 25 11 17 16 17 21 19 17 19 29 14 22 15 19 21 20 18 18 16 12 18 21 13 7 17 20 16 16 26 18 9 12 17 18 20 17 20 28 11 17
[329] 21 28 11 20 17 10 8 24 15 20 17 15 17 8 14 13 18 15 17 14 20 28 16 19 17 18 26 19 17 23 19 24 10 12 14 17 15 16 17 12 17
[370] 13 13 18 18 14 18 22 20 14 14 11 16 17 16 14 16 23 10 15 20 16 16 14 16 20 12 10 18 16 15 16 15 25 13 20 22 20 18 13 15 24
[411] 14 21 18 14 20 11 13 19 13 13 12 19 23 21 17 25 16 12 15 19 14 15 21 19 20 17 18 15 16 23 17 14 23 21 15 14 19 20 13 16 18
[452] 13 18 16 23 16 14 17 17 16 23 8 18 22 18 15 19 15 16 25 19 13 20 17 22 18 10 20 19 13 19 18 14 15 15 14 11 16 17 17 25 14
[493] 11 12 19 14 14 16 15 14 11 9 15 21 19 25 19 13 17 12 15 22 12 22 15 14 20 19 19 12 20 18 11 18 17 22 13 17 15 18 17 22 16
[534] 15 14 14 29 14 21 17 24 21 14 17 20 19 19 16 17 13 21 15 20 14 12 22 17 21 13 15 21 9 24 15 14 12 17 15 17 15 21 17 24 19
[575] 18 14 16 12 17 11 14 17 20 21 22 17 16 18 24 24 21 17 12 18 18 17 19 22 27 14 16 15 18 24 17 24 18 13 13 12 14 19 11 16 15
[616] 16 23 17 20 9 15 14 17 16 23 15 17 18 18 7 11 27 21 14 20 17 18 19 17 22 18 11 16 14 18 25 17 19 15 22 20 14 18 19 25 25
[657] 16 17 17 22 16 11 23 7 20 18 17 24 17 16 19 14 19 12 17 21 20 16 23 16 14 14 20 22 12 22 19 19 18 23 19 13 22 14 16 24 20
[698] 18 24 29 16 13 11 25 11 16 16 17 26 20 15 14 16 18 13 15 14 13 23 15 17 17 14 15 26 18 13 14 23 22 17 16 18 17 19 16 17 17
[739] 11 16 21 12 14 14 15 17 18 15 19 20 25 16 19 10 27 12 25 14 23 14 22 19 19 13 18 12 18 16 19 21 13 12 16 17 19 19 18 20 17
[780] 20 18 15 16 13 14 9 13 13 20 20 16 19 13 12 13 17 20 14 18 17 24 14 10 16 16 18 16 20 14 13 12 19 18 11 21 23 10 13 17 21
[821] 15 16 9 16 19 15 15 15 20 19 11 19 21 16 13 16 19 22 20 14 15 12 20 21 10 16 17 13 13 15 12 18 19 21 25 12 23 16 19 17 16
[862] 20 18 18 13 22 17 18 13 10 20 16 17 15 24 9 16 17 14 20 21 15 19 7 19 18 21 19 23 18 22 15 15 14 18 20 21 19 23 25 15 11
[903] 19 13 14 16 17 14 15 15 14 11 21 20 17 21 12 21 16 20 17 22 15 20 13 17 17 16 24 19 17 28 22 15 19 16 21 15 22 17 16 13 18
[944] 9 13 15 26 19 20 21 16 23 22 9 14 20 21 23 17 22 22 24 20 18 17 14 17 18 15 20 15 22 20 19 17 20 21 18 22 13 17 23 16 17
[985] 22 16 17 16 15 15 22 21 14 18 24 14 18 12 13 21
and I would like to calculate an empiric pvalue.
Here is the idea:
if value_obs > mean(generate_all[1,]):
pvalue=(sum(i > value for i in generate_all[1,])/lenght(generate_all[1,])
else:
pvalue=(sum(i < value for i in generate_all[1,])/lenght(generate_all[1,])
so because the value_obs = 5 and the mean of the exemple distribution is
> mean(generate_all[1,])
[1] 17.2347
I want to calculate the number of values that are above 5 in the distribution and get a % of it by dividing with the length generate_all[1,].
Does anyone have an idea with R ?

I am not sure I understand your issue exactly, but if you want to compute your custom pvalue for each column of your data.frame, you can use the apply function provided in R
pvalue <- apply(df, 2, function(x) if(value_obs > mean(x)){
return(sum(x>5)/nrow(df))} else {
return(sum(x<5)/nrow(df))})
Here, pvalue will be a vector containing the pvalues computed for each column of the dataframe.
You should check Grouping functions (tapply, by, aggregate) and the *apply family for detailed explanations.

Related

Generating a vector with n repetitions of x, then y, then z, with a fixed upper bound

I am trying to create a vector where I have 3 repetitions of the number 1, then 3 repetitions of the number 2, and so on up to, for instance, 3 repetitions of the number 36.
c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5...)
I have tried the following use of rep() but got the following error:
Error in rep(3, seq(1:36)) : argument 'times' incorrect
What formulation do I need to use to properly generate the vector I want?

sort(rep(1:36, 3))
Or even better as #Wimpel mentioned in the comments, use the each argument of the rep function.
rep(1:36, each = 3)
output
# [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22
# [65] 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34 34 34 35 35 35 36 36 36

This one should work. However probably not the most elegant.
reps = c()
n = 36
for(i in 1:n){
reps = append(reps, rep(i, 3))
}
reps
alternatively using the rep function properly (see documentation (?rep for argument each):
rep(1:36,each = 3)

rep approach is preferable (see existing answers)
Here are some other options:
> kronecker(1:36, rep(1, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
> c(outer(rep(1, 3), 1:36))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36

Order Levels (format: numbers) in a factor from low to high in RStudio

I have a problem in R. I have created a factor (called reference). But the Levels are not in the right order; I want them to be in order from low to high (1,2,3...30). I tried reference<-relevel(reference,1) but then the order is not right. Is there a way how I can change the order as I want?
reference
[1] 5 5 1 5 5 5 1 1 1 1 1 11 1 1 1 5 1 5 1 2 1 1 1 1 2 1 1 1 3 1 2 1 2 15 2 2 2 15
[39] 16 3 2 2 4 2 16 23 2 14 2 4 2 3 2 14 4 24 2 2 2 2 3 4 3 3 3 3 25 3 2 3 3 3 3 3 25 3
[77] 3 3 3 1 3 15 3 3 3 3 3 1 1 3 8 4 4 4 4 8 4 4 4 4 4 4 4 4 4 4 8 4 4 4 4 8 4 4
[115] 4 4 15 8 4 16 8 16 14 14 5 5 5 5 7 5 16 5 14 16 14 14 5 5 5 5 14 5 3 5 7 8 4 7 5 5 6 4
[153] 4 15 15 15 6 4 6 14 4 14 15 6 4 11 6 28 16 6 16 15 9 14 6 14 15 6 16 14 7 14 16 16 16 16 7 7 14 16
[191] 16 7 15 7 4 15 7 15 14 15 15 9 14 7 16 15 15 15 16 14 8 8 9 4 8 8 10 8 4 7 8 4 8 4 8 8 8 8
[229] 8 8 9 8 8 4 8 8 14 8 8 8 29 14 29 9 29 14 9 16 29 10 29 14 16 9 9 29 29 29 9 29 16 4 4 9 15 29
[267] 9 23 29 9 10 4 10 10 10 10 10 10 10 10 10 10 14 5 10 10 15 11 10 11 10 10 11 11 11 15 4 10 15 10 10 11 11 10
[305] 10 10 11 11 10 11 11 10 11 11 11 10 11 10 8 11 10 10 11 11 10 10 11 11 10 11 14 16 7 15 12 14 14 15 15 14 14 14
[343] 12 17 11 2 15 16 7 16 15 15 15 14 17 28 5 7 17 16 11 13 13 11 13 13 13 13 16 13 13 13 11 13 15 13 13 13 11 13
[381] 13 13 10 13 13 13 13 13 13 10 11 14 15 14 4 14 14 15 8 14 4 4 14 14 14 15 15 14 4 4 14 4 4 4 14 7 8 14
[419] 11 11 15 15 16 3 5 11 15 14 15 15 4 3 15 4 15 15 15 14 14 15 16 15 14 15 11 15 15 15 15 16 7 4 16 16 16 14
[457] 15 14 16 16 16 16 15 12 4 4 4 14 16 16 15 15 14 26 16 26 14 16 4 13 17 21 17 21 17 17 17 17 17 20 17 21 17 17
[495] 18 17 18 17 18 21 17 21 20 18 21 18 18 20 17 17 17 20 17 17 18 18 18 20 18 18 21 18 21 17 17 17 18 18 21 18 17 18
[533] 17 21 17 20 18 22 18 20 19 18 18 19 4 19 18 19 19 15 1 19 17 7 3 20 17 19 19 19 20 18 19 19 19 19 20 18 15 14
[571] 21 20 20 20 20 22 20 20 19 20 20 20 20 20 22 20 21 18 20 20 21 21 17 18 21 20 20 18 20 20 21 17 21 21 22 21 20 20
[609] 21 21 21 21 17 21 18 21 17 18 17 20 20 18 20 20 18 21 21 21 20 17 21 22 22 22 22 22 22 22 21 22 22 22 21 22 22 21
[647] 22 22 21 29 21 22 22 22 22 22 22 20 18 22 8 15 4 4 15 4 4 15 15 15 4 4 4 15 4 15 15 23 4 23 4 2 8 23
[685] 4 23 10 2 4 7 4 7 18 24 15 15 26 11 15 15 4 24 7 15 15 5 24 15 4 1 4 8 24 23 23 6 4 3 23 4 5 1
[723] 1 1 1 2 1 1 1 1 1 1 1 1 3 2 1 5 1 2 2 1 1 1 20 2 3 25 2 1 3 15 15 15 14 14 14 5 14 15
[761] 4 18 14 8 26 4 15 20 10 16 8 4 15 15 16 4 23 18 15 4 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27
[799] 27 27 27 27 27 27 27 27 27 27 27 27 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
[837] 28 28 28 28 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 30 30 30 30
[875] 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Levels: 26 28 27 29 3 30 4 5 6 7 8 9 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25

You can convert the factors to numbers and then factors again :
reference <- factor(as.numeric(as.character(reference)))
Or if you already know the range of factors :
reference <- factor(reference, 1:30)

Plotting multi boxplot with pvalues in R

I have a hypothetical sample of my data as follows:
df <- read.table(header = TRUE, text =
"time goal book pen temp1 temp2 weight hight
18 13 18 15 13 15 18 13
18 13 20 16 16 15 20 19
20 12 20 14 18 20 17 15
15 19 18 16 18 20 15 15
16 17 14 12 17 20 20 15
17 12 18 16 17 19 14 16
20 18 15 18 19 13 15 18
13 14 19 20 14 18 12 15
20 16 18 12 16 16 14 13
12 19 19 15 20 14 20 16
14 16 13 14 15 15 15 17
18 19 17 13 14 13 15 17
16 15 14 17 12 18 14 14
19 20 15 13 12 16 20 15
17 12 18 16 16 17 12 20
20 16 14 19 17 20 17 13
17 13 15 16 15 15 17 17
12 17 15 16 14 16 18 13
15 18 17 20 15 13 14 19
19 12 13 17 12 20 12 13
19 18 14 19 15 20 12 16
20 17 18 15 13 19 19 17
16 18 17 19 16 16 12 16
12 15 19 18 20 15 17 19"
)
Please consider, I have more columns. I want to plot boxplots for each group and see them within a sing plot
The logic is : time with goal; book with pen; temp1 with temp2 and weight with hight.
I would like to see p-values for each group, e.g., time with goal, book with pen...
I struggled to show the outcome, but I hope that my description does make sense.

Based on your data this may be what you are looking for
library (ggpubr)
#get data into long format
long <- data.table::melt(df)
#get colnames
nms <- colnames(df)
# get pairs of variables in a list to pass to comparison
comps <- split(nms,ceiling(seq_along(nms) / 2))
#add group to data for colouring
long$group <- rep(letters[1:4],each = nrow(long)/4)
#plot
ggboxplot(long,x="variable",y="value",fill ="group") +
stat_compare_means(comparisons = comps)

Display vector in R with a defined viewport

I want to display a vector consistently in different R environment.
For example, for a vector like this
c(1:30)
will display 24 values per row
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30
and not
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

The closest thing to what you are looking for is to use options() to configure the width of the results window:
options(width = 75)
c(1:30)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30

Rename several columns (variable number)

I have a dataset where the last columns indicate the number of stops extracted from that dataset.
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10 (...)
a g c a q e r e r q g h q (...)
What I want is to select from column 1, until the last column, and add Stop before it, ending up with Stop1, Stop2, etc...
The problem is that those columns can vary. Sometimes I have 10 after 1 other times I have 6.
I've tried with dplyr and data.table but I'm not sure how to automate this.
EDIT: ColA to ColC are fixed and always the same.

If I correctly understood your problem, this is a sufficiently flexible code that should solve your problem. Start considering the following dataset:
set.seed(1)
df <- data.frame(matrix(rpois(130, 20),ncol=13))
names(df) <- c(paste("Col",LETTERS[1:3],sep=""),as.character(1:10))
df
#######
ColA ColB ColC 1 2 3 4 5 6 7 8 9 10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
Now rename columuns as required:
k <- which(names(df)=="1")
names(df)[k:ncol(df)] <- paste("Stop",1:(ncol(df)-k+1),sep="")
df
#############
ColA ColB ColC Stop1 Stop2 Stop3 Stop4 Stop5 Stop6 Stop7 Stop8 Stop9 Stop10
1 17 21 20 13 13 15 29 25 16 15 12 23 17
2 25 17 11 24 23 14 22 23 25 14 18 19 15
3 25 18 22 18 19 30 16 19 23 27 18 19 11
4 21 18 24 25 23 19 19 18 27 23 18 16 18
5 13 21 16 18 21 23 22 18 22 24 22 26 15
6 22 16 17 27 17 20 24 24 14 21 19 17 15
7 23 23 18 22 16 16 20 18 21 27 17 22 14
8 22 22 17 17 26 13 19 25 24 17 15 13 20
9 18 24 21 22 28 26 15 22 23 20 19 15 27
10 26 23 19 16 18 20 17 25 16 20 19 18 19
I hope it can help you.