For each reported study, I want to do 1000 simulations of a parameter X using normal or log-normal distribution (based on a flag) and then combine all the simulations in one data frame. I am looking for an automated way of doing this.
What I have is a data frame with the following columns:
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0
when DIST==0 then it is normal distribution, if DIST==1 then it is log-normal.
I am able to do the simulations and combine them using hard coding: for example:
#for Study1:
set.seed <-1
NSUB <- 10
MEAN <- 1.5
SD <- 0.3
DIST <- 0 #Normal distribution
df1 <- data.frame("SOURCE"="Study1","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rnorm(1000, mean=MEAN, sd=SD))
#For study2
set.seed <-2
NSUB <- 5
MEAN <- 2.5
SD <- 0.4
DIST <- 1 #log-normal distribution
df2 <- data.frame("SOURCE"="Study2","NSUB"=NSUB,"DIST"=DIST, "VALUE" = rlnorm(1000, meanlog=log(MEAN), sdlog=SD))
#Combine all
dfall <- rbind(df1,df2)
However, this would be tedious to me I have alot of reported means and SD for the parameter. I need help in how to make this automated so it does 1000 simulation for each row (using MEAN and SD) and then combine all simulated data in one data frame.
In the interest of implementing readable and general code, you should do two things here:
Write a function that takes each row of your simulation configuration dataset and returns the simulated values as a data_frame (doSim below). This makes it easier to test your simulation code separately from your iteration over simulation configurations.
Use dplyr to pass each row of the function to this function, and collect up the results as a data_frame.
Here is some sample code:
library(dplyr)
# read in the simultation configuration dataset
dfX = read.table(textConnection("
SOURCE NSUB MEAN SD DIST
Study1 10 1.5 0.3 0
Study2 5 2.5 0.4 1
Study1 4 3.5 0.3 0"),
header = TRUE, stringsAsFactors = FALSE)
# write a function that takes each row of the configuration
# data.frame and returns the simulations
doSim = function(simConfig, seed = 12345) {
set.seed(seed)
dist = if(simConfig[["DIST"]] == 0) rnorm else rlnorm
mean = if(simConfig[["DIST"]] == 0) simConfig[["MEAN"]] else log(simConfig[["MEAN"]])
return(
data_frame(
source = simConfig[["SOURCE"]],
nsub = simConfig[["NSUB"]],
value = dist(1000, mean = mean, sd = simConfig[["SD"]])
)
)
}
# test the function
doSim(dfX[1, ])
# apply over dfX
dfX %>%
rowwise() %>%
do(doSim(.))
Related
I can do it for the two sample t test but not for Median test or Wilcoxon test or Hodges Lehmann test
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
mean(data_2000)
mean(data_2019)
mean(data_2019) - mean(data_2000)
combined_data <- c(data_2000, data_2019)
set.seed(123)
null_dist <- c()
for (i in 1:100000) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
null_dist[i] <- mean(shuffled_2019) - mean(shuffled_2000)
}
(p_value <- (sum(null_dist >= 49.57143) + sum(null_dist <=
`enter code here`-49.57143))/length(null_dist))
I think this is what you're trying to do. I altered your code as little as possible. There are packages like infer that will do this for you and the for loop is not the most efficient but it's plenty good enough and may help you learn. As long as we're looping I did mean and median at the same time since all other parts of the code are identical. ifelse is a nice easy way to make 1s and 0s to sum.
data_2000 <- c(500,450,600,700,550,551,552)
data_2019 <- c(560,460,620,720,540,600,750)
delta_mean <- mean(data_2019) - mean(data_2000)
delta_median <- median(data_2019) - median(data_2000)
combined_data <- c(data_2000, data_2019)
trials <- 100000
set.seed(123)
mean_diff <- c()
median_diff <- c()
for (i in 1:trials) {
shuffled_data <- sample(combined_data)
shuffled_2000 <- shuffled_data[1:7]
shuffled_2019 <- shuffled_data[8:14]
mean_diff[i] <- mean(shuffled_2019) - mean(shuffled_2000)
median_diff[i] <- median(shuffled_2019) - median(shuffled_2000)
}
p_mean <- sum(ifelse(mean_diff > delta_mean | mean_diff < -1 * delta_mean, 1, 0)) / trials
p_median <- sum(ifelse(median_diff > delta_median | median_diff < -1 * delta_median, 1, 0)) / trials
p_mean
#> [1] 0.31888
p_median
#> [1] 0.24446
Following up on your question about HL test. Quoting Wikipedia
The Hodges–Lehmann statistic also estimates the difference between two populations. For two sets of data with m and n observations, the set of two-element sets made of them is their Cartesian product, which contains m × n pairs of points (one from each set); each such pair defines one difference of values. The Hodges–Lehmann statistic is the median of the m × n differences.
You could run it on your data with the following code...
Do NOT run it 100,000 times the answer is the same everytime because you're already making all 49 possible pairings
hl_df <- expand.grid(data_2019, data_2000)
hl_df$pair_diffs <- hl_df$Var1 - hl_df$Var2
median(hl_df$pair_diffs)
[1] 49
You can do the Wilcoxon test with wilcox.test in the stats package (loaded by default as part of R core). You need to set exact = FALSE because an exact p-value is not possible if there are ties.
wilcox.test(data_2019, data_2000, exact = FALSE)
Wilcoxon rank sum test with continuity correction
data: data_2019 and data_2000
W = 33.5, p-value = 0.2769
alternative hypothesis: true location shift is not equal to 0
I'll update this when I figure out how to do the other tests.
I have code that calculate Kaplan-Meier product..
km_mean <- function(x,nd) {
library(tidyverse)
# first remove any missing data
df <- tibble(x,nd) %>% filter(!is.na(x))
x <- df %>% pull(x); nd <- df %>% pull(nd)
# handle cases of all detects or all nondetects; in these situations, no Kaplan-Meier
# estimate is possible or necessary; instead treat all detects as actual concentration estimates
# and all NDs as imputed at half their reporting limits
if (all(nd==0)) return(tibble(mean=mean(x),sd=sd(x)))
if (all(nd==1)) return(tibble(mean=mean(x/2),sd=sd(x/2)))
# for cases with mixed detects and NDs, table by nd status;
# determine unique x values; first subtract epsilon to each nondetect to associate
# larger rank for detects tied with NDs with same reporting limits
eps <- 1e-6
x <- x - nd*eps
nn <- nlevels(factor(x))
# determine number of at-risk values; build kaplan-meier CDF and survival function;
# note: need to augment and adjust <tab> for calculation below to work correctly
km.lev <- as.numeric(levels(factor(x)))
xa <- c(x,max(x)+1); nda <- c(nd,0)
tab <- table(xa,nda)
tab[nn+1,1] <- 0
km.rsk <- cumsum(tab[,1] + tab[,2])
km.cdf <- rev(cumprod(1 - rev(tab[,1])/rev(km.rsk)))[-1]
names(km.cdf) <- as.character(km.lev)
km.surv <- 1 - km.cdf
km.out <- tibble(km.lev,km.rsk=km.rsk[-length(km.rsk)],km.cdf,km.surv)
row.names(km.out) <- NULL
# estimate adjusted mean and SD
xm <- km.lev[1] + sum(diff(km.lev)*km.surv[-length(km.surv)])
dif <- diff(c(0,km.cdf))
xsd <- sqrt(sum(dif*(km.lev - xm)^2))
names(xm) <- NULL; names(xsd) <- NULL
tibble(mean=xm,sd=xsd)
}
My data has three columns, a sample-ID, value (x), and detect/non-detect flag (nd).
a1 0.23 0
a1 2.3 0
a1 1.6 0
a2 3.0 1
a2 3.1 0
a2 2.76 0
How can I adapt the function to run on all a1 samples as a group, then a2, etc.?
I've tried group_by commands, but can't seem to break through.
I have a data frame in which a variable(var1) is expressed over time in seconds. I want to calculate the mean of var1 for each sample at different time intervals (10 seconds interval until 500 seconds).
the dataframe looks like this:
sample time var1
S1 1 3.5
S1 2 6.3
S1 3 7.8
S1 4 20.5
S1 … ...
S1 530 4.5
S2 1 6.7
S2 2 20.3
S2 3 5.4
S2 … ...
S2 710 70.3
...
The data frame that I want to obtain looks like this
Sample var1_mean10:20sec var1_mean20:30sec .... var1_mean490:500sec
S1
S2
..
So I wrote this code:
setwd("…")
A <- read_excel("dati.xlsx")
for (cat in unique(A$sample))
{
A.s <- subset(A, A$sample == cat)
cuts <- cut (A.s$time, breaks=seq.int(from = 0, to = 500, by = 10))
d <- by (A.s$var1, cuts, mean)
Y<-data.frame(d)
j <- t(Y)
write.csv(Y, file = paste(cat, "var1", sep = "_"))
}
But when I run it I get Error message: Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""by"" to a data.frame
The plan is to eventually merge all the different csv.
If I understood your problem correctly you are trying to average your data in 10 second interval. I would like to propose an alternative approach using the function aggregate to compute the mean across the 10 seconds interval. The 10 seconds interval would be created through a fictitious 'time' array used to group your 10 seconds interval and then averaging.
# try to create some data similar to yours
A <- data.frame(sample = c(rep('A1', 530), rep('A2', 710)),
time = c(1 : 530, 1:710), var1 = runif(530+710))
A$times <- ceiling(A$time / 10)
Y <- aggregate(var1 ~ sample + times, data = A, FUN = mean)
Then you could export tmp straightaway.
HTH
Solved :
A <- read_excel("data.xlsx")
n <- subset(A, time <= 500)
d<-data.frame(sample= n$sample, time= n$time, ms=n$var1)
storage.data<-data.frame(matrix(nrow = n, ncol = n))
for(cat in unique(d$sample)){
g <- subset(d, d$sample == cat)
cuts <- cut (g$time, breaks=seq.int(from = 0, to = 500, by = 10))
p <- by (g$ms, cuts, mean)
storage.data[cat] = p}
View(storage.data)
storage.data_t <- t(storage.data)
View(storage.data_t)
write.csv(storage.data_t, file = "filename.csv")
I've been wrecking my head for the past four hours trying to find the solution to an R problem, which is driving me nuts. I've searching everywhere for a decent answer but so far I've been hitting wall after wall. I am now appealing to your good will of this fine community for help.
Consider the following dataset:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
I need to perform a t-test for every row in DataSample in order to find out if groups TRIAL and CONTROL differ (equal variance applies).
Then I need to count the number of rows with a p-value equal to, or lower than 0.05.
So here is the code I tried, which I know is wrong:
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
pValResults <- apply(
DataSample[,1:12],1,function(x) t.test(x,DataSample[,13:24], var.equal=T)$p.value
)
sum(pValResults < 0.05) # Returns the wrong answer (so I was told)
I did try looking at many similar questions around stackoverflow, but I would often end-up with syntax errors or a dimensional mismatch. The code above is the best I could get without returning me an R error -- but I since the code is returning the wrong answer I have nothing to feel proud of.
Any advice will be greatly appreciated! Thanks in advance for your time.
One option is to loop over the data set calculating the t test for each row, but it is not as elegant.
set.seed(2112)
DataSample <- matrix(rnorm(24000),nrow=1000)
colnames(DataSample) <- c(paste("Trial",1:12,sep=""),paste("Control",13:24,sep=""))
# initialize vector of stored p-values
pvalue <- rep(0,nrow(DataSample))
for (i in 1:nrow(DataSample)){
pvalue[i] <- t.test(DataSample[i,1:12],DataSample[i,13:24])$p.value
}
# finding number that are significant
sum(pvalue < 0.05)
I converted to a data.table, and the answer I got was 45:
DataSample.dt <- as.data.table(DataSample)
sum(sapply(seq_len(nrow(DataSample.dt)), function(x)
t.test(DataSample.dt[x, paste0('Trial', 1:12), with=F],
DataSample.dt[x, paste0('Control', 13:24), with=F],
var.equal=T)$p.value) < 0.05)
To do a paired T test, you need to supply the paired = TRUE parameter. The t.test function isn't vectorised, but it's quite simple to do t tests a whole matrix at a time. Here's three methods (including using apply):
library("genefilter")
library("matrixStats")
library("microbenchmark")
dd <- DataSample[, 1:12] - DataSample[, 13:24]
microbenchmark::microbenchmark(
manual = {ps1 <- 2 * pt(-abs(rowMeans(dd) / sqrt(rowVars(dd) / ncol(dd))), ncol(dd) - 1)},
apply = {ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], paired=TRUE)$p.value)},
rowttests = {ps3 <- rowttests(dd)[, "p.value"]})
#Unit: milliseconds
# expr min lq mean median uq max
# manual 1.611808 1.641783 1.677010 1.663122 1.709401 1.852347
# apply 390.869635 398.720930 404.391487 401.508382 405.715668 634.932675
# rowttests 2.368823 2.417837 2.639671 2.574320 2.757870 7.207135
# neval
# 100
# 100
# 100
You can see the manual method is over 200x faster than apply.
If you actually meant an unpaired test, here's the equivalent comparison:
microbenchmark::microbenchmark(
manual = {x <- DataSample[, 1:12]; y <- DataSample[, 13:24]; ps1 <- 2 * pt(-abs((rowMeans(x) - rowMeans(y)) / sqrt((rowVars(x) + rowVars(y)) / ncol(x))), ncol(DataSample) - 2)},
apply = { ps2 <- apply(DataSample, 1, function(x) t.test(x[1:12], x[13:24], var.equal = TRUE)$p.value)},
rowttests = {ps3 <- rowttests(DataSample, factor(rep(1:2, each = 12)))[, "p.value"]})
Note the manual method assumes that the two groups are the same sizes.
Adding an alternative using an external library.
Performing the test:
library(matrixTests)
res <- row_t_equalvar(DataSample[,1:12], DataSample[,13:24])
Format of the result:
res
obs.x obs.y obs.tot mean.x mean.y mean.diff var.x var.y var.pooled stderr df statistic pvalue conf.low conf.high alternative mean.null conf.level
1 12 12 24 0.30569721 0.160622830 0.145074376 0.5034806 1.0769678 0.7902242 0.3629105 22 0.399752487 0.69319351 -0.6075559 0.89770469 two.sided 0 0.95
2 12 12 24 -0.27463354 -0.206396781 -0.068236762 0.8133311 0.2807800 0.5470556 0.3019535 22 -0.225984324 0.82329990 -0.6944500 0.55797651 two.sided 0 0.95
3 12 12 24 -0.19805092 -0.023207888 -0.174843032 0.4278359 0.5604078 0.4941219 0.2869733 22 -0.609265949 0.54858909 -0.7699891 0.42030307 two.sided 0 0.95
Number of rows with p <= 0.05:
> sum(res$pvalue <= 0.05)
[1] 4
I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]
Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216