Using split function in R - r

I am trying to simulate three small datasets, which contains x1,x2,x3,x4, trt and IND.
However, when I try to split simulated data by IND using "split" in R I get Warning messages and outputs are correct. Could someone please give me a hint what I did wrong in my R code?
# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run
# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################
p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))
# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)
IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)
Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")
Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)
The warning message:
> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
and the output becomes
$`2`
x1 x2 x3 x4 trt IND
141 1.0894068 0.09765185 -0.46702047 0.4049424 A 3
145 -1.2953113 -1.94291045 0.09926239 -0.5338715 A 3
148 0.0274979 0.72971804 0.47194731 -0.1963896 A 3
$`3`
[1] x1 x2 x3 x4 trt IND
<0 rows> (or 0-length row.names)
I have checked my R code several times however, I can't figure out what I did wrong. Many thanks in advance

IND is the global variable for the full data, sim. You want to use the specific one for the subset, eg
Anew <- split(Aset, f = Aset$IND)

It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do.
From the "details" section of the help file:
f is recycled as necessary and if the length of x is not a multiple of
the length of f a warning is printed. Any missing values in f are
dropped together with the corresponding values of x.
Try checking the length of your IND against the size of your dataframe, maybe.

Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package.
> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
trt IND x1mean x2sum x3min x4max
1 A 1 -0.49356448 -1.5650528 -1.016615 2.0027822
2 A 2 0.05908053 5.1680463 -1.514854 0.8184445
3 A 3 0.22898716 1.8584443 -1.934188 1.6326763
4 B 1 0.01531230 1.1005720 -2.002830 2.6674931
5 B 2 0.17875088 0.2526760 -1.546043 1.2021935
6 B 3 0.13398967 -4.8739380 -1.565945 1.7887837
7 C 1 -0.16993037 -0.5445507 -1.954848 0.6222546
8 C 2 -0.04581149 -6.3230167 -1.491114 0.8714535
9 C 3 -0.41610973 0.9085831 -1.797661 2.1174894
>
Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. If lists are the target, ldply is your friend.

Related

Reverting a linear filter for time series in R

I'm using the stats::filter function in R in order to understand ARIMA simulations in R (as in the function stats::arima.sim) and estiamtion. I know that stats::filter applies a linear filter to a vector or time series, but I'm not sure how to "unfilter" my series.
Consider the following example: I want to use a recursive filter with value 0.7 to my series x = 1:5 (which is essentially generating an AR(1) with phi=0.7). I can do so by:
x <- 1:5
ar <-0.7
filt <- filter(x, ar, method="recursive")
filt
Time Series:
Start = 1
End = 5
Frequency = 1
[1] 1.0000 2.7000 4.8900 7.4230 10.1961
Which returns me essentially c(y1,y2,y3,y4,y5) where:
y1 <- x[1]
y2 <- x[2] + ar*y1
y3 <- x[3] + ar*y2
y4 <- x[4] + ar*y3
y5 <- x[5] + ar*y4
Now imagine I have the y = c(y1,y2,y3,y4,y5) series. How can I use the filter function to return me the original series x = 1:5?
I can write a code to do it like:
unfilt <- rep(NA, 5)
unfilt[1] <- filt[1]
for(i in 2:5){
unfilt[i] <- filt[i] - ar*filt[i-1]
}
unfilt
[1] 1 2 3 4 5
But I do want to use the filter function to do so, instead of writing my own function. How can I do so? I tried stats::filter(filt, -ar, method="recursive"), which returns me [1] 1.0000 2.0000 3.4900 4.9800 6.7101 not what I desire.
stats::filter used with the recursive option is a particular case of an ARMA filter.
a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
You could implement this filter with the signal package which allows more options than stat::filter :
a = c(1,-ar)
b = 1
filt_Arma <- signal::filter(signal::Arma(b = b, a = a),x)
filt_Arma
# Time Series:
# Start = 1
# End = 5
# Frequency = 1
# [1] 1.0000 2.7000 4.8900 7.4230 10.1961
identical(filt,filt_Arma)
# [1] TRUE
Reverting an ARMA filter can be done by switching b and a, provided that the inverse filter stays stable (which is the case here):
signal::filter(signal::Arma(b = a, a = b),filt)
# Time Series:
# Start = 2
# End = 6
# Frequency = 1
# [1] 1 2 3 4 5
This corresponds to switching numerator and denominator in the z-transform:
Y(z) = a(z)/b(z) X(z)
X(z) = b(z)/a(z) Y(z)

R: Tetrachoric correlation for multiple variables at one go?

You can see I'm a beginner at this when I'm not even able to reproduce my problem with a dummy dataset... Anyways, here goes: I want to calculate tetrachoric correlations between one grouping variable and multiple other variables. Like this:
library(psych)
set.seed(42)
n <- 16
dat <- data.frame(id=1:n,
group=c(rep("a", times=5), rep("b", times=3)),
x=sample(1:2, n, replace=TRUE),
y=sample(1:2, n, replace=TRUE),
z=sample(1:2, n, replace=TRUE))
dat
id group x y z
1 1 a 1 1 2
2 2 a 1 2 2
3 3 a 1 1 2
4 4 a 1 2 2
5 5 a 2 1 1
6 6 b 2 2 1
7 7 b 2 1 1
8 8 b 2 1 1
tetrachoric(as.matrix(dat[,c("group","y")]))
Now with this example (not with my actual dataset) I get an error which I'm unable to solve:
Error in apply(x, 2, function(x) min(x, na.rm = TRUE)) :
dim(X) must have a positive length
In addition: Warning messages:
1: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion
2: In tetrachoric(as.matrix(dat[, c("group", "y")])) :
Item = group had no variance and was deleted
My question is still what would be the best solution to get all the correlations with a single piece of code? Thank you for help!
The help file for tetrachoric says "The tetrachoric correlation is the inferred Pearson Correlation from a two x two table with the assumption of bivariate normality", so presumably you need to pass it a 2x2 table. You could write a little function that would hand the tetrachoric the appropriate table and collect the results:
myfun <- function(x,y, ...){
tabs <- lapply(seq_along(y), function(i)table(x,y[,i]))
l <- lapply(tabs, function(x)tetrachoric(x, ...))
rho <- sapply(l, function(x)x$rho)
tau <- sapply(l, function(x)x$tau)
colnames(tau) <- colnames(y)
names(rho) <- colnames(y)
ret <- list(rho = rho ,
tau = tau)
ret
}
myfun(dat$group, dat[,c("x", "y", "z")])
# $rho
# x y z
# 0.5397901 -0.2605839 0.6200705
#
# $tau
# x y z
# a 0.3186394 0.3186394 0.2690661
# 1 0.1573107 0.1573107 -0.6045853

Convolution of a piecewise function

Suppose I have two functions, f1 and f2, which is defined piecewise from a list of enumerated steps / jump points.
set.seed(1729)
n = 100
x1 = cumsum(runif(n))
x2 = cumsum(runif(n))
val1 = cumsum(runif(n))
val2 = cumsum(runif(n))
f1_list = data.frame(f = val1, x = x1)
f2_list = data.frame(f = val2, x = x2)
For simplicity, let's assume both are right-continuous. The first few values look like
> head(f1_list)
f x
1 0.1371357 0.5852396
2 0.4752026 1.0226336
3 1.0987574 1.5955279
4 1.9413884 1.9487419
5 2.0264764 2.8100133
6 2.3962088 3.2208168
> head(f2_list)
f x
1 0.3294329 0.5373382
2 0.8749826 1.3104701
3 1.5604155 2.0395473
4 1.9325968 2.9311143
5 2.3134223 3.2732812
6 2.4605212 3.6648067
I want to compute the convolution, g(t) = (f1*f2)(t). Programmatically, this does not seem like an easy task, because not only do we need to keep track of all the jump-points, but also in reverse, by the definition of convolution. What I've tried is
#############
#Say, t = 10#
#############
t = 10
f2_list$x_rev = t - f2_list$x
At which point, I'm stuck, since I don't know how to match up the corresponding intervals of x values for f1_list and f2_list.

Extract function calls from the right hand side of a formula

Several functions in R treat certain functions of variables on the right hand side of a formula specially. For example s in mgcv or strata in survival. In my case, I want particular functions of variables to be taken out of the model matrix and treated specially. I can't see how to do this other than using grep on the column names (see below) - which also doesn't work if f(.) has not been used in the formula. Does anyone have a more elegant solution? I have looked in survival and mgcv but I find the code very hard to follow and is overkill for my needs. Thanks.
f <- function(x) {
# do stuff
return(x)
}
data <- data.frame(y = rnorm(10),
x1 = rnorm(10),
x2 = rnorm(10),
s = rnorm(10))
formula <- y ~ x1 + x2 + f(s)
mf <- model.frame(formula, data)
x <- model.matrix(formula, mf)
desired_x <- x[ , -grep("f\\(", colnames(x))]
desired_f <- x[ , grep("f\\(", colnames(x))]
output:
> head(desired_x)
(Intercept) x1 x2
1 1 0.29864902 0.1474018
2 1 -0.03192798 -0.4424467
3 1 -0.83716557 1.0268295
4 1 -0.74094149 1.1094299
5 1 1.38706580 -0.2339486
6 1 -0.52925896 1.2866540
> desired_f
1 2 3 4 5 6
0.46751965 0.65939178 -1.35835634 -0.05322648 -0.09286254 1.05423067
7 8 9 10
-1.71971996 0.71743985 -0.65993305 -0.79821349

Store or print results for 't.test' in for loop

I am new to R and having a problem with printing the results of 'for' loop in R. Here is my code:
afile <- read.table(file = 'data.txt', head =T)##Has three columns Lab, Store and Batch
lab1 <- afile$Lab[afile$Batch == 1]
lab2 <- afile$Lab[afile$Batch == 2]
lab3 <- afile$Lab[afile$Batch == 3]
lab_list <- list(lab1,lab2,lab3)
for (i in 1:2){
x=lab_list[[i]]
y=lab_list[[i+1]]
t.test(x,y,alternative='two.sided',conf.level=0.95)
}
This code runs without any error but produces no output on screen. I tried taking results in a variable using 'assign' but that produces error:
for (i in 1:2){x=lab_list[[i]];y=lab_list[[i+1]];assign(paste(res,i,sep=''),t.test(x,y,alternative='two.sided',conf.level=0.95))}
Warning messages:
1: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
2: In assign(paste(res, i, sep = ""), t.test(x, y, alternative = "two.sided", :
only the first element is used as variable name
Please help me on how can I perform t.test in loop and get their results i.e. print on screen or save in variable.
AK
I would rewrite your code like this :
I assume your data is like this
afile <- data.frame(Batch= sample(1:3,10,rep=TRUE),lab=rnorm(10))
afile
Batch lab
1 2 0.4075675
2 1 0.3006192
3 1 -0.4824655
4 3 1.0656481
5 1 0.1741648
6 2 -1.4911526
7 2 0.2216970
8 1 -0.3862147
9 1 -0.4578520
10 1 -0.6298040
Then using lapply you can store your result in a list :
lapply(1:2,function(i){
x <- subset(afile,Batch==i)
y <- subset(afile,Batch==i+1)
t.test(x,y,alternative='two.sided',conf.level=0.95)
})
[[1]]
Welch Two Sample t-test
data: x and y
t = -0.7829, df = 6.257, p-value = 0.4623
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.964637 1.005008
sample estimates:
mean of x mean of y
0.3765373 0.8563520
[[2]]
Welch Two Sample t-test
data: x and y
t = -1.0439, df = 1.797, p-value = 0.4165
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.588720 4.235776
sample estimates:
mean of x mean of y
0.856352 2.032824
In a loop, you need to explicitly print your results in many cases. Try:
print(t.test(x,y,alternative='two.sided',conf.level=0.95))
or
print(summary(t.test(x,y,alternative='two.sided',conf.level=0.95)))
In addition to 'Hansons' solution of printing, results can be saved and printed like:
result <- vector("list",6)
for (i in 1:5){x=lab_list[[i]];y=lab_list[[i+1]];result[[i]] = t.test(x,y,alternative='two.sided',conf.level=0.95)}
result
AK

Resources