For testing a specific hypothesis, I am trying to contrast a factor in R.
set.seed(24)
data <- data.frame(var = sample(1:100, 70, replace = TRUE),
version = rep(c("v3", "v4", "v1", "v3", "v4","v2","v2"),times=10))
c1 <- c(1/3, -1, 1/3, 1/3)
c2 <- c(0, -1, 1, 0)
c3 <- c(0, -1, 0, 1)
c4 <- c(1, -1, 0, 0)
mat <- cbind(c1, c2, c3, c4)
contrasts(data$version, how.many = 4) <- mat
model <- aov(var ~ version, data = data)
summary.aov(model, split=list(version=list("comparison1"=1,"comparison2"= 2,
"comparison3"=3,"comparison4"= 4)))
Why is there no result for comparison 4? How can I fix that? Thanks.
We specify the how.many parameter and it should work. According to ?contrasts
how.many -How many contrasts should be made. Defaults to one less than the number of levels of x. This need not be the same as the number of columns of value.
So, it is the default behavior we observe while doing the assignment without any how.many parameter
contrasts(data$var, how.many = 4) <- mat
contrasts(data$var)
# c1 c2 c3 c4
#var1 0.3333333 0 0 1
#var2 -1.0000000 -1 -1 -1
#var3 0.3333333 1 0 0
#var4 0.3333333 0 1 0
data
set.seed(24)
data <- data.frame(var = sample(paste0("var", 1:4), 20, replace = TRUE))
Related
Suppose I have the following data frame:
> example
col1 col2 col3
1 -1 1 -1
2 0 -1 3
3 1 10 -1
and I want to plot a barplot and using row 3 as an example, I do barplot(example[3,]). This works perfectly. However, I want to flip the value and add more color -- specifically, I want:
if the value is negative (i.e., -1 in row 3), I want to flip it into +1 and color red when plotting the boxplot. (but note that there are +1 value in the row already and we don't want to color that in red)
if the value is >= +10, color the column green in the boxplot
How can I do the above?
> dput(example)
structure(list(col1 = c(-1, 0, 1), col2 = c(1, -1, 10), col3 = c(-1,
3, -1)), row.names = c(NA, 3L), class = "data.frame")
Here it is a complet solution using only base R :
df <- structure(list(col1 = c(-1, 0, 1),
col2 = c(1, -1, 10),
col3 = c(-1, 3, -1)),
row.names = c(NA, 3L),
class = "data.frame")
# apply conditions on the matrix to color your plot
# If I well understand your demand, it is to have specific color with
# respect to specific condition. Multiply by 2 is to have different factor
level
color <- (df >= 10) * 2 + (df < 0) + 1
# to swap -1 to +1 do that :
df[df < 0] <- df[df < 0] + 2
# set color as wishes
color <- matrix(c("black", "red", "green")[color], nrow = nrow(color), ncol = ncol(color), byrow = F)
# plot the vector we want
barplot(df[,3], col = color[,3])
EDIT 1
To plot the row 3 you can use this trick with transposition function t() :
barplot(t(df)[, 3], col = t(color)[, 3])
So, let's say I have a 1000-row, 6-column dataframe, the columns are a1, a2, b1, b2, c1, c2. I want to run some t-tests using a's, b's, and c's and get an output df with 3 columns for the t-values of a-b-c and another three for the significance information for those values, making it a total of 6 columns. The problem I have is with rows, I want to loop over chunks of 20, rendering the output a (1000/20=)50-row, 6-column df.
I have already tried creating an index column for my inital df which repeats a 1 for the first 20 row, a 2 for the next 20 row and so on.
convert_n <- function(df) {
df <- df %T>% {.$n_for_t_tests = rep(c(1:(nrow(df)/20)), each = 20)}
}
df <- convert_n(df)
However, I can't seem to find a way to properly utilize the items in this column as indices for a "for" or any kind of loop.
Below you can see the relevant code for that creates a 1-row, 6-column df; I need to modify the [0:20] parts, create a loop that does this for 20 groups and binds them.
t_test_a <- t.test(df$a1[0:20], dfff$a2[0:20], paired = T, conf.level
= 0.95)
t_test_b <- t.test(df$b1[0:20], dfff$b2[0:20], paired = T, conf.level
= 0.95)
t_test_c <- t.test(df$c1[0:20], dfff$c2[0:20], paired = T, conf.level
= 0.95)
t_tests_df <- data.frame(t_a = t_test_a$statistic[["t"]],
t_b = t_test_b$statistic[["t"]],
t_c = t_test_c$statistic[["t"]])
t_tests_df <- t_tests_df %T>% {.$dif_significance_a = ifelse(.$t_a >
2, "YES", "NO")} %T>%
{.$dif_significance_b = ifelse(.$t_b >
2, "YES", "NO")} %T>%
{.$dif_significance_c = ifelse(.$t_c >
2, "YES", "NO")} %>%
dplyr::select(t_a, dif_significance_a,
t_b, dif_significance_b,
t_c, dif_significance_c)
Thank you in advance for your help.
You can use split() and sapply():
set.seed(42)
df <- data.frame(a1 = sample(1000, 1000), a2 = sample(1000, 1000),
b1 = sample(1000, 1000), b2 = sample(1000, 1000),
c1 = sample(1000, 1000), c2 = sample(1000, 1000))
group <- gl(50, 20)
D <- split(df, group)
myt <- function(Di)
with(Di, c(at=t.test(a1, a2)$statistic, ap=t.test(a1, a2)$p.value,
bt=t.test(b1, b2)$statistic, bp=t.test(b1, b2)$p.value,
ct=t.test(c1, c2)$statistic, cp=t.test(c1, c2)$p.value))
sapply(D, FUN=myt) ### or
t(sapply(D, FUN=myt))
This is not the most pretty but i did a for loop like this:
df <- data.frame(a1 = sample(1000, 1000),
a2 = sample(1000, 1000),
b1 = sample(1000, 1000),
b2 = sample(1000, 1000),
c1 = sample(1000, 1000),
c2 = sample(1000, 1000))
df_ttest <- data.frame(p_a = c(1:50),
t_a = c(1:50),
p_b = c(1:50),
t_b = c(1:50),
p_c = c(1:50),
t_c = c(1:50))
index <- 0:50*20
for(i in seq_along(index)) {
df_ttest$p_a[i] = t.test(df$a1[index[i] : index[i+1]])$p.value
df_ttest$p_b[i] = t.test(df$b1[index[i] : index[i+1]])$p.value
df_ttest$p_c[i] = t.test(df$c1[index[i] : index[i+1]])$p.value
df_ttest$t_a[i] = t.test(df$a1[index[i] : index[i+1]])$statistic
df_ttest$t_b[i] = t.test(df$b1[index[i] : index[i+1]])$statistic
df_ttest$t_c[i] = t.test(df$c1[index[i] : index[i+1]])$statistic
}
This gives a 50x6 dataframe with seperate columns of p and t values for every 20 row chunk of a, b and c.
You could even go further and make a nested for loop to cycle through each row in df_ttest to make this abit prettier.
I am trying to fit many nonlinear fits using wrapnls in parallel using dplyr and broom (and eventually mclapply), but I am getting a parsing evaluation error from nlxb:
Error in parse(text = joe) (from #11) : <text>:1:6: unexpected input
1: b1.10% <- 20
I get this error using both do and lapply approaches.
library(nlmrt)
library(dplyr)
library(purrr)
library(broom)
data_frame(x = seq(0, 200, 0.1),
y = 1.2*exp(-(times - 10)^2/(2*4.2^2)) + 2.4*exp(-(times - 50)^2/(2*3.8^2)) + 5.3*exp(-(times - 80)^2/(2*5.1^2)) + rnorm(length(times), sd = 0.05)) %>%
do({
xl <- quantile(.$x, 0.1, na.rm = TRUE)
xm <- quantile(.$x, 0.5, na.rm = TRUE)
xh <- quantile(.$x, 0.8, na.rm = TRUE)
starts <- c(a1 = 5, a2 = 5, a3 = 5,
b1 = xl, b2 = xm, b3 = xh,
c1 = 5, c2 = 5, c3 = 5)
fmla <- y ~ a1*exp(-(x - b1)^2/(2*c1^2)) + a2*exp(-(x - b2)^2/(2*c2^2)) + a3*exp(-(x - b3)^2/(2*c3^2))
df <- data_frame(x = .$x, y = .$y)
mod <- wrapnls(fmla, lower = 0, upper = 200, start = starts, data = df)
tidy(mod)
})
Is there any way around this?
The problem isn't with the do aspect, it's the code inside the do, so you can debug that part directly. The starts vector is getting the b# names concatenated with the quantiles:
names(starts)
## [1] "a1" "a2" "a3" "b1.10%" "b2.50%" "b3.80%" "c1" "c2" "c3"
Adding unname to the quantile calculation fixes the issue.
data_frame(x = seq(0, 200, 0.1),
y = 1.2*exp(-(x - 10)^2/(2*4.2^2)) + 2.4*exp(-(x - 50)^2/(2*3.8^2)) + 5.3*exp(-(x - 80)^2/(2*5.1^2)) + rnorm(length(x), sd = 0.05)) %>%
do({
xl <- quantile(.$x, 0.1, na.rm = TRUE) %>% unname()
xm <- quantile(.$x, 0.5, na.rm = TRUE) %>% unname()
xh <- quantile(.$x, 0.8, na.rm = TRUE) %>% unname()
starts <- c(a1 = 5, a2 = 5, a3 = 5,
b1 = xl, b2 = xm, b3 = xh,
c1 = 5, c2 = 5, c3 = 5)
fmla <- y ~ a1*exp(-(x - b1)^2/(2*c1^2)) + a2*exp(-(x - b2)^2/(2*c2^2)) + a3*exp(-(x - b3)^2/(2*c3^2))
df <- data_frame(x = .$x, y = .$y)
mod <- wrapnls(fmla, lower = 0, upper = 200, start = starts, data = df)
tidy(mod)
})
## term estimate std.error statistic p.value
## 1 a1 2.386492 0.007455097 320.1155 0
## 2 a2 5.296250 0.006437509 822.7174 0
## 3 a3 1.199384 0.007132559 168.1562 0
## 4 b1 49.997697 0.013702894 3648.6960 0
## 5 b2 80.004023 0.007150546 11188.5193 0
## 6 b3 10.077847 0.028644821 351.8209 0
## 7 c1 3.798829 0.013702940 277.2273 0
## 8 c2 5.094727 0.007150573 712.4921 0
## 9 c3 4.175235 0.028944448 144.2499 0
The data can be found here
library(nlme)
library(dfoptim)
dat0 <- read.table("aids.dat2",head=T)
dat1 <- dat0[dat0$day<=90, ] # use only first 90-day data
dat2 <- dat1[!apply(is.na(dat1),1,any),] # remove missing data
aids.dat <- groupedData(lgcopy ~ day | patid, data=dat2)
aids.dat$log10copy = log10(aids.dat$lgcopy)
myfun2 <- function(arg){
s.p1 <- arg[1]
s.b1 <- arg[2]
s.p2 <- arg[3]
s.b2 <- arg[4]
model = nlme(log10copy ~ exp(p1-b1*day) + exp(p2-b2*day),
fixed = list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1),
random = list(patid = pdDiag(list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1))),
start = list(fixed = c(p1 = s.p1, b1 = s.b1, p2 = s.p2, b2 = s.b2)),
data =aids.dat)
return(model$logLik)
}
nmkb(fn = myfun2, par = c(10,0.5,6,0.005), lower = c(5, 0.1, 5, 0.001), upper = c(15, 1, 10, 0.1))
Running the above code, I run into several errors:
Error in nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
step halving factor reduced below minimum in PNLS step
In addition: Warning message:
In nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
Singular precision matrix in level -1, block 1
The model fits fine with the staring values from par = c(10,0.5,6,0.005). However, I think as the random algorithm starts using other starting values in lower = c(5, 0.1, 5, 0.001), upper = c(15, 1, 10, 0.1) the nlme call runs into the above problems because it's so sensitive to starting values. As a result, the nmkb call never amounts to anything.
Is there a way to circumvent this?
The model log-liklihood needs to be maximized, but many optimization procedures in R gives minimization result. So the function to be optimized has to be the negative log-likelihood. So it should look like this:
myfun2 <- function(arg){
s.p1 <- arg[1]
s.b1 <- arg[2]
s.p2 <- arg[3]
s.b2 <- arg[4]
model = nlme(log10copy ~ exp(p1-b1*day) + exp(p2-b2*day),
fixed = list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1),
random = list(patid = pdDiag(list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1))),
start = list(fixed = c(p1 = s.p1, b1 = s.b1, p2 = s.p2, b2 = s.b2)),
data =aids.dat)
return(-model$logLik)
}
And although there are still many warnings, there is no more error on my machine, and the algorithm converges successfully.
$par
[1] 13.460199068 0.848526199 7.764024099 0.001513636
$value
[1] -358.6631
$feval
[1] 197
$restarts
[1] 0
$convergence
[1] 0
$message
[1] "Successful convergence"
Warning messages:
1: In nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
Singular precision matrix in level -1, block 1
I have 2 given matrices
a1 <- matrix(c(0.4092951, 0.1611806, 0.4283178, 0.001206529), nrow =
1)
a2 <- matrix(c(0.394223557, 0.140443266, 0.463980790, 0.001352387),
nrow = 1)
I have an initial matrix
b <- matrix(c(0.4095868, 0.1612955, 0.4286231, 0.0004946572,
0, 0.2732351, 0.7260891, 0.0006757670,
0, 0, 0.9909494, 0.0090505527,
0, 0, 0, 1), nrow = 4, byrow = T)
I need to update 'b' such that
a1 %*% b = a2
The above is an optimization problem where the
objective function is to minimize
(a1 %*% b - a2)
which would drive the value of the sum(absolute value(a1 %*% b - a2)) to zero, subject to the constraints:
Lower triangle(b) = 0 ;
RowSum(b) = 1
## creating a data vector with a1 and a2
data = c(as.numeric(a1), as.numeric(a2))
## objective function
min_obj <- function(p){
## Creating a matrix to recreate 'b'
p1 <- matrix(rep(0, 16), nrow = 4)
k = 1
for(i in 1:nrow(p1)){
for (j in 1:ncol(p1)){
if(j >= i){
p1[i,j] <- p[k]
k = k+1
}
}
}
actual <- matrix(data[1:(length(data)/2)], nrow = 1)
pred <- matrix(data[(length(data)/ 2 + 1):length(data)], nrow = 1)
s <- (actual %*% p1) - pred
sum(abs(s))
}
## Initializing the initial values for b taking only non-zero values
init <- b[b>0]
opt <- optim(init, min_obj, control = list(trace = T), method =
"L-BFGS-B", lower = rep(0, length(init)), upper = rep(1,
length(init)))
transformed_b <- matrix(rep(0, 16), nrow = 4)
k = 1
for(i in 1:nrow(transformed_b)){
for (j in 1:ncol(transformed_b)){
if(j >= i){
transformed_b[i,j] <- opt$par[k]
k = k+1
}
}
}
transformed_b
The issue with transformed_b is that rowSum of the matrix is not 1. Any help is highly appreciated.
"optim" is the right choice. Since the row sums have to be 1, there are only 6 parameters, not 10 as in your attempt. The diagonal is uniquely determined by the values strictly above the diagonal.
a1 <- matrix(c(0.4092951, 0.1611806, 0.4283178, 0.001206529), nrow =
1)
a2 <- matrix(c(0.394223557, 0.140443266, 0.463980790, 0.001352387),
nrow = 1)
b <- matrix(c(0.4095868, 0.1612955, 0.4286231, 0.0004946572,
0, 0.2732351, 0.7260891, 0.0006757670,
0, 0, 0.9909494, 0.0090505527,
0, 0, 0, 1), nrow = 4, byrow = T)
#======================================================================
# Build an upper triangular matrix with rowsums 1:
B <- function(x)
{
X <- matrix(c(0,x[1:3],0,0,x[4:5],0,0,0,x[6],rep(0,4)),4,4,byrow=TRUE)
diag(X) <- 1-rowSums(X)
return(X)
}
#----------------------------------------------------------------------
# The function we want to minimize:
f <- function(x)
{
return (sum((a1%*%B(x) - a2)^2))
}
#----------------------------------------------------------------------
#Optimization:
opt <- optim( par = c(b[1,2:4],b[2,3:4],b[3,4]),
fn = f,
lower = rep(0,6),
method = "L-BFGS-B" )
optB <- B(opt$par)
Result:
> optB
[,1] [,2] [,3] [,4]
[1,] 0.9631998 0.03680017 0.0000000 0.0000000000
[2,] 0.0000000 0.77820700 0.2217930 0.0000000000
[3,] 0.0000000 0.00000000 0.9998392 0.0001608464
[4,] 0.0000000 0.00000000 0.0000000 1.0000000000
> a1 %*% optB - a2
[,1] [,2] [,3] [,4]
[1,] 9.411998e-06 5.07363e-05 1.684534e-05 -7.696464e-05
> rowSums(optB)
[1] 1 1 1 1
I chose the sum of squares instead of sum of absolute values, since it is differentiable. This makes it easier for "optim" to find the minimum, I guess.