I want to create a loop using variable names instead of numbers but I'm struggling with it.
I have over 1000 variables in my data but the structure looks like this:
#Reproducible data
id <- rep(c("1","2","3","4","5","6"),3)
sequence <- rep(c("1","2","1","2","1","1"),3)
treatment <- c(rep(c("A"), 6), rep(c("B"), 6),rep(c("C"), 6))
var1 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
var1_base <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
var2 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
var2_base <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
var3 <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
var3_base <- c(rnorm(3, 1, 0.4), rnorm(3, 3, 0.5), rnorm(3, 6, 0.8), rnorm(3, 1.1, 0.4), rnorm(3, 0.8, 0.2), rnorm(3, 1, 0.6))
DF <- data.frame(id,sequence,treatment, var1, var2, var3, var1_base, var2_base, var3_base) %>%
mutate(id = factor(id),
sequence = factor(sequence),
treatment = factor(treatment, levels = c("A","B","C")))
> head(DF)
id sequence treatment var1 var2 var3 var1_base var2_base var3_base
1 1 1 A 0.5488589 1.3045888 0.2367363 1.2646227 1.2241417 0.1968524
2 2 2 A 1.0201801 1.3480361 0.9944096 0.3625067 0.8987885 1.5868442
3 3 1 A 0.7269204 0.7091029 1.2025266 0.1238612 1.8828400 0.8687552
4 4 2 A 3.3240269 3.3133104 3.2251780 2.4116230 2.6284785 2.6027341
5 5 1 A 3.3051822 2.4542786 2.1687379 3.5250026 3.2231797 2.9990167
6 6 1 A 2.7436715 2.7419527 3.8349072 2.9971485 3.0528477 2.6970430
I want to create a linear mixed model with var as the outcome; treatment, var_base (baseline), and sequence as the fixed effect; id as a random effect.
To code it one by one, it would look like this:
lm1 <- lmer(var1 ~ var1_base + treatment + sequence + (1|id), data = DF)
But since I have over 1000 vars, it wouldn't make sense to do it individually. I tried writing for loop but did not turn out to be what I expected.
#Approaches 1--it worked but I want the result to show "var" instead of "[[1]]"
lm_output <- list()
for(i in 4:6){
lm1 <-lmer(DF[[i+3]] ~ DF[[i]] + Treatment+ sequence + (1|id), data = DF)
summary(lm1)
lm_output[[i]] <- summary(lm1)
}
>print(lm_output[1:6])
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.8995 0.6129 13.0000 1.468 0.16598
DF[[i]] 0.6772 0.1860 13.0000 3.641 0.00299 **
TreatmentB 0.1621 0.6885 13.0000 0.235 0.81751
TreatmentC -0.3112 0.7049 13.0000 -0.441 0.66611
sequence2 -0.1001 0.5715 13.0000 -0.175 0.86367
[[5]]
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.137752 0.365302 11.104560 0.377 0.713
DF[[i]] 0.729762 0.071874 9.810327 10.153 1.61e-06 ***
TreatmentB 0.531048 0.332585 9.144490 1.597 0.144
TreatmentC 0.060414 0.343280 9.185060 0.176 0.864
sequence2 -0.001702 0.440920 4.000881 -0.004 0.997
[[6]]
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.765739 0.446747 13.000000 1.714 0.110
DF[[i]] 0.783985 0.132198 13.000000 5.930 4.98e-05 ***
TreatmentB 0.006516 0.554550 13.000000 0.012 0.991
TreatmentC -0.312968 0.515562 13.000000 -0.607 0.554
sequence2 -0.762799 0.436095 13.000000 -1.749 0.104
Is there a way to transform [[4]] --> var1, [[5]] --> var2..., so it's more intuitive and easier to manage the data?
#Approaches 2--Tried storing vars name as a vector first and ran. Did not work
responseList <- names(DF)[c(4:6)]
lm_output2 <- list()
for(i in n){
lm2<-lmer(get(n+3) ~ get(n) + Treatment+ sequence + (1|id), data = DF)
summary(lm2)
lm_output2[[i]] <- summary(lm2)
}
> Error in n + 3 : non-numeric argument to binary operator
I understand this error because in this case, n is not numeric so it would fail to do get (n+3). But I don't know how can I specify var and var_base in the same loop.
Any suggestion is appreciated, thank you!
You can build the formula for lmer as a string. So we could loop over vars (1, 2, 3, etc.) and concatenate formula from the desired variable names, like this:
library(lme4)
lm_output <- list()
for(i in 1:3) {
outcome_var = paste("var", i, sep = "")
base_var = paste(outcome_var, "base", sep = "_")
form = as.formula(paste(outcome_var,
" ~ ",
base_var,
" + treatment + sequence + (1 | id)",
sep = ""))
lm1 = lmer(form, data = DF)
summary(lm1)
lm_output[[i]] <- summary(lm1)
}
Related
Say I have a simple DAG representing a confounding variable X = Smoking, a treatment T and outcome Y = Death such that:
T ~ X
Y ~ T + X
Is it possible to produce a synthetic dataset of say 1m observations that follows some specified conditional probabilities:
# Pr(smoking):
smoking <- data.frame(
smoking = c(0, 1),
proba = c(0.7, 0.3)
)
# Pr(treatment | smoking):
treatment <- expand.grid(
smoking = c(0, 1),
treatment = c(0, 1)
) %>% arrange(smoking, treatment)
treatment$proba <- c(0.8, 0.2, 0.45, 0.55)
# Pr(death | treatment, smoking):
death <- expand.grid(
treatment = c(0, 1),
smoking = c(0,1),
dead = c(0,1)
) %>%
arrange(treatment, smoking, dead)
death$proba <- c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5)
I can do this manually here because it's a very basic DAG but I was wondering if it can be done in another more scalable way, using something like bnlearn .
Current solution:
db <- data.frame(
smoking = rbinom(n = 1000000, size = 1, prob = 0.3)
)
db$treatment[db$smoking == 0] <- rbinom(n = sum(db$smoking == 0), size = 1, prob = 0.2)
db$treatment[db$smoking == 1] <- rbinom(n = sum(db$smoking == 1), size = 1, prob = 0.55)
db$dead[db$treatment == 0 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 0),
size = 1, prob = 0.1
)
db$dead[db$treatment == 0 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 1),
size = 1, prob = 0.8
)
db$dead[db$treatment == 1 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 0),
size = 1, prob = 0.11
)
db$dead[db$treatment == 1 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 1),
size = 1, prob = 0.5
)
It will be easier to let existing packages do this for you; like bnlearn. You can use custom.fit to specify the DAG and the CPTs and then use rbn to draw samples from it.
An example
library(bnlearn)
# Specify DAG
net <- model2network("[treatment|smoking][smoking][death|treatment:smoking]")
graphviz.plot(net)
# Define CPTs
smoking <- matrix(c(0.7, 0.3), ncol = 2, dimnames = list(NULL, c("no", "yes")))
treatment <- matrix(c(0.8, 0.2, 0.45, 0.55), ncol = 2, dimnames = list(c("no", "yes"), c("no", "yes")))
death <- array(c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5), c(2,2,2), dimnames=list(c("no", "yes"), c("no", "yes"), c("no", "yes")))
# Build BN
fit <- custom.fit(net, dist = list(smoking = smoking, treatment = treatment, death = death))
# Draw samples
set.seed(69395642)
samples <- rbn(fit, n=1e6)
I have a character vector with the name of my variables:
variables -> c("w", "x", "y", "z")
I need to create a function that calculates the mean of every variable for a specified parameter (as below for alpha). However, it doesn't rename the columns with the iterating variable names and does not reduce the alpha columns down to one on the left.
calc <- function(df,
parameter,
iteration,
variables){
variable <- sym(variables[iteration])
mean <- df %>% group_by(.dots = parameter) %>%
summarise(variable = mean(!!variable),sd_variable = sd(!!variable))
return(mean)
}
means <- map_dfc(1:length(variables), ~calc(df = input,
parameter = "alpha",
iteration = .,
variables = variables))
Ideally the output df (means) would look like this:
alpha | w | sd_w | x | sd_x | y | sd_y | z | sd_z |
Here is what the input df looks like:
structure(list(time = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 999.5, 999.6,
999.7, 999.8, 999.9, 1000), w = c(10, 10.0057192322758, 10.0198266325956,
10.040096099625, 10.0637654242843, 10.087779652849, 0.889708853982268,
0.890916575744663, 0.892121389863897, 0.89332329218135, 0.894522278550115,
0.895718344834999), x = c(10, 11.0467963604334, 12.1782709261765,
13.3728962503142, 14.6035317074526, 15.8398164069251, 62.6631746231113,
62.6583134156356, 62.6534565303638, 62.648604016965, 62.6437559251575,
62.6389123047088), y = c(10, 9.89605687874935, 9.59253574727296,
9.11222320249057, 8.48917353431654, 7.76447036695841, 0.00833796964522317,
0.00835876233547079, 0.00837957883570158, 0.00840041916631544,
0.00842128334742553, 0.00844217139885453), z = c(10, 9.05439359565339,
8.21533762023494, 7.48379901688836, 6.85562632179817, 6.3231517466183,
-7.50539460838544, -7.48234149534558, -7.45927733670319, -7.43620225192078,
-7.41311636057114, -7.39001978233681), alpha = c(0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), beta = c(0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), eta = c(0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), zeta = c(0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1), lambda = c(0.95,
0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95, 0.95
), phi = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), kappa = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), ode_outputs..iteration.. = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c("1",
"1.1", "1.2", "1.3", "1.4", "1.5", "3.9995", "3.9996", "3.9997",
"3.9998", "3.9999", "3.10000"), class = "data.frame")
Ideally the function would use dplyr and/or baseR.
If I understand you correctly, there's no need to iterate over columns. It can all be done directly in dplyr...
library(tidyverse)
df %>%
group_by(alpha) %>%
summarise(
across(
c(w, x, y, z),
list(mean=mean, sd=sd)
),
.groups="drop"
) %>%
rename_with(function(x) str_sub(x,1,1), ends_with("mean"))
# A tibble: 2 x 9
alpha w w_sd x x_sd y y_sd z z_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.1 10.0 0.0345 12.8 2.20 9.14 0.875 7.99 1.38
2 0.5 0.893 0.00225 62.7 0.00908 0.00839 0.0000390 -7.45 0.0432
My question builds on another one previously posted by someone: mapply for all arguments' combinations [R]
I want to apply a function to multiple arguments using mapply, and this works with my code below. But I want to add a condition such that NOT ALL tmin- and tmax- values will be combined, instead only the first tmin with the first tmax, the second tmin with the second tmax (if tmin == 0.01 & tmax == 0.99 or if tmin == 0.05 & tmax == 0.95, but e.g. tmin == 0.01 should not be combined with tmax == 0.95).
But the first elements of tmin & tmax should be combined with ALL variables, all second elements of tmin & tmax should be combined with ALL variables, etc (as below in the expand.grid() function).
In the end I should have a data frame as the one called "alltogether", but I should have 15 rows with the described condition and not 75 as it is the case now.
I could just filter rows with dplyr::filter afterwards, but is there a nice way to include this condition in the function?
Here an example data frame:
dataframe <- data.frame(personID = 1:10,
Var1 = c(4, 6, 3, 3, 7, 1, 20, NA, 12, 2),
Var2 = c(5, 4, 5, 6, 9, 14, 14, 1, 0, NA),
Var3 = c(NA, 15, 12, 0, NA, NA, 2, 7, 6, 7),
Var4 = c(0, 0, 0, 0, 1, 0, 1, 4, 2, 1),
Var5 = c(12, 15, 11, 10, 10, 15, NA, 10, 13, 11))
and here the code I have so far:
des <- function(var, tmin, tmax){
v <- var[var >= quantile(var, probs = tmin, na.rm = TRUE) &
var <= quantile(var, probs = tmax, na.rm = TRUE)]
d <- psych::describe(v)
df <- cbind(variable = deparse(substitute(var)), tmin = tmin, tmax = tmax, d)
print(df)
}
args = expand.grid(var = dataframe[, c("Var2", "Var4", "Var5")], tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
alltogether <- do.call("rbind", mapply(FUN = des, var = args$var, tmin = args$tmin, tmax = args$tmax, SIMPLIFY = FALSE))
Thank you for helping!
Edit:
The expected output is the one after filtering the "alltogether"-dataframe with the following code (15 obs. of 16 variables):
alltogether <- alltogether%>%
dplyr::filter((tmin == 0.01 & tmax == 0.99) |
(tmin == 0.05 & tmax == 0.95) |
(tmin == 0.1 & tmax == 0.9) |
(tmin == 0.2 & tmax == 0.8) |
(tmin == 0.25 & tmax == 0.75))
OK, here's a solution to both problems. Unfortunately, I couldn't get one using mapply so I had to rely on a good old for loop (but it's still faster, given that it doesn't have to do all the extra calculations). Also, I changed the function to give you the names of the variables as you wanted. The biggest difference is that I'm not using expand.grid but merge. Finally, it incorporates your comment from above.
des <- function(var, tmin, tmax, cor.var, cor.method = c("spearman", "pearson", "kendall")){
var[var < quantile(var, probs = tmin, na.rm = TRUE) |
var > quantile(var, probs = tmax, na.rm = TRUE)] <- NA
d <- psych::describe(var)
correlation<- cor(cor.var, var, use="pairwise.complete", match.arg(cor.method))
df <- cbind(variable = names(var), tmin = tmin, tmax = tmax, d, correlation)
names(df)[length(names(df))]<- paste0("correlation_with_", names(cor.var))
print(df)
}
minmax = data.frame(tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
args<- merge(c("Var2", "Var4", "Var5"), minmax)
args[,1]<- as.character(args[,1])
alltogether<- NULL
for (i in 1:nrow(args)){
alltogether<- rbind(alltogether, des(var = dataframe[args[i,1]],
tmin = args[i, 2], tmax=args[i, 3], cor.var = dataframe["Var1"]))
}
When i use code from example:
library(deepnet)
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
y <- c(rep(1, 50), rep(0, 50))
nn <- dbn.dnn.train(x, y, hidden = c(5))
it works. But when i use this code:
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
**y <- c(rep("1", 50), rep("0", 50))**
nn <- dbn.dnn.train(x, y, hidden = c(5))
i receive error:
Error in batch_y - nn$post[[i]] : non-numeric argument to binary operator
How can i use deepnet package for classification problem?
y1 <- c(rep("1", 50), rep("0", 50))
lead you to character vector which is not acceptable by the package. so that you get error
class(y)
#[1] "character"
The right y should be numeric as follows
y <- c(rep(1, 50), rep(0, 50))
class(y)
#[1] "numeric"
if you see inside your y , you can find that you have 1 or 0 which is a binary values for classification
> table(y)
#y
# 0 1
#50 50
If you want to train as it is mentioned in the manual, you can do the following to train and predict a test set
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
y <- c(rep(1, 50), rep(0, 50))
If you now look at your x and y by str just simply write str(x) or str(y) you can see that they are numeric (to make sure, you can check them by class(x) and class(y).
After having your X and y , then you can build your model
dnn <- dbn.dnn.train(x, y, hidden = c(5, 5))
If you have a test set to predict, then you can predict it using for example as is mentioned in the manual
test_Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
test_Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
test_x <- matrix(c(test_Var1, test_Var2), nrow = 100, ncol = 2)
nn.test(dnn, test_x, y)
#[1] 0.25
Again your test_x must be numeric. If your problem is that you have the values as character, then you can convert it to numeric by mydata<- as.numeric()
Why SVM from e1071 package replaces original labels by "1" and "2", if there is at least one such column having only one unique value?
For example, the code below works correctly:
trainData <- data.frame("cA" = c(1, 1, 1, 0.99),
"cB" = c(0.5, 0.6, 0.5, 0.3),
"is_match" = factor(c("N", "N", "P", "P")))
testData <- data.frame("cA" = c(1, 1, 0, 0),
"cB" = c(0.2, 0.3, 0.2, 0.1))
model <- svm(is_match ~ ., data = trainData, type = "C-classification")
pred <- predict(model, testData, type = "class")
print(pred)
it returns
1 2 3 4
P P P P
However, if I change 0.99 to 1 in the first column - so that all values become the same - svm changes labels "N" and "P" to "1" and "2":
trainData <- data.frame("cA" = c(1, 1, 1, 1),
"cB" = c(0.5, 0.6, 0.5, 0.3),
"is_match" = factor(c("N", "N", "P", "P")))
testData <- data.frame("cA" = c(1, 1, 0, 0),
"cB" = c(0.2, 0.3, 0.2, 0.1))
model <- svm(is_match ~ ., data = trainData, type = "C-classification")
pred <- predict(model, testData, type = "class")
print(pred)
Such code returns:
1 2 3 4
2 2 2 2
Additional notes:
It happens with all possible values in column (zeros, NAs) as long as they are all the same for each instance
if labels are digits, svm doesn't replace them
other ML methods like rpart or ada works correctly