Related
Into my model x is categorical variable with 3 categories: 0,1 & 2, where 0 is reference category. However 0 categories are larger than others (1,2), so to avoid biased sample I want to to stratified bootstrapping, but could not find any relevant method for that
df <- data.frame (x = c(0,0,0,0,0,1,1,2,2),
y = c(10,11,10,10,12,17,16,20,19),
m = c(6,5,6,7,2,10,14,8,11)
)
df$x <- as.factor(df$x)
df$x <- relevel(df$x,ref = "0")
fit <- lm(y ~ x*m, data = df)
summary(fit)
Expanding on Roland's answer in the comments, you can harvest the confidence intervals from bootstrapping using boot.ci:
library(boot)
b <- boot(df, \(DF, i) coef(lm(y ~ x*m, data = df[i,])), strata = df$x, R = 999)
result <- do.call(rbind, lapply(seq_along(b$t0), function(i) {
m <- boot.ci(b, type = 'norm', index = i)$normal
data.frame(estimate = b$t0[i], lower = m[2], upper = m[3])
}))
result
#> estimate lower upper
#> (Intercept) 12.9189189 10.7166127 15.08403731
#> x1 6.5810811 2.0162637 8.73184665
#> x2 9.7477477 6.9556841 11.37390826
#> m -0.4459459 -0.8010925 -0.07451434
#> x1:m 0.1959459 -0.1842914 0.55627896
#> x2:m 0.1126126 -0.2572955 0.48352616
And even plot the results like this:
ggplot(within(result, var <- rownames(result)), aes(estimate, var)) +
geom_vline(xintercept = 0, color = 'gray') +
geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1) +
geom_point(color = 'red') +
theme_light()
Say there is a formula:
f1 = as.formula(y~ var1 + var2 + var3)
f1
## y ~ var1 + var2 + var3
Then I want to update the formula by adding a named vector a.
a = 'aaabbbccc'
f2 = update(f1, ~ . + a)
f2
## y ~ var1 + var2 + var3 + a
This is not what I expected. I want a to be evaluated in the formula. Then I tried this:
f3 = update(f1, ~ . + get(a))
f3
## y ~ var1 + var2 + var3 + get(a)
Also failed. What I expected is this:
y ~ var1 + var2 + var3 + aaabbbccc
Any help will be highly appreciated!
If you are evaluating these statements in your global environment, then you could do:
f <- y ~ var1 + var2 + var3
a <- as.name("aaabbbccc")
update(f, substitute(~ . + a, env = list(a = a)))
## y ~ var1 + var2 + var3 + aaabbbccc
Otherwise, you could do:
update(f, substitute(~ . + a, env = environment()))
## y ~ var1 + var2 + var3 + aaabbbccc
The important thing is that the value of a in env is a symbol, not a string: as.name("aaabbbccc") or quote(aaabbbccc), but not "aaabbbccc".
Somewhat unintuitively, substitute(expr, env = .GlobalEnv) is equivalent to substitute(expr, env = NULL). That is the only reason why it is necessary to pass list(a = a) (or similar) in the first case.
I should point out that, in this situation, it is not too difficult to create the substitute result yourself, "from scratch":
update(f, call("~", call("+", quote(.), a)))
## y ~ var1 + var2 + var3 + aaabbbccc
This approach has the advantage of being environment-independent, and, for that reason, is probably the one I would use.
I was wondering if the syntax of the regression model below could be made more concise (shorter) than it currently is?
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/bv1.csv')
library(nlme)
model <- lme(achieve ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time), correlation = corSymm(),
random = ~0 + D1:time | schcode/id, data = dat, weights = varIdent(form = ~1|factor(math)),
na.action = na.omit, control = lmeControl(maxIter = 200, msMaxIter = 200, niterEM = 50,
msMaxEval = 400))
coef(summary(model))
Focusing on the fixed-effect component only.
Original formula:
form1 <- ~ 0 + D1 + D2+
D1:time + D2:time+
D1:schcontext + D2:schcontext +
D1:female + D2:female+
D1:I(female*time) + D2:I(female*time)+
D1:I(schcontext*time) + D2:I(schcontext*time)
X1 <- model.matrix(form1, data=dat)
I think this is equivalent
form2 <- ~0 +
D1 + D2 +
(D1+D2):(time + schcontext + female + female:time+schcontext:time)
X2 <- model.matrix(form2, data=dat)
(Unfortunately ~ 0 + (D1 + D2):(1 + time + ...) doesn't work as I would have liked/expected.)
For a start, the model matrix has the right dimensions. Staring at the column names of the model matrices and reordering the columns manually:
X2o <- X2[,c(1:3,6,4,7,5,8,9,11,10,12)]
all.equal(c(X1),c(X2o)) ##TRUE
(For numerical predictors, you don't need I(A*B): A:B is equivalent.)
Actually you can do a little better using the * operator
form3 <- ~0 +
D1 + D2 +
(D1+D2):(time*(schcontext+female))
X3 <- model.matrix(form3, data=dat)
X3o <- X3[,c(1:3,6,4,7,5,8,10,12,9,11)]
all.equal(c(X1),c(X3o)) ## TRUE
Compare formula length:
sapply(list(form1,form2,form3),
function(x) nchar(as.character(x)[[2]]))
## [1] 183 84 54
I am fitting a logit model on R using mlogit package. All the parameters of the dummy variables (var1, var2, var3) are normally distributed. How can I put a constraint on one variable (var1) to make its parameter zero-mean? The current code returns a non-zero mean for var1. Thanks!
model <- mlogit(outcome ~ var1 + var2 + var3 | 0 ,
data=data,
rpar = c(var1 = "n",
var2 = "n",
var3 = "n"),
correlation = FALSE,
R = 100,
halton = NA)
I know it is relatively easy in STATA to put constraints on parameters. for my model, my code in STATA is:
constraint 1 _b[var1]=0
mixlogit choice var1 var2 var3, group(qid) id(id) constraint(1)
I just want to know how to do the same thing in R. Thanks!
based on gmnl package, we can constrain the estimation using maxLik. For this question,
A <- matrix (c(1,0,0,0,0,0, -1,0,0,0,0,0))
B <- matrix (c(0, 0))
model <- gmnl(outcome ~ var1 + var2 + var3 | 0 ,
data=data,
rpar = c(var1 = "n",
var2 = "n",
var3 = "n"),
constraints = list (eqA=A, eqB=B),
model = "mixl",
correlation = FALSE,
R = 100,
halton = Null)
The constraints mean in math
1* mean(var1) + 0*mean(var2) + 0*mean(var3) + 0*sd(var1) + 0*sd(var2) > + 0*sd(var3) = 0;
and
-1* mean(var1) + 0*mean(var2) + 0*mean(var3) + 0*sd(var1) + 0*sd(var2) + 0*sd(var3) = 0
so that the mean of var1 has to be 0 in the estimation.
I'm trying to make a binary classification model based on the built-in iris dataset using the glm() function.
First, filtered the dataset so only Species Versicolor and Virginica are used in the binary classification. I am getting warning messages in the code.
Is there a way to fix the code to get rid of these warning messages? Warning messages are under the ROC.train code and ggplot() code. The code is below:
>library(dplyr)
> library(forcats)
> library(ggplot2)
> iris.small <- datasets::iris %>%
+ dplyr::filter(Species != "setosa") %>%
+ dplyr::mutate(Species = fct_drop(Species)) %>%
+ dplyr::group_by(Species) %>%
+ dplyr::summarize(avg_sl = mean(Sepal.Length),
+ avg_sw = mean(Sepal.Width),
+ avg_pl = mean(Petal.Length),
+ avg_pw = mean(Petal.Width))
> set.seed(2016-11-14)
> iris.big <- data_frame(Species = as.factor(c(rep("versicolor", 500), rep("virginica", 500))),
+ sl = c(rnorm(500, iris.small$avg_sl[1]), rnorm(500, iris.small$avg_sl[2])),
+ sw = c(rnorm(500, iris.small$avg_sw[1]), rnorm(500, iris.small$avg_sw[2])),
+ pl = c(rnorm(500, iris.small$avg_pl[1]), rnorm(500, iris.small$avg_pl[2])),
+ pw = c(rnorm(500, iris.small$avg_pw[1]), rnorm(500, iris.small$avg_pw[2])))
> train_fraction <- 0.5 #fraction of data for training purposes
> n_obs <- nrow(iris.big)
> train_size <- floor(train_fraction * nrow(iris.big))
> train_indices <- sample(n_obs, size=train_size, replace=TRUE) #sample(x, size, replace = FALSE, prob = NULL)x Either a (numeric, complex, character or logical) vector of more than one element from which to choose, or a positive integer.size non-negative integer giving the number of items to choose. replace Should sampling be with replacement? prob A vector of probability weights for obtaining the elements of the vector being sampled
> train_data <- iris.big[train_indices, ]
> test_data <- iris.big[-train_indices, ]
> glm.out.train <- glm(Species ~ sl + sw + pl + pw, data=train_data, family = "binomial")
> test_pred <- predict(glm.out.train, test_data, type='response')
> calc_ROC <- function(probabilities, known_truth, model.name=NULL)
+ {
+ outcome <- as.numeric(factor(known_truth))-1
+ pos <- sum(outcome) # total known positives
+ neg <- sum(1-outcome) # total known negatives
+ pos_probs <- outcome*probabilities # probabilities for known positives
+ neg_probs <- (1-outcome)*probabilities # probabilities for known negatives
+ true_pos <- sapply(probabilities,
+ function(x) sum(pos_probs>=x)/pos) # true pos. rate
+ false_pos <- sapply(probabilities,
+ function(x) sum(neg_probs>=x)/neg)
+ if (is.null(model.name))
+ result <- data.frame(true_pos, false_pos)
+ else
+ result <- data.frame(true_pos, false_pos, model.name)
+ result %>% dplyr::arrange(false_pos, true_pos)
+ }
> ROC.train <- calc_ROC(probabilities=test_pred, known_truth=train_data$Species, model.name="train")
Warning messages:
1: In outcome * probabilities :
longer object length is not a multiple of shorter object length
2: In (1 - outcome) * probabilities :
longer object length is not a multiple of shorter object length
> ROC.test <- calc_ROC(probabilities=test_pred, known_truth=test_data$Species, model.name="test")
> ROCs <- rbind(ROC.train, ROC.test)
> ggplot(ROCs, aes(x=false_pos, y=true_pos, color=model.name)) + geom_line() + xlim(0, 0.25)
Warning message:
Removed 745 rows containing missing values (geom_path).
> ROCs %>% dplyr::group_by(model.name) %>% dplyr::mutate(delta=false_pos-lag(false_pos)) %>% dplyr::summarize(AUC=sum(delta*true_pos, na.rm=T)) %>% dplyr::arrange(desc(AUC))
# A tibble: 2 × 2
model.name AUC
<fctr> <dbl>
1 test 0.8700770
2 train 0.7329557
JackStat recommendation:
> library(dplyr)
> library(forcats)
> library(ggplot2)
> iris.small <- datasets::iris %>%
+ dplyr::filter(Species != "setosa") %>%
+ dplyr::mutate(Species = fct_drop(Species)) %>%
+ dplyr::group_by(Species) %>%
+ dplyr::summarize(avg_sl = mean(Sepal.Length),
+ avg_sw = mean(Sepal.Width),
+ avg_pl = mean(Petal.Length),
+ avg_pw = mean(Petal.Width))
> set.seed(2016-11-14)
> iris.big <- data_frame(Species = as.factor(c(rep("versicolor", 500), rep("virginica", 500))),
+ sl = c(rnorm(500, iris.small$avg_sl[1]), rnorm(500, iris.small$avg_sl[2])),
+ sw = c(rnorm(500, iris.small$avg_sw[1]), rnorm(500, iris.small$avg_sw[2])),
+ pl = c(rnorm(500, iris.small$avg_pl[1]), rnorm(500, iris.small$avg_pl[2])),
+ pw = c(rnorm(500, iris.small$avg_pw[1]), rnorm(500, iris.small$avg_pw[2])))
> train_fraction <- 0.5 #fraction of data for training purposes
> n_obs <- nrow(iris.big)
> train_size <- floor(train_fraction * nrow(iris.big))
> train_indices <- sample(n_obs, size=train_size, replace=TRUE) #sample(x, size, replace = FALSE, prob = NULL)x Either a (numeric, complex, character or logical) vector of more than one element from which to choose, or a positive integer.size non-negative integer giving the number of items to choose. replace Should sampling be with replacement? prob A vector of probability weights for obtaining the elements of the vector being sampled
> train_data <- iris.big[train_indices, ]
> test_data <- iris.big[-train_indices, ]
> glm.out.train <- glm(Species ~ sl + sw + pl + pw, data=train_data, family = "binomial")
> test_pred <- predict(glm.out.train, test_data, type='response')
> calc_ROC <- function(probabilities, known_truth, model.name=NULL)
+ {
+ outcome <- as.numeric(factor(known_truth))-1
+ pos <- sum(outcome) # total known positives
+ neg <- sum(1-outcome) # total known negatives
+ pos_probs <- outcome*probabilities # probabilities for known positives
+ neg_probs <- (1-outcome)*probabilities # probabilities for known negatives
+ true_pos <- sapply(probabilities,
+ function(x) sum(pos_probs>=x)/pos) # true pos. rate
+ false_pos <- sapply(probabilities,
+ function(x) sum(neg_probs>=x)/neg)
+ if (is.null(model.name))
+ result <- data.frame(true_pos, false_pos)
+ else
+ result <- data.frame(true_pos, false_pos, model.name)
+ result %>% dplyr::arrange(false_pos, true_pos) eps <- 1e-15; test_pred = pmax(pmin(test_pred, 1 - eps), eps)
Error: unexpected symbol in:
"result <- data.frame(true_pos, false_pos, model.name)
result %>% dplyr::arrange(false_pos, true_pos) eps"
> }
Error: unexpected '}' in "}"
> ROC.train <- calc_ROC(probabilities=test_pred, known_truth=train_data$Species, model.name="train")
Warning messages:
1: In outcome * probabilities :
longer object length is not a multiple of shorter object length
2: In (1 - outcome) * probabilities :
longer object length is not a multiple of shorter object length
> ROC.test <- calc_ROC(probabilities=test_pred, known_truth=test_data$Species, model.name="test")
> ROCs <- rbind(ROC.train, ROC.test)
> ggplot(ROCs, aes(x=false_pos, y=true_pos, color=model.name)) + geom_line() + xlim(0, 0.25)
Warning message:
Removed 745 rows containing missing values (geom_path).
> ROCs %>% dplyr::group_by(model.name) %>% dplyr::mutate(delta=false_pos-lag(false_pos)) %>% dplyr::summarize(AUC=sum(delta*true_pos, na.rm=T)) %>% dplyr::arrange(desc(AUC))
# A tibble: 2 × 2
model.name AUC
<fctr> <dbl>
1 test 0.8700770
2 train 0.7329557
The code is still giving an error message.
Am I putting this code in the right place?
eps <- 1e-15; test_pred = pmax(pmin(test_pred, 1 - eps), eps)
I edited the code to exclude 0s and 1s. But, I'm still getting errors. What can I do now to fix the error?
> library(dplyr)
> library(forcats)
> library(ggplot2)
> iris.small <- datasets::iris %>%
+ dplyr::filter(Species != "setosa") %>%
+ dplyr::mutate(Species = fct_drop(Species)) %>%
+ dplyr::group_by(Species) %>%
+ dplyr::summarize(avg_sl = mean(Sepal.Length),
+ avg_sw = mean(Sepal.Width),
+ avg_pl = mean(Petal.Length),
+ avg_pw = mean(Petal.Width))
> set.seed(2016-11-14)
> iris.big <- data_frame(Species = as.factor(c(rep("versicolor", 500), rep("virginica", 500))),
+ sl = c(rnorm(500, iris.small$avg_sl[1]), rnorm(500, iris.small$avg_sl[2])),
+ sw = c(rnorm(500, iris.small$avg_sw[1]), rnorm(500, iris.small$avg_sw[2])),
+ pl = c(rnorm(500, iris.small$avg_pl[1]), rnorm(500, iris.small$avg_pl[2])),
+ pw = c(rnorm(500, iris.small$avg_pw[1]), rnorm(500, iris.small$avg_pw[2])))
> iris.big$sl[iris.big$sl==0] <-0.0000000001
> iris.big$sw[iris.big$sw==0] <-0.0000000001
> iris.big$pl[iris.big$pl==0] <-0.0000000001
> iris.big$pw[iris.big$pw==0] <-0.0000000001
> iris.big$sl[iris.big$sl==1] <-0.99999999
> iris.big$sw[iris.big$sw==1] <-0.99999999
> iris.big$pl[iris.big$pl==1] <-0.99999999
> iris.big$pw[iris.big$pw==1] <-0.99999999
> train_fraction <- 0.5 #fraction of data for training purposes
> n_obs <- nrow(iris.big)
> train_size <- floor(train_fraction * nrow(iris.big))
> train_indices <- sample(n_obs, size=train_size, replace=TRUE) #sample(x, size, replace = FALSE, prob = NULL)x Either a (numeric, complex, character or logical) vector of more than one element from which to choose, or a positive integer.size non-negative integer giving the number of items to choose. replace Should sampling be with replacement? prob A vector of probability weights for obtaining the elements of the vector being sampled
> train_data <- iris.big[train_indices, ]
> test_data <- iris.big[-train_indices, ]
> glm.out.train <- glm(Species ~ sl + sw + pl + pw, data=train_data, family = "binomial")
> test_pred <- predict(glm.out.train, test_data, type='response')
> calc_ROC <- function(probabilities, known_truth, model.name=NULL)
+ {
+ outcome <- as.numeric(factor(known_truth))-1
+ pos <- sum(outcome) # total known positives
+ neg <- sum(1-outcome) # total known negatives
+ pos_probs <- outcome*probabilities # probabilities for known positives
+ neg_probs <- (1-outcome)*probabilities # probabilities for known negatives
+ true_pos <- sapply(probabilities,
+ function(x) sum(pos_probs>=x)/pos) # true pos. rate
+ false_pos <- sapply(probabilities,
+ function(x) sum(neg_probs>=x)/neg)
+ if (is.null(model.name))
+ result <- data.frame(true_pos, false_pos)
+ else
+ result <- data.frame(true_pos, false_pos, model.name)
+ result %>% dplyr::arrange(false_pos, true_pos)
+ }
> ROC.train <- calc_ROC(probabilities=test_pred, known_truth=train_data$Species, model.name="train")
Warning messages:
1: In outcome * probabilities :
longer object length is not a multiple of shorter object length
2: In (1 - outcome) * probabilities :
longer object length is not a multiple of shorter object length
> ROC.test <- calc_ROC(probabilities=test_pred, known_truth=test_data$Species, model.name="test")
> ROCs <- rbind(ROC.train, ROC.test)
> ggplot(ROCs, aes(x=false_pos, y=true_pos, color=model.name)) + geom_line() + xlim(0, 0.25)
Warning message:
Removed 745 rows containing missing values (geom_path).
> ROCs %>% dplyr::group_by(model.name) %>% dplyr::mutate(delta=false_pos-lag(false_pos)) %>% dplyr::summarize(AUC=sum(delta*true_pos, na.rm=T)) %>% dplyr::arrange(desc(AUC))
# A tibble: 2 × 2
model.name AUC
<fctr> <dbl>
1 test 0.8700770
2 train 0.7329557