I have 1,000 dependent variables (y) to use in Ancova. Independent variables (x) is the same in all models.
Yvar1, Yvar2, …, Yvar1000, x1, x2
1 2 5 11 16
2 3 6 18 23
I need an R code that replace example$Y.to.use values for each Yvar.
I tried this:
for (i in 1:1000){
example$Y.to.use<- example$paste("Yvar",i, sep="")
# Error: attempt to apply non-function
paste("fitted.model",i)<-lme(log(Y.to.use+1) ~ x1*x2, data=example, random=~1| Exp/Person)
}
# Error in paste("fitted.model", i, sep = "") <- lme(log(Y.to.use+1) ~ x1*x2 : target of assignment expands to non-language object
I will create a table with coefficients from each fitted model of each Yvar.
Related
I have a 2 level dataset of 37000 instances, which represents the choices of 199 subjects. I have to estimate coefficients in logistic regression for each of the 199 individuals. I have done manually 199 times by subsetting, but I want to know whether there is a more efficient way of getting the coefficients by looping without using the lme4 package. Also, I should compute the coefficients as variables in each subject.
Here is my code.
### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)
### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]
### Logistic regression
glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)
### Extracting the coefficients
reward_transition <- coef(glm1)[4]
reward <- coef(glm1)[2]
transition <- coef(glm1)[3]
reward<- as.numeric(reward)
reward_transition <- as.numeric(reward_transition)
transition <- as.numeric(transition)
omega <- reward_transition - reward
### Computing the constant coefficients as variables
df1$rewardmix <- 1
df1$rewardmix <- reward
df1$omega <- 1
df1$omega <- omega
df1$transmix <- 1
df1$transmix <- transition
df1$reward_transitionmix <- reward_transition
You can use the by() function from the base package, whose short description is "Apply a Function to a Data Frame Split by Factors" (ref: help(by))
Here is an example using your terminology for the data frame and the subject ID variable names:
# Make the simulated data reproducible
set.seed(1717)
# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')
# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
y=rnorm(length(ids)),
x=rnorm(length(ids)))
head(df_merged2)
The top 6 rows of the data look like:
sjind y x
1 A -1.4548934 1.1004932
2 B -1.7084245 -0.7731208
3 B 2.1004557 -1.6229203
4 A -1.0283021 0.4233806
5 A 0.4133888 1.2398577
6 B -1.4104637 0.3746706
Now use the by() function to fit a GLM model for each group defined by the sjind unique values:
glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
function(df) glm(y ~ x, data=df))
The output object glm_by_sjind is a list with the following properties:
It has as many elements as the number of unique values in sjind (in this case 3)
It is indexed by the unique values of the sjind variable (in this case "A", "B", "C")
Each element contains the regression output from glm() run on each split of the input data frame (where splits are clearly defined by the sjind unique values)
So for example, you can request the summary of the regression output for subject "B" as follows:
> summary(glm_by_sjind[["B"]])
Call:
glm(formula = y ~ x, data = df)
Deviance Residuals:
2 3 6 7 8 12
-1.40226 1.59040 -0.00186 0.06400 -1.93118 1.68091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0487 0.7472 -1.404 0.233
x -0.9605 0.9170 -1.047 0.354
(Dispersion parameter for gaussian family taken to be 2.763681)
Null deviance: 14.087 on 5 degrees of freedom
Residual deviance: 11.055 on 4 degrees of freedom
AIC: 26.694
Number of Fisher Scoring iterations: 2
If we go a little further, we can also perform a sanity check that each GLM model is based on the expected number of cases (i.e. the number of cases in each model should be equal to the frequency distribution of the sjind variable in the input data frame).
freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) )
all.equal( freq_sjind_in_data,
ncases_in_each_glm )
which returns TRUE.
Or also inspect that visually:
as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)
which return
A B C
1 3 6 4
in both cases.
I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous (in addition to the response variable, which is also obviously categorical/binary).
When calling summary(model_name), is there a way to include a column representing the number of observations within each factor level?
I have a logistic regression model in R, where all of the predictor variables are categorical rather than continuous.
If all your covariates are factors (not including the intercept), this is fairly easy as the model matrix only contains 0 and 1 and the number of 1 indicates the occurrence of that factor level (or interaction level) in your data. So just do colSums(model.matrix(your_glm_model_object)).
Since a model matrix has column names, colSums will give you a vector with "names" attribute, that is consistent with the "names" field of coef(your_glm_model_object).
The same solution applies to a linear model (by lm) and a generalized linear model (by glm) for any distribution family.
Here is a quick example:
set.seed(0)
f1 <- sample(gl(2, 50)) ## a factor with 2 levels, each with 50 observations
f2 <- sample(gl(4, 25)) ## a factor with 4 levels, each with 25 observations
y <- rnorm(100)
fit <- glm(y ~ f1 * f2) ## or use `lm` as we use `guassian()` family object here
colSums(model.matrix(fit))
#(Intercept) f12 f22 f23 f24 f12:f22
# 100 50 25 25 25 12
# f12:f23 f12:f24
# 12 14
Here, we have 100 observations / complete-cases (indicated under (Intercept)).
Is there a way to display the count for the baseline level of each factor?
Baseline levels are contrasted, so they don't appear in the the model matrix used for fitting. However, we can generate the full model matrix (without contrasts) from your formula not your fitted model (this also offers you a way to drop numeric variables if you have them in your model):
SET_CONTRAST <- list(f1 = contr.treatment(nlevels(f1), contrast = FALSE),
f2 = contr.treatment(nlevels(f2), contrast = FALSE))
X <- model.matrix(~ f1 * f2, contrasts.arg = SET_CONTRAST)
colSums(X)
#(Intercept) f11 f12 f21 f22 f23
# 100 50 50 25 25 25
# f24 f11:f21 f12:f21 f11:f22 f12:f22 f11:f23
# 25 13 12 13 12 13
# f12:f23 f11:f24 f12:f24
# 12 11 14
Note that it can quickly become tedious in setting contrasts when you have many factor variables.
model.matrix is definitely not the only approach for this. The conventional way may be
table(f1)
table(f2)
table(f1, f2)
but could get tedious too when your model become complicated.
I fit my data.frame, g
` Day V`
1 13 211.45
2 15 274.40
3 18 381.15
4 21 499.80
5 26 614.65
6 29 723.75
7 33 931.70
8 36 996.35
9 40 1037.40
10 43 1277.75
by using the following steps
fit <- glm(V ~ Day, data = g, family = gaussian(link = "log"))
pred <- predict(fit, type = "response")
# get Intercept from glm
intercept<-fit$coefficients[[1]]
# Get Slope from glm
slope<-fit$coefficients[[2]]
Now, I want to calculate the Day for a particular value of V, say V = 800.0 based on the fit. Although I know the fit coefficients, I cannot construct a formula to calculate this.
Let's assume this exponential data can be represented by a formula like
V=Voexp(mt)
where
Vo is the intercept
m is the slope the fit
then, the Day for a particular V = 800.0 can be calculated by
t(V) = log(V/intercept)/slope
I do not know what formula predict function is using to calculate predicted values based on the fit. I tried to do the following
new<-data.frame(V=c(800.00))
p<-predict(fit,newdata=new, type=response")
I get the following error
Error in eval(predvars, data, env) : object 'Day' not found
This is because, this function is intended to calculate, V from a new Day, not the other way around.
But how can I do this using R?
I am using the delthamethod from package msm to derive the standard error of a transformed variable.
Example code:
require(msm)
x1 <- 1:10
x2 <- c(0,0,0,0,0,0,0,0,0,0)
y <- c(1,3,3,4,5,7,7,8,9,10)
m1 <- lm(y~x1+x2)
summary(m1)
deltamethod(~ (1-x1), coef(m1), vcov(m1))
The error I get is "Covariances should be a 3x3 matrix". The reason is that 1 variable does not have any variation (x2 is always zero) and has "NA" in the regression output.
Is there an easy fix to this? I know I could leave the variable out, but I am running more than 1.000 regressions with each around 15 parameters to estimate, and the NA variables (without variation) are every time different variables.
How about:
deltamethod(~(1-x1), na.omit(coef(m1)), vcov(m1))
# [1] 0.2949063
I used a generalized linear model with multiple variables in R .my data (young) looks like below and I have 5 DVS(dv1,dv2,dv3,dv4,dv5) and three IVS(IV1,IV2,IV3) as a data frame. I keep getting the error below , can some one answer please , as to I am doing wrong.
> head(young)
IV1 IV2 IV3 DVS
1 18 1 1 dv1
2 20 1 1 dv1
3 21 2 1 dv1
4 21 1 2 dv1
5 22 1 1 dv1
6 22 1 1 dv1
> models <- list()
> dvnames <- paste("DVS", 1:5, sep='')
> ivnames <- paste("IV", 1:3, sep='') ## for some value of 3
> for (y in dvnames){
+ form <- formula(paste(y, "~", ivnames))
+ models[[y]] <- glm(form, data=young, family='poisson') }
**Error in eval(expr, envir, enclos) : object 'DVS1' not found**
It is easy to see why you get the error. The first time the loop iterates, y takes the value DVS1. In a model formula, R will look for a variable in young with the name DVS1. As you have shown us, there is no variable (column) with that name (nor presumably an object with that within the scope of glm()) and hence
**Error in eval(expr, envir, enclos) : object 'DVS1' not found**
Which is quite correct.
Now the more important question becomes, what are you trying to do? You seem to be fitting a Poisson model but you claim the response variables are in a single column, DVS, which R will be treating as a factor variable. Where are the count data that you wish to model as some function of IV1, IV2, and IV3?
R expects to be supplied something it can interpret as a numeric count for the left-hand-side of the formula (where you are putting y).