I have 100 groups with 40 observations each. I know there might be other appropriate models but i am currently only interested in the following.
reg<-lmList(Y ~ Intercept + a + b + c + d | grp,data=data, pool=F)
In order to save the residuals (from lmList) to my table I just do
data$residual <- residuals(reg)
Now I want to save the beta coefficients to the original table as well. Since each group only has a single set of coefficients it should be same for the group but different across groups.
coef <- coef(reg) gives me a list with the group coefficients. However R shows this as a list with only 5 columns (excluding the group names).
data$coef<-coef[,c(1) ] gives me the intercepts but I lose the group information.
I was thinking about creating a separate table with group identification so that I just merge the original table and the coefficient tables. But could not figure out how to get the group identification along with the coefficients.
If there is an easier way to do this please help.
If fm1 is your fitted lmList object then I think simply making the row names into an additional column should do what you want:
library(nlme)
fm1 <- lmList(distance ~ age | Subject, Orthodont)
res <- data.frame(Subject=rownames(coef(fm1)),coef(fm1),check.names=FALSE)
rownames(res) <- NULL ## now redundant
head(res)
Subject (Intercept) age
1 M16 16.95 0.550
2 M05 13.65 0.850
...
Then you should be able to merge() (although merging with the Orthodont object seems problematic - I think this may be because it's a weird groupedData object ...)
Related
I have a 2 level dataset of 37000 instances, which represents the choices of 199 subjects. I have to estimate coefficients in logistic regression for each of the 199 individuals. I have done manually 199 times by subsetting, but I want to know whether there is a more efficient way of getting the coefficients by looping without using the lme4 package. Also, I should compute the coefficients as variables in each subject.
Here is my code.
### Split of the dataset in each subject ID
mylist <- split(df_merged2, df_merged2$sjind)
### Indication of subject 1 in the first subsetting
df1 <- mylist[[1]]
### Logistic regression
glm1 <- glm(rep ~ reward_v.2 + trans_v.2 + reward_transition, data = df1)
### Extracting the coefficients
reward_transition <- coef(glm1)[4]
reward <- coef(glm1)[2]
transition <- coef(glm1)[3]
reward<- as.numeric(reward)
reward_transition <- as.numeric(reward_transition)
transition <- as.numeric(transition)
omega <- reward_transition - reward
### Computing the constant coefficients as variables
df1$rewardmix <- 1
df1$rewardmix <- reward
df1$omega <- 1
df1$omega <- omega
df1$transmix <- 1
df1$transmix <- transition
df1$reward_transitionmix <- reward_transition
You can use the by() function from the base package, whose short description is "Apply a Function to a Data Frame Split by Factors" (ref: help(by))
Here is an example using your terminology for the data frame and the subject ID variable names:
# Make the simulated data reproducible
set.seed(1717)
# The IDs can be sorted in any order
ids = c('A','B','B','A','A','B','B','B','C','C','C','B','C')
# Sample data frame with: subject ID, target variable (y), input variable (x)
df_merged2 = data.frame(sjind=ids,
y=rnorm(length(ids)),
x=rnorm(length(ids)))
head(df_merged2)
The top 6 rows of the data look like:
sjind y x
1 A -1.4548934 1.1004932
2 B -1.7084245 -0.7731208
3 B 2.1004557 -1.6229203
4 A -1.0283021 0.4233806
5 A 0.4133888 1.2398577
6 B -1.4104637 0.3746706
Now use the by() function to fit a GLM model for each group defined by the sjind unique values:
glm_by_sjind = by(df_merged2, as.factor(df_merged2$sjind),
function(df) glm(y ~ x, data=df))
The output object glm_by_sjind is a list with the following properties:
It has as many elements as the number of unique values in sjind (in this case 3)
It is indexed by the unique values of the sjind variable (in this case "A", "B", "C")
Each element contains the regression output from glm() run on each split of the input data frame (where splits are clearly defined by the sjind unique values)
So for example, you can request the summary of the regression output for subject "B" as follows:
> summary(glm_by_sjind[["B"]])
Call:
glm(formula = y ~ x, data = df)
Deviance Residuals:
2 3 6 7 8 12
-1.40226 1.59040 -0.00186 0.06400 -1.93118 1.68091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0487 0.7472 -1.404 0.233
x -0.9605 0.9170 -1.047 0.354
(Dispersion parameter for gaussian family taken to be 2.763681)
Null deviance: 14.087 on 5 degrees of freedom
Residual deviance: 11.055 on 4 degrees of freedom
AIC: 26.694
Number of Fisher Scoring iterations: 2
If we go a little further, we can also perform a sanity check that each GLM model is based on the expected number of cases (i.e. the number of cases in each model should be equal to the frequency distribution of the sjind variable in the input data frame).
freq_sjind_in_data = as.list( table(df_merged2$sjind) )
ncases_in_each_glm = lapply( glm_results, function(glm) NROW(glm$data) )
all.equal( freq_sjind_in_data,
ncases_in_each_glm )
which returns TRUE.
Or also inspect that visually:
as.data.frame(freq_sjind_in_data)
as.data.frame(ncases_in_each_glm)
which return
A B C
1 3 6 4
in both cases.
How can I calculate the estimate of factor loading for latent variables between different data frames?
I have 3 data frames related to 1 latent variable (the same variables A B C D) but in different intervals.
dataframe1 (100-120 days). (1550 records)
dataframe2 (120-150 days). (1780 records)
dataframe3 (180-250 days). (1670 records)
package (lavaan)
model1 <- 'latent_variable1 =~ A + B + C + D
A~~B'
output1 <- cfa(model1, data=datafram1, std.lv=TRUE)
output2 <- cfa(model1, data=datafram2, std.lv=TRUE)
output3 <- cfa(model1, data=datafram3, std.lv=TRUE)
Now I would like to calculate the difference on the estimate between the 3 latent variables using SEM.
Someone could help me?
example:
model<- 'latent_variable1 ~ latent_variable2 ~ latent_variable3'
output4<-sem(model, dataframe =????, std.lv=TRUE)
If I understood correctly, your aim is mainly to test the differences on the factor loadings on these 3 datasets. And, in fact, you do not have 3 latent variables, but just one, but applied in 3 different contexts.
This is a specific type of the assessment for measurement invariance for latent variable.
So, what you need is:
merge the datasets into just one dataset, but an indication of their origin. Let's call it group
datafram1$group<-"G1"
datafram2$group<-"G2"
datafram3$group<-"G3"
df<-rbind(datafram1,datafram2)
df<-rbind(df,datafram3)
run a multi group assessment, using this group variable
output1 <- cfa(model1, data=df, std.lv=TRUE, group="group")
This will return 3 outputs, one per group, similar to assessing in the 3 different dataframes.
You can now constrain the loadings to be equal
output2 <- cfa(model1, data=df, std.lv=TRUE, group="group",group.equal="loadings")
and we can assess the significance of the differences between them:
anova(output1, output2)
I have defined a binary response mixed effects model using the R function glmer as follows:
fit <-glmer(binary_r ~ cat1 + (1 | SUBJECTIDf) + (1 | cat2) + (1 | cat1:cat2),
family = binomial("logit"), data = mydata))
where cat1 and cat2 are categorical variables and SUBJECTIDf denotes the factor variable tagging the individual subjects of the study. Also, SUBJECTIDf and cat2 are cross-classified factors.
I would like to do the following for the above model:
create a table which presents the probability of a positive response for each combination of categories pertaining to cat1 and cat2;
create a plot (possibly a caterpillar plot) which displays the probabilities defined under a);
and
create summary statistics for the probabilities defined under 1., including the minimum and maximum probabilities, across all combinations of categories pertaining to cat1 and cat2.
I am assuming that in order to achieve the above outcomes, it would be appropriate to assume that the individual probabilities defined under 1., above are formed as averages or medians across all subjects for a given combination of cat1 and cat2.
I would be most grateful for advice on how to proceed.
Since you didn't give a reproducible example I'm going to simulate one ... this part of the answer is only setting up an example data set.
## crossed cat2 and SUBJECTIDf
dd <- expand.grid(cat2=factor(letters[1:10]),
SUBJECTIDf=factor(1:10))
## each subject gets one value of cat1:
## for example, half get A while half get B
cat_tab <- data.frame(SUBJECTIDf=factor(1:10),cat1=rep(c("A","B"),5))
dd <- merge(dd,cat_tab)
I'm having a little trouble accommodating the design consideration that cat1 and cat2 are not cross-classified. I'll just knock out some categories:
dd <- with(dd,dd[!(cat1=="A" & cat2 %in% c("a","b","c")),])
Now we have the design set up, we simulate response values:
library(lme4)
form <- binary_r ~ cat1 + (1 | SUBJECTIDf) + (1 | cat2) + (1 | cat1:cat2)
dd$binary_r <- simulate(form[-2], ## RHS only
family=binomial,
newdata=dd,
newparams=list(beta=0:1,
theta=c(2,4,1)),
seed=101)[[1]]
At this point we come in with the model fit you've suggested above.
fit <- glmer(form, family = binomial, data=dd)
create a table which presents the probability of a positive response for each combination of categories pertaining to cat1 and cat2;
By setting re.form below to exclude subject ID, we are implicitly computing the value for the hypothetical median individual (i.e., random effect set to zero; mean and median predictions coincide on the logit scale, but not once we back-transform to the probability scale).
Get unique combinations of cat1 and cat2 found in the data:
newdd <- unique(dd[,c("cat1","cat2")])
newdd$SUBJECTIDf <- NA ## need to have SUBJECTIDf in the data frame ...
t1 <- predict(fit,newdata=newdd,type="response",
re.form=~(1|cat2)+(1|cat1:cat2))
newdd <- data.frame(newdd[,c("cat1","cat2")],pred=t1)
head(newdd)
## cat1 cat2 pred
## 4 A d 0.215336024
## 5 A e 0.944897414
## 6 A f 0.036751551
## 7 A g 0.003819873
## 8 A h 0.970115614
## 9 A i 0.003819873
We could also compute predictions for all individuals as follows:
## we happen to have a factorial design, but expand.grid() would
## e.g. fill in missing values
newdd2 <- unique(dd[,c("cat1","cat2","SUBJECTIDf")])
t2 <- predict(fit,newdata=newdd2,type="response",
re.form=NULL)
newdd2$pred <- t2
head(newdd2)
In order to create the plot (see below) we would have to summarize across individuals within category combinations.
create a plot (possibly a caterpillar plot) which displays the probabilities defined under a);
library(ggplot2); theme_set(theme_bw())
ggplot(newdd,aes(cat2,pred,colour=cat1))+
geom_point()+scale_colour_brewer(palette="Set1")
Or, aggregating the subject-specific predictions:
ggplot(newdd2,aes(cat2,pred,colour=cat1))+
stat_summary(fun.y=mean,geom="point")+
scale_colour_brewer(palette="Set1")
We could use reorder() on the cat2 categories to try to get a more sensible order, but since there is a cat1:cat2 interaction, that might not work too well. Caterpillar plots (i.e. getting uncertainties on predictions) are a little trickier, because of the difficulty of getting uncertainty on predictions that combine uncertainty in conditional modes (values of individual random effects) and fixed effects. Can be done by (1) assuming conditional modes and fixed effects are independent or (2) parametric bootstrapping (bootMer), but both are a little more trouble than I'm willing to take at the moment ...
create summary statistics for the probabilities defined under 1., including the minimum and maximum probabilities, across all combinations of categories pertaining to cat1 and cat2.
This doesn't really make sense to me unless we are going the disaggregated route. If we have predicted for each combination of cat1 and cat2, then we only have a single value for each combination (i.e., no "min/max" probabilities). Aggregating is easy in base R, e.g.
aggregate(pred~cat1:cat2,data=newdd2,
FUN=function(x) c(min=min(x),max=max(x)))
or in the tidyverse:
library(dplyr)
newdd2 %>% group_by(cat1,cat2) %>%
summarise(min=min(pred),max=max(pred))
I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions
I have a dataframe that has a column for time, symbol, price, volatility. I use this dataframe to run a first pass OLS regression using dummy variables for the symbol
fit <- lm(volatility~factor(symbol) + 0
Then I want to use the coefficients from that regression in a second pass regression, so I save the coeffiecients of the regression to reuse and then I want to use that to scale volatility
scale <- summary(fit)$coefficients[,1]
yscale <- volatility/scale
fit2 <- lm(yscale~factor(time) + factor(symbol)*factor(time) + 0
The problem that I am having is that I want to use the factor coefficients that are applicable to each symbol. So in the original dataframe I want to divide the volatility by the coeffiecient that matches its symbol. So, if I have symbols, DDX, CTY, LOL then I want to divide DDX's volatility by the coefficient with factor DDX from the regression then do the same for CTY and LOL.
Also, I need to figure out how to do the product in the second fit2 coefficient.
You should provide a reproducible example to get an exact answers. Here some data:
dat <- data.frame(volatility= rnorm(30),
symbol = sample(c('DDX', 'CTY', 'LOL'),30,rep=TRUE))
fit <- lm(volatility~factor(symbol) + 0,data=dat)
mm <- coef(fit)
names(mm) <- gsub('factor\\(symbol\\)','',names(mm))
I transform the names to get a pretty names that can be used later :
CTY DDX LOL
-0.1991273 0.1331980 -0.1567511
Then using transform , I divide each volatility with the corresponding coefficients:
transform(dat,vol.scale = volatility/mm[symbol],coef = mm[symbol])
volatility symbol vol.scale coef
1 -0.592306253 DDX -4.44680974 0.1331980
2 1.143486046 DDX 8.58485769 0.1331980
3 -0.693694139 LOL 4.42544868 -0.1567511
4 -0.166050131 LOL 1.05932325 -0.1567511
5 1.381900588 CTY -6.93978353 -0.1991273
..............................