I need to apply a linear regression formula to all the columns in my dataframe called mydf. There are thousands of columns in mydf, so indicating each of them would not be possible in the formula. There are two columns, weight and age which will remain same in the formula for all other columns in which I want to apply this formula.
The formula for first column (column bmd) is
fit1 <- lm(bmd ~ weight + age, data=mydf)
mydf[,"a"] <-fit1$fitted.values
I want to apply this formula to other columns as well (except weight and age)
fit1 <- lm(bp ~ weight + age, data=mydf)
mydf[,"bp"] <-fit1$fitted.values
and
fit1 <- lm(choles ~ weight + age, data=mydf)
mydf[,"choles"] <-fit1$fitted.values
what would be the best (time efficient way as it takes really long time) way to expand this formula and store the fitted.values in the right column across all the wanted columns (bmd,bp,choles) ?
mydf
bmd bp choles weight age
1 2 3 22.3 12
2 1 2 33.2 13
3 2 5 44.5 16
Try this:
apply(mydf[ ,-c(4:5)], 2, function(x) lm(x ~ mydf$weight + mydf$age)$fitted.values)
EDIT: added missing comma
One method:
# get names of all dependent variables
dependents <- names(df[, -which(names(df) %in% c("weight", "age"))]
# build a little function
myFit <- function(depName) {
fit1 <- lm(as.formula(paste(depName, "~ weight + age")), data=mydf)
return(fit1$fitted.values)
}
# sapply them
fittedValues <- sapply(dependents, myFit)
Then cbind to your dataset.
Related
I was wondering whether it is possible to compute fitted values for a sample of observations which is different from the subsample that has been used to perform a linear regression. In particular, I have a full dataframe of 400 individuals. I want to perform two separate OLS regressions, subsampling the dataframe according to the value of a dummy.
ols1<-lm(log_consumption ~ log_wage + Age + Age2 + Education, data=df, subset = type==1)
ols2<-lm(log_consumption ~ log_wage + Age + Age2 + Education, data=df, subset = type==0)
this code obviously returns me the two separate models and the corresponding fitted values. However I would like to get the fitted values of all my dataframe (i.e. the fitted values of all the 400 individuals) firstly according to model 1, then according to model 2. Basically I want to compare the fitted values for the entire dataframe exploiting the differences among the OLS coefficients that I get under the two different "regimes".
Is there a way to do this in R??
Thanks for your help,
Marco
It looks like you want to predict(). Try: predict(ols1, df) and predict(ols2, df). Here is an example using the iris data set.
## data
df <- iris
df$type <- rep(c(0, 1), 75) # 75 type 0 and 75 type 1
## models
ols1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = df, subset = type == 1)
ols2 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = df, subset = type == 0)
## predicted values for all the 150 observations
# just for checking: fitted(ols1) and fitted(ols2) give the 75 fitted values
length(fitted(ols1))
length(fitted(ols2))
# here, we want predicted values instead of fitted values
# the the predict() function, we can obtained predicted values for all the 150 observations
predict(ols1, df)
predict(ols2, df)
# check: we have 150 observations
length(predict(ols1, df))
length(predict(ols2, df))
I have 100 groups with 40 observations each. I know there might be other appropriate models but i am currently only interested in the following.
reg<-lmList(Y ~ Intercept + a + b + c + d | grp,data=data, pool=F)
In order to save the residuals (from lmList) to my table I just do
data$residual <- residuals(reg)
Now I want to save the beta coefficients to the original table as well. Since each group only has a single set of coefficients it should be same for the group but different across groups.
coef <- coef(reg) gives me a list with the group coefficients. However R shows this as a list with only 5 columns (excluding the group names).
data$coef<-coef[,c(1) ] gives me the intercepts but I lose the group information.
I was thinking about creating a separate table with group identification so that I just merge the original table and the coefficient tables. But could not figure out how to get the group identification along with the coefficients.
If there is an easier way to do this please help.
If fm1 is your fitted lmList object then I think simply making the row names into an additional column should do what you want:
library(nlme)
fm1 <- lmList(distance ~ age | Subject, Orthodont)
res <- data.frame(Subject=rownames(coef(fm1)),coef(fm1),check.names=FALSE)
rownames(res) <- NULL ## now redundant
head(res)
Subject (Intercept) age
1 M16 16.95 0.550
2 M05 13.65 0.850
...
Then you should be able to merge() (although merging with the Orthodont object seems problematic - I think this may be because it's a weird groupedData object ...)
I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions
I have a dataframe that has a column for time, symbol, price, volatility. I use this dataframe to run a first pass OLS regression using dummy variables for the symbol
fit <- lm(volatility~factor(symbol) + 0
Then I want to use the coefficients from that regression in a second pass regression, so I save the coeffiecients of the regression to reuse and then I want to use that to scale volatility
scale <- summary(fit)$coefficients[,1]
yscale <- volatility/scale
fit2 <- lm(yscale~factor(time) + factor(symbol)*factor(time) + 0
The problem that I am having is that I want to use the factor coefficients that are applicable to each symbol. So in the original dataframe I want to divide the volatility by the coeffiecient that matches its symbol. So, if I have symbols, DDX, CTY, LOL then I want to divide DDX's volatility by the coefficient with factor DDX from the regression then do the same for CTY and LOL.
Also, I need to figure out how to do the product in the second fit2 coefficient.
You should provide a reproducible example to get an exact answers. Here some data:
dat <- data.frame(volatility= rnorm(30),
symbol = sample(c('DDX', 'CTY', 'LOL'),30,rep=TRUE))
fit <- lm(volatility~factor(symbol) + 0,data=dat)
mm <- coef(fit)
names(mm) <- gsub('factor\\(symbol\\)','',names(mm))
I transform the names to get a pretty names that can be used later :
CTY DDX LOL
-0.1991273 0.1331980 -0.1567511
Then using transform , I divide each volatility with the corresponding coefficients:
transform(dat,vol.scale = volatility/mm[symbol],coef = mm[symbol])
volatility symbol vol.scale coef
1 -0.592306253 DDX -4.44680974 0.1331980
2 1.143486046 DDX 8.58485769 0.1331980
3 -0.693694139 LOL 4.42544868 -0.1567511
4 -0.166050131 LOL 1.05932325 -0.1567511
5 1.381900588 CTY -6.93978353 -0.1991273
..............................
I have a list of 100 columns in a data frame Data1. One of these variables is the dependent variable. The others are predictors.
I need to extract 99 predictors into a column (say varlist) to be used in the equation below
equation <- as.formula(paste('y', "~", paste(varlist, collapse="+"),collapse=""))
I can use dput on the dataframe to extract all the columns but I could not get rid of the dependent variable y from the list:
Varlist <- dput(names(Data1))
It would be much more appropriate to go a different route. If you want to include all of the other variables in your data frame besides the response variable you can just use y ~ . to specify that.
fakedata <- as.data.frame(matrix(rnorm(100000), ncol = 100))
names(fakedata)[1] <- "y"
o <- lm(y ~ ., data = fakedata)
This fit a regression using the 99 other columns in fakedata as the predictors and 'y' as the response and stored it into 'o'
Edit: If you want to exclude some variables you can exclude those from the data set. The following removes the 10th column through the 100th column leaving a regression of y on columns 2-9
o <- lm(y ~ ., data = fakedata[,-(10:100)])