I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.
Related
I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])
I want to make a linear regression where my dependant variable is data$fs_deviation_score while independent vairables are multiple columns of my data frame (column 660 to 675).
with this function it works but i am not able to save the output (Coefficients : Estimate , Std.Error, p value..)
Reg<-lapply( data[660:675], function(x) summary(lm(data$fs_deviation_score ~ x)))
I search a function to save the output (Coefficients :Estimate, Std.Error, p value..)
Thanks
If you have that Reg object produced by:
Reg<-lapply( data[660:675], function(x) summary(lm(data$fs_deviation_score ~ x)))
Then all you need to do to extract the coefficients matrix from each summary object is this:
CoefMats <- lapply( Reg, coef)
The summary-objects are actually named lists (as are lm and glm objects). ?summary.lm will bring up the specific help page and the Value subsection will give you all the names. The See Also links should also be reviewed.
Had you want the r.squared values you could have use this:
Rsqds <- lapply(Reg, "[[", "r.squared")
Example:
m <- lm (as.integer (Species), iris);
save (m, "iris_lm.RData");
Now in a new session.
load ("iris_lm.RData");
summary (m);
names (m);
You can use the save function to save the entire object and read and use it later. Note that this is a general way to store any R object.
To be able to want to store the components of the components of the summary of the model m. You can do as follows for example to store the residuals.
write.csv (m$residuals, "residuals.csv");
To access each components of the summary, execute names(m) so see what are the components do help (summary.lm)
If you want to save the output of the regression model in Tibble format, you can use the broom package. This works for other model objects too.
Also, gtsummary package helps to get output in presentation-ready table format.
For examples; see this
I'm a new user trying to figure out lapply.
I have two data sets with the same 30 variables in each, and I'm trying to run t-tests to compare the variables in each sample. My ideal outcome would be a table that lists each variable along with the t stat and the p-value for the difference in that variable between the two datasets.
I tried to design a function to do the t test, so that I could then use lapply. Here's my code with a reproducible example.
height<-c(2,3,4,2,3,4,5,6)
weight<-c(3,4,5,7,8,9,5,6)
location<-c(0,1,1,1,0,0,0,1)
data_test<-cbind(height,weight,location)
data_north<-subset(data_test,location==0)
data_south<-subset(data_test,location==1)
variables<-colnames(data_test)
compare_t_tests<-function(x){
model<-t.test(data_south[[x]], data_north[[x]], na.rm=TRUE)
return(summary(model[["t"]]), summary(model[["p-value"]]))
}
compare_t_tests(height)
which gets the error:
Error in data_south[[x]] : attempt to select more than one element
My plan was to use the function in lapply like this, once I figure it out.
lapply(variables, compare_t_tests)
I'd really appreciate any advice. It seems to me like I might not even be looking at this right, so redirection would also be welcome!
You're very close. There are just a few tweaks:
Data:
height <- c(2,3,4,2,3,4,5,6)
weight <- c(3,4,5,7,8,9,5,6)
location <- c(0,1,1,1,0,0,0,1)
Use data.frame instead of cbind to get a data frame with real names ...
data_test <- data.frame(height,weight,location)
data_north <- subset(data_test,location==0)
data_south <- subset(data_test,location==1)
Don't include location in the set of variables ...
variables <- colnames(data_test)[1:2] ## skip location
Use the model, not the summary; return a vector
compare_t_tests<-function(x){
model <- t.test(data_south[[x]], data_north[[x]], na.rm=TRUE)
unlist(model[c("statistic","p.value")])
}
Compare with the variable in quotation marks, not as a raw symbol:
compare_t_tests("height")
## statistic.t p.value
## 0.2335497 0.8236578
Using sapply will automatically collapse the results into a table:
sapply(variables,compare_t_tests)
## height weight
## statistic.t 0.2335497 -0.4931970
## p.value 0.8236578 0.6462352
You can transpose this (t()) if you prefer ...
Bit of a R novice here, so it might be a very simple problem.
I've got a dataset with GENDER (being a binary variable) and a whole lot of numerical variables. I wanted to write a simple function that checks for equality of variance and then performs the appropriate t-test.
So my first attempt was this:
genderttest<-function(x){ # x = outcome variable
attach(Dataset)
on.exit(detach(Dataset))
VARIANCE<-var.test(Dataset[GENDER=="Male",x], Dataset[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
This works well outside of a function (replacing the x, of course), but gave me an error here because variable lengths differ.
So I thought it might be handling the NA cases strangely and I should clean up the dataset first and then perform the tests:
genderttest<-function(x){ # x = outcome variable
Dataset2v<-subset(Dataset,select=c("GENDER",x))
Dataset_complete<-na.omit(Dataset2v)
attach(Dataset_complete)
on.exit(detach(Dataset_complete))
VARIANCE<-var.test(Dataset_complete[GENDER=="Male",x], Dataset_complete[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
But this gives me the same error.
I'd appreciate if anyone could point out my (probably stupid) mistake.
I believe the problem is that when you call t.test(x~GENDER), it's evaluating the variable x within the scope of Dataset rather than the scope of your function. So it's trying to compare values of x between the two genders, and is confused because Dataset doesn't have a variable called x in it.
A solution that should work is to call:
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), data=Dataset))
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), var.equal=T, data=Dataset))
which will call t.test() and pass the value of x as part of the formula argument rather than the character x (i.e score ~ GENDER instead of x ~ GENDER).
The reason for the particular error you saw is that Dataset$GENDER has length equal to the number of rows in Dataset, while Dataset$x has length = 0.
For each of 100 data sets, I am using lm() to generate 7 different equations and would like to extract and compare the p-values and adjusted R-squared values.
Kindly assume that lm() is in fact the best regression technique possible for this scenario.
In searching the web I've found a number of useful examples for how to create a function that will extract this information and write it elsewhere, however, my code uses paste() to label each of the functions by the data source, and I can't figure out how to include these unique pasted names in the function I create.
Here's a mini-example:
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
for (i in testrun)
{
assign(paste(i,"test",sep=""),lm(temp$LogPre~temp$labels))
assign(paste(i,"test2",sep=""),lm(temp$LogPre~temp$labels2))
}
I would then like to extract the coefficients of each equation
But the following doesn't work:
summary(paste(i,"test",sep="")$coefficients)
and neither does this:
coef(summary(paste(i,"test",sep="")))
Both generating the error :$ operator is invalid for atomic vectors
EVEN THOUGH
summary(XXtest)$coefficients
and
coef(summary(XXtest))
work just fine.
How can I use paste() within summary() to allow me to do this for AAtest, AAtest2, ABtest, ABtest2, etc.
Thanks!
Hard to tell exactly what your purpose is, but some kind of apply loop may do what you want in a simpler way. Perhaps something like this?
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
names(testrun) <- testrun
out <- lapply(testrun, function(i) {
list(test1=lm(temp$LogPre~temp$labels),
test2=lm(temp$LogPre~temp$labels2))
})
Then to get all the p-values for the slopes you could do:
> sapply(out, function(i) sapply(i, function(x) coef(summary(x))[2,4]))
XX
test1 0.02392516
test2 0.02389790
Just using paste results in a character string, not the object with that name. You need to tell R to get the object with that name by using get.
summary(get(paste(i,"test",sep="")))$coefficients