I feel like this should be the easiest thing in the world. Firstly, I am relatively new to R, but I wanted to learn it. That being said, my experience so far suggests that R is not very intuitive. What I was able to figure out in Python within a couple hours has so far taken 2 days without result in R.
I want to regress a selection of dependent variables within a selection of panel data. I have several variables with various normalization curves. I would like to be able to iterate through many instead of writing regressions 1 at a time.
I want to do something like the following: plm(dependent ~ loopedvar + var2 + var3 + var4, data=mydata, model=c("within"))
I have created a varlist using grep, which is actually very easy. Now I want to substitute in the variables in varlist 1-by-1 as the 'loopedvar.'
In python with SPSS I would do something like
nvariables=len(varlist)
for variable in xrange(nvariables):
testvariable=varlist[variable]
spss.Submit("""AREG dependent WITH
{}
var2
var3
var4
/METHOD PW.
""" .format(testvariable))
I have also found this tutorial http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm, but I cannot get it to work, and I do not understand the *apply functions in R. For one, when writing lapply(varlist, function (x) [model]) how does the varlist[var] know where to go?
I have tried for loops with paste and substitute with varying errors.
for (var in 1:length(varlist)) {
models<-plm(substitute(dependent ~ i, list(i=as.name(paste0(var)), as.name("var2"), as.name("var3"), as.name("var4")) data=mydata, model=c("within")))
}
Throws "Error: unexpected symbol in: [...(var4"")) data]"
for (var in 1:length(varlist)) {
+ models<-summary(plm(paste0("dependent ~ ",var," + var2 + var3 + var4"), data=mydata, model=c("within")))
+ }
Throws "Error: inherits(object, "formula") is not TRUE"
These errors are super unhelpful, and I'm just sick of guessing. R syntax is not very straightforward in my estimation, and the chances that I will get it right are slim.
Please don't post a non-response. R people have a penchant for that in my experience. If I have insufficiently described my issue or desires just request more information, and I will be happy to oblige.
EDIT: I forgot the index argument in plm function. It should be there.
Definitely one of the harder things to wrap one's head around in R is that it does not like the "macro" approach used in some other languages (I learned to code Stata before branching out into R). Almost always there is a way to use the *apply functions instead of a loop-with-macro-reference to do what you want to do.
Here is how I would approach your particular problem.
data <- data.frame(dep = runif(100), var1=runif(100), var2=runif(100),var3=runif(100)) #Create some fake data
varlist<-c("var1","var2","var3") # Declare your varlist as a vector
lm.results<- lapply(data[,varlist],function(x) lm(dep ~ x, data=data)) # run the regression on each variable.
Let me break that last line down a little bit. A dataframes in R is actually a list with extra structure, where each item in the list is a variable/column. So lapply(data[,varlist],FUN) will evaluate the function FUN, using each column in data[,varlist] i.e. each variable in data which is named in varlist.
Since there isn't a built in function for what you need (there often isn't) you declare it on the fly. function(x) lm(dep ~ x, data=data) takes a variable as an argument (in the lapply call, each variable in varlist) and regresses dep on that variable. The results will be stored in a new list called lm.results.
Related
I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!
I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2
I'm trying to write a function I can apply to a string vector or list instead of writing a loop. My goal is to run a regression for different endogenous variables and save the resulting tables. Since experienced R users tell us we should learn the apply functions, I want to give it a try. Here is my attempt:
Broken Example:
library(ExtremeBounds)
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,1,0.2),var3=rnorm(30),var4=rnorm(30),var5=rnorm(30),var6=rnorm(30))
spec1 <- list(y=c("var1"),freevars=("var3"),doubtvars=c("var4","var5"))
spec2 <- list(y=c("var2"),freevars=("var4"),doubtvars=c("var3","var5","var6"))
specs <- c("spec1","spec2")
myfunction <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, se.fun = se.robust, weights = "lri", family = binomial(logit))
output <- eba$bounds
output <- output[,-(3:7)]
}
lapply(specs,myfunction)
Which gives me an error that makes me guess that R does not understand when x should be "spec1" or "spec2". Also, I don't quite understand what lapply would try to collect here. Could you provide me with some best practice/hints how to communicate such things to R?
error: Error in x$y : $ operator is invalid for atomic vectors
Working example:
Here is a working example for spec1 without using apply that shows what I'm trying to do. I want to loop this example through 7 specs but I'm trying to get away from loops. The output does not have to be saved as a csv, a list of all outputs or any other collection would be great!
eba <- eba(data=Data, y=spec1$y,
free=spec1$freevars,
doubtful=spec1$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, se.fun = se.robust, weights = "lri", family = binomial(logit))
output <- eba$bounds
output <- output[,-(3:7)]
write.csv(output, "./Results/eba_pmr.csv")
Following the comments of #user20650, the solution is quite simple:
In the lapply command, use lapply(mget(specs),myfunction) which gets the names of the list elements of specs instead of the lists themselves.
Alternatively, one could define specs as a list: specs <- list(spec1,spec2) but that has the downside that the lapply command will return a list where the different specifications are numbered. The first version keeps the names of the specifications (spec1 and spec2) which which makes work with the resulting list much easier.
I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.
For each of 100 data sets, I am using lm() to generate 7 different equations and would like to extract and compare the p-values and adjusted R-squared values.
Kindly assume that lm() is in fact the best regression technique possible for this scenario.
In searching the web I've found a number of useful examples for how to create a function that will extract this information and write it elsewhere, however, my code uses paste() to label each of the functions by the data source, and I can't figure out how to include these unique pasted names in the function I create.
Here's a mini-example:
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
for (i in testrun)
{
assign(paste(i,"test",sep=""),lm(temp$LogPre~temp$labels))
assign(paste(i,"test2",sep=""),lm(temp$LogPre~temp$labels2))
}
I would then like to extract the coefficients of each equation
But the following doesn't work:
summary(paste(i,"test",sep="")$coefficients)
and neither does this:
coef(summary(paste(i,"test",sep="")))
Both generating the error :$ operator is invalid for atomic vectors
EVEN THOUGH
summary(XXtest)$coefficients
and
coef(summary(XXtest))
work just fine.
How can I use paste() within summary() to allow me to do this for AAtest, AAtest2, ABtest, ABtest2, etc.
Thanks!
Hard to tell exactly what your purpose is, but some kind of apply loop may do what you want in a simpler way. Perhaps something like this?
temp <- data.frame(labels=rep(1:10),LogPre= rnorm(10))
temp$labels2<-temp$labels^2
testrun<-c("XX")
names(testrun) <- testrun
out <- lapply(testrun, function(i) {
list(test1=lm(temp$LogPre~temp$labels),
test2=lm(temp$LogPre~temp$labels2))
})
Then to get all the p-values for the slopes you could do:
> sapply(out, function(i) sapply(i, function(x) coef(summary(x))[2,4]))
XX
test1 0.02392516
test2 0.02389790
Just using paste results in a character string, not the object with that name. You need to tell R to get the object with that name by using get.
summary(get(paste(i,"test",sep="")))$coefficients