I am trying to code a fixed effects regression, but I have MANY dummy variables. Basically, I have 184 variables on the RHS of my equation. Instead of writing this out, I am trying to create a loop that will pass through each column (I have named each column with a number).
This is the code i have so far, but the paste is not working. I may be totally off base using paste, but I wasn't sure how else to approach this. However, I am getting an error (see below).
FE.model <- plm(avg.kw ~ 0 + (for (i in 41:87) {
paste("hour.dummy",i,sep="") + paste("dummy.CDH",i,sep="")
+ paste("dummy.MA",i,sep="") + paste("DR.variable",i,sep="")
}),
data = data.reg,
index=c('Site.ID','date.hour'),
model='within',
effect='individual')
summary(FE.model)
As an example for the column names, when i=41 the names should be "hour.dummy41" "dummy.CDH41", etc.
I'm getting the following error:
Error in paste("hour.dummy", i, sep = "") + paste("dummy.CDH", i, sep = "") : non-numeric argument to binary operator
So I'm not sure if it's the paste function that is not appropriate here, or if it's the loop. I can't seem to find a way to loop through column names easily in R.
Any help is much appreciated!
Ignoring worries about fitting a model with so many terms for the moment, you probably want to generate a string, and then cast it as a formula:
#create a data.frame where rows are the parts of the variable names, then collapse it
rhs <- do.call(paste, c(as.list(expand.grid(c("hour.dummy","dummy.CDH"), 41:87)), sep=".", collapse=" + "))
fml <- as.formula(sprintf ("avg.kw ~ %s"), rhs))
FE.model <-pml(flm, ...
I've only put in two of the 'dummy's in the second line- but you should get the idea
Related
I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])
I first use grep to obtain all variable names that begin with the preface: "h_." I then collapse that array into a single string, separated with plus signs. Is there a way to subsequently use this string in a linear regression?
For example:
holiday_array <- grep("h_", names(df), value=TRUE)
holiday_string = paste(holiday_array, collapse=' + ' )
r_3 <- lm(log(assaults) ~ year + month + holiday_string, data = df)
I get the straightforward error variable lengths differ (found for 'holiday_string')
I can do it like this, for example:
holiday_formula <- as.formula(paste('log(assaults) ~ attend_v + year+ month + ', paste("", holiday_vars, collapse='+')))
r_3 <- lm(holiday_formula, data = df)
But I don't want to have to type a separate formula construction for each new set of controls. I want to be able to add the "string" inside the lm function. Is this possible?
The above is problematic, because let's say I want to then add another set of control variables to the formula contained in holiday_formula, so something like this:
weather_vars <- grep("w_", names(df), value=TRUE) weather_formula
<- as.formula(paste(holiday_formula, paste("+", weather_vars,
collapse='+')))
Not sure how you would do the above.
I don't know a simple method for construction of a formula argument different than the one you are rejecting (although I considered and rejected using update.formula since it would also have required using as.formula), but this is an alternate method for achieving the same goal. It uses the "."-expansion feature of R-formulas and relies on the ability of the [-function to accept character argument for column selection:
r_3 <- lm(log(assaults) ~ attend_v + year+ month + . ,
data = df[ , c('assaults', 'attend_v', 'year', 'month', holiday_vars] )
In R, I am using readHTMLTable to read in a tables from the web. The tables I want occur at indexes 16 & 17, [[16]] & [[17]].
Here is a small sample of the data for you to work with:
These are some of the urls that contain the HTML tables.
url1 = "http://www.basketball-reference.com/leagues/NBA_1980.html"
url2 = "http://www.basketball-reference.com/leagues/NBA_1981.html"
url3 = "http://www.basketball-reference.com/leagues/NBA_1982.html"
And here, I read in the tables to variables named x1, x2, and x3.
x1 = readHTMLTable(url1)
x2 = readHTMLTable(url2)
x3 = readHTMLTable(url3)
If you look at the summary of each of these summary(x1), summary(x2), summary(x3) and count down through the indexes, the tables I want are the ones named "team" and "opponent", which occur on line 16 and line 17.
I have been trying to write a loop that would cycle through these and name the "team" table from each to a variables named team.1980, team.1981, and team.1982, respectively. The "opponent" tables would follow the same trend, opp.1980, and so forth.
This is the code for the loop I have been trying:
for(i in 1:3) {
for (j in 1980:1982) {
nam1 = paste0("team.", j)
nam2 = paste0("opp.", j)
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
}
}
I think the theory behind this loop works, however the problem occurs with the two assign functions:
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
When I run the loop, I get the error message
Error in paste0("x.", i)[[16]] : subscript out of bounds
which is the same error I get if I just run:
paste0("x", 1)[[16]]
> paste0("x", 1)[[16]]
Error in paste0("x", 1)[[16]] : subscript out of bounds
So I am pretty sure this is where my problem is. Does anyone know how I could cycle through variables and pull out indexes from each?
Please keep in mind that I am rather new to R, so simplicity would be much appreciated! Thanks in advance!
The output from readHTMLTable() is a list and the elements can be referenced by name; index isn't necessary. (Though you can use it.)
Suppose x1, x2, and x3 are defined as in your post. Then you can just do this:
for (i in 1:3) {
year <- 1980 + i - 1
eval(parse(text=paste0("team.", year, " <- x", i, '[["team"]]')))
eval(parse(text=paste0("opp.", year, " <- x", i, '[["opponent"]]')))
}
This evaluates the parsed text that's constructed dynamically in the loop. It creates 6 data frames: team.1980 and opp.1980 for years 1980-1982.
Let's take a closer look at what it's doing...
First a string is constructed using paste0() to concatenate the values into a string with no separator. The first call to paste0() in the first iteration yields this string:
'team.1980 <- x1[["team"]]'
Calling parse() on this tells R to turn that string into an object called an expression. Expressions can be evaluated using eval(). So this string gets turned into an R statement and executed, thereby assigning team.1980.
This process continues for each of the 3 iterations.
This may not be the best approach, but it should work in your situation. I assume you have more than just these 6, otherwise you might as well just write them as individual assignments.
I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.
I sometimes vectorise the variables I used in a model and do other stuff with it (e.g. descriptives etc...). The problem is that sometimes I use "as.numeric(var)" or "as.factor(var)", or center "I(var-15)". I then need the name of the original variables.
The problem is that I can't simply gsub(lmfit$model,"as.factor(","") because I get an error, and I want to avoid delete variables that contain I etc... so I need to delete I(* -any number) and as.factor(*), where * is the variable name that I want to remain untouched.
Let's say I have a vector of coefficients from a model:
outcome <- c(1:9)
INDEX <- c(18,17,15,20,10,20,25,13,12)
BODYFAT <- c(18,18,15,20,20,20,15,20,15)
lmfit <- glm(outcome ~ as.factor(BODYFAT) + I(INDEX-15), family = gaussian())
names(lmfit$model)
How would you work on names(lmfit$model) to get the original variable names back (i.e. BODYFAT and INDEX?
I've started creating some clunky code to remove all the centering numbers (assuming 1 to 500 should be enough in most cases)
b<-paste(paste0("- ",1:500,"|",collapse=""),"-501",collapse="")
library(stringr)
str_replace_all(names(lmfit$model),b, " ")
But I'm having real problems with the removing I() and as.factor(). Any suggestions?
Many thanks in advance