From the FSelector manual:
data(iris)
subset <- cfs(Species~., iris)
f <- as.simple.formula(subset, "Species")
print(f)
Specifically, I mean the one in "Species~.".
Now, it's awfully tough to Google how a bit of punctuation is used (for me anyway) and I couldn't anything. This code is unclear.
I think you're referring to the period contained in Species~., in which case this is just the standard R formulation of referring to 'all other variables' in the data frame, rather than typing them out one by one, as in Species ~ Variable1 + Variable2 etc.
From the help files of ?formula:
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
Related
I'm trying to use lm() and matchit() on a subset of covariates. I have generated an arbitrary number of columns with prefix "covar", i.e. "covar.1", "covar.2", etc. I'd like to do something like
lm(group ~ covars, data=df)
where covars is a vector of strings c("covar.1", "covar.2", ...).
I tried several things like
cols <- colnames(df)
covars <- cols[grep("covar", colnames(df))]
m.out <- matchit(group ~ covars, data=df, method="nearest", distance="logit", caliper=.20)
but got variable lengths differ (found for 'covars').
Defining a new dataframe only with covars and group can work but that defeats my purpose using matchit because I want the matched data to have other columns, too, not just covars I picked to be the matched on.
This seems to be an easy task but somehow I can't figure out after some googling. Not sure what R formula expects there as subset of columns. Any help is appreciated.
You might want to use as.formula.
Try doing this:
Replace group ~ covars
with as.formula(paste('group','~', paste(covars, collapse="+"))))
I mentioned this in your other question, but the cobalt package has a function specifically for this, which is f.build(). The first argument to f.build() is a string containing the name of the treatment variable (or left hand side of the formula), and the second argument is a string vector containing the names of the variables to be on the right hand side of the formula (i.e., the covariates). The second argument can also be a data.frame containing the covariates; f.build() simply extracts the names. It then performs the operation described in the chosen answer, bit adds in a few other aspects that make it a little more general and robust to errors.
The cobalt documentation has a section on f.build() and uses its use with glm() and matchit() as examples.
After running matchit(), you can assess balance on the covariates using the bal.tab() function in cobalt, which is compatible with MatchIt:
bal.tab(m.out, un = TRUE)
The documentation for cobalt explains its use with MatchIt in detail.
I am having trouble with mlogit() function. I am trying to predict which variables in a given set are the most preferred amongst people who took our survey. I am trying to predict the optimal combination of variables to create the most preferred option. Basically, we are measuring "Name", "Logo Size", "Design", "Theme","Flavor", and "Color".
To do this, we have a large data set and are trying to run it through mlogit.data() and mlogit(), although we keep getting the same error:
Error in if (abs(x - oldx) < ftol) { :
missing value where TRUE/FALSE needed
None of my data is negative or missing, so this is very confusing. My syntax is:
#Process data in mlogit.data()
data2 <-
mlogit.data(data=data, choice="Choice",
shape="long", varying=5:10,
alt.levels=paste("pos",1:3))
#Make character columns factors and "choice" column (the one we are
#measuring) a numeric.
data2$Name <- as.factor(data2$Name)
data2$Logo.Size <- as.factor(data2$Logo.Size)
data2$Design <- as.factor(data2$Design)
data2$Theme <- as.factor(data2$Theme)
data2$Color <- as.factor(data2$Color)
data2$Choice <- as.numeric(as.character(data2$Choice))
##### RUN MODEL #####
m1 <- mlogit(Choice ~ 0 + Name + Logo.Size + Design + Theme + Flavor
+ Color, data = data2)
m1
Does it look like there is a problem with my syntax, or is it likely my data that is the problem?
In a panel setting, it is potentially the case that one or more of your choice cards does not have a TRUE value. One fix would be to drop choice cards that are missing a choice.
## Use data.table
library(data.table)
## Drop choice cards that received no choice
data.table[, full := sum(Choice), by=Choice_id]
data.table.full <- data.table[full!=0,]
This is an issue specific to mlogit(). For example, STATA's mixed logit approach ignores missing response variables, R views this as more of an issue that needs to be addressed.
I had the same error. It got resolved when I arranged the data by unique ID and alternative ID. For some reason, mlogit requires all the choice instances to be stacked together.
Error in if (abs(x - oldx) < ftol) { : missing value where TRUE/FALSE needed
Suggests that if your response variable is binary ie 1/0 then one or more of the values is something other than 1/0
Look at: table(data2$Choice) to see if this is the case
I had similar issue, but eventually figured out. In my case, it is due to missing value of the covariates not the choice response.
I had this problem when my data included choice situations (questions that participants were asked) in which none of the choices was selected. Removing those rows fixed the problem.
Just in case others might have the same issue. I got this error when I did run my choice model (a maximum difference scaling) when I had partial missings. E.g. if two choices per task/set had to be made by the respondent, but only one choice was made.
I could solve this issue in the long format data set by dropping those observations that belonged to the missing choice while keeping the observations where a valid choise was made.
E.g. assume I have a survey with 9 tasks/sets and in each task/set 5 alternatives are provided. In each task my respondents had to make two choices, i.e. selecting one of the 5 alternatives as "most important" and one of the alternatives as "least important". This results in a data set that has 5*9*2 = 90 rows per respondent. There are exactly 5 rows per task*choice combination (e.g. 5 rows for task 1 containing the alternatives, where exactly one of these 5 rows is coded as 1 in the response variable in case it was chosen as the most (or least) important alternative).
Now imagine a respondent only provides a choice for "most important", but not for least important. In such a case the 5 rows for "least important" would all have a 0 in the response variable. Excluding these 5 rows from the data solves the aboove error issue and btw leads to the exact same results as other tools woudl provide (e.g. Sawtooth's Lighthouse software).
Re (1)
"data2 <- mlogit.data(data=data, choice="Choice",
shape="long", varying=5:10,
**alt.levels=paste("pos",1:3))**"
and (2)
"m1 <- mlogit(**Choice** ~ 0 + Name + Logo.Size + Design + Theme + Flavor + Color, data = data2)"
In addition to making sure all of the data is filled in, I would just highlight that: (1) The level names need to exactly match the part of the variable name after the separator. And, (2) The DV in the model needs to be the variable name appearing before the separator.
Example: original variable "Media" with 5 categories -> 5 dummy variables "Med_Radio", "Med_TV", etc: The level names need to be "Radio", "TV", etc., exactly as written. And you must put "Med" into the model, not "Media", as DV.
This fixed the problem for me.
Ok, I have a data frame with 250 observations of 9 variables. For simplicity, let's just label them A - I
I've done all the standard stuff (converting things to int or factor, creating the data partition, test and train sets, etc).
What I want to do is use columns A and B, and predict column E. I don't want to use the entire set of nine columns, just these three when I make my prediction.
I tried only using the limited columns in the prediction, like this:
myPred <- predict(rfModel, newdata=myData)
where rfModel is my model, and myData only contains the two fields I want to use, as a dataframe. Unfortunately, I get the following error:
Error in predict.randomForest(rfModel, newdata = myData) :
variables in the training data missing in newdata
Honestly, I'm very new to R, and I'm not even sure this is feasible. I think the data that I'm collecting (the nine fields) are important to use for "training", but I can't figure out how to make a prediction using just the "resultant" field (in this case field E) and the other two fields (A and B), and keeping the other important data.
Any advice is greatly appreciated. I can post some of the code if necessary.
I'm just trying to learn more about things like this.
A assume you used random forest method:
library(randomForest)
model <- randomForest(E ~ A+ B - c(C,D,F,G,H,I), data = train)
pred <- predict(model, newdata = test)
As you can see in this example only A and B column would be taken to build a model, others are removed from model building (however not removed from the dataset). If you want to include all of them use (E~ .). It also means that if you build your model based on all column you need to have those columns in test set too, predict won't work without them. If the test data have only A and B column the model has to be build based on them.
Hope it helped
As I mentioned in my comment above, perhaps you should be building your model using only the A and B columns. If you can't/don't want to do this, then one workaround perhaps would be to simply use the median values for the other columns when calling predict. Something like this:
myData <- cbind(data[, c("A", "B)], median(data$C), median(data$D), median(data$E),
median(data$F), median(data$G), median(data$H), median(data$I))
myPred <- predict(rfModel, newdata=myData)
This would allow you to use your current model, built with 9 predictors. Of course, you would be assuming average behavior for all predictors except for A and B, which might not behave too differently from a model built solely on A and B.
I feel like this should be the easiest thing in the world. Firstly, I am relatively new to R, but I wanted to learn it. That being said, my experience so far suggests that R is not very intuitive. What I was able to figure out in Python within a couple hours has so far taken 2 days without result in R.
I want to regress a selection of dependent variables within a selection of panel data. I have several variables with various normalization curves. I would like to be able to iterate through many instead of writing regressions 1 at a time.
I want to do something like the following: plm(dependent ~ loopedvar + var2 + var3 + var4, data=mydata, model=c("within"))
I have created a varlist using grep, which is actually very easy. Now I want to substitute in the variables in varlist 1-by-1 as the 'loopedvar.'
In python with SPSS I would do something like
nvariables=len(varlist)
for variable in xrange(nvariables):
testvariable=varlist[variable]
spss.Submit("""AREG dependent WITH
{}
var2
var3
var4
/METHOD PW.
""" .format(testvariable))
I have also found this tutorial http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm, but I cannot get it to work, and I do not understand the *apply functions in R. For one, when writing lapply(varlist, function (x) [model]) how does the varlist[var] know where to go?
I have tried for loops with paste and substitute with varying errors.
for (var in 1:length(varlist)) {
models<-plm(substitute(dependent ~ i, list(i=as.name(paste0(var)), as.name("var2"), as.name("var3"), as.name("var4")) data=mydata, model=c("within")))
}
Throws "Error: unexpected symbol in: [...(var4"")) data]"
for (var in 1:length(varlist)) {
+ models<-summary(plm(paste0("dependent ~ ",var," + var2 + var3 + var4"), data=mydata, model=c("within")))
+ }
Throws "Error: inherits(object, "formula") is not TRUE"
These errors are super unhelpful, and I'm just sick of guessing. R syntax is not very straightforward in my estimation, and the chances that I will get it right are slim.
Please don't post a non-response. R people have a penchant for that in my experience. If I have insufficiently described my issue or desires just request more information, and I will be happy to oblige.
EDIT: I forgot the index argument in plm function. It should be there.
Definitely one of the harder things to wrap one's head around in R is that it does not like the "macro" approach used in some other languages (I learned to code Stata before branching out into R). Almost always there is a way to use the *apply functions instead of a loop-with-macro-reference to do what you want to do.
Here is how I would approach your particular problem.
data <- data.frame(dep = runif(100), var1=runif(100), var2=runif(100),var3=runif(100)) #Create some fake data
varlist<-c("var1","var2","var3") # Declare your varlist as a vector
lm.results<- lapply(data[,varlist],function(x) lm(dep ~ x, data=data)) # run the regression on each variable.
Let me break that last line down a little bit. A dataframes in R is actually a list with extra structure, where each item in the list is a variable/column. So lapply(data[,varlist],FUN) will evaluate the function FUN, using each column in data[,varlist] i.e. each variable in data which is named in varlist.
Since there isn't a built in function for what you need (there often isn't) you declare it on the fly. function(x) lm(dep ~ x, data=data) takes a variable as an argument (in the lapply call, each variable in varlist) and regresses dep on that variable. The results will be stored in a new list called lm.results.
I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2