I can't seem to get the lm function to work properly on any columns that have 0 as a data value. Here is my code:
project.lm = lm(SalePrice ~Lot.Area + Year.Built + Year.Remod.Add + Gr.Liv.Area +
Yr.Sold + Bsmt.Unf.SF, project.table)
But when I do summary of project.lm, I get literally thousands of variables in my linear model, in fact one variable for each value of Bsmt.Unf.SF. This occurs for all columns where there is a value of 0; otherwise, everything works fine. Any ideas?!?
See the documentation for read.csv and read.table : there's an argument called stringsAsFactors which is TRUE by default. Set it to FALSE and you may be happier :-)
Related
I am using a dataset containing mvar_1 as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.
When I use the following code:
data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 +
mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 +
mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 +
mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
mvar_28, data=data.train, ntree=20000, mtry=15,
importance=TRUE, na.action = na.omit )
This error message appears:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
One of your mvar is a factor with more than 53 levels.
You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)
More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor)).
If you don't know what a factor is, the second option is probably it. You should do a summary of data.train, this will help you see which mvar are incorrectly typed. If the mvar is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.
In any case, calling summary will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.
I had the same problem, but solved it after seeing that I had imported the data frame with comma separators without indicating it.
After importing the table using read.table(data, dec=",") the problem was solved!
I encountered the same problem - error message suggesting >53 factor levels, but none of my variables were like that.
Upon further inspection, I found I had some factor variables with empty levels.
I used the forcats function fct_drop to remove these, then everything worked!
As antoine-sac pointed out, in my case this error was because of numeric variables appearing as factors. Only that the conversion happened by R when it was importing my (numeric) file.
Casting the factors as numerics didn't work. But what worked was using strip.white = TRUE when importing the dataset. (I found this solution here.)
This error occurs when you train your model with the entire dataset and not with the train data. Try implementing the model with train data and work out with test adm to perform prediction.
first, I got 2 feature which are character initially.
train_address = train$address
test_address = test$address
and then I bind them together.
address = c(train_address, test_address)
and then I change it from character to integer because I will dummy them later and I want to process it faster.(those character are not in English)
train_address = as.integer(factor(train_address, levels = unique(address)))
test_address = as.integer(factor(test_address, levels = unique(address)))
and now, here is the problem. code is shown below.
My goal is to transfer all the data which in train but not in test to 0.
for (a in train_address) {
if (!(train_address[a] %in% test_address)) {
train_address[a] = 0
}
}
train_address = as.factor(train_address)
test_address = as.factor(test_address)
after I process the data in this way, it should be:
the number of factor of test + 1 = the number of factor of train
(because R start from 1 so 0 is not been used until I transfer some of the data in train via the for loop above)
but in reality, difference between the number of factor of train and of test is 400+.
I know there must be something wrong about the code but I don't know where...
Following should do the trick.
You don't need loop for this but use vectorized manipulation.
train_address[!(train_address %in test_address)] <- 0
Explanation :
(train_address %in test_address) gives boolean vector where TRUE means to element in train_address is in test_address
! negates that boolean vector
train_address[!(train_address %in test_address)] gives all the elements in train_address that are not in test_address.
finally you set them to zero by our command train_address[!(train_address %in test_address)] <- 0
I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!
I am using a dataset containing mvar_1 as column, having names of one of 5 parties that citizen voted for last year. Other variables are just demographic variables, as the number of rallies attended for each parties, other stuffs.
When I use the following code:
data.model.rf = randomForest(mvar_1 ~ mvar_2 + mvar_3 + mvar_4 + mvar_5 +
mvar_6 + mvar_7 + mvar_8 + mvar_9 + mvar_10 +
mvar_11 + mvar_15 + mvar_17 + mvar_18 + mvar_21 +
mvar_22 + mvar_23 + mvar_24 + mvar_25 + mvar_26 +
mvar_28, data=data.train, ntree=20000, mtry=15,
importance=TRUE, na.action = na.omit )
This error message appears:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
One of your mvar is a factor with more than 53 levels.
You may have a categorical variable with lots of levels, like demographic group, and you should aggregate it into less levels to use this package. (See here for the best way of doing it)
More likely, you have a non-categorical variable incorrectly typed as a factor. In this case you should fix it by typing your variable correctly. E.g. to get a numeric from a factor, you call as.numeric(as.character(myfactor)).
If you don't know what a factor is, the second option is probably it. You should do a summary of data.train, this will help you see which mvar are incorrectly typed. If the mvar is typed as numeric, you will see min, max, mean, median, etc. If a numeric variable is incorrectly typed as a factor, you will not see that but you will see the number of occurence of each level.
In any case, calling summary will help you because it shows the number of levels for each factor. The variables with >53 levels are causing the issue.
I had the same problem, but solved it after seeing that I had imported the data frame with comma separators without indicating it.
After importing the table using read.table(data, dec=",") the problem was solved!
I encountered the same problem - error message suggesting >53 factor levels, but none of my variables were like that.
Upon further inspection, I found I had some factor variables with empty levels.
I used the forcats function fct_drop to remove these, then everything worked!
As antoine-sac pointed out, in my case this error was because of numeric variables appearing as factors. Only that the conversion happened by R when it was importing my (numeric) file.
Casting the factors as numerics didn't work. But what worked was using strip.white = TRUE when importing the dataset. (I found this solution here.)
This error occurs when you train your model with the entire dataset and not with the train data. Try implementing the model with train data and work out with test adm to perform prediction.
Bit of a R novice here, so it might be a very simple problem.
I've got a dataset with GENDER (being a binary variable) and a whole lot of numerical variables. I wanted to write a simple function that checks for equality of variance and then performs the appropriate t-test.
So my first attempt was this:
genderttest<-function(x){ # x = outcome variable
attach(Dataset)
on.exit(detach(Dataset))
VARIANCE<-var.test(Dataset[GENDER=="Male",x], Dataset[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
This works well outside of a function (replacing the x, of course), but gave me an error here because variable lengths differ.
So I thought it might be handling the NA cases strangely and I should clean up the dataset first and then perform the tests:
genderttest<-function(x){ # x = outcome variable
Dataset2v<-subset(Dataset,select=c("GENDER",x))
Dataset_complete<-na.omit(Dataset2v)
attach(Dataset_complete)
on.exit(detach(Dataset_complete))
VARIANCE<-var.test(Dataset_complete[GENDER=="Male",x], Dataset_complete[GENDER=="Female",x])
if(VARIANCE$p.value<0.05){
t.test(x~GENDER)
}else{
t.test(x~GENDER, var.equal=TRUE)
}
}
But this gives me the same error.
I'd appreciate if anyone could point out my (probably stupid) mistake.
I believe the problem is that when you call t.test(x~GENDER), it's evaluating the variable x within the scope of Dataset rather than the scope of your function. So it's trying to compare values of x between the two genders, and is confused because Dataset doesn't have a variable called x in it.
A solution that should work is to call:
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), data=Dataset))
do.call('t.test', args=list(formula=as.formula(paste0(x,'~GENDER')), var.equal=T, data=Dataset))
which will call t.test() and pass the value of x as part of the formula argument rather than the character x (i.e score ~ GENDER instead of x ~ GENDER).
The reason for the particular error you saw is that Dataset$GENDER has length equal to the number of rows in Dataset, while Dataset$x has length = 0.