I am having some doubt regarding the calculation of MSE in R.
I have tried two different ways and I am getting two different results. Wanted to know which one is the correct way of finding mse.
First:
model1 <- lm(data=d, x ~ y)
rmse_model1 <- mean((d - predict(model1))^2)
Second:
mean(model1$residuals^2)
In principle, they should give you the same result. But in the first option, you should use d$x. If you just use d, recycling rule in R will repeat predict(model1) twice (as d has two columns) and the computation will also involve d$y.
Note that it is recommended to include na.rm = TRUE to mean, and newdata = d to predict in the first option. This makes your code robust to missing values in your data. On the other hand you don't need worry about NA in the second option, as lm automatically drops NA cases. You may have a look at this thread for potential effect of this feature: Aligning Data frame with missing values.
Related
I am using "svyby" function from survey R package, and get an error I don't know how to deal with.
At first, I used variable cntry as a grouping, next, I used essround as grouping, and it all worked smoothly. But when I use their combination ~cntry+essround it returns an error.
I am puzzled how it can work separately for each grouping but doesn't work for combined grouping.
This is somehow related to omitted data, as when I drop all the empty cells (i.e. using na.omit(dat) instead of dat for defining survey design) it starts working. But I don't want to drop all the missings. I thought na.rm argument of svymean should deal with it. Note that variables cntry and essround do not contain any missing values.
library("survey")
s.w <- svydesign(ids = ~1, data = dat, weights = dat[,weight])
svyby(~ Security, by=~ essround, s.w, svymean, na.rm=T) # Works
svyby(~ Security, by=~ cntry, s.w, svymean, na.rm=T) # Also works
svyby(~ Security, by=~ essround+cntry, s.w, svymean, na.rm=T) # Gives an error
Error in tapply(1:NROW(x), list(factor(strata)), function(index) { :
arguments must have same length
So my question is - how to make it work?
UPDATE.
Sorry, I misread the documentation. The problem is solved by adding na.rm.all = TRUE to the svyby function.
Forgive me for the late answer, but I was just looking for solution for a similar problem and solved it for myself just now. Check to see if you have any empty cells in your cross-tabulation of essround, cntry, and Security (using table()). If you do, try transforming the grouping variables into ordered factors with ordered() and explicitly naming your levels with the levels argument of the function, before you run the svyby(). Ordered factors will show frequency of 0 in a cross tabulation, while regular factors will drop empty cells.
I don't know exactly why, but here's how I resolved the same issue. It seems to have something to do with the way svyby deals with NA data - even if you specify na.rm=T. I made subsets of my data frame and found that it does happen if the subset is smaller than the certain threshold (it was 500 in my case, but the exact value is to be determined) AND contains NA - works well for other subsets like bigger than 10,000 with NA or smaller than 500 without NA. In your case, there should be a subset of essround==x & cntry==y which is small and where Security = NA. So, clean the data not to have NA BEFORE you do svyby (could be removal, estimate, or separate grouping - it's up to you) and then try once again. It worked for me.
I am creating a correlation matrix, and via the findCorrelation() function from the caret package I am identifying parameters that have a correlation with another parameter higher than 0.75.
After that I am removing the correlated parameters coming out of the findCorrelation command.
highlyCorrelated <- findCorrelation(correlationMatrix,cutoff=(0.75),verbose = FALSE)
correlated_var=colnames(data[,highlyCorrelated])
data.dat <- data[!(names(data) %in% c(correlated_var))]
For completeness sake, in presenting later results, I want to present a list of what parameters are removed, and also because of what correlation.
Is there a way to generate a data frame that contains in the first column the removed parameter, and in the following columns the parameter(s) that that specific parameter was correlated to?
I can call upon certain correlations by using:
correlationMatrix[correlationMatrix[x,]>0.75,x]
Where x is an identified parameter with a correlation higher than 0.75 with other parameter(s). But I am not sure how I can turn this into a data frame or table, in order to present the findings.
Help is much appreciated!
Regards,
Eddy
I got somewhere using the packages plyr and rowr:
cor.table <- matrix(, nrow = 0, ncol = 0)
for (i in sort(highlyCorrelated)){
cor.table.i <- c(paste(colnames(correlationMatrix)[c(i)],":"),paste(names(correlationMatrix[abs(correlationMatrix[i,])>0.75,i])))
cor.table <- cbind.fill(cor.table,cor.table.i,fill=NA)
}
cor.table <- t(cor.table[c(-1)])
It's a bit of a workaround, and maybe not the prettiest, but at least I get something I can export.
I can't get rid of the fact that the answer shows that the parameter is correlated to itself, for some reason.
Ok, I have a data frame with 250 observations of 9 variables. For simplicity, let's just label them A - I
I've done all the standard stuff (converting things to int or factor, creating the data partition, test and train sets, etc).
What I want to do is use columns A and B, and predict column E. I don't want to use the entire set of nine columns, just these three when I make my prediction.
I tried only using the limited columns in the prediction, like this:
myPred <- predict(rfModel, newdata=myData)
where rfModel is my model, and myData only contains the two fields I want to use, as a dataframe. Unfortunately, I get the following error:
Error in predict.randomForest(rfModel, newdata = myData) :
variables in the training data missing in newdata
Honestly, I'm very new to R, and I'm not even sure this is feasible. I think the data that I'm collecting (the nine fields) are important to use for "training", but I can't figure out how to make a prediction using just the "resultant" field (in this case field E) and the other two fields (A and B), and keeping the other important data.
Any advice is greatly appreciated. I can post some of the code if necessary.
I'm just trying to learn more about things like this.
A assume you used random forest method:
library(randomForest)
model <- randomForest(E ~ A+ B - c(C,D,F,G,H,I), data = train)
pred <- predict(model, newdata = test)
As you can see in this example only A and B column would be taken to build a model, others are removed from model building (however not removed from the dataset). If you want to include all of them use (E~ .). It also means that if you build your model based on all column you need to have those columns in test set too, predict won't work without them. If the test data have only A and B column the model has to be build based on them.
Hope it helped
As I mentioned in my comment above, perhaps you should be building your model using only the A and B columns. If you can't/don't want to do this, then one workaround perhaps would be to simply use the median values for the other columns when calling predict. Something like this:
myData <- cbind(data[, c("A", "B)], median(data$C), median(data$D), median(data$E),
median(data$F), median(data$G), median(data$H), median(data$I))
myPred <- predict(rfModel, newdata=myData)
This would allow you to use your current model, built with 9 predictors. Of course, you would be assuming average behavior for all predictors except for A and B, which might not behave too differently from a model built solely on A and B.
I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(
Sorry to ask this ... it's surely a FAQ, and it's kind of a silly question, but it's been bugging me. Suppose I want to get the variance of every numeric column in a dataframe, such as
df <- data.frame(x=1:5,y=seq(1,50,10))
Naturally, I try
var(df)
Instead of giving me what I'd hoped for, which would be something like
x y
2.5 250
I get this
x y
x 2.5 25
y 25.0 250
which has the variances in the diagonal, and covariances in other locations. Which makes sense when I lookup help(var) and read that "var is just another interface to cov". Variance is covariance between a variable and itself, of course. The output is slightly confusing, but I can read along the diagonal, or generate only the variances using diag(var(df)), sapply(df, var), or lapply(df, var), or by calling var repeatedly on df$x and df$y.
But why? Variance is a routine, basic descriptive statistic, second only to mean. Shouldn't it be completely and totally trivial to apply it to columns of a dataframe? Why give me the covariances when I only asked for variances? Just curious. Thanks for any comments on this.
The idiomatic approach is
sapply(df, var)
var has a method for data.frames which deals with data.frames by coercing to a matrix.
Variance is a routine basic descriptive statistic, so are covariances and correlations. They are all interlinked and interesting , especially if you are aiming to use a linear model.
You could always create your own function to perform as you want
Var <- function(x,...){
if(is.data.frame(x)) {
return(sapply(x, var,...))} else { return(var(x,...))}
}
This is documented in ?var, namely:
Description:
‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
covariance or correlation of ‘x’ and ‘y’ if these are vectors. If
‘x’ and ‘y’ are matrices then the covariances (or correlations)
between the columns of ‘x’ and the columns of ‘y’ are computed.
where by "matrices" the text means objects of class "matrix" and "data.frame".
var doesn't have a method for data frames in the conventional sense. var simply coerces the input data frame to a matrix via as.matrix and then calls cov on that matrix.
In response to the question why, well I guess that the variance is closely related to the concept of covariance and to keep code simple R Core wrote a single implementation for the covariance of a matrix-like object and used this for the variance as that is the most likely thing you want from a matrix.
Or more succinctly; that is how R Core implemented this. Learn to live with it. :-)
Also note that R is moving away from having functions like mean and sd operate on the components (columns) of a data frame. If you want to apply any of these functions, including var, you are required to call something like:
apply(foo, 2, mean) ## for matrices
sapply(foo, mean) ## for data frames
or faster specific alternatives
colMeans(foo)
In this instance, I suspect that diag(var(df)) will be the most efficient way to get the variances instead of calling var repeatedly via one of the apply family of functions. diag(var(df)) is unlikely to be quicker than sapply(df, var) as the former has to compute all the covariances as well as the variances.
Your actual answer has been covered by #GavinSimpson. For var you could also just use:
sd(df)^2
# x y
# 2.5 250.0
And by doing so you will see what #GavinSimpson means about R "moving away from having functions like mean and sd operate on the components (columns) of a data frame". Deprecated means the functionality maybe be retired with an impending version change of R and your code may break if you dont heed the warning and change appropriately:
Warning message:
sd() is deprecated.
Use sapply(*, sd) instead.
So we could use:
sapply(df,sd)^2
# x y
# 2.5 250.0
Which gives us the exact same result.
However, it's kinda silly to do it this way as you are effectively calling (sqrt(var(x, na.rm = na.rm)))^2 on each column! Instead as #mnel suggests, sapply( df , var) is how you should obtain the variance for each column vector.