Sorry to ask this ... it's surely a FAQ, and it's kind of a silly question, but it's been bugging me. Suppose I want to get the variance of every numeric column in a dataframe, such as
df <- data.frame(x=1:5,y=seq(1,50,10))
Naturally, I try
var(df)
Instead of giving me what I'd hoped for, which would be something like
x y
2.5 250
I get this
x y
x 2.5 25
y 25.0 250
which has the variances in the diagonal, and covariances in other locations. Which makes sense when I lookup help(var) and read that "var is just another interface to cov". Variance is covariance between a variable and itself, of course. The output is slightly confusing, but I can read along the diagonal, or generate only the variances using diag(var(df)), sapply(df, var), or lapply(df, var), or by calling var repeatedly on df$x and df$y.
But why? Variance is a routine, basic descriptive statistic, second only to mean. Shouldn't it be completely and totally trivial to apply it to columns of a dataframe? Why give me the covariances when I only asked for variances? Just curious. Thanks for any comments on this.
The idiomatic approach is
sapply(df, var)
var has a method for data.frames which deals with data.frames by coercing to a matrix.
Variance is a routine basic descriptive statistic, so are covariances and correlations. They are all interlinked and interesting , especially if you are aiming to use a linear model.
You could always create your own function to perform as you want
Var <- function(x,...){
if(is.data.frame(x)) {
return(sapply(x, var,...))} else { return(var(x,...))}
}
This is documented in ?var, namely:
Description:
‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
covariance or correlation of ‘x’ and ‘y’ if these are vectors. If
‘x’ and ‘y’ are matrices then the covariances (or correlations)
between the columns of ‘x’ and the columns of ‘y’ are computed.
where by "matrices" the text means objects of class "matrix" and "data.frame".
var doesn't have a method for data frames in the conventional sense. var simply coerces the input data frame to a matrix via as.matrix and then calls cov on that matrix.
In response to the question why, well I guess that the variance is closely related to the concept of covariance and to keep code simple R Core wrote a single implementation for the covariance of a matrix-like object and used this for the variance as that is the most likely thing you want from a matrix.
Or more succinctly; that is how R Core implemented this. Learn to live with it. :-)
Also note that R is moving away from having functions like mean and sd operate on the components (columns) of a data frame. If you want to apply any of these functions, including var, you are required to call something like:
apply(foo, 2, mean) ## for matrices
sapply(foo, mean) ## for data frames
or faster specific alternatives
colMeans(foo)
In this instance, I suspect that diag(var(df)) will be the most efficient way to get the variances instead of calling var repeatedly via one of the apply family of functions. diag(var(df)) is unlikely to be quicker than sapply(df, var) as the former has to compute all the covariances as well as the variances.
Your actual answer has been covered by #GavinSimpson. For var you could also just use:
sd(df)^2
# x y
# 2.5 250.0
And by doing so you will see what #GavinSimpson means about R "moving away from having functions like mean and sd operate on the components (columns) of a data frame". Deprecated means the functionality maybe be retired with an impending version change of R and your code may break if you dont heed the warning and change appropriately:
Warning message:
sd() is deprecated.
Use sapply(*, sd) instead.
So we could use:
sapply(df,sd)^2
# x y
# 2.5 250.0
Which gives us the exact same result.
However, it's kinda silly to do it this way as you are effectively calling (sqrt(var(x, na.rm = na.rm)))^2 on each column! Instead as #mnel suggests, sapply( df , var) is how you should obtain the variance for each column vector.
Related
I'm trying to use lm() and matchit() on a subset of covariates. I have generated an arbitrary number of columns with prefix "covar", i.e. "covar.1", "covar.2", etc. I'd like to do something like
lm(group ~ covars, data=df)
where covars is a vector of strings c("covar.1", "covar.2", ...).
I tried several things like
cols <- colnames(df)
covars <- cols[grep("covar", colnames(df))]
m.out <- matchit(group ~ covars, data=df, method="nearest", distance="logit", caliper=.20)
but got variable lengths differ (found for 'covars').
Defining a new dataframe only with covars and group can work but that defeats my purpose using matchit because I want the matched data to have other columns, too, not just covars I picked to be the matched on.
This seems to be an easy task but somehow I can't figure out after some googling. Not sure what R formula expects there as subset of columns. Any help is appreciated.
You might want to use as.formula.
Try doing this:
Replace group ~ covars
with as.formula(paste('group','~', paste(covars, collapse="+"))))
I mentioned this in your other question, but the cobalt package has a function specifically for this, which is f.build(). The first argument to f.build() is a string containing the name of the treatment variable (or left hand side of the formula), and the second argument is a string vector containing the names of the variables to be on the right hand side of the formula (i.e., the covariates). The second argument can also be a data.frame containing the covariates; f.build() simply extracts the names. It then performs the operation described in the chosen answer, bit adds in a few other aspects that make it a little more general and robust to errors.
The cobalt documentation has a section on f.build() and uses its use with glm() and matchit() as examples.
After running matchit(), you can assess balance on the covariates using the bal.tab() function in cobalt, which is compatible with MatchIt:
bal.tab(m.out, un = TRUE)
The documentation for cobalt explains its use with MatchIt in detail.
I am having some doubt regarding the calculation of MSE in R.
I have tried two different ways and I am getting two different results. Wanted to know which one is the correct way of finding mse.
First:
model1 <- lm(data=d, x ~ y)
rmse_model1 <- mean((d - predict(model1))^2)
Second:
mean(model1$residuals^2)
In principle, they should give you the same result. But in the first option, you should use d$x. If you just use d, recycling rule in R will repeat predict(model1) twice (as d has two columns) and the computation will also involve d$y.
Note that it is recommended to include na.rm = TRUE to mean, and newdata = d to predict in the first option. This makes your code robust to missing values in your data. On the other hand you don't need worry about NA in the second option, as lm automatically drops NA cases. You may have a look at this thread for potential effect of this feature: Aligning Data frame with missing values.
I am still quite new to r (used to program in Matlab) and I am trying use the parallel package to speed up some calculations. Below is an example which I am trying to calculate the rolling standard deviation of a matrix (by column) with the use of zoo package, with and without parallelising the codes. However, the shape of the outputs came out to be different.
# load library
library('zoo')
library('parallel')
library('snow')
# Data
z <- matrix(runif(1000000,0,1),100,1000)
#This is what I want to calculate with timing
system.time(zz <- rollapply(z,10,sd,by.column=T, fill=NA))
# Trying to achieve the same output with parallel computing
cl<-makeSOCKcluster(4)
clusterEvalQ(cl, library(zoo))
system.time(yy <-parCapply(cl,z,function(x) rollapplyr(x,10,sd,fill=NA)))
stopCluster(cl)
My first output zz has the same dimensions as input z, whereas output yy is a vector rather than a matrix. I understand that I can do something like matrix(yy,nrow(z),ncol(z)) however I would like to know if I have done something wrong or if there is a better way of coding to improve this. Thank you.
From the documentation:
parRapply and parCapply always return a vector. If FUN always returns
a scalar result this will be of length the number of rows or columns:
otherwise it will be the concatenation of the returned values.
And:
parRapply and parCapply are parallel row and column apply functions
for a matrix x; they may be slightly more efficient than parApply but
do less post-processing of the result.
So, I'd suggest you use parApply.
I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.
New to R and having problem with a very simple task! I have read a few columns of .csv data into R, the contents of which contains of variables that are in the natural numbers plus zero, and have missing values. After trying to use the non-parametric package, I have two problems: first, if I use the simple command bw=npregbw(ydat=y, xdat=x, na.omit), where x and y are column vectors, I get the error that "number of regression data and response data do not match". Why do I get this, as I have the same number of elements in each vector?
Second, I would like to call the data ordered and tell npregbw this, using the command bw=npregbw(ydat=y, xdat=ordered(x)). When I do that, I get the error that x must be atomic for sort.list. But how is x not atomic, it is just a vector with natural numbers and NA's?
Any clarifications would be greatly appreciated!
1) You probably have a different number of NA's in y and x.
2) Can't be sure about this, since there is no example. If it is of following type:
x <- c(3,4,NA,2)
Then ordered(x) should work fine. Please provide an example of your case.
EDIT: You of course tried bw=npregbw(ydat=y, xdat=x)? ordered() makes your vector an ordered factor (see ?ordered), which is not an atomic vector (see 2.1.1 link and ?factor)
EDIT2: So the problem was the way of subsetting data. Note the difference in various ways of subsetting. data$x and data[,i] (where i = column number of column x) give you vectors, while data[c("x")] and data[i] give a data frame. Functions expect vectors, unless they call for data = (your data). In that case they work with column names