use function on multiple columns (variables) in r - r

I am trying to run tests of homogeneity of variance using the leveneTest function from the car package. I can run the test on a single variable like so (using the iris dataset as an example)
library(car)
library(datasets)
data(iris)
leveneTest(iris$Sepal.Length, iris$Species)
However, I would like to run the test on all the dependent variables in the dataset simultaneously (so Sepal.Length, Sepal.Width, Petal.Length, Petal.Width). I am guessing it has something to do with the apply family of functions (sapply, lapply, tapply) but I just can't figure out how. The closest I came is something like this:
lapply(iris, leveneTest(group = iris$Species))
However I get the error
Error in leveneTest.default(group = iris$Species) :
argument "y" is missing, with no default
Which I understand is probably because it isn't able to specify the outcome variables. I am certain I must be missing some obvious use of the apply functions, but I just don't understand what it is. Apologies for the basic question, but I am relatively new to R and am often applying the same function to multiple variables (usually by copying the code several times), so it would be great to understand how to use these functions properly :)

Common parameters to the function need to be passed to ... within lapply. Like this:
lapply(subset(iris, select = -Species), leveneTest, group = iris$Species)
help("lapply") explains that ... is for "optional arguments to FUN" (meaning optional for lapply not for FUN) and provides lapply(x, quantile, probs = 1:3/4) as an example.

Piggybacking on #Roland's answer, you can do the following in base R as well:
lapply(iris[,-5], leveneTest, group = iris$Species
the -5 is obviously specific to the iris dataset. You could replace it with a variable like
lapply(iris[,-length(iris)]....
and that would let you remove the last element of the df, assuming your grouping variable is last.
Additionally as a data.table fanboy, I'll add an option for you to use that as well, if you're interested.
dt.iris[, lapply(.SD, leveneTest, group = Species), .SDcols = !'Species']
this code enables you to 'remove' the Species column from your lapply function in a similar manner to the above base R examples, but by naming it explicitly via the .SD and .SDcols variables. Then you run your analysis in a fairly straightforward manner. Hope this helps!

Related

Creating a formula in R [duplicate]

This question already has an answer here:
Loop for Shapiro-Wilk normality test for multiple variables in R
(1 answer)
Closed 2 years ago.
I am trying to create a formula which I can used to quickly check different variables for normality. I'm new to R and am not quite sure how to proceed. This is my attempt, but it does not work:
normality_test <- function(my_data) { shapiro.test(my_data$"x") }
My goal is to be able to use the formula as follows:
normality_test("variable name")
Use [[ to access column data.
normality_test<- function(my_data, col) shapiro.test(my_data[[col]])
You can use it as :
normality_test(my_data, "var1")
normality_test(my_data, "var2")
To apply normality_test for all the columns, you could use :
result <- lapply(names(my_data), normality_test, my_data = my_data)
However, if you want to run this for all the columns you can directly use
result <- lapply(my_data, shapiro.test)
with no need to create normality_test function.
Here is a working solution for you. The main difference from yours it the use of [ ] notation as opposed to $ notation for variable extraction and that mine provides both data and variable name to the function. Be sure to select only the variables which are numeric or can be coerced to such for use with the function. Also, since the function now has two arguments and the first one is data you can use marnitrr pipe (%>%) to make it more readable and use the function over a data set.
test <- mtcars
normality_test<- function(my_data, x) {
return(shapiro.test(as.numeric(my_data[,x])))
}
normality_test(test, "qsec")

Fit model on a subset of columns in dataframe in R

I'm trying to use lm() and matchit() on a subset of covariates. I have generated an arbitrary number of columns with prefix "covar", i.e. "covar.1", "covar.2", etc. I'd like to do something like
lm(group ~ covars, data=df)
where covars is a vector of strings c("covar.1", "covar.2", ...).
I tried several things like
cols <- colnames(df)
covars <- cols[grep("covar", colnames(df))]
m.out <- matchit(group ~ covars, data=df, method="nearest", distance="logit", caliper=.20)
but got variable lengths differ (found for 'covars').
Defining a new dataframe only with covars and group can work but that defeats my purpose using matchit because I want the matched data to have other columns, too, not just covars I picked to be the matched on.
This seems to be an easy task but somehow I can't figure out after some googling. Not sure what R formula expects there as subset of columns. Any help is appreciated.
You might want to use as.formula.
Try doing this:
Replace group ~ covars
with as.formula(paste('group','~', paste(covars, collapse="+"))))
I mentioned this in your other question, but the cobalt package has a function specifically for this, which is f.build(). The first argument to f.build() is a string containing the name of the treatment variable (or left hand side of the formula), and the second argument is a string vector containing the names of the variables to be on the right hand side of the formula (i.e., the covariates). The second argument can also be a data.frame containing the covariates; f.build() simply extracts the names. It then performs the operation described in the chosen answer, bit adds in a few other aspects that make it a little more general and robust to errors.
The cobalt documentation has a section on f.build() and uses its use with glm() and matchit() as examples.
After running matchit(), you can assess balance on the covariates using the bal.tab() function in cobalt, which is compatible with MatchIt:
bal.tab(m.out, un = TRUE)
The documentation for cobalt explains its use with MatchIt in detail.

Levene test in R

I am Having a little problem doing a Levene test in R. I does not get any output value, only NaN. Anyone know what the problem might be?
Have used the code:
with(Test,levene.test(Sample1,Sample2,location="median"))
The problem
Best regards
The levene.test function assumes the data are in a single vector. The second argument is a grouping variable.
Concatenate your data using the c() function: data=c(Sample1, Sample2). Construct a vector of group names like gp = rep('Gp1','Gp2', each=240). Then, call the function as follows: levene.test(data, gp, location='median').
This can also be done directly:
levene.test(c(Sample1, Sample2), rep('Gp1', 'Gp2', each=240)), location='median')

Recall different data names inside loop

here is how I created number of data sets with names data_1,data_2,data_3 .....and so on
for initial
dim(data)<- 500(rows) 17(column) matrix
for ( i in 1:length(unique( data$cluster ))) {
assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,]))
}
upto this point everything is fine
now I am trying to use these inside the other loop one by one like
for (i in 1:5) {
data<- paste(data, i, sep = "_")
}
however this is not giving me the data with required format
any help will be really appreciated.
Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop.
If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply if the result should be a list.

Extract Group Regression Coefficients in R w/ PLYR

I'm trying to run a regression for every zipcode in my dataset and save the coefficients to a data frame but I'm having trouble.
Whenever I run the code below, I get a data frame called "coefficients" containing every zip code but with the intercept and coefficient for every zipcode being equal to the results of the simple regression lm(Sealed$hhincome ~ Sealed$square_footage).
When I run the code as indicated in Ranmath's example at the link below, everything works as expected. I'm new to R after many years with STATA, so any help would be greatly appreciated :)
R extract regression coefficients from multiply regression via lapply command
library(plyr)
Sealed <- read.csv("~/Desktop/SEALED.csv")
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
regressions <- dlply(Sealed, .(Sealed$zipcode), x)
coefficients <- ldply(regressions, coef)
Because dlply takes a ... argument that allows additional arguments to be passed to the function, you can make things even simpler:
dlply(Sealed,.(zipcode),lm,formula=hhincome~square_footage)
The first two arguments to lm are formula and data. Since formula is specified here, lm will pick up the next argument it is given (the relevant zipcode-specific chunk of Sealed) as the data argument ...
You are applying the function:
x <- function(df) {
lm(Sealed$hhincome ~ Sealed$square_footage)
}
to each subset of your data, so we shouldn't be surprised that the output each time is exactly
lm(Sealed$hhincome ~ Sealed$square_footage)
right? Try replacing Sealed with df inside your function. That way you're referring to the variables in each individual piece passed to the function, not the whole variable in the data frame Sealed.
The issue is not with plyr but rather in the definition of the function. You are calling a function, but not doing anything with the variable.
As an analogy,
myFun <- function(x) {
3 * 7
}
> myFun(2)
[1] 21
> myFun(578)
[1] 21
If you run this function on different values of x, it will still give you 21, no matter what x is. That is, there is no reference to x within the function. In my silly example, the correction is obvious; in your function above, the confusion is understandable. The $hhincome and $square_footage should conceivably serve as variables.
But you want your x to vary over what comes before the $. As #Joran correctly pointed out, swap sealed$hhincome with df$hhincome (and same for $squ..) and that will help.

Resources