Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.
In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.
You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))
I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))
This is related to the questions 1 and 2
I have a list of objects (in my case they are also lists AFAII), as returned by running:
gof_stats <- models %>% map(gof_stats)
Where models is a list of models created by fitdistrplus and gof_stats is a function that computes goodness of fit stats for each model.
Now if I want to extract a specific stat from that list I could do something like:
gof_stats[[1]]$cvm
to get the Cramer von Mises stat. I can achieve the same over the whole list (as per the linked questions) like so:
cvms <- sapply(gof_stats, "[[", "cvm")
Is there a way to do the same using dplyr/purrr syntax?
BONUS: How would you handle the case where some of the elements in the models list are NULL?
If you prefer map to sapply for this, you can do
library(purrr)
map(gof_stats, ~ .x[["cvm"]])
If you just like pipes you could do
gof_stats %>% sapply("[[", "cvm")
Your question is about lists, not data frames, so dplyr doesn't really apply. You may want to look up ?magrittr::multiply_by to see a list of other aliases from the package that defines %>% as you seem to like piping. For example, magrittr::extract2 is an alias of [[ that can be easily used in the middle of a piping chain.
As for your bonus, I would pre-filter the list to remove NULL elements before attempting to extract things.
A solution composed completely of tidyverse functions is:
gof_stats %>%
map(~ .x %>% pluck("cvm"))
Legend:
map is the tidyverse function for lists, replacing the apply family
.x is the object in each iteration
~ is the purrr short syntax for the anonymous function function(x) { }
pluck extracts a list element by index or string
I have a dataset that I will be performing cross validation training upon. However, due to this splitting of the data, I sometimes encounter errors because the factor level found in the test set was not found in the training set ---- because this factor might occur a very limited number of times.
I would like a way to easily filter out these rows prior to doing any cross validation to avoid errors...
for example, how would I check to make sure that the factors that contain 9 or less observations are removed?
mtcars$carb = factor(mtcars$carb)
table(mtcars$carb)
Using library dplyr, you can try something like this:
library(dplyr)
mtcars %>% group_by(carb) %>% filter(n() > 9)
Alternatively, you can re-factor the variable in training set and remove any levels not in training data from the test set.