Create a variable to invoke a data.frame in R - r

So I have created a program that runs a summary and anova as well as plots some graphs for me. The problem is that for each new data frame I use I need to change the variables inside the formulas. What I want to do is create a variable at the beginning of the script that I assign to the column I'm interested in and then the program does the work:
mydata <- Leaves.data.csv
attach(mydata)
str(mydata)
var <- Leaves
avgVaL <- group_by(mydata, Treatment, Medium, Treatment:Medium) %>%
summarise(count=sum(!is.na(var)), mean = mean(var, na.rm = T), sd = sd(var, na.rm=T), se = sd/sqrt(count))
The only thing I wish to change is Leaves. The problem with this code is summarise takes var as 1 single variable and returns the count, mean, sd and se of the all the data points instead of each group.

In the end I needed to use the quo() function as this function quotes my input rather than evaluation, thus (if I understand correctly) quoting would be similar to calling in other programming languages, which means you invoke directly that variable from the original data frame rather than creating a new one altogether.
At the same time you have to use !! behind every call inside the function of interest as this tells the function to evaluate the already quoted variable (rather than quoting again).
Much better explained: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
The code:
var <- quo(Root.growth)
avgVar <- group_by(mydata, Treatment, Medium)
%>% summarise(count=sum(!is.na(!!var)), mean = mean(!!var, na.rm = T), sd = sd(!!var, na.rm=T), se = sd/sqrt(count))

Related

Trying to call existing variable in for loop in r

I am trying to create a for loop where it calculates the mean of an already existing variable. The data frames are titled "mali2013", "mali2014", "mali2015", "mali2016", and "mali2017" and the variable is prop_AFR. I am trying to calculate the mean of variable per data frame.
I tried
for (i in 2014:2017) {
variable = paste0("mali", Year, "$prop_AFR")
M_mean_AFR_data <- mean(as.numeric(variable), na.rm = TRUE)
assign(paste0("Mali_prop_AFR_", i), M_mean_AFR_data)
}
but it kept yielding NaN. Is there any way to put this in a loop, or should I just do it manually?
It looks like Stata style code to me. In R, there might be several simpler ways to do it without looping. I would try this:
library(dplyr)
df <- bind_rows(mali2013, mali2014, mali2015, mali2016, mali2017)
df %>% group_by(Year) %>%
summarize(prop_AFR = mean(prop_AFR, na.rm = TRUE)

Trouble constructing a function properly in R

In the code below, I'm trying to find the mean correct score for each item in the "category" column of the "regular season" dataset I'm working with.
rs_category <- list2env(split(regular_season, regular_season$category),
.GlobalEnv)
unique_categories <- unique(regular_season$category)
for (i in unique_categories)
Mean_[i] <- mean(regular_season$correct[regular_season$category == i], na.rm = TRUE, .groups = 'drop')
eapply(rs_category, Mean_[i])
print(i)
I'm having trouble getting this to work though. I have created a list of the items in the category as sub-datasets and separately, (I think) I have created a vector of the unique items in the category in order to run the for loop with. I have a feeling the problem may be with how I defined the mean function because an error occurs at the "eapply()" line and tells me "Mean_[i]" is not a function, but I can't think of how else to define the function. If someone could help, I would greatly appreciate it.
The issue would be that Mean_ wouldn't have an i name. In the below code, we initiaize the object 'Mean_' as type numeric with length as the same as length of 'unique_categories', then loop over the sequence of 'unique_categories', get the subset of 'correct', apply the mean function and store that as ith value of 'Mean_'
Mean_ <- numeric(length(unique_categories))
for(i in seq_along(unique_categories)) {
Mean_[i] <- mean(regular_season$correct[regular_season$category
== unique_categories[i]], na.rm = TRUE)
}
If we need to use a faster execution, use data.table
library(data.table)
setDT(regular_season[, .(Mean_ = mean(correct, na.rm = TRUE)), category]
Or using collapse
library(collapse)
fmean(slt(regular_season, category, correct), g = category)
Instead of splitting the dataset and using for loop R has functions for such grouping operations which I think can be used here. You can apply a function for each unique group (value).
library(dplyr)
regular_season %>%
group_by(category) %>%
summarise(Mean_ = mean(correct, na.rm = TRUE)) -> result
This gives you average value of correct for each category, where result$Mean_ is the vector that you are looking for.
In base R, this can be solved with aggregate.
result <- aggregate(correct~category, regular_season, mean, na.rm = TRUE)

summarizing across multiple variables and assign to new variable names

I am using the 'across' function to get the summary statistics for a series of variables (for example, all variables that starts with 'f_'. Since the across function will store the summarised results back to the original variable names, having multiple across functions with different summarising functions would overwrite the results (as shown below).
I can think of a work-around by renaming the variables after summarise() and cbind the resulting individual tables. However, that seems cumbersome, and I am wondering if there is a tidy (pun intended) way to store the series of summarised results to new variable names.
var_stats = data %>%
summarise(
across(starts_with('f_'), max, na.rm = T),
across(starts_with('f_'), min, na.rm = T)
)
With across you can use multiple summary functions at the same time and control the names of the variables. Does the following example help you?
mtcars %>%
summarise(across(starts_with("c"), list(custom_min = min, custom_max = max), .names = "{.col}_{.fn}"))
cyl_custom_min cyl_custom_max carb_custom_min carb_custom_max
1 4 8 1 8

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?
This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"
You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))
R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

In R, how do I compute mean and standard error of a subset of data, grouped by multiple columns, and output this into a new data frame?

I have a dataset (named 'gala') that has the columns "Day", "Tree", "Trt", and "LogColumn". The data was collected over time, so each numbered tree is the same tree for each treatment is the same across all days. The tree numbers are repeated for each treatment (e.g. there is a tree "1" for multiple treatments).
I would like to compute the mean and standard error for the 'LogColumn' column, for each tree per each treatment per each day (e.g. I will have a mean + standard error for day 1, tree one, treatment x, etc.), and output the mean and standard error results into a new data frame that also includes the original day, Tree, Trt values.
I have been unsuccessfully trying to make a Frankenstein of codes from other Stack Overflow answers, but I cannot seem to find one that has all the components at once. If I missed this, I am sorry, and please let me know with a link to this answer. I am new to coding, and R, and do not understand well how other codes not directly relating to what I would like to do can be applied.
At this point, I have this, but do not know if it is anywhere near correct (I am also currently getting the error message "object of type 'closure' is not subsettable"):
TreeAverages <- data.table[, MeanLog=mean(gala$LogColumn), se=std.error(gala$LogColumn), by=c("Day","Tree","Trt")]
Any help is greatly appreciated. Thank you!
If you're using data.table, remember to convert gala into a data.table object first.
gala = data.table(gala)
gala_output = gala[, .("MeanLog" = mean(LogColumn),
"std" = std.error(LogColumn)),
by = c("Day", "Tree", "Trt")]
You were really close, but data.table works like dplyr does, so it already knows variable names. You don't need to specify gala$LogColumn throughout, just do it by name.
.() is just a shorthand for list(), so I'm specifying that data.table should return the columns MeanLog and std grouped by Day, Tree, and Trt.
Using base R aggregate:
aggregate(LogColumn ~ Day + Tree + Trt, data = gala,
FUN = function(x) c(mean = mean(x), se = std.error(x)))
Using dplyr
library(dplyr)
df <- gala %>%
group_by(Day, Tree, Trt) %>%
summarise(mean = mean(LogColumn),
std = sd(LogColumn))

Resources