Trouble constructing a function properly in R

Trouble constructing a function properly in R - r

In the code below, I'm trying to find the mean correct score for each item in the "category" column of the "regular season" dataset I'm working with.
rs_category <- list2env(split(regular_season, regular_season$category),
.GlobalEnv)
unique_categories <- unique(regular_season$category)
for (i in unique_categories)
Mean_[i] <- mean(regular_season$correct[regular_season$category == i], na.rm = TRUE, .groups = 'drop')
eapply(rs_category, Mean_[i])
print(i)
I'm having trouble getting this to work though. I have created a list of the items in the category as sub-datasets and separately, (I think) I have created a vector of the unique items in the category in order to run the for loop with. I have a feeling the problem may be with how I defined the mean function because an error occurs at the "eapply()" line and tells me "Mean_[i]" is not a function, but I can't think of how else to define the function. If someone could help, I would greatly appreciate it.

The issue would be that Mean_ wouldn't have an i name. In the below code, we initiaize the object 'Mean_' as type numeric with length as the same as length of 'unique_categories', then loop over the sequence of 'unique_categories', get the subset of 'correct', apply the mean function and store that as ith value of 'Mean_'
Mean_ <- numeric(length(unique_categories))
for(i in seq_along(unique_categories)) {
Mean_[i] <- mean(regular_season$correct[regular_season$category
== unique_categories[i]], na.rm = TRUE)
}
If we need to use a faster execution, use data.table
library(data.table)
setDT(regular_season[, .(Mean_ = mean(correct, na.rm = TRUE)), category]
Or using collapse
library(collapse)
fmean(slt(regular_season, category, correct), g = category)

Instead of splitting the dataset and using for loop R has functions for such grouping operations which I think can be used here. You can apply a function for each unique group (value).
library(dplyr)
regular_season %>%
group_by(category) %>%
summarise(Mean_ = mean(correct, na.rm = TRUE)) -> result
This gives you average value of correct for each category, where result$Mean_ is the vector that you are looking for.
In base R, this can be solved with aggregate.
result <- aggregate(correct~category, regular_season, mean, na.rm = TRUE)

Related

Trying to call existing variable in for loop in r

I am trying to create a for loop where it calculates the mean of an already existing variable. The data frames are titled "mali2013", "mali2014", "mali2015", "mali2016", and "mali2017" and the variable is prop_AFR. I am trying to calculate the mean of variable per data frame.
I tried
for (i in 2014:2017) {
variable = paste0("mali", Year, "$prop_AFR")
M_mean_AFR_data <- mean(as.numeric(variable), na.rm = TRUE)
assign(paste0("Mali_prop_AFR_", i), M_mean_AFR_data)
}
but it kept yielding NaN. Is there any way to put this in a loop, or should I just do it manually?

It looks like Stata style code to me. In R, there might be several simpler ways to do it without looping. I would try this:
library(dplyr)
df <- bind_rows(mali2013, mali2014, mali2015, mali2016, mali2017)
df %>% group_by(Year) %>%
summarize(prop_AFR = mean(prop_AFR, na.rm = TRUE)

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Relatively new R user here that has been wrestling with making code more efficient for future uses, mainly trying out functions from the apply family.
Right now, I have a script in which I pull means from a large number of variables by (manually) creating a list of variable names and passing it into a sapply.
So this is an example of how I made a list of variable names and how I passed that into sapply
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
However, I now want to use a function that uses the argument format of function(varname, data), so I can't actually use that list of names I made. What I'm trying to do:
krusk <- sapply(vars, function(x) kruskal.test(x ~ group, data))
I feel like there is a way to pass my variable names into functions I have been completely overlooking, instead manually creating lists. Anyone have any suggestions?

This can work using iris dataset as data similar to great suggestion from #deschen:
#Vars
vars <- c("Sepal.Length", "Sepal.Width")
#Code
krusk <- sapply(vars, function(x) kruskal.test(iris[[x]] ~ iris[['Species']]))
Output:
krusk
Sepal.Length Sepal.Width
statistic 96.93744 63.57115
parameter 2 2
p.value 8.918734e-22 1.569282e-14
method "Kruskal-Wallis rank sum test" "Kruskal-Wallis rank sum test"
data.name "iris[[x]] by iris[["Species"]]" "iris[[x]] by iris[["Species"]]"

You were very close! You can do it by subsetting the data frame that you input to sapply using your vars vector, and changing the formula in kruskal.test:
vars <- c("Sepal.Length", "Sepal.Width")
sapply(iris[, vars], function(x) kruskal.test(x ~ iris$Species))

R is a very diverse coding language and there will ultimately be many ways to do the same thing. Some functions expect a more standard evaluation while others may use NSE (non-standard evaluation).
However, you seem to be asking about functions who just expect a single vector as impute as opposed to functions who have a data argument, in which you use variable as opposed to data$variable.
I have a few sidebars before I give some advice
Sidebar 1 - S3 Methods
While this may be besides the point in regards to the question, the function kruskal.test has two methods.
methods("kruskal.test")
#[1] kruskal.test.default* kruskal.test.formula*
#see '?methods' for accessing help and source code
Which method is used depends on the class of the first argument in the function. In this example you are passing a formula expression in which the data argument is necessary, whereas the default method just requires x and g arguments (which you could probably use your original pipelines).
So, if you're used to doing something one way, be sure to check the documentation of a function has different dispatch methods that will work for you.
Sidebar 2 - Data Frames
Data frames are really just a collection of vectors. The difference between f(data$variable) and f(x = variable, data = data), is that in the first the user explicitly tells R where to find the vector, while in the latter, it is up the the function f to evaluate x in the context of data.
I bring this up because of what I said in the beginning - there are many ways to do the same thing. So its generally up to you what you want your standard to be.
If you prefer to be explicit
vars <- c("data$age", "data$gender", "data$PCLR")
means <- sapply(vars, fmean, data$group, na.rm=TRUE)
krusk <- sapply(vars, kruskal.test, g = data$group)
or you can write your functions in which they are expected to be evaluated within a certain data.frame object
vars <- c("age", "gender", "PCLR")
means <- sapply(vars, function(x, id, data) fmean(data[[x]], id = data[[id]], na.rm=T), id = "group", data = data)
krusk <- sapply(vars, function(x, id, data) kruskal.test(data[[x]], data[[id]]), id = "group", data = data)
My Advice
I recommend looking into the following packages dplyr, tidyr, purrr. I'm sure there are a few things in these packages that will make your life easier.
For example, you have expressed having to manually make list before doing the sapply. In dplyr package you can possibly circumvent this if there is a condition to filter on.
data %>%
group_by(group) %>% #groups data
summarise_if(is.numeric, mean, na.rm = TRUE) #applys mean to every column that is a numeric vector
And similarly, we can summarise the results of the kurskal.test function if we reshape the data a bit.
data %>%
group_by(group) %>% #grouping to retain column in the next select statement
select_if(is.numeric) %>% # selecting all numeric columns
pivot_longer(cols = -group) %>% # all columns except "group" will be reshaped. Column names are stored in `name`, and values are stored in `values`
group_by(name) %>% #regroup on name variable (old numeric columns)
summarise(krusk = list(kruskal.test(value ~ as.factor(group)))) #perform test
I've only mentioned purrr because you can almost drop and replace all apply style functions with their map variants. purrr is very consistent across its function variants with a lot of options to control the output type.
I hope this helps and wish you luck on your coding adventures.

Only strings can be converted to symbols within a function in R

I have a function that is intended to operate on data obtained from a variety of sources with many manual entry fields. Since I don't know what to expect for the layout or naming convention used in these files, I want it to 'scan' a data frame for columns with the character string 'fix', 'name', or 'agent', and mutate the column to a new column with name 'Firm', then proceed to do string cleaning on the entries of that column, then finally, remove the original column. I have gotten it to work with SOME of the CSVs that I have already, but now have run into this error: ONLY STRINGS CAN BE CONVERTED TO SYMBOLS. I have checked into this thread ERROR: Only strings can be converted to symbols but to no avail.
Here is the function at the moment:
clean_firm_names2 <- function(df){
df <- df %>%
mutate(Firm := !!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T)) %>%
str_replace_all(pattern = "(\\W)+"," ") %>%
...str manipulations...
str_squish()) %>%
dplyr::select(-(!!rlang::sym(grep(pattern = '(AGENT)|(NAME)|(FIX)',x = colnames(.), ignore.case = T, value = T))))
return(df)
}
I have tried using as.character() around the grep() function but that did not solve the problem. I have looked at the CSV that the function is meant to operate on and all of the column names are character strings. I read in the CSV using vroom(), as with my other CSVs, and that works fine, all of the column names appear. I can perform other dplyr functions on the df, suggesting to me that the df is behaving normally otherwise. I have run out of ideas as to why the function is choking up only on SOME of my CSVs but works as intended on others. Has anyone run into similar issues or got any clues as to what might be causing this error? This is the first time I've used SO-- I'm sorry if this question isn't very clear. I'll try and edit as needed.
Thanks!

Note that grep() returns indices of the matches (integers), not the matches themselves (strings). Integer indices can be passed directly to dplyr::rename, so perhaps the following may work better?
i <- grep(pattern = '(AGENT)|(NAME)|(FIX)', x = colnames(df), ignore.case = T, value = T)
df <- df %>%
rename(Firm = i) %>%
mutate(Firm = ...str manipulations... )
(There is an implicit assumption here that your grep() returns a single index. Additional code may be required to handle multiple matches.)

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.

There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)

Passing column names in R function containing is.na() and median

I have data with Income,spending,population and state. Income, spending and population has missing values.
I have created a for loop to replace the missing values by median which is calculated state-wise. However I have to run the for loop separately for Income, Spending and population. I tried to create a function to pass just the column names but it is giving me an error with is.na(). Here is the for loop
for (i in (unique(data$State))) {
data$Income[is.na(data$Income) & data$State==i] <-
median(data$Income[data$State==i], na.rm = TRUE)
}
In place of income I tried making a function and passing x.. but it is not working. Can someone help me achieve this function. I tried a few things but it gave me an error with is.na
Med_sub <- function(x){
for (i in (unique(data$State))) {
data$x[is.na(data$x)&data$State==i] <- median(data$x[data$State==i], na.rm = TRUE)
}
}
Med_sub(Income)
Med_sub(Population)
I am new to R. Any help would be greatly appreciated.

Consider a base R two-liner with ave (the inline aggregate function that slices numeric columns by factors) and ifelse all wrapped in a sapply loop:
median_fill <- function(x) ifelse(is.na(x), median(x, na.rm=TRUE), x)
data[c("Income","spending","population")] <- sapply(data[c("Income","spending","population")],
function(i) ave(i, data$state, FUN=median_fill))

A tidyverse three-liner:
library(dplyr)
data %>%
group_by(State) %>%
mutate_all(.funs = funs(coalesce(., median(., na.rm=TRUE))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trouble constructing a function properly in R - r

Related

Trying to call existing variable in for loop in r

How to use apply for functions that need "data$varname" vs functions that need just "varname"

Only strings can be converted to symbols within a function in R

Rolling correlation with 'grouped by' - Error: incorrect number of dimensions

Passing column names in R function containing is.na() and median

Categories

Resources