Chose a function with apply() based on some condition

Chose a function with apply() based on some condition - r

In apply() function, I need to provide a function name. But in my case, that function name needs to be based on some other condition. Below is such example:
library(dplyr)
Function = TRUE
as.data.frame(matrix(1:12, 4)) %>%
mutate(Res = apply(as.matrix(.), 1, ifelse(Function, ~mean, ~sd), na.rm = TRUE))
However with this I am getting below error:
Error: Problem with `mutate()` column `Res`.
ℹ `Res = apply(as.matrix(.), 1, ifelse(Function, ~mean, ~sd), na.rm = TRUE)`.
✖ attempt to replicate an object of type 'language'
Run `rlang::last_error()` to see where the error occurred.
Can you please help me on right way to apply condition to chose a function.

This should work:
library(dplyr)
Function = TRUE
as.data.frame(matrix(1:12, 4)) %>%
mutate(Res = apply(as.matrix(.), 1, if (Function) mean else sd, na.rm = TRUE))
ifelse is a function that takes a vector and applies a logical condition to it, and returns a vector containing some specified value if that condition is true for that element, or another specified value if that condition is false for that element. The separate if else operators are used for conditionals when programming in R. Sometimes they're interchangeable and sometimes they're not.

Related

How can I add a column in R whose values reference a column in a different data frame?

So I have an R script that ranks college football teams. It outputs a rating and I want to take that rating from a different data frame and add it as a new column to a different data frame containing info from the upcoming week of games. Here's what I'm currently trying to do:
random_numbers <- rnorm(130, mean = mean_value, sd = sd_value)
sample_1 <- as.vector(sample(random_numbers, 1, replace = TRUE))
upcoming_games_df <- upcoming_games_df %>%
mutate(home_rating = case_when(home_team %in% Ratings$team ~ Ratings$Rating[Ratings$team == home_team]),
TRUE ~ sample_1)
sample_2 <- as.vector(sample(random_numbers, 1, replace = TRUE))
upcoming_games_df <- upcoming_games_df %>%
mutate(away_rating = case_when(away_team %in% PrevWeek_VoA$team ~ Ratings$Rating[Ratings$team == away_team]),
TRUE ~ sample_2)
I originally had the sample(random_numbers) inside of the mutate() function but I got error "must be a vector, not a formula object." So I moved it outside the mutate() function and added the as.vector(), but it still gave me the same error. I also got a warning about "longer object length is not a multiple of shorter object length". I don't know what to do now. The code above is the last thing I tried before coming here for help.

case_when requires all arguments to be of same length. sample_1 or sample_2 have a length of 1 and it can get recycled. (as.vector is not needed as rnorm returns a vector).
In addition, when we use ==, it is elementwise comparison and can be used only when the length of both the columns compared are same or one of them have a length of 1 (i.e. it gets recycled). Thus Ratings$team == home_team would be the cause of longer object length warning.
Instead of case_when, this maybe done with a join (assuming the 'team' column in 'Ratings' is not duplicated)
library(dplyr)
upcoming_games_df2 <- upcoming_games_df %>%
left_join(Ratings, by = c("home_team" = "team")) %>%
mutate(home_rating = coalesce(Rating, sample_1), team = NULL) %>%
left_join(PrevWeek_VoA, by = c("away_team" = "team")) %>%
mutate(away_rating = coalesce(Rating, sample_2))

Error when using na.rm = T for ntile() in R

Summary
I am using R's ntile() to assign observations into deciles.
I want NAs to be ignored.
However, when I write na.rm = T, the function no longer works.
Code
group_by(date) %>%
mutate(aggregate_ranking = ntile(average_ranking, 10)) %>%
ungroup()
Other
What's most important to me is to assign data into deciles. If it cannot be done via ntile(), is there another function that does this where NAs are ignored?
Error message
Error: Problem with `mutate()` input `aggregate_ranking`.
x unused argument (na.rm = T)
ℹ Input `aggregate_ranking` is `ntile(average_ranking, 10, na.rm = T)`.
ℹ The error occurred in group 1: date = Jan 1993.

replace NA is selected columns with replace_na

I have a dataset that contains columns hh_c22j, hh_r02a, hh_r02b. I want to replace NAs in these col with 0. Right now I have the command as below, it works. But is redundant, as I need to specify for each column to replace with 0.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(list(hh_c22j=0, hh_r02a=0, hh_r02b=0))
I want to select the columns together in an array/list like below.
df %>% select(case_id, hh_c22j, hh_r02a, hh_r02b) %>% replace_na(c(hh_c22j, hh_r02a, hh_r02b), 0)
But I got an error. The error msg is :
Error in is_list(replace) : object 'hh_c22j' not found
Error: 1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlib_error_dots_unused>
1 components of `...` were not used.
We detected these problematic arguments:
* `..1`
Did you misspecify an argument?
Backtrace:
1. `%>%`(...)
5. ellipsis:::action_dots(...)
Run `rlang::last_trace()` to see the full context.

Assuming you have other columns in the data as well but want to change just the three columns, you can do this:
library(dplyr)
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace(., which(is.na(.)), 0)))
# Alternatively, using replace_na
df %>% mutate_at(vars(hh_c22j, hh_r02a, hh_r02b), list(~ replace_na(., 0)))
Just for future reference, a small reproducible sample would go a long way to get better answers!

One option to do this in a clean way is make use of the mutate_all function and pass it the function to use on each of the columns. For example, here I create a dataset similar to what you have and replace the null values with 0s:
data <- data.frame(hh_c22j = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02a = sample(c(NA, 1), size = 5, replace = TRUE),
hh_r02b = sample(c(NA, 1), size = 5, replace = TRUE))
data %>%
mutate_all(replace_na, 0)
If you only want to perform this operation on some columns, mutate_at is a similar option where you can specify which column(s) to use this on.

summarize percent calculation invalid 'type' error

I have the following code that was working but now it is throwing an error. I think a package may have updated and broke it.
scorecard_data %>%
select (STABBR, HBCU, MENONLY, WOMENONLY) %>%
filter (str_detect(STABBR, "OH|PA|WV|KY|IN|MI")) %>%
group_by (STABBR) %>%
summarize (prcntHBCU = (sum(HBCU, na.rm = TRUE)/length(HBCU[!is.na(HBCU)])*100),
prcntMEN = (sum(MENONLY, na.rm = TRUE)/length(MENONLY[!is.na(MENONLY)])*100),
prcntWOMEN = (sum(WOMENONLY, na.rm = TRUE)/length(WOMENONLY[!is.na(WOMENONLY)])*100)) %>%
gather(key = 'Type.prcnt', value = 'Prcnt', prcntHBCU:prcntWOMEN) %>%
ggplot (aes (x = STABBR, y = Prcnt, fill = Type.prcnt)) +
geom_col(stat = "identity", position = "dodge") +
ggtitle ("% of HBCUs, Men Only, and Women Only Institutions - by OH and Neighboring States") +
xlab ("State") +
ylab ("Percent of Institutions")
and here is the error R Studio is giving when I run it...
Error: Problem with `summarise()` input `prcntHBCU`.
x invalid 'type' (character) of argument
i Input `prcntHBCU` is `(sum(HBCU, na.rm = TRUE)/length(HBCU[!is.na(HBCU)]) * 100)`.
i The error occurred in group 1: STABBR = "IN".
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/dplyr_error>
Problem with `summarise()` input `prcntHBCU`.
x invalid 'type' (character) of argument
i Input `prcntHBCU` is `(sum(HBCU, na.rm = TRUE)/length(HBCU[!is.na(HBCU)]) * 100)`.
i The error occurred in group 1: STABBR = "IN".
Backtrace:
1. dplyr::select(., STABBR, HBCU, MENONLY, WOMENONLY)
1. dplyr::filter(., str_detect(STABBR, "OH|PA|WV|KY|IN|MI"))
1. dplyr::group_by(., STABBR)
2. dplyr::summarize(...)
14. dplyr:::h(simpleError(msg, call))
Can anyone help debug this and tell me why it isn't working?

Building on #gregmacfarlane and #Calumn_You the cause is likely applying sum to a character vector.
One easy was to check the type of your variables is summary(scorecard_data). Numeric variables will give min, max, median. Character variables will just say that the variable type is character. Factor variables will give a tally of the different counts.
You can convert characters to numeric with as.numeric assuming the character string is a number. If the variable is a factor, it is often best to first convert to character and then to numeric with as.numeric and as.character.
So you are probably looking for a solution like:
scorecard_data %>%
mutate(STABBR = as.numeric(STABBR), # if STABBR is of type character
HBCU = as.numeric(as.character(HBCU)), # if HBCU is of type factor
MENONLY = as.numeric(MENONLY),
WOMENONLY = as.numeric(WOMENONLY)) %>%
# the rest of your code follows here

Alter Variables in List with purrr

I have three datasets a, b, c with identical variable names. I want to check whether these variables contain missing/invalid values.
I have a checking function check_variables() that checks missing or invalid values (for example the function could just be is.na).
While I could apply my checking function check_variables() explicitly to each of these datasets, like:
check.output = list(
a = check_variables(a),
b = check_variables(b),
c = check_variables(c)
)
purrr offers a nice all-in-one-step solution for this problem:
list(a,b,c) %>%
map(~ .x %>% check_variables())
But this step only maps check_variables() to elements of datasets in the list. Instead, I want function check_variables() map to each dataset. Is there a way to effectively map functions to the datasets in the list instead of the elements within each dataset?

If you want to modify the independent variables you can pass a list of variable names to edit then use get and assign to access and modify them.
library(purrr)
library(magrittr)
a = list(var = 1)
b = list(var = 2)
c = list(var = 3)
# get the current environment. alternative is to use functions like
# parent.frame from within the loop but that can get confusing
e = environment()
c('a','b','c') %>%
map(function(x){
ls = get(x,envir = e)
# whatever modification you want to make on the list
ls$var = ls$var+1
assign(x,ls,envir = e)
})
Note in real life, as #MrFlick stated, you probably don't want to do this. Keep a, b, c in a single list and your downstream analysis will be easier since I assume they will have to be processes through the same pipeline. map will happily return your modified list that you can either use to overwrite the original list or assign to a new variable. Alternatively, use a for loop over list indexes to modify the original list on the go or fill a pre-allocated new variable.

If the purpose is to apply check_variables() which takes in a dataset (table) and returns a single TRUE or FALSE, then the issue might be related to the usage of vectorized functions.
R and packages of R have many vectorized functions, such as is.na, which means when applying these function on to a list c(1, NA, 2) or dataframe, the function will be applied on to each elements of the list, resulting FALSE TRUE FALSE instead of TRUE (any element is.na) or FALSE (all element is.na).
When check_variable() function is composed by these vectorized functions, we will need to "aggregate" the vectorized functions use functions like all, any. Further more we will need to control the scope of aggregation in order to control whether the check_variables() function is to be applied on elements, variables (columns), or the entire table(dataframe):
require(tidyverse) # in production code, import only `dplyr` and `tidyr`
require(purrr)
a = data.frame(x = c(1,2,3), y =c(3,NA,5))
b = data.frame(x = c(1,NA,3), y =c(3,4,5))
c = data.frame(x = c(1,NA,3), y =c(3,NA,4))
# apply `check.func` on varaibles(columns)
# aggregation has to be limited within scope of each varaible (column)
# `dplyr::summarize_all` happens to functioning like this
check.vars = function(list.tbls, check.func) list.tbls %>% map(~ .x %>% summarize_all(check.func) )
# apply `check.func` on the entire table
# as long as `check.func` takes a table and returns a single value
# we can directly apply this function
check.tbls = function(list.tbls, check.func) list.tbls %>% map(~ check.func(.x))
## Some sample functions
# check if all elements under the scope, has no NA
# take in either a vector or a table, return a boolean
has.no.na = . %>% is.na %>% any %>% `!`
# check if all elements under the scope is less than 5, NAs are counted as False
# take in either a vector or a table, return a boolean
has.no.na = . %>% is.na %>% any %>% `!`
is.lt.5 = . %>% `<`(5) %>% all %>% replace_na(F)
# check if all elements under the scope is less than 5, NAs are ignored, all NA means TRUE
# take in either a vector or a table, return a boolean
is.lt.5.rm.na = . %>% `<`(5) %>% all(na.rm=T)
## Use of sample functions to check variables within each dataset
list(a,b,c) %>% check.vars(has.no.na)
list(a,b,c) %>% check.vars(is.lt.5)
## Use of sample functions to check each dataset
list(a,b,c) %>% check.tbls(has.no.na)
list(a,b,c) %>% check.tbls(is.lt.5)
list(a,b,c) %>% check.tbls(is.lt.5.rm.na)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Chose a function with apply() based on some condition - r

Related

How can I add a column in R whose values reference a column in a different data frame?

Error when using na.rm = T for ntile() in R

replace NA is selected columns with replace_na

summarize percent calculation invalid 'type' error

Alter Variables in List with purrr

Categories

Resources