I'm sure there's a way to do this but I can't figure it out. I'd like to be able to pass a list of arguments to mutate_at() within a function without having to specify each argument
library(tidyverse)
fake_data <-
tibble(
id = letters[1:6],
ind_group_a = rep(0:1, times = 3),
ind_group_b = rep(1:0, each = 3)
)
# id ind_group_a ind_group_b
# a 0 1
# b 1 1
# c 0 1
# d 1 0
# e 0 0
# f 1 0
This function will then converts all 1's to "yes" and 0's to "no"
recode_indicator <- function(x, if_1 = "yes", if_0 = "no") {
ifelse(x == 1, if_1, if_0)
}
And I can use it fine like so:
fake_data %>%
mutate_at(
vars(starts_with("ind_")),
recode_indicator,
if_1 = "Has",
if_0 = "Missing"
)
# id ind_group_a ind_group_b
# chr> <chr> <chr>
# a Missing Has
# b Has Has
# c Missing Has
# d Has Missing
# e Missing Missing
# f Has Missing
This is a simplified example but what I'd like to do is make it available in a function without having to write out all of the arguments. Ideally something short like binary_values = list(...)but I can't figure out how to pass these items as the additional arguments of recode_indicator()
roll_up_indicators <- function(x,
#binary_values = list(if_1 = "yes", if_0 = "no"),
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
recode_indicator # ,
# binary_values # <- here's the problem area
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
fake_data %>% roll_up_indicators()
# |group_a |group_b | n|
# |:-------|:-------|--:|
# |No |No | 1|
# |No |Yes | 2|
# |Yes |No | 2|
# |Yes |Yes | 1|
Update
In terms of not rewriting all of the arguments, the formals() function can be used:
roll_up_indicators <- function(x,
binary_values = formals(recode_indicator), # <--- formals
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
partial(recode_indicator, !!!binary_values) # <--- the winning answer
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
One solution is to use purrr::partial to specify that if_1 and if_0 arguments should come from binary_values:
roll_up_indicators <- function(x,
binary_values = list(if_1 = "yes", if_0 = "no"),
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
partial(recode_indicator, !!!binary_values) ## <--- partial() here
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
fake_data %>% roll_up_indicators()
# |group_a |group_b | n|
# |:-------|:-------|--:|
# |No |No | 1|
# |No |Yes | 2|
# |Yes |No | 2|
# |Yes |Yes | 1|
It's probably best to go with the pre-made functions, like recode, but I've also adapted your function if you wanted to add additional functionality. For that, I'm assuming that binary_values is appropriately named and will only ever include two values.
Option 1: Use recode
This requires you to put the starting and ending values within a list. You'll need to quote strings, obviously and either quote or use `` around numbers.
binary_values = list("1" = "yes", "0" = "no")
fake_data %>%
mutate_at(vars(starts_with("ind_")),
list(~recode(.,!!!binary_values)))
Option 2: Specify location or name in list within function
recode_value <- function(x,
binary_values = list(if_1 = "yes", if_0 = "no")
## You'll need to decide whether you'll name them as expected or always put them in this order; it's up to you
) {
if_1 = binary_values$if_1 # or binary_values[[1]]
if_0 = binary_values$if_0 # or binary_values[[1]]
ifelse(x == 1, if_1, if_0)
}
binary_values = list(if_1 = "yes", if_0 = "no")
fake_data %>%
mutate_at(
vars(starts_with("ind_")),
recode_value, ## fixed typo
binary_values
)
Related
I am attempting to automate many of the tasks done when creating new shiny apps, by writing the needed code to files based on any given dataset. While creating code to be used as a starting point for factor levels, I have become stuck.
The idea is to gather all the unique values appearing in certain columns, and create character vectors from them that can then be altered as needed. The issue is that some of the desired levels span multiple columns, as more than one can be selected. I have managed to write almost working code, but it fails to behave as I expected at one point. The issue results in all but the first string being dropped when mappping a list of vectors. Sorry, it is hard to explain, hopefully you can see what I am doing below - and ask if anything is still not clear please.
### Starting point
data <- tibble(
a = rep(c("foo", "bar"), 3),
b = rep(c("baz", "zap"), 3),
c = rep(c("yes", "no"), 3),
c_opt_one = rep(c("c_one", ""), 3),
c_opt_two = rep(c("c_two", ""), 3)
)
levels_meta <- tibble(
column = c("a", "b", "c", "c_opt", "c_opt"),
blah = rep(c("blah"), 5) <- multiple other columns, not needed here
)
### Desired output, with problem noted
#>levels
# a_responses <- c(
# "foo" = "foo", <- only first entry kept
# "bar" = "bar" <- missing
# )
#
# b_responses <- c(
# "baz" = "baz", <- only first entry kept
# "zap" = "zap" <- missing
# )
#
# c_responses <- c(
# "yes" = "yes", <- only first entry kept
# "no" = "no" <- missing
# )
#
# c_opt_responses <- c(
# "c_opt_one" = "c_one", <- all kept as desired, but only because these
# "c_opt_two" = "c_two" <- come from single element vectors before combined
# )
### Processing code
level_names <- levels_meta %>%
select(column) %>%
group_by(column) %>%
add_count()
multi_col_level_names <- level_names %>%
filter(n > 1) %>%
pull(column) %>%
unique()
single_col_level_names <- setdiff(level_names$column, multi_col_level_names)
levels <- lapply(data, unique) %>%
lapply(setdiff, "")
levels <- map(levels, ~ paste0(" \"", .x, "\"", " = \"", .x, "\""))
# Problem occurs here - only first entry is kept.
# I did try replacing the FALSE arg with levels[[.x]], but same result.
levels <- imap(levels, ~ ifelse(length(.x) == 1, str_replace(.x, "\\w+", .y), .x))
# Rest of code does work, including in case anyone could suggest a more efficient way
multi_col_levels <- map(
multi_col_level_names,
function(prefix) levels %>%
keep(startsWith(names(.), prefix)) %>%
set_names(str_replace(names(.), names(.), prefix))
) %>% squash()
multi_col_levels <- map(
set_names(multi_col_level_names),
~ unlist(multi_col_levels[names(multi_col_levels) == .], use.names = FALSE)
)
levels <- c(levels[single_col_level_names], multi_col_levels)
levels <- map(levels, ~ paste0(.x, collapse = ",\n"))
levels <- imap(levels, ~ paste0(.y, "_responses <- c(\n", .x, "\n)"))
paste_lvls <- function(out, input) paste(out, input, sep = "\n\n")
levels <- levels %>% reduce(paste_lvls)
My suggestion is to keep it more simple than your imap/ifelse-solution. The problem should be relatively small, so a simple for loop can solve it with less hassle and more clarity (given that the rest of code does what you want):
for (eachlevel in names(levels)) {
if(length(levels[[eachlevel]]) == 1) {
levels[[eachlevel]] <- str_replace(levels[[eachlevel]], "\\w+", eachlevel)
}
}
I am not sure if the approach below is what you are after:
library(tidyverse)
levels_meta$column %>%
unique %>%
set_names(., paste0(., "_response")) %>%
map(. ,
~ {
dat <- select(data, starts_with(.x) & ends_with(.x))
if(length(dat) == 0) {
dat <- select(data, starts_with(.x))
}
if (length(dat) == 1) {
set_names(unique(dat[[.x]]))
} else if (length(dat) > 1) {
map(dat, ~ unique(.x[which(.x != "")]))
} else {
NULL
}
}
)
#> $a_response
#> foo bar
#> "foo" "bar"
#>
#> $b_response
#> baz zap
#> "baz" "zap"
#>
#> $c_response
#> yes no
#> "yes" "no"
#>
#> $c_opt_response
#> $c_opt_response$c_opt_one
#> [1] "c_one"
#>
#> $c_opt_response$c_opt_two
#> [1] "c_two"
Created on 2022-06-02 by the reprex package (v2.0.1)
I have a dataframe that looks like this :
df <- data.frame(ID = rep(1:10, each = 6),
Site = rep(c("A","B","C","D"), each = 6, times = 10),
Department = rep(c("E","F","G","H"), each = 6, times = 10),
Occupation = rep(c("I","J","K","L"), each = 6, times = 10),
Construct = rep(paste0("X",1:6), times = 10),
Score = sample(c("Green","Orange","Red"), size = 60, replace = TRUE))
head(df)
Basically, each ID belongs to a site, a department and has an occupation, and is evaluated on six constructs.
I have adapted a previous function of mine to compute the N and the rate of each Score category for a given Construct, by any combination of Site, Department and Occupation :
my_function <- function(..., dimension = NULL){
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
This works perfectly, as I simply have to indicate which Construct, and add any of the three factors (Site, Departement, Occupation) as optional arguments to obtain a summary. For example, a summary of X1 by Site and Department would be :
my_function(dimension = "X1", Site, Department)
However, I would like to filter out some of the values of the Occupation variable, but only when looking at a summary including this variable. I tried to do so by checking whether Occupation was passed as an optional argument, and exclude the specific values when it was the case. Something like :
my_function <- function(..., dimension = NULL){
if(hasArg(Occupation)){
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
But it does not seem to work, as it consistently returns includes the values I'd like to filter out, even when I specify Occupation as an optional argument. I tried to fiddle with things like curly-curly {{}} but I can't seem to get this function to filter the specific values.
hasArg seems to expect all of the arguments to be named, whereas in
my_function(dimension="X1", Site, Department, Occupation)
this is not the case.
Perhaps:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
if (hasOcc) {
df %>%
filter(Construct == dimension, Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
} else {
df %>%
filter(Construct == dimension) %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
}
my_function(Site, Department, Occupation, dimension = "X1")
# # A tibble: 7 x 6
# # Groups: Site, Department, Occupation [3]
# Site Department Occupation Score n rate
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 B F J Green 6 0.6
# 2 B F J Orange 4 0.4
# 3 C G K Green 2 0.2
# 4 C G K Orange 2 0.2
# 5 C G K Red 6 0.6
# 6 D H L Green 6 0.6
# 7 D H L Orange 4 0.4
Some other thoughts on the function:
reaching out of its scope to get df is not a good practice: it is not really reproducible, and it can be difficult to troubleshoot. For instance, if you forget to assign your data to df, you'll see
my_function(Site, Department, Occupation, dimension = "X1")
# Error in UseMethod("filter") :
# no applicable method for 'filter' applied to an object of class "function"
(This error is because it is finding stats::df.)
Further, if you want to use it against a different non-df-named dataset, you're out of luck.
Recommendation: explicitly pass the data. A tidyverse commonality is to pass it as the first argument. One side-benefit of this is that you can (generally) use this in the middle of a %>%-pipe directly.
my_function <- function(.data, ..., dimension = NULL) { .data %>% ... }
You can reduce the number of pipelines in there by including the Occupation conditional directly in the filter(..). This is not just code-golf: in more complex code examples, it's not hard to imagine updating one of the %>%-pipes and either forgetting the other or updating it differently. Since the only difference here is a component of filter, we can add it there:
my_function <- function(..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
df %>%
filter(Construct == dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
If dimension is required, don't default to NULL since, if omitted, this will produce an error.
my_function <- function(.data, ..., dimension) { ... }
If it is instead optional and you don't want to filter on it if not provided, then you need to check for that in your filter:
filter(if (is.null(dimension)) TRUE else Construct == dimension, ...)
If you can imagine wanting dimension to be either NA (matching an explicit NA value in the data) or you might want "one or more", then you may want to use %in% instead of ==:
NA == NA
# [1] NA
NA %in% NA
# [1] TRUE
So your function could use
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, ...)
These points would result in your function being either
my_function <- function(.data, ..., dimension = NULL){
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(if (is.null(dimension)) TRUE else Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
if dimension is optional, or
my_function <- function(.data, ..., dimension) {
hasOcc <- "Occupation" %in% as.character(match.call())
.data %>%
filter(Construct %in% dimension, !hasOcc | Occupation != "I") %>%
group_by(..., Score) %>%
summarise(n = n()) %>%
mutate(rate= round(n/sum(n),2))
}
otherwise.
How can I select other columns in sf_MX dataframe to add in sumbyweek? I am stuck.
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul')[13:32],
sum,na.rm = TRUE,.names = '{col}_total'))%>%
select(Col1,col2) #unable to get the idea result
sf_MX Data Table:
Col 1 | Col 2 | Col 3| Cumul1 |Cumul2 | Cumul3 …
Expected result:
Col 1 | Col 2 | Cumul1_total |Cumul2_total |Cumul3_total
We could do
library(dplyr)
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul'),
sum, na.rm = TRUE, .names = '{col}_total'))
Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
This question is essentially a duplicated of this question, except I am working in R. The pyspark solution looks solid, but I haven't been able to figure out how to apply collect_list over a window function in the same way in sparklyr.
I have a Spark DataFrame with the following structure:
------------------------------
userid | date | city
------------------------------
1 | 2018-08-02 | A
1 | 2018-08-03 | B
1 | 2018-08-04 | C
2 | 2018-08-17 | G
2 | 2018-08-20 | E
2 | 2018-08-23 | F
I am trying to group the DataFrame by userid, order each group by date, and collapse the city column into a concatenation of its values. Desired output:
------------------
userid | cities
------------------
1 | A, B, C
2 | G, E, F
The trouble is that each method I've tried to do this with has resulted in some users (appx. 3% on a test of 5000 users) not having their "cities" column in the correct order.
Attempt 1: using dplyr and collect_list.
my_sdf %>%
dplyr::group_by(userid) %>%
dplyr::arrange(date) %>%
dplyr::summarise(cities = paste(collect_list(city), sep = ", ")))
Attempt 2: using replyr::gapply since the operation fits the description of "Grouped-Order-Apply".
get_cities <- . %>%
summarise(cities = paste(collect_list(city), sep = ", "))
my_sdf %>%
replyr::gapply(gcolumn = "userid",
f = get_cities,
ocolumn = "date",
partitionMethod = "group_by")
Attempt 3: write as a SQL window function.
my_sdf %>%
spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, CONCAT_WS(', ', collect_list(city)) AS cities
OVER (PARTITION BY userid
ORDER BY date)
FROM my_sdf") %>%
sparklyr::sdf_register() %>%
sparklyr::sdf_copy_to(sc, ., "my_sdf", overwrite = T)
^ throws the following error:
Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'OVER' expecting <EOF>(line 2, pos 19)
== SQL ==
SELECT userid, conversion_location, CONCAT_WS(' > ', collect_list(channel)) AS path
OVER (PARTITION BY userid, conversion_location
-------------------^^^
ORDER BY occurred_at)
FROM paths_model
Solved! I misunderstood how collect_list() and Spark SQL could work together. I didn't realize a list could be returned, I thought that the concatenation had to take place within the query. The following produces the desired result:
spark_output <- spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, collect_list(city)
OVER (PARTITION BY userid
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AS cities
FROM my_sdf") %>%
sdf_register() %>%
group_by(userid) %>%
filter(row_number(userid) == 1) %>%
ungroup() %>%
mutate(cities = paste(cities, sep = " > ")) %>%
sdf_register()
Ok: so I admit that the following solution is not at all efficient (it uses a for loop and is actually a lot of code for what seems like it could be a simple task), but I believe this should work:
#install.packages("tidyverse") # if needed
library(tidyverse)
df <- tribble(
~userid, ~date, ~city,
1 , "2018-08-02" , "A",
1 , "2018-08-03" , "B",
1 , "2018-08-04" , "C",
2 , "2018-08-17" , "G",
2 , "2018-08-20" , "E",
2 , "2018-08-23" , "F"
)
cityPerId <- df %>%
spread(key = date, value = city)
toMutate <- NA
for (i in 1:nrow(cityPerId)) {
cities <- cityPerId[i,][2:ncol(cityPerId)] %>% t() %>%
as.vector() %>%
na.omit()
collapsedCities <- paste(cities, collapse = ",")
toMutate <- c(toMutate, collapsedCities)
}
toMutate <- toMutate[2:length(toMutate)]
final <- cityPerId %>%
mutate(cities = toMutate) %>%
select(userid, cities)