trying to branch out an learn some R, one thing I do often at my job is I pull weighted means by some time specific period variable. I figured out how to do that individually like this:
means_by_period <- df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, wgtvar),
var2 = weighted.mean(var2, wgtvar),
var3 = weighted.mean(var3, wgtvar),
var4 = weighted.mean(var4, wgtvar)
)
We do this all the time but I am not always going to know how many variables/what variables I am going to be pulling and it would be a pain to edit this code every time, so I built an excel sheet to do it for me, but this seems like a good opportunity to learn how to write a function to do it. Problem is I am not sure how to write it such that it will work. I know my arguments will be: 1. the current data set 2. the period 3. the weighted variable 4. a concatenated vector of my variables?
newfunction <- function(df, period, weight, variables)
{df %>%
group_by(period) %>%
summarize(var1 = weighted.mean(var1, weight),
var2 = weighted.mean(var2, weight),
var3 = weighted.mean(var3, weight),
var4 = weighted.mean(var4, weight) )
}
I am like 2 weeks into learning so if anyone could give me some pointers on what I'd need to do here that would be great. Thanks!
If the 'var1', 'var2', 'var3', 'var4' are a vector of column names (as strings in the 'variables', then we can convert to symbol and evaluate (!!)
library(dplyr)
newfunction <- function(df, period, weight, variables) {
df %>%
group_by({{period}}) %>%
summarize(
!! variables[1] := weighted.mean( !! rlang::sym(variables[1]), {{weight}}),
!! variables[2] := weighted.mean( !! rlang::sym(variables[2]), {{weight}}),
!! variables[3] := weighted.mean( !! rlang::sym(variables[3]), {{weight}}),
!! variables[4] := weighted.mean( !! rlang::sym(variables[4]), {{weight}}) )
}
Here, the column names for 'period', 'weight' are assumed to be passed as unquoted, while the 'variables' as a vector of strings
As the OP mentioned that 'variables' can be of unknown length, we can loop over the vector of column names ('variables') in map
library(purrr)
newfunction2 <- function(df, period, weight, variables) {
map(variables, ~ df %>%
group_by({{period}}) %>%
summarise(!! .x := weighted.mean(!! rlang::sym(.x), {{weight}}))) %>%
reduce(full_join)
}
Related
My question seems simple, but I just can't do it. I have a dataframe with multiple columns with the name starting with coa and another column p with values like A, D, F, and so on, which changes according to the id.
All I found is how to do this matching with a fixed value, let's say "A", as below:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), "A")))
However, in my case, I want to compare to the column p specifically, where p changes, something like this:
df <-df %>%
mutate(ly = any(str_detect(c_across(starts_with("coa")), p)))
In this case, I get the error:
x no applicable method for 'type' applied to an object of class "factor"
Any thoughts? Thanks!
If we need to create a column, use if_any
library(dplyr)
library(stringr)
df <- df %>%
mutate(ly = if_any(starts_with("coa"), ~ str_detect(.x, p)))
I think this is a good place to use dplyr::across. You can run vignette('colwise') for a more comprehensive guide, but the key point here is that we can mutate all columns starting with "coa" simultaneously using the function == and we can pass a second argument, p, to == using the ... option provided by across.
library(dplyr)
df <- tibble(p = 1:10, coa1 = 1:10, coa2 = 11:20)
df %>%
mutate(across(.cols = starts_with('coa'), .fns = `==`, p))
Hi there I am trying to mutate values (e.g. changing kilograms to tonnes) and replace them in the original dataset but it doesn't seem to remain within the original dataset.
Here is a sample dataset for reference.
Country
Type
Quantity
A
Kilograms
23132
B
Kilograms
34235
C
Tonnes
700
library(dplyr)
df %>%
filter(Type == "Kilograms") %>%
group_by(Quantity) %>%
mutate(Quantity = Quantity /1000)
But I am not sure what to do the for next step, I tried the replace function but it didn't work.
Also, I plan to add a line at the end that changes all kilograms to tonnes, something like this:
df$Unit[df$Type == 'Kilograms'] <- 'Tonnes'
You can also use case_when() which is dplyr's equivalent to SQL's CASE WHEN. Basically it allows you to vectorize multiple if_else() statements. Below, the first condition is the if statement and then TRUE ~ is the else statement
df <- data.frame(Country = c('A', 'B', 'C'),
Type = c("Kilograms", "Kilograms", "Tonnes"),
Quantity = c(23132, 34235, 700))
df <- df %>%
mutate(Quantity = case_when(Type == 'Kilograms' ~ Quantity/1000,
TRUE ~ Quantity),
Type = case_when(Type == 'Kilograms' ~ 'Tonnes',
TRUE ~ 'Tonnes')
)
use ifelse function to change the value based on other condition. This function also works weel with tidyverse environment.
Don't forget to reassign the result to original variable since pipe operator does not change the input data
library(dplyr)
df = df %>% mutate(Quantity = ifelse(Type=="Kilograms",Quantity/1000,Quantity),
Type = ifelse(Type=='Kilograms','Tonnes',Type))
I am making my first baby steps with non standard evaluation (NSE) in dplyr.
Consider the following snippet: it takes a tibble, sorts it according to the values inside a column and replaces the n-k lower values with "Other".
See for instance:
library(dplyr)
df <- cars%>%as_tibble
k <- 3
df2 <- df %>%
arrange(desc(dist)) %>%
mutate(dist2 = factor(c(dist[1:k],
rep("Other", n() - k)),
levels = c(dist[1:k], "Other")))
What I would like is a function such that:
df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)
produces the same result, where old_column column "dist" (the column I use to sort the data set), new_column (the column I generate) is "dist2" and levels_to_keep is "k" (number of values I explicitly retain).
I am getting lost in enquo, quo_name etc...
Any suggestion is appreciated.
You can do:
library(dplyr)
sort_keep=function(df,old_column, new_column, levels_to_keep){
old_column = enquo(old_column)
new_column = as.character(substitute(new_column))
df %>%
arrange(desc(!!old_column)) %>%
mutate(use = !!old_column,
!!new_column := factor(c(use[1:levels_to_keep],
rep("Other", n() - levels_to_keep)),
levels = c(use[1:levels_to_keep], "Other")),
use=NULL)
}
df%>%sort_keep(dist,dist2,3)
Something like this?
old_column = "dist"
new_column = "dist2"
levels_to_keep = 3
command = "df2bis<-df %>% sort_keep(old_column, new_column, levels_to_keep)"
command = gsub('old_column', old_column, command)
command = gsub('new_column', new_column, command)
command = gsub('levels_to_keep', levels_to_keep, command)
eval(parse(text=command))
I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)
I'm trying to build some functions for creating standard tables from a questionnaire, using dplyr for the data manipulation. This question was very helpful for the group_by function, passing arguments (in this case, the name of the variable I want to use to make the table) to (...), but that seems to break down when trying to pass the same arguments to other dplyr commands, specifically 'select' and 'filter'. The error message I get is '...' used in an incorrect context'.
Does anyone have any ideas on this? Thank you
For the sake of completeness (and any other hints - I'm very new to writing functions), here is the code I would like to use:
myTable <- function(x, ...) {
df <-
x %>%
group_by(Var1, ...) %>%
filter(!is.na(...) & ... != '') %>% # To remove missing values: Not working!
summarise(value = n()) %>%
group_by(Var1) %>%
mutate(Tot = sum(value)) %>%
group_by(Var1, ...) %>%
summarise(num = sum(value), total = sum(Tot), proportion = num/total*100) %>%
select(Var1, ..., proportion) # To select desired columns: Not working!
tab <- dcast(df, Var1 ~ ..., value.var = 'proportion')
tab[is.na(tab)] <- 0
print(tab)
}