I have created a user function in R to multiply two columns to create a third (within a series), so this function creates 4 new columns.
create_mult_var <- function(.data){
.data <-.data%>%
mutate(Q4_1_4 = Q4_1_2_TEXT*Q4_1_3_TEXT) %>%
mutate(Q4_2_4 = Q4_2_2_TEXT*Q4_2_3_TEXT) %>%
mutate(Q4_3_4 = Q4_3_2_TEXT*Q4_3_3_TEXT) %>%
mutate(Q4_4_4 = Q4_4_2_TEXT*Q4_4_3_TEXT)
.data
I am trying to modify this function so that I can apply it to a different set of columns that match the same type. For instance, if I want to repeat this on the series of columns that start with "Q8", I know I can do the following:
create_mult_var_2 <- function(.data){
.data <-.data%>%
mutate(Q8_1_4 = Q8_1_2_TEXT*Q8_1_3_TEXT) %>%
mutate(Q8_2_4 = Q8_2_2_TEXT*Q8_2_3_TEXT) %>%
mutate(Q8_3_4 = Q8_3_2_TEXT*Q8_3_3_TEXT) %>%
mutate(Q8_4_4 = Q8_4_2_TEXT*Q8_4_3_TEXT)
.data
}
Instead of creating a different function for each of the Q4 and Q8 series, I would like to add the "Q4" or "Q8" as an argument. I tried this below, but R would not accept this as an argument this way. Is there a way to achieve my desired outcome?
This does not work:
create_mult_var <- function(.data,question){
.data <-.data%>%
mutate(question_1_4 = question_1_2_TEXT*question_1_3_TEXT) %>%
mutate(question_2_4 = question_2_2_TEXT*question_2_3_TEXT) %>%
mutate(question_3_4 = question_3_2_TEXT*question_3_3_TEXT) %>%
mutate(question_4_4 = question_4_2_TEXT*question_4_3_TEXT)
.data
}
I would like to modify the function, such as that I can use the following:
data_in %>% create_mult_var("Q4") %>% create_mult_var("Q8")
Or something similar to create these new columns? Any suggestions are appreciated! Thank you! If this is a bad idea, any suggestions for how I should approach this?
We could use paste and evaluate with !!
create_mult_var_2 <- function(.data, pat){
.data <-.data%>%
mutate(!! str_c(pat, '_1_4') :=
!! rlang::sym(str_c(pat, '_1_2_TEXT')) *
!! rlang::sym(str_c(pat, '_1_3_TEXT')))
.data
}
create_mult_var_2(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
Also, based on the pattern showed, this can be automated as well
library(dplyr)
library(stringr)
create_mult_var_3 <- function(.data, pat) {
.data %>%
mutate(across(matches(str_c("^", pat, "_\\d+_2")), ~
.* get(str_replace(cur_column(), '_2_TEXT', '_3_TEXT')),
.names = '{.col}_new')) %>%
rename_at(vars(ends_with('_new')),
~ str_replace(., '\\d+_TEXT_new', '4'))
}
-testing
create_mult_var_3(data_in, "Q4")
# Q4_1_2_TEXT Q4_1_3_TEXT Q4_1_4
#1 1 5 5
#2 2 6 12
#3 3 7 21
#4 4 8 32
data
data_in <- data.frame(Q4_1_2_TEXT = 1:4, Q4_1_3_TEXT = 5:8)
Related
What is the best way to nest a function operation on a data frame in another function? I want to write a function which takes a data frame and a column name and then does something on that column and returns the modified data frame like below:
library(dplyr)
func = function(df, col){
df = df %>% mutate(col = col + 1)
return(df)
}
new_df = func(cars, 'speed')
But this raises an error because col is not a string in the function and I am not sure how to replace it with a function input argument other than strings. Any idea how to fix this with minimum effort?
To use dplyr code in function you have to use non-standard evaluation. In this case using {{}} in the function would do.
library(dplyr)
func = function(df, col) {
df = df %>% mutate({{col}} := {{col}} + 1)
return(df)
}
new_df = func(cars, speed)
head(cars)
# speed dist
#1 4 2
#2 4 10
#3 7 4
#4 7 22
#5 8 16
#6 9 10
head(new_df)
# speed dist
#1 5 2
#2 5 10
#3 8 4
#4 8 22
#5 9 16
#6 10 10
You can read more about non-standard evaluation here https://dplyr.tidyverse.org/articles/programming.html
I think you mean that you want col to be numeric? so that you can + 1. If that is correct see below.
library(dplyr)
func = function(df, col){
df = df %>% mutate(col = as.numeric(col) + 1)
return(df)
}
new_df = func(cars, 'speed')
Another alternative would be to use the index of the column name as the function argument, instead of the string name of the column.
That might look something like
library(dplyr)
func = function(df, col_index){
col_name <- colnames(df)[col_index]
df = df %>% mutate(col_name = col_name + 1)
return(df)
}
new_df = func(cars, 2)
I have a dataframe with a large amount of annual data. For example consider the following toy example like so:
dat <- data.frame(id = 1:2, quantity = 3:4, agg_2002 = 5:6, agg_2003 = 7:8, agg_2020 = 9:10)
What I would like to do is the following:
Look for columns named "agg_",in the set of column names, names(df)
Substitute the "agg_" in names(df) for "change_"
Calculate the relative change from year to year, so for example,
df$change_2002 <- df$agg_2002/df$agg_2002 (since 2002 is first year)
df$change_2003 <- df$agg_2003/df$agg_2002
df$change_2004 <- df$agg_2004/df$agg_2003...all the way up to 2020 or the latest value with "agg_" in the column name.
What I have so far is the following function:
func <- function(dat, overwrite = FALSE) {
nms <- grep("agg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("agg_", "chg_", nms)
for i = 1:ncol(df) %in% revnms{
dat[, rvnms][i] <- lapply(dat[, rvnms][i], `/`, dat[, rvnms][i-1])
}
dat
}
What I am struggling with is the indexing. How do I get R to make the above calculations recursively without having to do it manually? The desired result is the "chg_" columns appended to the original dataframe:
id quantity agg_2002 agg_2003 agg_2020 chg_2002 chg_2003 chg_2020
1 1 3 5 7 9 1 1.40 1.28
2 2 4 6 8 10 1 1.33 1.25
I would like to modify the specified function above to produce the desired result via lapply if possible. All ideas are welcome. Thank you.
UPDATE: I would much prefer something using lapply or something that can accomodate differing data types
You can make table to long form, change name (can use gsub), then spread back
library(tidyverse)
library(stringr)
df <- dat %>% pivot_longer(-c(id,quantity), names_to = "agg", values_to = "year") %>%
mutate(agg = str_replace(agg, "agg", "change")) %>%
group_by(id) %>%
mutate(year = ifelse(is.na(lag(year)), year/year, year/lag(year))) %>% # Divide itself if there is no lag(year)
pivot_wider(names_from = "agg", values_from = "year")
inner_join(dat, df, by = c("id","quantity"))
id quantity agg_2002 agg_2003 agg_2020 change_2002 change_2003 change_2020
1 1 3 5 7 9 1 1.400000 1.285714
2 2 4 6 8 10 1 1.333333 1.250000
Here is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(cols = starts_with("agg"),
names_to = "year",
names_prefix = "agg_",
values_to = "agg") %>%
group_by(id) %>%
arrange(year) %>%
mutate(change = agg / lag(agg, 1)) %>%
pivot_wider(names_from = year, values_from = c("agg", "change"))
I'm trying to use mutate_if or select_if, etc, verbs with column names within the predicate function.
See example below:
> btest <- data.frame(
+ sjr_first = c('1','2','3',NA, NA, '6'),
+ jcr_first = c('1','2','3',NA, NA, '6'),
+ sjr_second = LETTERS[1:6],
+ jcr_second = LETTERS[1:6],
+ sjr_third = as.character(seq(6)),
+ jcr_fourth = seq(6) + 5,
+ stringsAsFactors = FALSE)
>
> btest %>% select_if(.predicate = ~ str_match(names(.), 'jcr'))
Error in selected[[i]] <- eval_tidy(.p(column, ...)) :
replacement has length zero
I'm aware I could use btest %>% select_at(vars(dplyr::matches('jcr'))) but my goal here is actually to combine the column name condition with another condition (e.g. is.numeric) using mutate_if() to operate on a subset of my columns. However I'm not sure how to get the first part with the name matching to work...
You can do:
btest %>%
select_if(str_detect(names(.), "jcr") & sapply(., is.numeric))
jcr_fourth
1 6
2 7
3 8
4 9
5 10
6 11
Tidyverse solution:
require(dplyr)
# Return (get):
btest %>%
select_if(grepl("jcr", names(.)) & sapply(., is.numeric))
# Mutate (set):
btest %>%
mutate_if(grepl("jcr", names(.)) & sapply(., is.numeric), funs(paste0("whatever", .)))
Base R solution:
# Return (get):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric), drop = FALSE]
# Mutate (set):
btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)] <- paste0("whatever", unlist(btest[,grepl("jcr", names(btest)) & sapply(btest, is.numeric)]))
You could separate two select_if calls
library(dplyr)
library(stringr)
btest %>% select_if(str_detect(names(.), 'jcr')) %>% select_if(is.numeric)
# jcr_fourth
#1 6
#2 7
#3 8
#4 9
#5 10
#6 11
We cannot combine the two calls because the first one operates on entire dataframe together whereas the second one operates column-wise.
Well, I know that there are already tons of related questions, but none gave an answer to my particular need.
I want to use dplyr "summarize" on a table with 50 columns, and I need to apply different summary functions to these.
"Summarize_all" and "summarize_at" both seem to have the disadvantage that it's not possible to apply different functions to different subgroups of variables.
As an example, let's assume the iris dataset would have 50 columns, so we do not want to address columns by names. I want the sum over the first two columns, the mean over the third and the first value for all remaining columns (after a group_by(Species)). How could I do this?
Fortunately, there is a much simpler way available now.
With the new dplyr 1.0.0 coming out soon, you can leverage the across function for this purpose.
All you need to type is:
iris %>%
group_by(Species) %>%
summarize(
# I want the sum over the first two columns,
across(c(1,2), sum),
# the mean over the third
across(3, mean),
# the first value for all remaining columns (after a group_by(Species))
across(-c(1:3), first)
)
Great, isn't it?
I first thought the across is not necessary as the scoped variants worked just fine, but this use case is exactly why the across function can be very beneficial.
You can get the latest version of dplyr by devtools::install_github("tidyverse/dplyr")
As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.
For the example you mentioned, you could try the following:
summarizer <- function(tb, colsone, colstwo, colsthree,
funsone, funstwo, funsthree, group_name) {
return(bind_cols(
summarize_all(select(tb, colsone), .funs = funsone),
summarize_all(select(tb, colstwo), .funs = funstwo) %>%
ungroup() %>% select(-matches(group_name)),
summarize_all(select(tb, colsthree), .funs = funsthree) %>%
ungroup() %>% select(-matches(group_name))
))
}
#With colnames
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = contains("Sepal"),
colstwo = matches("Petal.Length"),
colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
#With indexes
iris %>% as.tibble() %>%
group_by(Species) %>%
summarizer(colsone = 1:2,
colstwo = 3,
colsthree = 4,
funsone = "sum",
funstwo = "mean",
funsthree = "first",
group_name = "Species")
You could summarise the data with each function separately and then join the data later if needed.
So something like this for the iris example:
sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)
Though I would try to think of something else if there are more than a handful of summarising functions you need to use.
Try this:
library(plyr)
library(dplyr)
dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe
# var var2 var3 var4
#1 1 10 2 5
#2 1 9 3 5
#3 1 8 4 3
#4 2 7 5 2
#5 2 6 6 4
#6 2 5 7 2
funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
function(x,funcs,inds){
mapply(function(func,ind){
func(x[,ind])
},funcs,inds)
},funnames,colnums)
# var V1 V2 V3
#1 1 27 3 5
#2 2 18 6 2
See this - feature coming soon
data is a data.frame containing: date, a, b, c, d columns. Last 4 is numeric
Y.columns <- c("a")
X.columns <- c("b","c","d")
what i need:
data.mutated <- data %>%
mutate(Y = a, X = b+c+d) %>%
select(date,Y,X)
but i would like to pass mutate arguments from character vector,
i tried the following:
Y.string <- paste(Y.columns, collapse='+')
X.string <- paste(X.columns, collapse='+')
data.mutated <- data %>%
mutate(Y = UQ(Y.string), X = UQ(X.string)) %>%
select(date,Y,X)
But it didn't work. any help is appreciated.
To use tidyeval with UQ, you need to first parse your expressions to a quosure with parse_quosure from rlang (Using mtcars as example, since OP's question is not reproducible):
Y.columns <- c("cyl")
X.columns <- c("disp","hp","drat")
Y.string <- paste(Y.columns, collapse='+')
X.string <- paste(X.columns, collapse='+')
library(dplyr)
library(rlang)
mtcars %>%
mutate(Y = UQ(parse_quosure(Y.string)),
X = UQ(parse_quosure(X.string))) %>%
select(Y,X)
or with !!:
mtcars %>%
mutate(Y = !!parse_quosure(Y.string),
X = !!parse_quosure(X.string)) %>%
select(Y,X)
Result:
Y X
1 6 273.90
2 6 273.90
3 4 204.85
4 6 371.08
5 8 538.15
6 6 332.76
7 8 608.21
8 4 212.39
9 4 239.72
10 6 294.52
...
Note:
mutate_ has now deprecated, so I think tidyeval with quosure's and UQ is the new way to go.