Custom Function Mutate that mutates column - r

I'm having trouble with a function that I'm trying to write. I'd like to be able to change the column via the function so that I can do some mutations inside of a pipe. I'm afraid my R is a bit rusty, so I would appreciate some help.
# Create some fake data
dat <- data.frame(x = 1:4,
y = 5:8,
z = 9:12)
# Write the function
my_func <- function(data, col1, number){
x <- data %>%
# This is where I run into trouble
mutate(col4 = [[col1]] + number)
return(x)
}
# Try to call the function - gets error
df <- my_func(dat, x, 5)
I've tried all sorts of stuff with this, but nothing works or even gets close.

Use curly-curly operator ({{}}) if we are passing unquoted column name
my_func <- function(data, col1, number){
x <- data %>%
mutate(col4 = {{col1}} + number)
return(x)
}
-testing
> my_func(dat, x, 5)
x y z col4
1 1 5 9 6
2 2 6 10 7
3 3 7 11 8
4 4 8 12 9
For more flexibility i.e. using either quoted or unquoted, convert to symbol with ensym and evaluate (!!)
my_func <- function(data, col1, number){
x <- data %>%
mutate(col4 = !! rlang::ensym(col1) + number)
return(x)
}
-testing
> my_func(dat, "x", 5)
x y z col4
1 1 5 9 6
2 2 6 10 7
3 3 7 11 8
4 4 8 12 9
> my_func(dat, x, 5)
x y z col4
1 1 5 9 6
2 2 6 10 7
3 3 7 11 8
4 4 8 12 9

Related

Perform set operations on duplicate items, in R (base R preferably) [duplicate]

I have 3 vectors
x <- c(1,3,5,7,3,8)
y <- c(3,5,7)
z <- c(3,3,8)
I want to find the elements of x that are not in y and not in z. Is there a function f that would give me the following output:
> f(x,y)
1 3 8
> f(x,z)
1 5 7
In other words, I want to find the "set difference" between 2 vectors, either of which may have repeated values. The functions %in%, match and setdiff do not work in this case for obvious reasons.
There should be some better ways to do this but here is one option
get_diff_vectors <- function(x, y) {
count_x <- table(x)
count_y <- table(y)
same_counts <- match(names(count_y), names(count_x))
count_x[same_counts] <- count_x[same_counts] - count_y
as.numeric(rep(names(count_x), count_x))
}
get_diff_vectors(x, y)
#[1] 1 3 8
get_diff_vectors(x, z)
#[1] 1 5 7
get_diff_vectors(x, c(5, 7))
#[1] 1 3 3 8
We count the frequency of x and y using table, match the numbers which occur in both and subtract the counts y from x. Finally recreate the remaining vector using rep.
Still not able to find a better way but here is dplyr way using the somewhat similar logic.
library(dplyr)
get_diff_vectors_dplyr <- function(x, y) {
df1 <- data.frame(x) %>% count(x)
df2 <- data.frame(y) %>% count(y)
final <- left_join(df1, df2, by = c("x" = "y")) %>%
mutate_at(c("n.x", "n.y"), funs(replace(., is.na(.), 0))) %>%
mutate(n = n.x - n.y)
rep(final$x, final$n)
}
get_diff_vectors_dplyr(x, y)
#[1] 1 3 8
get_diff_vectors_dplyr(x, z)
#[1] 1 5 7
get_diff_vectors_dplyr(x, c(5, 7))
#[1] 1 3 3 8
The vecsets package mentioned by OP has function vsetdiff which does this very easily
vecsets::vsetdiff(x, y)
#[1] 1 3 8
vecsets::vsetdiff(x, z)
#[1] 1 5 7
vecsets::vsetdiff(x, c(5, 7))
#[1] 1 3 3 8
Here's an attempt using make.unique to account for duplicates:
dupdiff <- function(x,y) x[-match(
make.unique(as.character(y)),
make.unique(as.character(x)),
nomatch=0
)]
Testing:
dupdiff(x,y)
#[1] 1 3 8
dupdiff(x,z)
#[1] 1 5 7
dupdiff(x, c(5, 7))
#[1] 1 3 3 8
dupdiff(x, c(5, 7, 9))
#[1] 1 3 3 8
match with a little for-loop does work:
> f(x, y)
[1] 1 3 8
> f(x, z)
[1] 1 5 7
Code
f <- function(s, r) {
for(i in 1:length(s)){
j <- match(s[i], r)
if(!is.na(j)) {
s[i] <- NA
r[j] <- NA
}
}
print(s[complete.cases(s)])
}
There is the new Hadley-verse package waldo
Makes real nice and concise overviews of the differences of objects, not only vectors
library(waldo)
compare(x, y)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 5 7
compare(x, z)
#> `old`: 1 3 5 7 3 8
#> `new`: 3 3 8

case_when using variable name to change data value

I have the following dataframe:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
I want to change a specific value of each columns using the following logic:
Variable name contains "_lag1" then the first element of the column has to turn into NA
Variable name contains "_lag2" then the first and second element of the column has to turn into NA
Else the column remains as it is
The expected result should be look like:
df_new <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(NA,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(NA,NA,34,5,45,7)
)
As you have the original unlagged variables in your df you could simply recompute the lagged values using e.g. dplyr::lag which by default will give you NAs:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
library(dplyr)
df %>% mutate(var1_lag1 = dplyr::lag(var1_lag0, n = 1), var2_lag2 = dplyr::lag(var2_lag0, n = 2))
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7
A base R solution might look like this:
df <- data.frame(var1_lag0 = c(1,2,3,4,5,6)
, var1_lag1 = c(0,1,2,3,4,5)
, var2_lag0 = c(34,5,45,7,2,1)
, var2_lag2 = c(0,0,34,5,45,7)
)
df_new <- df
df_new[1 , grep(pattern="_lag1", colnames(df))] <- NA
df_new[c(1,2) , grep(pattern="_lag2", colnames(df))] <- NA
df_new
#> var1_lag0 var1_lag1 var2_lag0 var2_lag2
#> 1 1 NA 34 NA
#> 2 2 1 5 NA
#> 3 3 2 45 34
#> 4 4 3 7 5
#> 5 5 4 2 45
#> 6 6 5 1 7
Created on 2021-01-06 by the reprex package (v0.3.0)
Here is a for loop that checks the column names of the df for the key words "_lag1" and "_lag2" and turns the corresponding values to NA.
for (i in 1:length(df)){
if (grepl("_lag1",colnames(df)[i])){
df[1,i] = NA
}
else if (grepl("_lag2",colnames(df)[i])){
df[1:2,i] = NA
}
}
You can try to wrap a case_when inside a helper function and use mutate_at with contains to get the proper columns.
df %>%
mutate_at(vars(contains("lag1")),
function(x, lag) fix(x, "lag1")) %>%
mutate_at(vars(contains("lag2")),
function(x, lag) fix(x, "lag2"))
Which produces
var1_lag0 var1_lag1 var2_lag0 var2_lag2
1 1 NA 34 NA
2 2 1 5 NA
3 3 2 45 34
4 4 3 7 5
5 5 4 2 45
6 6 5 1 7
Here is the helper function called fix
fix <- function(x, lag){
real_lag <- case_when(stringr::str_detect("lag1", lag) ~ 1,
stringr::str_detect("lag2", lag) ~ 2)
x[1:real_lag] <- NA
return(x)
}

Mutate dataframes in a nested list without for loop

I have a list of dataframes (and parameters for sensitivity analyses for a study), and I want to mutate each dataframe in the same way. The expected output is generated by the code below (a new column x2). Is there a way to assign the resulting dataframes (newdfs) to the list without using a for loop?
models <- list(m1 = list('params' = list('start'='2014-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(4,5,6))),
m2 = list('params' = list('start'='2017-01-01'),
'data' = data.frame(y=c(1,2,3), x1=c(7,8,9))))
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
z$data})
# Can I do this without "for"?
for(x in 1:length(models)) models[[x]]$data <- newdfs[[x]]
You can try this:
newdfs <- lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1
return(z)})
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10
Revise the function in lapply() to return z instead of z$data:
lapply(models, function(z) {z$data$x2 <- z$data$x1 + 1 ; z})
To make this question complete, here are two purrr solutions:
library(purrr)
map() + map_at()
map(models, map_at, "data", transform, x2 = x1 + 1)
transpose() + map()
models %>%
transpose %>%
`[[<-`(., "data", map(.$data, transform, x2 = x1 + 1)) %>%
transpose
Output
$m1
$m1$params
$m1$params$start
[1] "2014-01-01"
$m1$data
y x1 x2
1 1 4 5
2 2 5 6
3 3 6 7
$m2
$m2$params
$m2$params$start
[1] "2017-01-01"
$m2$data
y x1 x2
1 1 7 8
2 2 8 9
3 3 9 10

Getting NULL values for a multiple column IF statement passed to MAPPLY

I have a data frame of data:
df <- data.frame(x = c(11, 3, 2, 7, 9, 4, 6, 1, 6, 7),
y = c(rep("a",5), rep("b",5)))
df
x y
1 11 a
2 3 a
3 2 a
4 7 a
5 9 a
6 4 b
7 6 b
8 1 b
9 6 b
10 7 b
What I'm trying to do is an IF statement on both columns x and y, where it assigns a new value (z) based on meeting the criteria of x and y.
myfun <- function(x,y) {
if(x < 3 & y=="a") z <- 1
if(x>=3 & x <=7 & y=="a") z <- 2
if(x>7 & y=="a") z <- 3
if(x<3 & y=="b") z <-4
if(x>=3 & x<=1 & y=="b") z <-5
if(x>7 & y=="b") z<-6
}
I am trying to get the following result based on that logic above:
df
x y z
1 11 a 3
2 3 a 2
3 2 a 1
4 7 a 2
5 9 a 3
6 4 b 5
7 6 b 5
8 1 b 4
9 6 b 5
10 7 b 5
df$z <- mapply(myfun, df$x, df$x)
This results in:
x y z
1 11 a NULL
2 3 a NULL
3 2 a NULL
4 7 a NULL
5 9 a NULL
6 4 b NULL
7 6 b NULL
8 1 b NULL
9 6 b NULL
10 7 b NULL
I have no idea why. Can someone explain where I am going wrong?
if() function is not supposed to be used for vectors (or columns). It is used for single object comparisons like if(switch=="on"). What you should use is the ifelse() function. Your first three conditions would become:
myfun <- function(df) {
df$z <- with(df, ifelse(x < 3 & y=="a",1,NA))
df$z <- with(df, ifelse(x>=3 & x <=7 & y=="a",2,df$z))
df$z <- with(df, ifelse(x>7 & y=="a",3,df$z))
...
}
edit: and using df$x and df$y in the function call is probably not necessary. "result <- myfun(df)" would be enough unless you want x and y to be different.

Why is using list() critical for .dots = setNames() uses in dplyr?

I am calling mutate using dynamic variable names. An example that mostly works is:
df <- data.frame(a = 1:5, b = 1:5)
func <- function(a,b){
return(a+b)
}
var1 = 'a'
var2 = 'b'
expr <- interp(~func(x, y), x = as.name(var1), y = as.name(var2))
new_name <- "dynamically_created_name"
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Which produces
temp
a b func(a, b)
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
This is mostly fine except that set names ignored the nm key. This is solved by wrapping my function in list():
temp <- df %>% mutate_(.dots = setNames(list(expr), nm = new_name))
temp
a b dynamically_created_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10
My question is why is setNames ignoring it's key in the first place, and how does list() solve this problem?
As noted in the other answer, the .dots argument is assumed to be a list, and setNames is a convenient way to rename elements in a list.
What is the .dots argument doing? Let's first think about the actual dots ... argument. It is a series of expressions to be evaluated. Below the dots ... are the two named expressions c = ~ a * scale1 and d = ~ a * scale2.
scale1 <- -1
scale2 <- -2
df %>%
mutate_(c = ~ a * scale1, d = ~ a * scale2)
#> a b c d
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
We could just bundle those expressions together beforehand in a list. That's where .dots comes in. That parameter lets us tell mutate_ to evaluate the expressions in the list.
bundled <- list(
c2 = ~ a * scale1,
d2 = ~ a * scale2
)
df %>%
mutate_(.dots = bundled)
#> a b c2 d2
#> 1 1 1 -1 -2
#> 2 2 2 -2 -4
#> 3 3 3 -3 -6
#> 4 4 4 -4 -8
#> 5 5 5 -5 -10
If we want to programmatically update the names of the expressions in the list, then setNames is a convenient way to do that. If we want to programmatically mix and match constants and variable names when making expressions, then the lazyeval package provides convenient ways to do that. Below I do both to create a list of expressions, name them, and evaluate them with mutate_
# Imagine some dropdown boxes in a Shiny app, and this is what user requested
selected_func1 <- "min"
selected_func2 <- "max"
selected_var1 <- "a"
selected_var2 <- "b"
# Assemble expressions from those choices
bundled2 <- list(
interp(~fun(x), fun = as.name(selected_func1), x = as.name(selected_var1)),
interp(~fun(x), fun = as.name(selected_func2), x = as.name(selected_var2))
)
bundled2
#> [[1]]
#> ~min(a)
#>
#> [[2]]
#> ~max(b)
# Create variable names
exp_name1 <- paste0(selected_func1, "_", selected_var1)
exp_name2 <- paste0(selected_func2, "_", selected_var2)
bundled2 <- setNames(bundled2, c(exp_name1, exp_name2))
bundled2
#> $min_a
#> ~min(a)
#>
#> $max_b
#> ~max(b)
# Evaluate the expressions
df %>%
mutate_(.dots = bundled2)
#> a b min_a max_b
#> 1 1 1 1 5
#> 2 2 2 1 5
#> 3 3 3 1 5
#> 4 4 4 1 5
#> 5 5 5 1 5
From vignettes("nse"):
If you also want to output variables to vary, you need to pass a list of quoted objects to the .dots argument
So perhaps the reason why
temp <- df %>% mutate_(.dots = setNames(expr, nm = new_name))
Doesn't do what you want is, while you successfully set the name attribute here, expr is still a formula, not a list:
foo <- setNames(expr, nm = new_name)
names(foo) #"dynamically_created_name" ""
class(foo) #"formula"
So if you make it a list, it works as expected:
expr <- interp(~func(x, y), x = as.name(var1),
y = as.name(var2))
df %>% mutate_(.dots = list(new_name = expr))
a b new_name
1 1 1 2
2 2 2 4
3 3 3 6
4 4 4 8
5 5 5 10

Resources