As part of a much larger project, I am trying to create a new column in a data.frame called "unique_id" based on the interaction of user-specified variables. In this use-case, the number of variables needed and the names will vary quite a bit which each user, so this flexibility is important. Some data.frames, such as the toy one I made in my example will even come with a "unique id" variable, but this is quite rare. I included in my example to be clear about what my desired output is.
Consider this toy data.frame:
mini_df <- data.frame(
lat = c(41.23,37.37,41.23,39.01,32.00),
lon = c(-120.79,-120.68,-120.79,-119.13,-120.00),
station_id = c(300,527,300,228,72)
)
Outside of a proper function, it is quite easy to do something like this:
out_of_function_test_df <- mini_df %>%
mutate(id = interaction(lat, lon))
Which produces what I want, namely:
lat lon station_id id
1 41.23 -120.79 300 41.23.-120.79
2 37.37 -120.68 527 37.37.-120.68
3 41.23 -120.79 300 41.23.-120.79
4 39.01 -119.13 228 39.01.-119.13
5 32.00 -120.00 72 32.-120
I need this to work within a function in which the user specifies the interacting variables.
I have read many stack exchange posts which approach similar problems with some important differences to mine. The other questions address verbs other than mutate, attempt to apply different functions, or do not address the issue of multiple user-specified variables.
After reading these, trying many things, and reading this, the best I can come up with is the following:
create_unique_id <- function(df,
metadata_coords,
unique_id_coords) {
df <- df %>%
mutate_(id = interp(~interaction(args), # causes error with or without tilde
args = c("list", lapply(unique_id_coords, as.name))))
return(df)
}
This produces an error:
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
Here is the full traceback, if it is helpful:
22.
unique.default(x, nmax = nmax)
21.
unique(x, nmax = nmax)
20.
factor(x)
19.
as.factor(args[[i]])
18.
interaction(list("list", lat, lon))
17.
mutate_impl(.data, dots, caller_env())
16.
mutate.tbl_df(tbl_df(.data), ...)
15.
mutate(tbl_df(.data), ...)
14.
as.data.frame(mutate(tbl_df(.data), ...))
13.
mutate.data.frame(.data, !!!dots)
12.
mutate(.data, !!!dots)
11.
mutate_.data.frame(., id = interp(~interaction(args), args = c("list",
lapply(unique_id_coords, as.name))))
10.
mutate_(., id = interp(~interaction(args), args = c("list", lapply(unique_id_coords,
as.name))))
9.
function_list[[k]](value)
8.
withVisible(function_list[[k]](value))
7.
freduce(value, `_function_list`)
6.
`_fseq`(`_lhs`)
5.
eval(quote(`_fseq`(`_lhs`)), env, env)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
2.
df %>% mutate_(id = interp(~interaction(args), args = c("list",
lapply(unique_id_coords, as.name))))
1.
create_unique_id(df = mini_df, metadata_coords = c("lat", "lon",
"station_id"), unique_id_coords = c("lat", "lon"))
I do not have nearly enough background knowledge for this to be helpful to me. I am confused because it seems that the issue is deep within the interact() function. interact() calls unique() along the way (which makes sense), but unique() is what ends up failing. Somehow, the initial call of interact() within my function is different than when it was outside the function, but I am not sure how.
Does this application of tidyr::unite() help?
create_unique_id <- function(df,
unique_id_coords) {
df <- df %>%
tidyr::unite("id", {{ unique_id_coords }}, remove = FALSE)
return(df)
}
Result
create_unique_id(mtcars, unique_id_coords = c(wt, qsec)) %>% head()
mpg cyl disp hp drat id wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.62_16.46 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875_17.02 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.32_18.61 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215_19.44 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.44_17.02 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.46_20.22 3.460 20.22 1 0 3 1
Here is another option using the ellipses ...:
library(rlang)
create_unique_id <- function(df, ...){
df %>%
mutate(id = paste(!!! ensyms(...), sep = "_"))
}
Output
create_unique_id(mtcars, cyl, hp, vs) %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb id
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 6_110_0
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 6_110_0
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4_93_1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6_110_1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 8_175_0
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 6_105_1
Related
I use across() and I want to put NA where the computation fails. I tried to use tryCatch() but can't make it work in my case, whereas there are situations where it works.
This works:
library(dplyr)
head(mtcars) %>%
mutate(
across(
all_of("drat"),
function(x) tryCatch(blabla, error = function(e) NA) # create an intentional error for the example
)
)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 NA 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 NA 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 NA 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 NA 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 NA 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 NA 3.460 20.22 1 0 3 1
But this doesn't:
library(dplyr)
head(mtcars) %>%
mutate(
across(
all_of("drat"),
function(x) tryCatch(x[which(mpg == 10000)], error = function(e) NA) # create an intentional error for the example
)
)
#> Error in `mutate()`:
#> ! Problem while computing `..1 = across(...)`.
#> Caused by error in `across()`:
#> ! Problem while computing column `drat`.
Created on 2022-07-07 by the reprex package (v2.0.1)
I thought tryCatch() was supposed to catch any error. Why doesn't it work in the second situation? How to fix it?
Note: I need to use across() in my real situation (even if it's not truly needed in the examples)
The problem isn't the tryCatch because the code you run doesn't trigger an error. Basically you are running
foo <- function(x) tryCatch(x[which(mtcars$mpg==10000)], error = function(e) NA))
foo(mtcars$drat)
# numeric(0)
And notice that no error is triggered. That expression simply returns numeric(0). And the problem is that the function needs to return a value with a non-zero length. So the error is happening after your tryCatch code runs and dplyr is trying to assign the value back into the data.frame. You will need to handle the case where no values are found separately. Perhaps
head(mtcars) %>%
mutate(
across(
all_of("drat"),
function(x) {
matches <- mpg == 10000
if (any(matches)) x[which(matches)] else NA
}
)
)
It looks like you just need to reference mpg with x:
library(dplyr)
head(mtcars) %>%
mutate(
across(
all_of("drat"),
function(x) tryCatch(x[which(x$mpg == 10000)], error = function(e) NA) # create an intentional error for the example
)
)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 NA 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 NA 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 NA 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 NA 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 NA 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 NA 3.460 20.22 1 0 3 1
I have downloaded an .ods file from this website (UK office for national statistics). Because of the way the sheet is structured, I import it as two separate dataframes:
library(readODS)
income_pretax <- read_ods('/Users/c.robin/Downloads/NS_Table_3_1a_1819.ods', range = "A4:U103")
income_posttax <- read_ods('/Users/c.robin/Downloads/NS_Table_3_1a_1819.ods', range = "A104:U203")
I want to do some cleaning on both dataframes: changing the name of the two of the variables and recasting one of the variables as numeric. This is what I have for this, which works on a single df:
income_pretax <- income_pretax %>%
rename(pp_tot_income_pretax = 'Percentile point\nTotal income before tax',
'2008-09' = '2008-09(a)')
income_pretax['2008-09'] <- as.numeric(income_pretax$'2008-09')
I'm struggling to get the above into a function though. I think it should be something like the below, but honestly I have no idea how to tell R i'm passing multiple dataframes to the function, nor how to handle multiple variables. Can anyone advise on this?
##Attempting a function
cleanvars <- function(data, varlist){
data <- data %>%
rename(pp_tot_income_pretax = {{varlist}})
data['2008-09'] <- as.numeric(data$'2008-09')
}
You can pass a named vector to the function.
library(dplyr)
cleanvars <- function(data, varlist){
data %>% rename(varlist)
}
cleanvars(mtcars %>% head, c('new_mpg' = 'mpg', 'new_cyl' = 'cyl'))
# new_mpg new_cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We can do this in base R
nm1 <- c('mpg', 'cyl')
nm2 <- paste0("new_", nm1)
i1 <- match(nm1, names(mtcars))
names(mtcars)[i1] <- nm2
I have a dataframe with multiple columns of a numeric type, where I want to query if a range of values exist in any of them, and bring back a true/false binary flag with as.numeric.
So I can do this the long way with:
df <- df %>%
mutate(flag = as.numeric(days_dry %in% c(1:28) |
days_frozen %in% c(1:28) |
days_fresh %in% c(1:28))
But I have a bunch of columns I want to query. Why can't I bring back the same result with this?:
df <- df %>%
mutate(flag = as.numeric(vars(starts_with("days_")) %in% c(1:28))
I get no error, but it doesn't bring back any cases which match the criteria.
There might be a better way, but ...
mtcars %>%
mutate(flag = rowSums(sapply(cbind(select(., starts_with("c"))), `%in%`, 4:6)) > 0) %>%
head()
# mpg cyl disp hp drat wt qsec vs am gear carb flag
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 TRUE
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 TRUE
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 TRUE
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 TRUE
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 FALSE
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 TRUE
The premise is using cbind(select(., <>))) to form a mid-pipe inner frame. From there, we sapply over its columns, converting them to columns of logicals. The last step is using rowSums(.) > 0 to determine if a row has at least one TRUE; an alternative to rowSums can use Reduce(``` | ```, ...), but while that is elegant in a list-processing kind of way, it is also slower (especially with multiple matching columns).
I am trying to create a copy of a column based on a variable - that is, the new column's name is constant, but which one it copies changes. This is what I would do previously:
library(dplyr)
x <- "mpg"
mtcars %>%
mutate_(Target = x)
To receive results like this:
However, when you run this, you now receive a warning:
Warning message:
mutate_() is deprecated.
Please use mutate() instead
It suggests looking at https://tidyeval.tidyverse.org/ for guidance; I've had a quick skim, but didn't spot this as a use case in the document. (It doesn't seem to cover the problem of converting existing code, but maybe I'm just not understanding it well enough?)
How do I move this code from mutate_() to mutate()?
You need to adhere to dplyr's non-standard evaluation
mtcars %>% mutate(Target = !!sym(x))
# mpg cyl disp hp drat wt qsec vs am gear carb Target
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.7
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1
...
Here sym takes a string as input and turns it into a symbol, which you then unquote using the bang-bang operator !!.
Also note that mutate_ has been deprecated.
We can use mutate_at and this can be also used for multiple columns
library(dplyr)
mtcars %>%
mutate_at(vars(x), list(Target = ~ I))
You could use rlang::sym or base R get
library(dplyr)
mtcars %>% mutate(Target = !!rlang::sym(x))
mtcars %>% mutate(Target = get(x))
You can also try basic way like this...
x <- mtcars$mpg
mtcars$Target <- x
I'm trying to create a user-defined function which has as one output a network object that is named similarly to the input dataframe used in the function. Something like this.
node_attributes <- function(i){ #i is dataframe
j <- network(i)
##some other function stuff##
(i,'network',sep = '_')) <- j
}
The idea is to create add '_network' onto the i variable, which is meant to be a dataframe. So if my orignial dataframe is foo_bar_data, my output would be: foo_bar_data_network.
It is possible to get the name of input variables with deparse(substitute(argname)).
func <- function(x){
depsrse(substitute(x))
}
func(some_object)
## [1] "some_object"
I am not completely sure how you want to use the name of the input, so I used something similar to the answer of #JackStat
node_attributes <- function(i){
output_name <- paste(deparse(substitute(i)), 'network', sep = '_')
## I simplified this since I don't know what the function network is
j <- i
assign(output_name, j, envir = parent.frame())
}
node_attributes(mtcars)
head(mtcars_network)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
That said I don't really see any reason to code like this. Normally, returning the output from the function is the recommended way.
you can use assign
j <- network(i)
assign(paste0(i,'network',sep = '_'), j)