I would like to combine two dataframes using crossing, but some have the same columnnames. For that, I would like to add "_nameofdataframe" to these columns. Here are some reproducible dataframes (dput below):
> df1
person V1 V2 V3
1 A 1 3 3
2 B 4 4 5
3 C 2 1 1
> df2
V2 V3
1 2 5
2 1 6
3 1 2
When I run the following code it will return duplicated column names:
library(tidyr)
crossing(df1, df2, .name_repair = "minimal")
#> # A tibble: 9 × 6
#> person V1 V2 V3 V2 V3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 3 3 1 2
#> 2 A 1 3 3 1 6
#> 3 A 1 3 3 2 5
#> 4 B 4 4 5 1 2
#> 5 B 4 4 5 1 6
#> 6 B 4 4 5 2 5
#> 7 C 2 1 1 1 2
#> 8 C 2 1 1 1 6
#> 9 C 2 1 1 2 5
As you can see it returns the column names while being duplicated. My desired output should look like this:
person V1 V2_df1 V3_df1 V2_df2 V3_df2
1 A 1 3 3 1 2
2 A 1 3 3 1 6
3 A 1 3 3 2 5
4 B 4 4 5 1 2
5 B 4 4 5 1 6
6 B 4 4 5 2 5
7 C 2 1 1 1 2
8 C 2 1 1 1 6
9 C 2 1 1 2 5
So I was wondering if anyone knows a more automatic way to give the duplicated column names a name like in the desired output above with crossing?
dput of df1 and df2:
df1 <- structure(list(person = c("A", "B", "C"), V1 = c(1, 4, 2), V2 = c(3,
4, 1), V3 = c(3, 5, 1)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(V2 = c(2, 1, 1), V3 = c(5, 6, 2)), class = "data.frame", row.names = c(NA,
-3L))
As you probably know, the .name_repair parameter can take a function. The problem is crossing() only passes that function one argument, a vector of the concatenated column names() of both data frames. So we can't easily pass the names of the data frame objects to it. It seems to me that there are two solutions:
Manually add the desired suffix to an anonymous function.
Create a wrapper function around crossing().
1. Manually add the desired suffix to an anonymous function
We can simply supply the suffix as a character vector to the anonymous .name_repair parameter, e.g. suffix = c("_df1", "_df2").
crossing(
df1,
df2,
.name_repair = \(x, suffix = c("_df1", "_df2")) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
# person V1 V2_df1 V3_df1 V2_df2 V3_df2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 3 3 1 2
# 2 A 1 3 3 1 6
# 3 A 1 3 3 2 5
# 4 B 4 4 5 1 2
# 5 B 4 4 5 1 6
# 6 B 4 4 5 2 5
# 7 C 2 1 1 1 2
# 8 C 2 1 1 1 6
# 9 C 2 1 1 2 5
The disadvantage of this is that there is a room for error when typing the suffix, or that we might forget to change it if we change the names of the data frames.
Also note that we are checking for names which appear twice. If one of your original data frames already has broken (duplicated) names then this function will also rename those columns. But I think it would be unwise to try to do any type of join if either data frame did not have unique column names.
2. Create a wrapper function around crossing()
This might be more in the spirit of the tidyverse. Thecrossing() docs to which you linked state crossing() is a wrapper around expand_grid(). The source for expand_grid() show that it is basically a wrapper which uses map() to apply vctrs::vec_rep() to some inputs. So if we want to add another function to the call stack, there are two ways I can think of:
Using deparse(substitute())
crossing_fix_names <- function(df_1, df_2) {
suffixes <- paste0(
"_",
c(deparse(substitute(df_1)), deparse(substitute(df_2)))
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Output the same as above
crossing_fix_names(df1, df2)
The disadvantage of this is that deparse(substitute()) is ugly and can occasionally have surprising behaviour. The advantage is we do not need to remember to manually add the suffixes.
Using match.call()
crossing_fix_names2 <- function(df_1, df_2) {
args <- as.list(match.call())
suffixes <- paste0(
"_",
c(
args$df_1,
args$df_2
)
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Also the same output
crossing_fix_names2(df1, df2)
As we don't have the drawbacks of deparse(substitute()) and we don't have to manually specify the suffix, I think this is the probably the best approach.
test for the condition using dputs :
colnames(df1) %in% colnames(df2)
[1] FALSE FALSE TRUE TRUE
rename
colnames(df2) <- paste0(colnames(df2), '_df2')
then cbind
cbind(df1,df2)
person V1 V2 V3 V2_df2 V3_df2
1 A 1 3 3 2 5
2 B 4 4 5 1 6
3 C 2 1 1 1 2
not so elegant, but usefully discernible later.
Related
This question already has answers here:
Dictionary style replace multiple items
(11 answers)
Closed 1 year ago.
Another thread solved a similar problem very nicely
But what i would like to do is get rid of some redundancy in my similar problem.
Using their example:
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
creates:
df
name foo var1 var2
1 a 1 a 3
2 a 2 a 3
3 a 3 a 3
4 b 4 b 4
5 b 5 b 4
6 b 6 b 4
7 c 7 c 5
8 c 8 c 5
9 c 9 c 5
But what do i need to do to replace multiple characters with unique values?
a=1
b=2
c=3
I tried:
df[,c(4,6)] <- lapply(df[,c(4,6)], function(x) replace(x,x %in% "a", 1),
replace(x,x %in% "b", 2),
replace(x,x %in% "c", 3))
and
z<- c("a","b","c")
y<- c(1,2,3)
df[,c(1,3)] <- lapply(df[,c(1,3)], function(x) replace(x,x %in% z, y))
But neither seem to work.
Thanks.
You can use dplyr::recode
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
library(dplyr, warn.conflicts = FALSE)
df %>%
mutate(across(c(name, var1), ~ recode(., a = 1, b = 2, c = 3)))
#> name foo var1 var2
#> 1 1 1 1 3
#> 2 1 2 2 3
#> 3 1 3 3 3
#> 4 2 4 1 4
#> 5 2 5 2 4
#> 6 2 6 3 4
#> 7 3 7 1 5
#> 8 3 8 2 5
#> 9 3 9 3 5
Created on 2021-10-19 by the reprex package (v2.0.1)
Across will apply the function defined by ~ recode(., a = 1, b = 2, c = 3) to both name and var1.
Using ~ and . is another way to define a function in across. This function is equivalent to the one defined by function(x) recode(x, a = 1, b = 2, c = 3), and you could use that code in across instead of the ~ form and it would give the same result. The only name I know for this is what it's called in ?across, which is "purrr-style lambda function", because the purrr package was the first to use formulas to define functions in this way.
If you want to see the actual function created by the formula, you can look at rlang::as_function(~ recode(., a = 1, b = 2, c = 3)), although it's a little more complex than the one above to support the use of ..1, ..2 and ..3 which are not used here.
Now that R supports the easier way of defining functions below, this purrr-style function is maybe no longer useful, it's just an old habit to write it that way.
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
library(dplyr, warn.conflicts = FALSE)
df %>%
mutate(across(c(name, var1), \(x) recode(x, a = 1, b = 2, c = 3)))
#> name foo var1 var2
#> 1 1 1 1 3
#> 2 1 2 2 3
#> 3 1 3 3 3
#> 4 2 4 1 4
#> 5 2 5 2 4
#> 6 2 6 3 4
#> 7 3 7 1 5
#> 8 3 8 2 5
#> 9 3 9 3 5
Created on 2021-10-19 by the reprex package (v2.0.1)
A simple for loop would do the trick:
for (i in 1:length(z)) {
df[df==z[i]] <- y[i]
}
df
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
You could use a lookup vector combined with apply:
z <- c("a","b","c")
y <- c(1,2,3)
lookup <- setNames(y, z)
df[,c(1,3)] <- apply(df[,c(1,3)], 2, function(x) lookup[x])
df
This returns
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
If you are open to a tidyverse approach:
library(tidyverse)
df_new <- df %>%
mutate(across(c(var1, name), ~case_when(. == 'a' ~ 1,
. == 'b' ~ 2,
. == 'c' ~ 3)))
df_new
name foo var1 var2
1 1 1 1 3
2 1 2 2 3
3 1 3 3 3
4 2 4 1 4
5 2 5 2 4
6 2 6 3 4
7 3 7 1 5
8 3 8 2 5
9 3 9 3 5
Note, this code works only if you change all values of your column. E.g. if there was a „d“ in your var1 column that you don‘t tuen into a number, it would be changed to NA.
# Import data: df => data.frame
df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = letters[1:3], var2 = rep(3:5, each = 3))
# Function performing a mapping replacement:
# replaceMultipleValues => function()
replaceMultipleValues <- function(df, mapFrom, mapTo){
# Extract the values in the data.frame:
# dfVals => named character vector
dfVals <- unlist(df)
# Get all values in the mapping & data
# and assign a name to them: tmp1 => named character vector
tmp1 <- c(
setNames(mapTo, mapFrom),
setNames(dfVals, dfVals)
)
# Extract the unique values:
# valueMap => named character vector
valueMap <- tmp1[!(duplicated(names(tmp1)))]
# Recode the values, coerce vectors to appropriate
# types: res => data.frame
res <- type.convert(
data.frame(
matrix(
valueMap[dfVals],
nrow = nrow(df),
ncol = ncol(df),
dimnames = dimnames(df)
)
)
)
# Explicitly define the returned object: data.frame => env
return(res)
}
# Recode values in data.frame:
# res => data.frame
res <- replaceMultipleValues(
df,
c("a", "b", "c"),
c("1", "2", "3")
)
# Print data.frame to console:
# data.frame => stdout(console)
res
I've been struggling trying to add a new column if it does not exist. I found the answer in here: Adding column if it does not exist .
However, in my problem I must use it inside purrr environment. I tried to adapt the above answer, but it doesn't fit my needs.
Here is an example what I'm dealing with:
Suppose I have a list of two data.frames:
library(tibble)
A = tibble(
x = 1:5, y = 1, z = 2
)
B = tibble(
x = 5:1, y = 3, z = 3, w = 7
)
dt_list = list(A, B)
The column I'd like to add is w:
cols = c(w = NA_real_)
Separately, if I want to add a column if it does not exist, I could do the following:
Since it does exist, not columns is added:
B %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 5 3 3 7
2 4 3 3 7
3 3 3 3 7
4 2 3 3 7
5 1 3 3 7
In this case, since it does not exist, w is added:
A %>% tibble::add_column(!!!cols[!names(cols) %in% names(.)])
# A tibble: 5 x 4
x y z w
<int> <dbl> <dbl> <dbl>
1 1 1 2 NA
2 2 1 2 NA
3 3 1 2 NA
4 4 1 2 NA
5 5 1 2 NA
I tried the following to replicate it using purrr (I'd prefer not to use a for loop):
dt_list_2 = dt_list %>%
purrr::map(
~dplyr::select(., -starts_with("x")) %>%
~tibble::add_column(!!!cols[!names(cols) %in% names(.)])
)
But the output is not the same as doing it separately.
Note: This is an example of my real problem. In fact, I'm using purrr to read many *.csv files and then apply some data transformation. Something like this:
re_file <- list.files(path = dir_path, pattern = "*.csv")
cols_add = c(UCI = NA_real_)
file_list = re_file %>%
purrr::map(function(file_name){ # iterate through each file name
read_csv(file = paste0(dir_path, "//",file_name), skip = 2)
}) %>%
purrr::map(
~dplyr::select(., -starts_with("Textbox")) %>%
~dplyr::tibble(!!!cols[!names(cols) %in% names(.)])
)
You can use :
dt_list %>%
purrr::map(
~tibble::add_column(., !!!cols[!names(cols) %in% names(.)])
)
#[[1]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 1 1 2 NA
#2 2 1 2 NA
#3 3 1 2 NA
#4 4 1 2 NA
#5 5 1 2 NA
#[[2]]
# A tibble: 5 x 4
# x y z w
# <int> <dbl> <dbl> <dbl>
#1 5 3 3 7
#2 4 3 3 7
#3 3 3 3 7
#4 2 3 3 7
#5 1 3 3 7
I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6
I have a function in my real-world problem that returns a list. Is there any way to use this with the dplyr mutate()? This toy example doesn't work -:
it = data.table(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
list(temp1,temp2)
}
myfun(1,2)
it%.%mutate(new = myfun(V2,V3))
I see that it is cycling through the output of the function in the first "column" of the new variable, but do not understand why.
Thanks!
The idiomatic way to do this using data.table would be to use the := (assignment by reference) operator. Here's an illustration:
it[, c(paste0("V", 4:5)) := myfun(V2, V3)]
If you really want a list, why not:
as.list(it[, myfun(V2, V3)])
Alternatively, maybe this is what you want, but why don't you just use the data.table functionality:
it[, c(.SD, myfun(V2, V3))]
# V1 V2 V3 V4 V5
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Note that if myfun were to name it's output, then the names would show up in the final result columns:
# V1 V2 V3 new.1 new.2
# 1: a 1 2 3 -1
# 2: a 2 3 5 -1
# 3: b 3 4 7 -1
# 4: b 4 2 6 2
# 5: c 5 2 7 3
Given the title to this question, I thought I'd post a tidyverse solution that uses dplyr::mutate. Note that myfun needs to output a data.frame to work.
library(tidyverse)
it = data.frame(
v1 = c("a","a","b","b","c"),
v2 = c(1,2,3,4,5),
v3 = c(2,3,4,2,2))
myfun = function(arg1,arg2) {
temp1 = arg1 + arg2
temp2 = arg1 - arg2
data.frame(temp1, temp2)
}
it %>%
nest(data = c(v2, v3)) %>%
mutate(out = map(data, ~myfun(.$v2, .$v3))) %>%
unnest(cols = c(data, out))
#> # A tibble: 5 x 5
#> v1 v2 v3 temp1 temp2
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 2 3 -1
#> 2 a 2 3 5 -1
#> 3 b 3 4 7 -1
#> 4 b 4 2 6 2
#> 5 c 5 2 7 3
Created on 2020-02-04 by the reprex package (v0.3.0)
The mutate() function is designed to add new columns to the existing data frame. A data frame is a list of vectors of the same length. Thus, you cant add a list as a new column, because a list is not a vector.
You can rewrite your function as two functions, each of which return a vector. Then apply each of these separately using mutate() and it should work.