How can I pass column names to tidyr::complete - r

How do I use tidyr::complete to add additional rows to a data frame, specifying the column names wanted as an input, rather than having to hard code them?
df <- data.frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
This works:
df %>% tidyr::complete(group, item_id, item_name)
but to avoid hardcoding I ideally want this to work:
cols_wanted <- c("group", "item_id", "item_name")
df %>% tidyr::complete(cols_wanted)
But it returns the following error:
Error in `dplyr::full_join()`:
! Join columns must be present in data.
✖ Problem with `cols_wanted`.
Traceback:
1. df %>% tidyr::complete(cols_wanted)
2. tidyr::complete(., cols_wanted)
3. complete.data.frame(., cols_wanted)
4. dplyr::full_join(out, data, by = names)
5. full_join.data.frame(out, data, by = names)
6. join_mutate(x, y, by = by, type = "full", suffix = suffix, na_matches = na_matches,
. keep = keep)
7. join_cols(tbl_vars(x), tbl_vars(y), by = by, suffix = suffix,
. keep = keep, error_call = error_call)
8. standardise_join_by(by, x_names = x_names, y_names = y_names,
. error_call = error_call)
9. check_join_vars(by$y, y_names, error_call = error_call)
10. abort(bullets, call = error_call)
11. signal_abort(cnd, .file)
My current solution is:
eval(parse(text = paste("df %>% tidyr::complete(",
paste(noquote(cols_wanted), collapse = ", "),
")")))
But I would like a solution that doesn't use eval or parse

You can use !!!syms with the vector of column names, where syms turns the strings into a list of symbols, then we use the unquote-splice operator, !!!, which passes the list of arguments to complete.
library(tidyverse)
df %>%
complete(!!!syms(cols_wanted))
Output
group item_id item_name value1 value2
<dbl> <dbl> <chr> <int> <int>
1 1 1 a 1 4
2 1 1 b NA NA
3 1 2 a NA NA
4 1 2 b 3 6
5 2 1 a NA NA
6 2 1 b NA NA
7 2 2 a NA NA
8 2 2 b 2 5

Related

Replace a value in a data frame from other dataframe in r

Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).

Adding a new column next to each existing column that matches a certain column name pattern in R / tidyverse

In a dataframe I want to add a new column next each column whose name matches a certain pattern, for example whose name starts with "ip_" and is followed by a number. The name of the new columns should follow the pattern "newCol_" suffixed by that number again. The values of the new columns should be NA's.
So this dataframe:
should be transformed to that dataframe:
A tidiverse solution with use of regex is much appreciated!
Sample data:
df <- data.frame(
ID = c("1", "2"),
ip_1 = c(2,3),
ip_9 = c(5,7),
ip_39 = c(11,13),
in_1 = c("B", "D"),
in_2 = c("A", "H"),
in_3 = c("D", "A")
)
To get the columns is easy with across -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}'))
# ID ip_1 ip_9 ip_39 in_1 in_2 in_3 newCol_1 newCol_9 newCol_39
#1 1 2 5 11 B A D NA NA NA
#2 2 3 7 13 D H A NA NA NA
To get the columns in required order -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}')) %>%
select(ID, starts_with('in'),
order(suppressWarnings(readr::parse_number(names(.))))) %>%
select(ID, ip_1:newCol_39, everything())
# ID ip_1 newCol_1 ip_9 newCol_9 ip_39 newCol_39 in_1 in_2 in_3
#1 1 2 NA 5 NA 11 NA B A D
#2 2 3 NA 7 NA 13 NA D H A
To add the new NA columns :
df[, sub("^ip", "newCol", grep("^ip", names(df), value = TRUE))] <- NA
To reorder them :
df <- df[, order(c(grep("newCol", names(df), invert = TRUE), grep("^ip", names(df))))]
edit :
If it's something you (or whoever stumble here) plan on doing often, you can use this function :
insertCol <- function(x, ind, col.names = ncol(df) + seq_along(colIndex), data = NA){
out <- x
out[, col.names] <- data
out[, order(c(col(x)[1,], ind))]
}

R: How to insert a row in Dataframe starting at a certain column?

I have the following data frame:
df <- tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
I now need to extract the values from the second row, third to fifth column with this command:
newrow <- df[2,3:5]
I now want to insert a new row after the second row. The problem is that I need the new row to start at column 2. If I use the following code, the row will be added at the same column positions as I extracted it from:
df%>% add_row(newrow, .before = 3)
Hope anybody can help with this, any help is much appreciated.
Your newrow dataframe has the colnames from coluns 3:5 (z,a,b). Therefore add_row()matches the newrow to these columns.
You need to rename the columns of newrow with the first three column names.
df%>% add_row(setNames(newrow, names(df)[1:ncol(newrow)]),
.before = 3)
I'm not sure exactly what you're desired outcome is but does this achieve what you want?
library(tibble)
library(dplyr)
df <- tibble::tibble(x = 1:3, y = 3:1, z = 4:6, a = 6:4, b = 7:9)
whatrow <- 2
whatcolumns <- 3:5
beforerow <- 3
newdf <-
slice(df, whatrow) %>%
select(all_of(whatcolumns)) %>%
setNames(., names(df)[whatcolumns - 1]) %>%
add_row(df, ., .before = beforerow)
newdf
#> # A tibble: 4 x 5
#> x y z a b
#> <int> <int> <int> <int> <int>
#> 1 1 3 4 6 7
#> 2 2 2 5 5 8
#> 3 NA 5 5 8 NA
#> 4 3 1 6 4 9

Matching rows across tables for R

I'm trying to match data across two tables through two columns in R: ID number & address. I'm primarily matching through ID number, but there is missing data so address is the back-up column for matching. Any ideas on how to do it? Does merge() allow an "or" in the "by" argument?
left_join to get the ones that match then filter out missing data & repeat
This doesn't work but for instance:
merge(table1, table2, by = 'ID number' or 'address')
is too long.
One way is to merge twice - first with id and then with address - and then clean up the final values -
table1 <- data.frame(
id = c(1, 2, 3),
address = letters[1:3],
stringsAsFactors = F
)
table2 <- data.frame(
id = c(1, NA_integer_, 3),
address = c(letters[1:2], NA_character_),
value = 10:12,
stringsAsFactors = F
)
d <- merge(table1, table2[c("id", "value")], by = "id", all.x = T)
result <- merge(d, table2[c("address", "value")], by = "address", all.x = T)
result$final_value <- with(result, ifelse(is.na(value.x), value.y, value.x))
address id value.x value.y final_value
1 a 1 10 10 10
2 b 2 NA 11 11
3 c 3 12 NA 12
With dplyr -
table1 %>%
left_join(select(table2, id, value), by = "id") %>%
left_join(select(table2, address, value), by = "address") %>%
mutate(
final_value = coalesce(value.x, value.y)
)
id address value.x value.y final_value
1 1 a 10 10 10
2 2 b NA 11 11
3 3 c 12 NA 12

Function using vector index not value in internal data.frame with dplyr::mutate

Problem:
I have a function that uses an argument to index to an internal data.frame, but returns an integer. However when I run the function in dplyr::mutate to create a new variable based on another variable in adata.frame, I get an error:
Error in mutate_impl(.data, dots) :
Evaluation error: duplicate subscripts for columns.
This appears to be caused by the internal indexing of the data frame using the index position of the variable, instead of its value.
How can I solve this?
Example:
In this function I need to index to an internal data.frame and use this in the calculation of the result. :unction and data:
toyfun <- function(thing1){
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[, thing1] <- 1
thing3 <- sum(thing2[1,]) + thing1
return(thing3)
}
toydat <- tibble(thing1 = c(4, 3, 2, 1, 1, 2))
Function does as expected:
toyfun(thing1 = toydat$thing1[1])
#[1] 5
But if I want to calculate the function with each element of a variable in a tibble or data.frame, with mutate, it fails:
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
# Error in mutate_impl(.data, dots) :
# Evaluation error: duplicate subscripts for columns.
If we just use the first 4 rows (or fewer) of toydat, and note that the internal data.frame in toyfun is 4 columns wide, it works fine
toydat[1:4,] %>%
mutate(thing4 = toyfun(thing1 = thing1))
# # A tibble: 4 x 2
# thing1 thing4
# <dbl> <dbl>
# 1 4 5
# 2 3 4
# 3 2 3
# 4 1 2
But again, if we use 5 rows, so going over the index value of the internal data.frame, we fail again:
toydat[1:5,] %>%
mutate(thing4 = toyfun(thing1 = thing1))
# Error in mutate_impl(.data, dots) :
# Evaluation error: duplicate subscripts for columns.
Crux of the issue
This result seems to illustrate that the problem is with this internal indexing using the index value from thing1 rather than it's actual value. Which is weird, because as used in the 4-row example above, we can see that the returned values in thing4 are as they should be from using the values of thing1 to calculate the result.
NB: The same problem doesn't occur with sapply:
sapply(toydat$thing1, toyfun)
# [1] 5 4 3 2 2 3
Any ideas on ways around this in the dplyr type framework so I can keep the work flow consistent?
The issue is because mutate sends the entire column together to the function.
Let's debug the function
toyfun <- function(thing1){
browser()
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[,thing1] <- 1
thing3 <- thing1 + 1
return(thing3)
}
Now we run the mutate command
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
#Called from: toyfun(thing1 = thing1)
#Browse[1]> thing1
#[1] 4 3 2 1 1 2
As there are duplicate entries of column 1 , it gives an error.
It is same as
df <- mtcars
df[, c(5, 5)] <- 1
Error in [<-.data.frame(*tmp*, , c(1, 1), value = 1) :
duplicate subscripts for columns
Now let's look at sapply call
sapply(toydat$thing1, toyfun)
#Called from: FUN(X[[i]], ...)
#Browse[1]> thing1
#[1] 4
sapply passes the value one by one hence there is no error.
This is same as
df <- mtcars
df[, 5] <- 1
df[, 5] <- 1
which doesn't give any error.
To resolve the error we can use unique to get only unique entries of thing1
toyfun <- function(thing1){
thing2 <- data.frame(a = 0, b = 0, c = 0, d = 0)
thing2[,unique(thing1)] <- 1
thing3 <- thing1 + 1
return(thing3)
}
toydat %>%
mutate(thing4 = toyfun(thing1 = thing1))
# A tibble: 6 x 2
# thing1 thing4
# <dbl> <dbl>
#1 4 5
#2 3 4
#3 2 3
#4 1 2
#5 1 2
#6 2 3
and this would also continue to work with sapply
sapply(toydat$thing1, toyfun)
#[1] 5 4 3 2 2 3
If you do not want to change the function, another option is to use rowwise which works same as sapply and sends each individual value one by one to the function
toydat %>%
rowwise() %>%
mutate(thing4 = toyfun(thing1 = thing1))
#Called from: toyfun(thing1 = thing1)
#Browse[1]> thing1
#[1] 4
toydat %>%
rowwise() %>%
mutate(thing4 = toyfun(thing1 = thing1))
# thing1 thing4
# <dbl> <dbl>
#1 4 5
#2 3 4
#3 2 3
#4 1 2
#5 1 2
#6 2 3
Hope this was clear and helpful.

Resources