I'm trying to match data across two tables through two columns in R: ID number & address. I'm primarily matching through ID number, but there is missing data so address is the back-up column for matching. Any ideas on how to do it? Does merge() allow an "or" in the "by" argument?
left_join to get the ones that match then filter out missing data & repeat
This doesn't work but for instance:
merge(table1, table2, by = 'ID number' or 'address')
is too long.
One way is to merge twice - first with id and then with address - and then clean up the final values -
table1 <- data.frame(
id = c(1, 2, 3),
address = letters[1:3],
stringsAsFactors = F
)
table2 <- data.frame(
id = c(1, NA_integer_, 3),
address = c(letters[1:2], NA_character_),
value = 10:12,
stringsAsFactors = F
)
d <- merge(table1, table2[c("id", "value")], by = "id", all.x = T)
result <- merge(d, table2[c("address", "value")], by = "address", all.x = T)
result$final_value <- with(result, ifelse(is.na(value.x), value.y, value.x))
address id value.x value.y final_value
1 a 1 10 10 10
2 b 2 NA 11 11
3 c 3 12 NA 12
With dplyr -
table1 %>%
left_join(select(table2, id, value), by = "id") %>%
left_join(select(table2, address, value), by = "address") %>%
mutate(
final_value = coalesce(value.x, value.y)
)
id address value.x value.y final_value
1 1 a 10 10 10
2 2 b NA 11 11
3 3 c 12 NA 12
Related
Hi I have two dataframes, based on the id match, i wanted to replace table a's values with that of table b.
sample dataset is here :
a = tibble(id = c(1, 2,3),
type = c("a", "x", "y"))
b= tibble(id = c(1,3),
type =c("d", "n"))
Im expecting an output like the following :
c= tibble(id = c(1,2,3),
type = c("d", "x", "n"))
In dplyr v1.0.0, the rows_update() function was introduced for this purpose:
rows_update(a, b)
# Matching, by = "id"
# # A tibble: 3 x 2
# id type
# <dbl> <chr>
# 1 1 d
# 2 2 x
# 3 3 n
Here is an option using dplyr::left_join and dplyr::coalesce
library(dplyr)
a %>%
rename(old = type) %>%
left_join(b, by = "id") %>%
mutate(type = coalesce(type, old)) %>%
select(-old)
## A tibble: 3 × 2
# id type
#. <dbl> <chr>
#1 1 d
#2 2 x
#3 3 n
The idea is to join a with b on column id; then replace missing values in type from b with values from a (column old is the old type column from a, avoiding duplicate column names).
How do I use tidyr::complete to add additional rows to a data frame, specifying the column names wanted as an input, rather than having to hard code them?
df <- data.frame(
group = c(1:2, 1),
item_id = c(1:2, 2),
item_name = c("a", "b", "b"),
value1 = 1:3,
value2 = 4:6
)
This works:
df %>% tidyr::complete(group, item_id, item_name)
but to avoid hardcoding I ideally want this to work:
cols_wanted <- c("group", "item_id", "item_name")
df %>% tidyr::complete(cols_wanted)
But it returns the following error:
Error in `dplyr::full_join()`:
! Join columns must be present in data.
✖ Problem with `cols_wanted`.
Traceback:
1. df %>% tidyr::complete(cols_wanted)
2. tidyr::complete(., cols_wanted)
3. complete.data.frame(., cols_wanted)
4. dplyr::full_join(out, data, by = names)
5. full_join.data.frame(out, data, by = names)
6. join_mutate(x, y, by = by, type = "full", suffix = suffix, na_matches = na_matches,
. keep = keep)
7. join_cols(tbl_vars(x), tbl_vars(y), by = by, suffix = suffix,
. keep = keep, error_call = error_call)
8. standardise_join_by(by, x_names = x_names, y_names = y_names,
. error_call = error_call)
9. check_join_vars(by$y, y_names, error_call = error_call)
10. abort(bullets, call = error_call)
11. signal_abort(cnd, .file)
My current solution is:
eval(parse(text = paste("df %>% tidyr::complete(",
paste(noquote(cols_wanted), collapse = ", "),
")")))
But I would like a solution that doesn't use eval or parse
You can use !!!syms with the vector of column names, where syms turns the strings into a list of symbols, then we use the unquote-splice operator, !!!, which passes the list of arguments to complete.
library(tidyverse)
df %>%
complete(!!!syms(cols_wanted))
Output
group item_id item_name value1 value2
<dbl> <dbl> <chr> <int> <int>
1 1 1 a 1 4
2 1 1 b NA NA
3 1 2 a NA NA
4 1 2 b 3 6
5 2 1 a NA NA
6 2 1 b NA NA
7 2 2 a NA NA
8 2 2 b 2 5
In a dataframe I want to add a new column next each column whose name matches a certain pattern, for example whose name starts with "ip_" and is followed by a number. The name of the new columns should follow the pattern "newCol_" suffixed by that number again. The values of the new columns should be NA's.
So this dataframe:
should be transformed to that dataframe:
A tidiverse solution with use of regex is much appreciated!
Sample data:
df <- data.frame(
ID = c("1", "2"),
ip_1 = c(2,3),
ip_9 = c(5,7),
ip_39 = c(11,13),
in_1 = c("B", "D"),
in_2 = c("A", "H"),
in_3 = c("D", "A")
)
To get the columns is easy with across -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}'))
# ID ip_1 ip_9 ip_39 in_1 in_2 in_3 newCol_1 newCol_9 newCol_39
#1 1 2 5 11 B A D NA NA NA
#2 2 3 7 13 D H A NA NA NA
To get the columns in required order -
library(dplyr)
df %>%
mutate(across(starts_with('ip'), ~NA, .names = '{sub("ip", "newCol", .col)}')) %>%
select(ID, starts_with('in'),
order(suppressWarnings(readr::parse_number(names(.))))) %>%
select(ID, ip_1:newCol_39, everything())
# ID ip_1 newCol_1 ip_9 newCol_9 ip_39 newCol_39 in_1 in_2 in_3
#1 1 2 NA 5 NA 11 NA B A D
#2 2 3 NA 7 NA 13 NA D H A
To add the new NA columns :
df[, sub("^ip", "newCol", grep("^ip", names(df), value = TRUE))] <- NA
To reorder them :
df <- df[, order(c(grep("newCol", names(df), invert = TRUE), grep("^ip", names(df))))]
edit :
If it's something you (or whoever stumble here) plan on doing often, you can use this function :
insertCol <- function(x, ind, col.names = ncol(df) + seq_along(colIndex), data = NA){
out <- x
out[, col.names] <- data
out[, order(c(col(x)[1,], ind))]
}
Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))
I am struggling with reordering a dataFrame in R.
My dataFrame has data coming from two different sensors. So in the beginning every column has a name with the syntax "sensor number.sample number". The rowname is a coordinate of each sample.
Sadly the columns are not ordered with an ascending sample number.
How can I make an automatic ordering where after number 1 comes 2 and not 10?
With correct ordered columns I would like to cut all columns of the second sensor and append it under the rows from the first sensor. This is also tricky as the number of columns of each sensor varies in the reality.
To distinguish between both sensors I would add a postfix "a" or "b" for the new rownames.
Here my problem is that I know "rbind" but it requires identical column names, I cannot provide here. And I would also need to select the columns manually as I have no clue how to automatically select all of the second sensor.
My idea for the moment is to make subsets for each sensor, rename the columns and then use rbind with both subsets. Is this a good idea?
The rownames I then could modify with paste().
I now present simplified frames as the original is quite big. So the numbers (c(1:3)) are just exemplary.
This is how my dataFrame looks at the beginning:
myDf = data.frame(a.10= c(1:3),a.11= c(1:3),a.12= c(1:3),a.13= c(1:3),a.2= c(1:3),a.3= c(1:3),a.4= c(1:3),a.5= c(1:3),a.6= c(1:3),a.7= c(1:3),a.8= c(1:3),a.9= c(1:3),
b.1= c(1:3),b.10= c(1:3),b.11= c(1:3),b.2= c(1:3),b.3= c(1:3),b.4= c(1:3),b.5= c(1:3),b.6= c(1:3),b.7= c(1:3),b.8= c(1:3),b.9= c(1:3))
My goal is to transform the dataFrame that is looks like that:
desiredDf =data.frame(n9=rep(c(1:3),2), n10=rep(c(1:3),2), n11=rep(c(1:3),2), n12=c(c(1:3),NA, NA, NA), n13=c(c(1:3), NA, NA, NA))
rownames(desiredDf)<-(c("1a","2a","3a","1b","2b","3b"))
Thank you very much!
Here is an option.
library(tidyverse)
myDF2 <- myDf %>% gather(measure, result, a.10:b.9) %>%
separate(measure, into = c("letter", "number"), sep = "\\.") %>%
group_by(letter, number)%>%
mutate(n = row_number()) %>%
unite(col, n, letter, sep = "") %>%
ungroup() %>%
arrange(as.numeric(number))%>%
mutate(number = paste0("n", number))%>%
mutate(number = factor(number, levels = unique(number)))%>%
spread(number, result)%>%
arrange(col)
row.names(myDF2) <- myDF2$col
myDF2$col <- NULL
Convert the row names to a column, reshape into long form and separate the key, i.e. the original column names, into columns group and no converting the latter to numeric. Sort, reshape back to wide form, sort again, combine the rowname and group and preface each column name with n.
library(dplyr)
library(tibble)
library(tidyr)
myDf %>%
rownames_to_column %>%
gather(key, value, -rowname) %>%
separate(key, c("group", "no"), convert = TRUE) %>%
arrange(group, no) %>%
spread(no, value) %>%
arrange(group, rowname) %>%
unite(rowname, rowname, group, sep = "") %>%
column_to_rownames %>%
rename_all(~ paste0("n", .))
giving:
n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13
1a NA 1 1 1 1 1 1 1 1 1 1 1 1
2a NA 2 2 2 2 2 2 2 2 2 2 2 2
3a NA 3 3 3 3 3 3 3 3 3 3 3 3
1b 1 1 1 1 1 1 1 1 1 1 1 NA NA
2b 2 2 2 2 2 2 2 2 2 2 2 NA NA
3b 3 3 3 3 3 3 3 3 3 3 3 NA NA
Note
Above we used this for myDf, the input.
myDf <-
structure(list(a.10 = 1:3, a.11 = 1:3, a.12 = 1:3, a.13 = 1:3,
a.2 = 1:3, a.3 = 1:3, a.4 = 1:3, a.5 = 1:3, a.6 = 1:3, a.7 = 1:3,
a.8 = 1:3, a.9 = 1:3, b.1 = 1:3, b.10 = 1:3, b.11 = 1:3,
b.2 = 1:3, b.3 = 1:3, b.4 = 1:3, b.5 = 1:3, b.6 = 1:3, b.7 = 1:3,
b.8 = 1:3, b.9 = 1:3), class = "data.frame", row.names = c(NA,
-3L))