How to match multiple columns based on lookup table - r

I have the following two data frames:
lookup <- data.frame(id = c("A", "B", "C"),
price = c(1, 2, 3))
results <- data.frame(price_1 = c(2,2,1),
price_2 = c(3,1,1))
I now want to go through all columns of results and add the respective matching id from lookup as new columns. So I first want to take the price_1 column and find the ids (here: "B", "B", "A") and add it as a new column to results and then I want to do the same for the price_2 column.
My real-life case would need to match 20+ columns, so I want to avoid a hard-coded manual solution and are looking for a dynamic approach, ideally in the tidyverse.
results <- results %>%
left_join(., lookup, by = c("price_1" = "id")
would give me the manual solution for the first column and I could repeat this with the second column, but I'm wondering if I can do this automatically for all my results columns.
Expected output:
price_1 price_2 id_1 id_2
2 3 "B" "C"
2 1 "B" "A"
1 1 "A" "A"

We could unlist the dataframe and match directly.
new_df <- results
names(new_df) <- paste0("id", seq_along(new_df))
new_df[] <- lookup$id[match(unlist(new_df), lookup$price)]
cbind(results, new_df)
# price_1 price_2 id1 id2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A
In dplyr, we can do
library(dplyr)
bind_cols(results, results %>% mutate_all(~lookup$id[match(., lookup$price)]))

You can use apply and match to match multiple columns based on lookup table.
cbind(results, t(apply(results, 1, function(i) lookup[match(i, lookup[,2]),1])))
# price_1 price_2 1 2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A

Related

Find and keep duplicated items in each column in R

Is there any way I can use some like tidyverse's add_count() %>% filter() or distinct() or alternatively janitor's get_dupes() to find and keep the duplicated items of each column. No need to compare items of different columns with each other, each column needs to be considered separately.
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3
)
Expected Output would be
colA colB
c 1
You can try with map_dfc which will map over the columns and return a data frame by column binding the outputs
library(tidyverse)
data1 %>%
map_dfc(~.x[duplicated(.x)])
# A tibble: 1 x 2
colA colB
<chr> <dbl>
1 c 1
However this will result in unwanted behavior when each column has a different amount of duplicates due to recycling (when applying an operation to two vectors that requires them to be the same length - like column bind, R automatically repeats the shorter one, until it is long enough to match the longer one).
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3,
"d", 1,
)
data1 %>%
map_dfc( ~.x[duplicated(.x)])
# A tibble: 2 x 2
colA colB
<chr> <dbl>
1 c 1
2 c 1
here colA has been recycled to match the length of colB. In such a case you are better off returning a list with map
data1 %>%
map( ~.x[duplicated(.x)])
#output
$colA
[1] "c"
$colB
[1] 1 1
In baseR
dupicatedList <- lapply(data1, function(columnValues) {
unique(columnValues[duplicated(columnValues)])
})
A base R option
> list2DF(Map(function(x) x[duplicated(x)], data1))
colA colB
1 c 1

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

detecting sequence by group and compute new variable for the subset

I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D

Transforming directed dyads into undirected [duplicate]

This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
Closed 2 years ago.
this seems like such a basic question to me, that I'm almost sure it must be covered somewhere around here, but I've been searching for quite some time now and just can't seem to find the right answer.
My data looks like this:
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4))
col1 col2 value
1 A B 1
2 A C 2
3 B A 3
4 B C 4
I want to merge col1 and col2 into a variable that indicates the unique dyads in a single vector. It should not matter, whether "A" and "B" are a value of col1 or col2. Each row that contains "A" and "B" combined in col1 and col2 should get the same value of the new variable. I tried to use tidyr for this.
unite(data, col1, col2, col="dyad", sep="_")
returns
dyad value
1 A_B 1
2 A_C 2
3 B_A 3
4 B_C 4
Basically, I need dyad to contain the same value for A_B and B_A, because these pairs are equivalent for me. This is what it should look like, for example:
dyad value
1 A_B 1
2 A_C 2
3 A_B 3
4 B_C 4
Is there an easy way to do this? Thanks a lot!
There may be more elegant solutions, but perhaps this helps:
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4),
stringsAsFactors = FALSE)
data$dyad <- apply(data[,c("col1","col2")], 1, FUN= function(x) paste(sort(x), collapse="_"))
So the apply function ensures that the function is applied to each row of the data frame. The function first sorts the input and then pastes them together.
EDIT: I copied stringsAsFactors = FALSE from the other answer, as I used it as well but forgot to include it in my post :)
A solution using dplyr. Notice that I added stringsAsFactors = FALSE when creating the data frame because it is better to work on character columns in this case.
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4),
stringsAsFactors = FALSE)
library(dplyr)
data2 <- data %>%
rowwise() %>%
mutate(dyad = paste(sort(c(col1, col2)), collapse = "_")) %>%
select(dyad, value) %>%
ungroup()
data2
# # A tibble: 4 x 2
# dyad value
# <chr> <dbl>
# 1 A_B 1
# 2 A_C 2
# 3 A_B 3
# 4 B_C 4

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Resources