Merge two tables and keep only duplicated values in R - r

I want to merge two tables and only keep similar Task from two tables. Non-common Task is removed. If two Taskis similar, I only keep smaller value,
Two tables like this
x<-data.frame("Task"=c("A","B","C","D","E"),"FC"=c(12,NA,15,14,NA),FH=c(13,15,NA,17,20))
Task FC FH
1 A 12 13
2 B NA 15
3 C 15 NA
4 D 14 17
5 E NA 20
y<-data.frame("Task"=c("B","C","F","G"),"FC"=c(NA,12,20,NA),FH=c(NA,17,18,NA))
Task FC FH
1 B NA NA
2 C 12 17
3 F 20 18
4 G NA NA
I want an output like this
Task FC FH
2 B NA 15
3 C 12 17

One dplyr possibility could be:
x %>%
bind_rows(y) %>%
group_by(Task) %>%
filter(n() > 1) %>%
summarise_all(~ ifelse(all(is.na(.)), NA, min(., na.rm = TRUE)))
Task FC FH
<chr> <dbl> <dbl>
1 B NA 15
2 C 12 17
Or if there could be duplicate tasks per single df:
x %>%
bind_rows(y, .id = "ID") %>%
group_by(Task) %>%
filter(n() > 1 & n_distinct(ID) > 1) %>%
summarise_all(~ ifelse(all(is.na(.)), NA, min(., na.rm = TRUE))) %>%
select(-ID)

You can also do:
# Perform a join
merged <- merge(x = x, y = y, by = "Task")
# Get the minimum value out of two comparable columns
merged$FC <- with(merged, pmin(FC.x, FC.y, na.rm = TRUE))
merged$FH <- with(merged, pmin(FH.x, FH.y, na.rm = TRUE))
# Delete the unwanted columns appearing out of merge
merged <- merged[-c(2:5)]
Output:
Task FC FH
1 B NA 15
2 C 12 17

Related

R iterating by group and mapping values based on column value

I have the following data frame in R:
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
And another data frame:
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.
Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.
Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:
data[which(data$time > p1_start & data$time < p2_end),]
once I get the rows it is easy to append the values.
Problem: how could I do the iteration? I tried with the following:
df %>%
group_by(target, comb) %>%
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])
But I am getting an error that time has not been initialized
Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.
We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code
library(data.table)
setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")
cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]
After dcast, df looks like this
target comb type end start
1: 1 0 p1 3 1
2: 1 0 p2 7 5
3: 1 1 p1 11 9
4: 1 1 p2 15 13
5: 2 0 p1 19 17
6: 2 0 p2 23 21
7: 2 1 p1 27 25
8: 2 1 p2 31 29
And the output is
> data
time name target comb type
1: 2 a 1 0 p1
2: 5 b 1 0 p2
3: 8 c NA NA <NA>
4: 14 d 1 1 p2
5: 14 e 1 1 p2
6: 20 f NA NA <NA>
7: 21 g 2 0 p2
8: 26 h 2 1 p1
9: 28 i NA NA <NA>
10: 28 j NA NA <NA>
Here is a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
rename(name_df=name) %>%
mutate(x = time +1) %>%
pivot_longer(
cols = c(time, x),
names_to = "helper",
values_to = "time"
) %>%
right_join(data, by="time") %>%
select(time, name, target, comb)
time name target comb
<dbl> <chr> <dbl> <dbl>
1 2 a 1 0
2 5 b 1 0
3 8 c 1 0
4 14 d 1 1
5 14 e 1 1
6 20 f 2 0
7 21 g 2 0
8 26 h 2 1
9 28 i 2 1
10 28 j 2 1
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
library(fuzzyjoin)
library(tidyverse)
tmp <- df %>%
separate(name,
into = c("p", "period"),
sep = "_",
remove = TRUE) %>%
pivot_wider(
id_cols = c(p, target, comb),
names_from = period,
values_from = time
) %>%
select(-p)
fuzzy_left_join(
x = data,
y = tmp,
by = c("time" = "start",
"time" = "end"),
match_fun = list(`>=`, `<=`))
#> time name target comb start end
#> 1 2 a 1 0 1 3
#> 2 5 b 1 0 5 7
#> 3 8 c NA NA NA NA
#> 4 14 d 1 1 13 15
#> 5 14 e 1 1 13 15
#> 6 20 f NA NA NA NA
#> 7 21 g 2 0 21 23
#> 8 26 h 2 1 25 27
#> 9 28 i NA NA NA NA
#> 10 28 j NA NA NA NA
Created on 2022-01-11 by the reprex package (v2.0.1)

Separating mixed values and generating new columns in tidyverse [duplicate]

This question already has answers here:
Separate a column into multiple columns using tidyr::separate with sep=""
(2 answers)
Closed 3 years ago.
A sample of my data is as follows:
df1 <- read.table(text = "var Time
12O 12
13O 11
22B 45
33Z 22
21L 2
11M 13", header = TRUE)
I want to separate values in column "Var" to get the following data:
df2 <- read.table(text = " Group1 Group2 Group3
1 2 O
1 3 O
2 2 B
3 3 Z
2 1 L
1 1 M", header = TRUE)
I tried the following codes:
df2 <- df1 %>% separate(var, into = c('Group1', 'Group2','Group3'), sep = 1)
I get an error. I have searched to find the error out, but I have failed.
If you want to retain the original column, you can use str_split_fixed from stringr package and cbind the result to your existing dataframe
cbind(df1, str_split_fixed(as.character(df1$var),"", n = 3))
var Time 1 2 3
1 12O 12 1 2 O
2 13O 11 1 3 O
3 22B 45 2 2 B
4 33Z 22 3 3 Z
5 21L 2 2 1 L
6 11M 13 1 1 M
A possible base/stringr solution:
res<-as.data.frame(do.call(rbind,strsplit(stringr::str_replace_all(df1$var
,"([0-9])([0-9])([A-Z])","\\1 \\2 \\3"),
" ")))
names(res)<-paste0("Group",1:ncol(res))
cbind(df1["Time"],res)
Time Group1 Group2 Group3
1 12 1 2 O
2 11 1 3 O
3 45 2 2 B
4 22 3 3 Z
5 2 2 1 L
6 13 1 1 M
As far as I am concerned (Separate outputs empty separator error for each row independently), this cannot be done with tidyr separate(). A possibility is str_split() from stringr or strsplit() from base R.
So, using str_split():
df1 %>%
mutate(var = str_split(var, pattern = "")) %>%
unnest() %>%
group_by(Time) %>%
mutate(val = var,
var = paste0("Group", row_number())) %>%
spread(var, val) %>%
ungroup()
Time Group1 Group2 Group3
<int> <chr> <chr> <chr>
1 2 2 1 L
2 11 1 3 O
3 12 1 2 O
4 13 1 1 M
5 22 3 3 Z
6 45 2 2 B
Using strsplit():
df1 %>%
mutate(var = strsplit(as.character(var), split = "", fixed = TRUE)) %>%
unnest() %>%
group_by(Time) %>%
mutate(val = var,
var = paste0("Group", row_number())) %>%
spread(var, val) %>%
ungroup()
To have new columns with appropriate class (character, integer etc.), you can add convert = TRUE into spread().

dplyr collapse 'tail' rows into larger groups

library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)

identifying location of NA values in a data frame by ID (not row number) and column name

I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>

Use Negation with Select in dplyr 0.7.x

I'm trying to write a function that needs to exclude a user passed variable from the resultant data frame. I'm also taking this opportunity to learn a bit more about the new dplyr syntax.
The function acts like a cross join for data frames. I want to use it as a clean way of duplicating data across parameters of a function.
The function works as follows:
crossjoin_df <- function(df1, df2, temp_col = ".k") {
df1 <- df1 %>%
mutate(!!temp_col := 1)
df2 <- df2 %>%
mutate(!!temp_col := 1)
out <- left_join(df1, df2, by = temp_col)
# I'm trying to replace the next line
out[,!names(out)==temp_col]
}
params <- data.frame(k = c(11,10),
n = c(27,26))
data <- data.frame(a = 1:3,
b = 4:6)
crossjoin_df(params, data) # 6 row data set
I want to see if it's possible to replace the last statement with a piped select statement. However, the negation does not seem to be working.
I am able to get something like:
out %>% select(!!temp_col)
to work, but that obviously only selects .k. I am not able to get anything like:
out %>% select(-!!temp_col)
to work.
You'll need rlang, the backend package for dplyr that enables tidy eval, whether you want to keep using strings, in which case you'll need sym to turn a string into a quosure:
library(dplyr)
params <- data.frame(k = c(11,10),
n = c(27,26))
data <- data.frame(a = 1:3,
b = 4:6)
crossjoin_df <- function(df1, df2, temp_col = ".k") {
df1 <- df1 %>% mutate(!!temp_col := 1)
df2 <- df2 %>% mutate(!!temp_col := 1)
left_join(df1, df2, by = temp_col) %>%
select(-!!rlang::sym(temp_col))
}
crossjoin_df(params, data)
#> k n a b
#> 1 11 27 1 4
#> 2 11 27 2 5
#> 3 11 27 3 6
#> 4 10 26 1 4
#> 5 10 26 2 5
#> 6 10 26 3 6
...or switch to full tidy eval, in which case you'll need quo_name to turn a quosure into a name:
crossjoin_df <- function(df1, df2, temp_col = .k) {
temp_col <- enquo(temp_col)
df1 <- df1 %>% mutate(!!rlang::quo_name(temp_col) := 1)
df2 <- df2 %>% mutate(!!rlang::quo_name(temp_col) := 1)
left_join(df1, df2, by = rlang::quo_name(temp_col)) %>%
select(-!!temp_col)
}
crossjoin_df(params, data)
#> k n a b
#> 1 11 27 1 4
#> 2 11 27 2 5
#> 3 11 27 3 6
#> 4 10 26 1 4
#> 5 10 26 2 5
#> 6 10 26 3 6
Alternatively, just use tidyr::crossing:
tidyr::crossing(params, data)
#> k n a b
#> 1 11 27 1 4
#> 2 11 27 2 5
#> 3 11 27 3 6
#> 4 10 26 1 4
#> 5 10 26 2 5
#> 6 10 26 3 6
You can use one_of, and then negate the selection with -:
out %>% select(-one_of(temp_col))
crossjoin_df <- function(df1, df2, temp_col = ".k") {
# `$`(df1, temp_col) <- 1
df1 <- df1 %>%
mutate(!!temp_col := 1)
# `$`(df2, temp_col) <- 1
df2 <- df2 %>%
mutate(!!temp_col := 1)
left_join(df1, df2, by = temp_col) %>% select(-one_of(temp_col))
}
params <- data.frame(k = c(11,10),
n = c(27,26))
data <- data.frame(a = 1:3,
b = 4:6)
crossjoin_df(params, data)
# k n a b
#1 11 27 1 4
#2 11 27 2 5
#3 11 27 3 6
#4 10 26 1 4
#5 10 26 2 5
#6 10 26 3 6
This should work as well:
out %>% select_(paste0("-",temp_col))

Resources