Matching based on conditional data - r

I have two Dfs with multiple rows and columns. I want to see if Df1$Name matches Df2$Name. If it matches, I want it to take match value and create a new variable in DF3. But if it doesn't match, I want to paste the value from Df1. The issue is that Df1 has 270 observations and Df2 has 277.
See example:
Df1
Name
Natalie
Desmond,James
Kylie
Df2
Name
<Na>
Desmond,James
<Na>
Df3
Merged_name
Natalie
Desomond,James
Kylie
I've tried:
Df3$Merged_name <- ifelse(Df1Name %in% Df$Name
& !is.na(Df2$Name), Df1$Name
, Df2$Name)
I get an error saying that the longer object length is not a multiple of shorter object
length which I'm assuming is due to the varying observations. Do I have to separate rows that have more than one name in it(i.e. separate_rows())? If so, how do I re-merge back together?

You can use cbind.fill function which accepts binding of columns given different row numbers and then you do the conditional scenario you gave;
library(dplyr)
library(rowr)
Df1 <-
data.frame(
Name = c("Natalie", "Desmond,James", "Kylie"),
stringsAsFactors = FALSE
)
Df2 <-
data.frame(
Name = c(NA_character_, "Desmond,James", NA_character_, "Test"),
stringsAsFactors = FALSE
)
# Binding data by column and renaming similar column names
cbind.fill(Df1 %>% rename(Name1 = Name), Df2 %>% rename(Name2 = Name), fill = NA) %>%
mutate(Name = coalesce(Name2, Name1)) %>% # Conditional logic given
select(Name)
# Name
# Natalie
# Desmond,James
# Kylie
# Name1

Related

Finding Differences Between Two Dataframes

I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.
An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add
Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]

Subsetting data in R for repeated values

I have two data frames. df1 = 300000rows df2 = 100000rows. Few values in df1 are repeated (can be seen from dimension of data) as I have to a graphical analysis on data. The df2 contains metadata for values in rows in df2.
dput(df1[1:5, ])
c("ENSG00000272905.1", "ENSG00000269148.1", "ENSG00000272905.1",
"ENSG00000204581.2", "ENSG00000158486.12")
dput(df2[1:5, ])
structure(list(ensembl_gene_id = c("ENSG00000004838", "ENSG00000005206",
"ENSG00000007174", "ENSG00000009724", "ENSG00000009844"), hgnc_symbol = c("ZMYND10",
"SPPL2B", "DNAH9", "MASP2", "VTA1"), gene_biotype = c("protein_coding",
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
)), row.names = c(NA, 5L), class = "data.frame")
I want to match each rows in df1 and store its metadata (given in df2) in corresponding columns. My expected results are:
dput(df3[1:5, ])
c("ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod"
)
I tried match function but it returned NA as values in column1 of df1 are in decimals. I also tried %in% operator, but that returned "Error:incorrect dimension".
What should script look like where I can subset my data without omitting repeated values.
R automatically joins the dataframes by common variable names, but you would most likely want to specify df3 <- merge(df1, df2, by = "ensembl_gene_id") to make sure that you are matching on only the fields you desired.
I'm always a fan of the dplyr package (part of the tidyverse).
You will likely need something like this
Unique drops duplicates
df3 <- inner_join(unique(df1), df2, on = "ensembl_gene_id")
Alternatively you could just filter for the desired columns
df3 <- df2 %>% filter(ensembl_gene_id %in% pull(df1, ensembl_gene_id))
Edit: just reread the question, ignore unique. Also the second method will drop uniques too.
You just want df3 <- inner_join(df1, df2, on = "ensembl_gene_id")
Try the following code -
library(dplyr)
result <- result <- df1 %>%
mutate(ensembl_gene_id = sub('\\..*', '', ensembl_gene_id)) %>%
inner_join(df2, by = 'ensembl_gene_id')
result

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

R equivalent to SAS "merge" "by"

If you only use "merge" and "by" in SAS to merge datasets that contain several variables with equal names (beside the ID(s) that you merge by), SAS will combine these variables in to one using the value read last - it is described here https://communities.sas.com/t5/SAS-Programming/Merge-step-overwriting-shared-vars/m-p/281542#M57117
Text from above link:
"There is a rule: whichever value was read last. But that rule is simple only when the merge is one-to-one. In that case, the value you get depends on the order in the MERGE statement:
merge a b;
by id;
The value of common variables (for a one-to-one merge) comes from data set B. SAS reads a value from data set A, then reads a value from data set B. The value from B is read last, and overwrites the value read from data set A.
If there is a mismatch, and an ID appears only in data set A but not in data set B, the value will be the one found in data set A."
How do I make R behave the same way without having to combine the rows afterwards after certain conditions? (in SAS, values are not overwritten by NAs)
library(tidyverse)
#create tibbles
df1 <- tibble(id = c(1:3), y = c("tt", "ff", "kk"))
df2 <- tibble(id = c(1,2,4), y = c(4,3,8))
df3 <- tibble(id = c(1:3), y = c(5,7,NA))
#combine the tibbles
combined_df <- list(df1, df2, df3) %>%
reduce(full_join, by = "id")
# desired output
combined_df_desired <- tibble(id = 1:4, y = c(5,7,"kk",8))
I don't know exactly what you mean with "certain conditions". There isn't a way to change the inner workings of full_join() but you can do:
list(df1, df2, df3) %>%
reduce(full_join, by = "id") %>%
mutate_all(as.character) %>%
mutate(y = coalesce(y, y.y , y.x,)) %>%
select(id, y)
A tibble: 4 x 2
id y
<chr> <chr>
1 1 5
2 2 7
3 3 kk
4 4 8
coalesce() takes a set of columns and returns the first non-NA value for each row. You can order the columns inside the function according to your priorities.

Resources