Merging two dataframes based on conditions in multiple columns - r

I am trying to create a new df, call it df3, out of two other datasets:
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
and:
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
The conditions for the observations that should be matched, and form a new dataset, are:
df1$String %$in% df2$String
grepl(df1$Title, df2$Title) == TRUE
df1$Date < df$Date
What is the best way to do this kind of merging? I have tried to create an indicator along the lines of :
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title) & df1$Date < df$Date, 1, 0)
or
df1$indicator = ifelse(df1$String %in% df2$String & grepl(df1$Title, df2$Title[df1$String %in% df2$String) & df1$Date < df2$Date[df1$String %in% df2$String, 1, 0)
to then use for merging, but I've been getting "longer object length is not a multiple of shorter object length" and "argument 'pattern' has length > 1 and only the first element will be used" warnings.

One way: Use a crossjoin then filter the result.
Note that grepl is not vectorized over both arguments, so i use mapply.
df1 = data.frame("String" = c("a", "b", "c"), "Title" = c("A", "B", "C"), "Date" = c("2020-01-01", "2020-01-02", "2020-01-03"))
df2 = data.frame("String" = c("a", "x", "y"), "Title" = c("ABCDEF", "XYZ", "YZ"), "Date" = c("2020-01-03", "2020-01-20", "2020-01-30"))
merge(df1,df2, by=NULL, suffixes = c(".x", ".y")) |>
subset(String.x %in% String.y
& mapply(grepl, Title.x, Title.y)
& Date.x < Date.y )
#> String.x Title.x Date.x String.y Title.y Date.y
#> 1 a A 2020-01-01 a ABCDEF 2020-01-03

Related

Append text to a field based on another field's value

I want to append a text based on another field's value. For example:-
This is the current df:
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl", "Slash", "Duff", "Steven", "Izzy", "Dizzy")
df <- cbind(field_x, field_y)
I need to change the field_y based on field_x values so that it looks like this:
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl (Apple)", "Slash (Apple)", "Duff (Cat)", "Steven (Ball)", "Izzy (Ball)", "Dizzy (Cat)")
So, basically if field_x has "A" then "(Apple)" is to be appended to field_y and so forth. Thanks in advance!
First note that your df is actually a matrix: when you cbind vectors, you get a matrix. So first thing to do is convert to dataframe.
Then it depends on whether or not you are using dplyr.
field_x <- c("A", "A", "C", "B", "B", "C")
field_y <- c("Axl", "Slash", "Duff", "Steven", "Izzy", "Dizzy")
df <- cbind(field_x, field_y)
df <- as.data.frame(df)
Without dplyr:
df <- within(df, {
s <- ifelse(field_x == "A", "Apple", ifelse(field_x == "B", "Ball", "Cat"))
field_y <- paste0(field_y, "(", s, ")")
rm(s)
})
With dplyr:
library(dplyr)
library(stringr)
library(magrittr)
df %<>%
mutate(
s = recode(field_x, "A" = "Apple", "B" = "Ball", "C" = "Cat"),
field_y = str_glue("{field_y}({s})")) %>%
select(-s)
Another way, with case_when instead of recode:
df %<>%
mutate(
s = case_when(
field_x == "A" ~ "Apple",
field_x == "B" ~ "Ball",
field_x == "C" ~ "Cat"
),
field_y = str_glue("{field_y}({s})")) %>%
select(-s)
Note that I create an auxiliary variable s: it's not really necessary, but it makes the code more readable.
Here is another approach:
We could create a look-up table to address the concerns of #Tim Biegeleisen in the comment section:
look_up <- data.frame(x = c("A", "B" ,"C"),
y = c("Apple", "Ball", "Cat"))
library(dplyr)
df %>%
as.data.frame() %>%
rowwise() %>%
mutate(field_y = paste0(field_y, ' (', look_up$y[look_up$x==field_x], ')'))
field_x field_y
<chr> <chr>
1 A Axl (Apple)
2 A Slash (Apple)
3 C Duff (Cat)
4 B Steven (Ball)
5 B Izzy (Ball)
6 C Dizzy (Cat)

R merging Partial match

there are a lot of answers about that, but I didn't find out with the problem that I am handle.
I have 2 dataframes:
df1:
df2:
setA <- read.table("df1.txt",sep="\t", header=TRUE)
setB <- read.table("df2.txt",sep="\t", header=TRUE)
So, I want the matching rows by column value:
library(data.table)
setC <-merge(setA, setB, by.x = "name", by.y = "name", all.x = FALSE)
And I get this output:
df3:
Because in df I have also de value 1, but separete with a ";". How can I get the desire output?
Thanks!!
In future please apply the function dput(df1) and dput(df2) and copy and paste the output from the console into your question.
Base R solution to two part question:
# First unstack the 1;7 row into two separate rows:
name_split <- strsplit(df1$name, ";")
# If the values of last vector uniquely identify each row in the dataframe:
df_ro <- data.frame(name = unlist(name_split),
last = rep(df1$last, sapply(name_split, length)),
stringsAsFactors = FALSE)
# Left join to achieve the same result as first solution
# without specifically naming each vector:
df1_ro <- merge(df1[,names(df1) != "name"], df_ro, by = "last", all.x = TRUE)
# Then perform an inner join preventing a name space collision:
df3 <- merge(df1_ro, setNames(df2, paste0(names(df2), ".x")),
by.x = "name", by.y = "name.x")
# If you wanted to perform an inner join on all intersecting columns (returning
# no results because values in last and colour are different then):
df3 <- merge(df1_ro, df2, by = intersect(names(df1_ro), names(df2)))
Data:
df1 <- data.frame(name = c("1;7", "3", "4", "5"),
last = c("p", "q", "r", "s"),
colour = c("a", "s", "d", "f"), stringsAsFactors = FALSE)
df2 <- data.frame(name = c("1", "2", "3", "4"),
last = c("a", "b", "c", "d"),
colour = c("p", "q", "r", "s"), stringsAsFactors = FALSE)
At the end I achieved with this solution:
co=open('NewFile.txt','w')
f=open('IndexFile.txt','r')
g=open('File.txt','r')
tabla1 = f.readlines()
tabla2 = g.readlines()
B=[]
for ln in tabla1:
B = ln.split('\t')[3]
for k, ln2 in enumerate(tabla2):
if B in ln2.split('\t')[3]:
xx=ln2
print(xx)
co.write(xx)
break
co.close()

Exchange data.table columns with most prevalent value of columns

I have data
test = data.table(
a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6)
)
I wish to take the unique values of column a, store it in another data.table, and afterwards fill in the remaining columns with the most prevalent values of those remaining columns, such that my resulting data.table would be:
test2 = data.table(a = c(1,3,4,5,6), b = "a", c = 1)
Column be has equal amounts of "a" and "c", but it doesn't matter which is chosen in those cases.
Attempt so far:
test2 = unique(test, by = "a")
test2[, c("b", "c") := lapply(.SD, FUN = function(x){test2[, .N, by = x][order(-N)][1,1]}), .SDcols = c("b", "c")]
EDIT: I would preferrably like a generic solution that is compatible with a function where I specify the column to be "uniqued", and the rest of the columns are with the single most prevalent value. Hence my use of lapply and .SD =)
EDIT2: as #MichaelChirico points out, how do we keep the class. With the following data.table some of the solutions does not work, although solution of #chinsoon12 does work:
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
Another option:
dtmode <- function(x) x[which.max(rowid(x))]
test[, .(A=unique(A), B=dtmode(B), C=dtmode(C))]
data:
test = data.table(
A = c(1,1,3,4,5,6),
B = c("a", "be", "a", "c", "d", "c"),
C = rep(1, 6)
)
Not a clean way to do this but it works.
test = data.frame(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
a = unique(test$a)
b = tail(names(sort(table(test$b))), 1)
c = tail(names(sort(table(test$c))), 1)
test2 = cbind(a,b,c)
Output is like this:
> test2
a b c
[1,] "1" "c" "1"
[2,] "3" "c" "1"
[3,] "4" "c" "1"
[4,] "5" "c" "1"
[5,] "6" "c" "1"
>
#EmreKiratli is very close to what I would do:
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) as(tail(names(sort(table(x))), 1L), class(x)))
), .SDcols = !'a']
The as(., class(x)) part is because names in R are always character, so we have to convert back to the original class of x.
You might like this better in magrittr form since it's many nested functions:
library(magrittr)
test[ , c(
list(a = unique(a)),
lapply(.SD, function(x) {
table(x) %>% sort %>% names %>% tail(1L) %>% as(class(x))
})
), .SDcols = !'a']
I was able to make an OK solution, but if somebody can do it more elegantly, for example not going through the step of storting a list in refLevel below, please let me know! I'm very interested in learning data.table properly!
#solution:
test = data.table(a = c(1,1,3,4,5,6), b = c("a", "be", "a", "c", "d", "c"), c = rep(1, 6))
test2 = unique(test, by="a")
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(test[, c("b", "c")], funPrev)
test2[, c("b", "c") := refLevel]
...and using a function (if anybody see any un-necessary step, please let me know):
genData = function(dt, var_unique, vars_prev){
data = copy(dt)
data = unique(data, by = var_unique)
funPrev = function(x){unlist(as.data.table(x)[, .N, by=x][order(-N)][1,1], use.names = F)}
refLevel = lapply(dt[, .SD, .SDcols = vars_prev], funPrev)
data[, (vars_prev) := refLevel]
return(data)
}
test2 = genData(test, "a", c("b", "c"))
Here's another variant which one might find less sophisticated, yet more readable. It's essentially chinsoon12's rowid approach generalized for any number of columns. Also the classes are kept.
test = data.table(a = c(1,1,3,4,5,6),
b = c("a", "be", "a", "c", "d", "c"),
c = rep(1, 6),
d = as.Date("2019-01-01"))
test2 = unique(test, by = "a")
for (col in setdiff(names(test2), "a")) test2[[col]] = test2[[col]][which.max(rowid(test2[[col]]))]

how to duplicate rows by condition and replace content in R

I would like to dublicate rows of my data frame by testing a condition and then changing the contensts of variables.
My original data frame is this :
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
I would like to test each time that 2nd_decision is "partial" or "total", and if so, duplicate the row and replace the contents of the variables "decision" and "code" with "2nd_decision" and "2nd_code" ; also, I do not want to present any more the content of "2nd_decision" and "2nd_code" and keep the rest of my data frame as it was, like this:
id decision code 2nd_decision 2nd_code varx
x partial AAA20 total BBB50 a
y refusal AAA61 partial BBB89 v
w total AAA77 p
x total BBB50 a
y partial BBB89 v
Thank you in advance
Is this what you want?
df <- data.frame(id = c("x", "y", "w"), decision = c("partial", "refusal", "total"),
code = c("AAA20", "AAA61", "AAA77"), `2nd_decision` = c("total", "partial", NA),
`2nd_code` = c("BBB50", "BBB89", NA), varx = c("a", "v", "p"))
add_rows <- unique(df[, c("id", "X2nd_decision", "X2nd_code", "varx")])
colnames(add_rows) <- c("id", "decision", "code", "varx")
add_rows <- add_rows[!is.na(add_rows$decision), ]
library(plyr)
df_final <- rbind.fill(df, add_rows)
df_final
You can use mutate in combination with an ifelse statement.
Let's recreate your data first.
df <- data.frame(id = c("x", "y", "w", "x", "y"),
decision = c("partial", "refusal", "total", "total", "partial"),
code = c("AAA20", "AAA61", "AAA77", "BBB50", "BBB89"),
decision2 = c("total", "partial", NA, NA, NA),
varx = c("a", "v", "p", "a", "v"))
And here the code to test second decision and remove unwanted variable.
library(tidyverse)
dfnew <- df %>%
mutate(code = ifelse(decision2 == "total", "BBB50",
ifelse(decision2 == "partial", "BBB89", NA))) %>%
select(-decision2)

Inner_join with two conditions and interval within interval condition

Trying to join 2 dataframes according to multiple conditions and time interval condition like in the following example:
# two sample dataframes with time intervals
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))) %>%
mutate(t1 = interval(time1, time2)) %>%
select(key1, key2, t1)
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))) %>%
mutate(t2 = interval(sam1, sam2)) %>%
select(key1, key2, t2)
The first thing that needs to correspond are columns key1 and key2, and that can be done with the following (produces error):
df <- inner_join(df1, df2, by = c("key1", "key2"))
But there is one more condition that needs to be checked when joining and that is if the interval t2 is within t1. I can do this manually like this:
df$t2 %within% df$t1
I guess the error is from joining dataframes with intervals and this might not be the right way to do it which is why there are errors.
# desired dataframe
df <- data.frame(key1 = c("a", "b"), key2 = c(1,2), time_condition = c(TRUE, FALSE))
If the t1 is from "00:00:15" to "00:05:15" then the corresponding t2 which is "00:03:15" to "00:04:15" is going to be within the interval t1. This would result in the time_condition column which will be TRUE if t2 is within t1, and FALSE otherwise.
Using data.table, you can perform operations while joining. Here is an example
library(data.table)
df2[df1, # left join
.(time_condition = sam1 > time1 & sam2 < time2), # condition while joining
on = .(key1, key2), # keys
by = .EACHI, # check condition per join
nomatch = 0L] # make it an inner join
# key1 key2 time_condition
# 1: a 1 TRUE
# 2: b 2 FALSE
# your data generated using data.table
df1 <- data.table(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.ITime(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15")),
time2 = as.ITime(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))
df2 <- data.table(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.ITime(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15")),
sam2 = as.ITime(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))
How about this?
library(dplyr)
df1 %>%
inner_join(df2, by = c("key1", "key2")) %>%
filter(sam1 >= time1 & sam1 <= time2 & sam2 >= time1 & sam2 <= time2) %>%
mutate(t1 = interval(time1, time2),
t2 = interval(sam1, sam2)) %>%
select(key1, key2, t1, t2)
Output is:
key1 key2 t1 t2
1 a 1 1970-01-01 00:00:15 UTC--1970-01-01 00:05:15 UTC 1970-01-01 00:03:15 UTC--1970-01-01 00:04:15 UTC
Sample data:
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15"))))
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15"))))
You can use inbuilt function merge() for joins.
df = merge(df1, df2, by = c("key1", "key2"))
df = data.frame(df[,c("key1", "key2")], time_condition = df$t2 %within% df$t1)
df
# key1 key2 time_condition
#1 a 1 TRUE
#2 b 2 FALSE
Thank You

Resources