Creating a variable based off of matching conditions in two datasets - r

I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0

I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A

Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0

Related

How to filter/sort duplicates based on a text/factor column?

I have a data frame that has a column with unique names and then another column that has factors/char (can do either). I want to be able to filter out duplicate names and IF one of the duplicates has a certain value for a factor I want to keep that one, if there are two duplicates with the same factor, then I don't care which one it keeps.
Name Status
1. John A
2. John B
3. Sally A
4. Alex A
5. Sarah B
6. Joe A
7. Joe A
8. Sue B
9. Sue B
I want to keep the duplicate if the factor/char is set to B. If there are two As or two Bs I don't care which one it keeps.
This is the result I want:
1. John B
2. Sally A
3. Alex A
4. Sarah B
5. Joe A
6. Sue B
I have tried the following but it's still keeping both A and B for John:
Name <- c("John","John","Sally","Alex", "Sarah", "Joe", "Joe", "Sue", "Sue")
Status <- c('A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'B')
df_reddit <- data.frame(Name,Status)
df_reddit[, 'Status'] <- as.factor(df_reddit[, 'Status'])
df_reddit$Status <- factor(df_reddit$Status, levels = c("A", "B"))
df_reddit <- df_reddit[order(df_reddit$Status),]
df_reddit[!duplicated(df_reddit[,c('Name')]),]
Any help is appreciated! Would a loop or something be better for this?
A pure dplyr approach may look like so:
library(dplyr)
df_reddit |>
add_count(Name, wt = Status == "B") |>
filter(Status == "B" | n == 0) |>
distinct(Name, Status)
#> Name Status
#> 1 John B
#> 2 Sally A
#> 3 Alex A
#> 4 Sarah B
#> 5 Joe A
#> 6 Sue B
Here's one option using dplyr and data.table:
library(dplyr)
library(data.table)
df_reddit %>%
group_by(Name) %>%
mutate(rle = data.table::rleid(Status)) %>%
filter(rle == max(rle)) %>%
select(-rle) %>%
unique() %>%
ungroup()
# A tibble: 6 × 2
Name Status
<chr> <chr>
1 John B
2 Sally A
3 Alex A
4 Sarah B
5 Joe A
6 Sue B

Combining multiple filters with %ilike%

I'm working on matching last names between two tables. However, there are some variations that I have to take into account. For instance, I found that "Smith" in db1 can potentially have other forms in db2:
Smith, Smith-Whatever, Smith Jr., Smith Sr., Smith III (any Roman numeral)
Lower/uppercase is also an issue.
I'm trying to implement this logic in dplyr. I found the %ilike% operator in data.table, which seems to work kind of like the SQL equivalent. I can use it like this:
match <- db2 %>%
dplyr::filter(last_name %ilike% "^smith$" | last_name %ilike% "^smith-" | last_name %ilike% "^smith .r" | last_name %ilike% "^smith [ivx]")
Of course the strings wouldn't be hardcoded but rather obtained by iterating through db1. Either way, this is unwieldy.
Hence my question:
Is there a way to combine the functionality of %ilike% with something like %in% - by specifying a vector of regexes the 'ilikes' of which I would match against? Is there a smarter way of doing this?
You can combine the pattern with |. You may use grepl (or str_detect if you are using stringr).
library(dplyr)
db2 %>% filter(grepl("smith( .r|-| [ivx]|.*)", last_name, ignore.case = TRUE))
# last_name
#1 Smith
#2 Smith-Whatever
#3 Smith Jr.
#4 Smith Sr.
#5 Smith III
If you want to construct the pattern dynamically you can do -
pat <- c('smith', 'smith-', 'smith .r', 'smith [ivx]')
db2 %>% dplyr::filter(grepl(paste0(pat, collapse = "|"), last_name, ignore.case = TRUE))
Also, would it be enough to filter rows that have only 'smith' in them ?
db2 %>% filter(grepl('smith', last_name, ignore.case = TRUE))
Using RonakShah's pat and my db2 below, ...
Filter
You might try an any operator to iterate through each pattern:
db2 %>%
filter(rowSums(sapply(pat, grepl, name)) > 0)
# name
# 1 smith
# 2 smith-something
And since data.table::%ilike% and data.table::%like% are really using grepl under the hood, this is about the same thing.
Merge/join
If your patterns are in a new frame, you can join them in with:
patdf <- data.frame(ptn = pat, num = seq_along(pat))
patdf
# ptn num
# 1 smith 1
# 2 smith- 2
# 3 smith .r 3
# 4 smith [ivx] 4
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn"))
# name ptn num
# 1 smith smith 1
# 2 jones <NA> NA
# 3 hubert <NA> NA
# 4 smith-something smith 1
# 5 smith-something smith- 2
Granted, this is multiplying rows, since it matches multiple times. This can be reduced. Let's assume your original data has a unique id field:
db2$id <- 10L + seq_len(nrow(db2))
fuzzyjoin::regex_left_join(db2, patdf, by = c("name" = "ptn")) %>%
filter(!is.na(ptn)) %>%
group_by(id) %>%
slice_min(num) %>%
ungroup()
# # A tibble: 2 x 4
# name id ptn num
# <chr> <int> <chr> <int>
# 1 smith 11 smith 1
# 2 smith-something 14 smith 1
Data
db2 <- structure(list(name = c("smith", "jones", "hubert", "smith-something")), class = "data.frame", row.names = c(NA, -4L))

How to return values from group_by in R dplyr?

Good morning,
I've got a two-column dataset which I'd like to spread to more columns based on a group_by in Dplyr but I'm not sure how.
My data looks like:
Person Case
John A
John B
Bill C
David F
I'd like to be able to transform it to the following structure:
Person Case_1 Case_2 ... Case_n
John A B
Bill C NA
David F NA
My original thought was along the lines of:
data %>%
group_by(Person) %>%
spread()
Error: Please supply column name
What's the easiest, or most R-like way to achieve this?
You should first add a case id to the dataset, which can be done with a combination of group_by and mutate:
dat = data.frame(Person = c('John', 'John', 'Bill', 'David'), Case = c('A', 'B', 'C', 'F'))
dat = dat %>% group_by(Person) %>% mutate(id = sprintf('Case_%d', row_number()))
dat %>% head()
# A tibble: 4 × 3
Person Case id
<fctr> <fctr> <chr>
1 John A Case_1
2 John B Case_2
3 Bill C Case_1
4 David F Case_1
Now you can use spread to transform the data:
dat %>% spread(Person, Case)
# A tibble: 2 × 4
id Bill David John
* <chr> <fctr> <fctr> <fctr>
1 Case_1 C F A
2 Case_2 NA NA B
You can get the structure you list above using:
res = dat %>% spread(Person, Case) %>% select(-id) %>% t() %>% as.data.frame()
names(res) = unique(dat$id)
res
Case_1 Case_2
Bill C <NA>
David F <NA>
John A B

Group by a variable and merge row data from another column [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Closed 7 years ago.
I want to group my data by one column and paste the character strings from a different column into a single row. Suppose, for example, I have a data.frame A:
library(dplyr)
A <- data.frame(student = rep(c("John Smith", "Jane Smith"), 3),
variable1 = rep(c("Var1", "Var1", "Var2"), 2))
A <- arrange(A, student)
student variable1
1 Jane Smith Var1
2 Jane Smith Var1
3 Jane Smith Var2
4 John Smith Var1
5 John Smith Var2
6 John Smith Var1
But, I need to transform data.frame A into data.frame B, grouped by the student variable and pasting any variations from variable1 together:
B <- data.frame(student = c("John Smith", "Jane Smith"),
variable1 = c(paste("Var1", "Var2", sep = ","),
paste("Var1", "Var2", sep = ",")))
student variable1
1 John Smith Var1,Var2
2 Jane Smith Var1,Var2
I've attempted numerous group_by and mutate clauses from the dplyr package but haven't found success.
You can use the data.table package to do this easily, and quickly if you set student to be your key:
library(data.table)
A<-data.table(A)
setkey(A, student)
B<-A[, paste(unique(variable1), collapse=", "),by=student]
I believe you can use the aggregate function to do what you are looking for.
Is this what you are trying to do?
df=unique(A)
agg=aggregate(df$variable1, list(df$student), paste, collapse=",")
> agg
Group.1 x
1 Jane Smith Var1,Var2
2 John Smith Var1,Var2

How to subset R dataframe based on string matching?

Assume the dataframe is like below:
Name Score
1 John 10
2 John 2
3 James 5
I would like to compute the mean of all the Score values which has John's name.
You can easily perform a mean of every person's score with aggregate:
> aggregate(Score ~ Name, data=d, FUN=mean)
Name Score
1 James 5
2 John 6
Using dplyr:
For each name:
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 James 5
2 John 6
Filtering:
filter(df, Name=="John") %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 John 6
Using sqldf:
library(sqldf)
sqldf("SELECT Name, avg(Score) AS Score
FROM df GROUP BY Name")
Name Score
1 James 5
2 John 6
Filtering:
sqldf("SELECT Name, avg(Score) AS Score
FROM df
WHERE Name LIKE 'John'")
Name Score
1 John 6

Resources