Count number of entries in another dataframe given a certain condition - r

I have the following problem: I have two dataframes. df1 contains among other variables (which are not shown in the code below) a date-variable. In df2 I have an id (refering to the id in table df1), a factor-variable (type) and another date.
df1 <- data.frame(id=1:5, referenceDate=c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31"))
df2 <- data.frame(id=c(1,1,1,2,2,4,4,5,5), type=c("A", "A", "B", "A", "A", "B", "A", "B", "B"), dates=c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20"))
My goal is to create a new column in df1 indicating the number of rows in df2 where (e.g.) df2$type=='A' and df2$dates occures before df1$referenceDate.
In R I have the following solution that gives me the number of rows where df2$type=='A'. But how can I additionally consider the date? I had the idea of first joining the two tables in order to get the referenceDate-Variable from df1 into df2 and then do the counting and join the two tables again in the other direction (in order to get the count variable back into df1). But this does not sound very elegant to me.
library(tidyverse)
reduced <- df2 %>% filter(type=='A') %>% group_by(id) %>% mutate(count=n()) %>% filter(!duplicated(id))
df1 %>% left_join(reduced[, c("id", "count")])

I think this might be what you want:
df1 <- tibble(id = 1:5,
referenceDate = as.Date(c("2018-01-20","2018-02-03","2018-05-20", "2018-08-01", "2018-07-31")))
df2 <- tibble(id = c(1,1,1,2,2,4,4,5,5),
type = c("A", "A", "B", "A", "A", "B", "A", "B", "B"),
dates = as.Date(c("2018-01-10", "2018-01-23", "2018-01-24", "2018-05-21", "2018-05-18", "2018-06-01", "2018-09-01", "2018-07-10", "2018-07-20")))
df1 %>%
left_join(
df2 %>%
left_join(df1, by = 'id') %>%
filter(dates < referenceDate) %>%
group_by(id) %>%
count(type) %>%
ungroup(),
by = 'id'
)
The key is to join df1 to df2 first and then filter based on reference date. That allows you to use filter to keep what you want. Then, use count. Then join back to df1

Related

How to duplicate rows with incontinuous dates in R

I need to duplicate rows with incontinuous dates to fill all the dates in a dataframe.
Suppose this df:
df <- data.frame(date = c("2022-07-05", "2022-07-07", "2022-07-11", "2022-07-15", "2022-07-18"), letter = c("a", "b", "a", "b", "c"))
The desired output is this df_new:
df_new <- data.frame(date = c("2022-07-05", "2022-07-06",
"2022-07-07", "2022-07-08", "2022-07-09", "2022-07-10",
"2022-07-11", "2022-07-12", "2022-07-13", "2022-07-14",
"2022-07-15"),
letter = c("a", "a",
"b", "b", "b", "b",
"a", "a", "a", "a",
"c"))
Could you please help ?
We could use complete from tidyr to expand the data based on the min/max date incremented by '1 day' and then fill the NA elements in 'letter' by the previous non-NA element
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(date = seq(min(date), max(date), by = '1 day')) %>%
fill(letter)

How do I compare two columns and print non matching values in each dataframe?

I would like to compare df1 and df2 and find the entry which are present df1 but not df2 and present in df2 but not in df1 in r.
Input:
df1 <- data.frame(ID1 = c("d", "p", "n", "m", "c"))
df2 <- data.frame(ID2 = c("c", "b", "a", "d", "s", "p"))
Output:
nonMatch_Uniquedf1 <- data.frame(ID1 = c("n", "m"))
nonMatch_Uniquedf2 <- data.frame(ID2 = c("b", "a", "s"))
Please note that both columns of df1 and df2 have different row numbers.
Thank you for your help.
Here's another way of reaching the desired output using the anti_join function.
library(dplyr)
df1 %>%
anti_join(df2,
# Define equivalence in column names in both df
by = c("ID2" = "ID1"))
df2 %>%
anti_join(df1,
# Define equivalence in column names in both df
by = c("ID1" = "ID2"))
With dplyr:
require(dplyr)
df1 %>%
filter(!df1$ID1 %in% df2$ID2) #For df1 values not in df2
df2 %>%
filter(!df2$ID2 %in% df1$ID1) #For df2 values not in df1
Edit: with the expected output:
nonMatch_Uniquedf1 <- df1 %>%
filter(!df1$ID1 %in% df2$ID2) #For df1 values not in df2
nonMatch_Uniquedf2 <- df2 %>%
filter(!df2$ID2 %in% df1$ID1) #For df2 values not in df1

Creating duplicate in R

I have the following input data frame with 4 columns and 3 rows.
The time column can take value from 1 to the corresponding value of the maturity column for that customer, I want to create more observations for each customer till the value of time is = value of maturity, with the other columns retaining their original value. Please see the below links for input and expected output
Input
Output
Here is a dplyr solution inspired but not exactly equal to this post.
library(dplyr)
df <- data.frame(custno = 1:3, time = 1, dept = c("A", "B", "A"))
df %>%
slice(rep(1:n(), each = 5)) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
Edit
After the comments by the OP, the following seems to be better.
First, the data:
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))
And the solution.
df %>%
tidyr::uncount(maturity) %>%
group_by(custno) %>%
mutate(time = seq_along(time))
We can also use slice with row_number
library(dplyr)
library(data.table)
df %>%
slice(rep(row_number(), maturity)) %>%
mutate(time = rowid(custno))
data
df <- data.frame(custno = 1:3, time = 1,
dept = c("A", "B", "A"),
maturity = c(5,4,6))

Apply t.test on a tidy format data

I have a data frame in tidy format as follows:
df <- data.frame(name = c("A", "C", "B", "A", "B", "C", "D") ,
group = c(rep("case", 3), rep("cntrl", 4)),
mean = rnorm(7, 0,1))
I would like to group the data by two variables name and group and apply a t.test on mean value of each category. For example doing t.test between A_case.vs.A_cntrl and add pvalue as the result to the table.
Do you have any idea how can I do this using tidyverse package?
Thanks,
here, a group wise, t.test on 'name' cannot be performed as there is only a single observation for each pair. Instead, we can do
library(dplyr)
df %>%
summarise(ttest = list(t.test(mean[group == 'case'],
mean[group == 'cntrl']))) %>%
pull(ttest)
Update
If we need to create a column, use mutate
df %>%
mutate(pval = t.test(mean[group == 'case'],
mean[group == 'cntrl'])$p.value)
Or reshape to 'wide' format and then do the t.test on the columns
library(tidyr)
df %>%
pivot_wider(names_from = group, values_from = mean) %>%
summarise(ttest = list(t.test(case, cntrl))) %>%
pull(ttest)

tidyr::fill() with sequential integers rather than a repeated value

After grouping by id I wish to replace the NAs in dist_from_top with sequential values such that dist_from_top becomes c(5,4,3,2,1,5,4,3,2). I am using the one dist_from_top value within each id grouping as a seed of sorts to fill in the values of dist_from_top that are above and below.
tidyr::fill() can fill in the same value throughout the grouping, but I can't think of a way to make it increase and decrease by 1 as it fills. Any help is greatly appreciated.
library(dplyr)
library(tidyr)
df <-
tribble(
~id, ~mgr, ~dist_from_top,
"A", "B", NA,
"A", "C", NA,
"A", "D", 3,
"A", "E", NA,
"A", "F", NA,
"B", "C", NA,
"B", "D", 4,
"B", "E", NA,
"B", "F", NA
)
An "almost there" solution using fill()
df %>%
group_by(id) %>%
fill(dist_from_top, .direction = "up") %>%
fill(dist_from_top, .direction = "down")
Create a column that counts downwards in each group, from any starting point:
... %>% mutate(rn = -row_number())
Add the offset that is defined by the difference between dist_from_top and rn for the one row where dist_from_top is not NA:
... %>% mutate(dist_from_top = rn + max(dist_from_top - rn, na.rm = TRUE))
This uses max() merely to pick one value, assuming there is only one value that isn't NA.
Both mutate() operations operate on groups:
df %>%
group_by(id) %>%
mutate(rn = ...) %>%
mutate(dist_from_top = ...) %>%
ungroup() %>%
select(-rn)
If there is an all-NA group, you'll see a warning.

Resources