Subsetting a dataframe conditional on minimum difference between 2 dates - r

I have a dataframe with 4 variables: id, measurement, date_a, date_b.
A single id can contribute to the df more than once. I want to subset this dataframe to only include one measurement for each id. I want to select a single row for each id based on the minimum difference between date_b and date_a, however this minimum difference is required to be at least one year. Is there a way to do this using dplyr using one line of code rather than creating a new variable for the difference in dates?

Here's some fake data. (It's best practice to include something like this in your question to avoid ambiguity or misunderstandings about your particular situation.)
set.seed(8601)
df <- data.frame(
id = rep(1:3, each = 5),
measurement = "foo",
date_a = as.Date(sample(1:3000, 15), origin = "2010-01-01")
)
df$date_b <- df$date_a + sample(1:1000, 15)
Here's a slightly-longer-than-one-line approach with dplyr:
library(dplyr)
df %>% group_by(id) %>% filter(date_b-date_a >= 365) %>% filter(date_b-date_a == min(date_b-date_a))
Result:
# A tibble: 3 x 4
# Groups: id [3]
id measurement date_a date_b
<int> <fct> <date> <date>
1 1 foo 2013-06-13 2014-11-26
2 2 foo 2014-10-05 2017-04-14
3 3 foo 2012-01-07 2014-02-11

Related

Loop through specific columns of dataframe keeping some columns as fixed

I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])

How to transpose character data for unique IDs

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

R dplyr rowwise mutate

Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do:
For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.
However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.
Thank you for your help!
Jesse
This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.
Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.
Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.
Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.
Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400

Long to Wide with Non-Unique Key Combinations in R

I am trying to convert a dataset from long to wide format. Need to do this to feed into another program for analysis purposes. My input data is below:
sdata <- data.frame(c(1,1,1,1,1,1,1,1,1,1,1,1,1),c(1,1,1,1,1,1,1,1,1,2,2,2,2),c("X1","A","B","C","D","X2","A","B","C","X1","A","B","C"),c(81,31,40,5,5,100,8,90,2,50,20,24,6))
col_headings <- c("Orig","Dest","Desc","Estimate")
names(sdata) <- col_headings
Input Data
Depending on the unique combination of Orig-Dest-X1, Orig-Dest-X2 category above, the subcategories vary from only A,B,C to A,B,C,D to A,B, etc. I am trying to get the desired output (code to recreate in R below) along with image of desired output.
sdata_spread <- data.frame(c(1,1),c(1,2),c(81,50),c(31,20),c(40,24),c(5,6),c(5,NA),c(100,NA),c(8,NA),c(90,NA),c(2,NA))
col_headings <- c("Orig","Dest","X1", "X1_A", "X1_B", "X1_C", "X1_D","X2", "X2_A", "X2_B", "X2_C")
names(sdata_spread) <- col_headings
Desired Output
I tried the following:
sdata_spread <- sdata %>% spread(Desc,Estimate)
The error I got was:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 6 rows
I also tried the accepted answer given here: Long to wide with no unique key and here: Long to wide format with several duplicates. Circumvent with unique combo of columns but it did not get me the desired output.
Any insights would be much appreciated.
Thanks,
Krishnan
One option is to create a grouping variable based on the occurrence of 'X' as the first character in the 'Desc', use that to modify the 'Desc' by pasteing the first element of 'Desc' with each of the element based on a condition in case_when and reshape to wide format with pivot_wider (from tidyr_1.0.0, spread/gather are getting deprecated and in its place pivot_wider/pivot_longer are used)
library(dplyr)
library(tidyr)
library(stringr)
sdata %>%
group_by(grp = cumsum(str_detect(Desc, '^X'))) %>%
mutate(Desc = case_when(row_number() > 1 ~ str_c(first(Desc), Desc, sep="_"),
TRUE ~ as.character(Desc))) %>%
ungroup %>%
select(-grp) %>%
pivot_wider(names_from = Desc, values_from = Estimate)
# A tibble: 2 x 11
# Orig Dest X1 X1_A X1_B X1_C X1_D X2 X2_A X2_B X2_C
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 81 31 40 5 5 100 8 90 2
#2 1 2 50 20 24 6 NA NA NA NA NA

Resources