In R: Join two dataframes based on a time period condition - r

Being new to R, I am trying to merge two data frames by considering a time period condition.
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), "exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") , exp_id= c("a","b") , start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") , end_timestamp = c("2020-01-17 00:05:10", NULL) , current_event_id = c("1", "2") ,current_event_timestamp= c("2020-01-17 00:05:09", "2020-01-17 00:05:09"))
This is little bit difficult to explain, I will do my best to present the problem.
Basically, I am monitoring some expeditions (df2) and I want to know which events (df1) are related to a certain expedition (Have a look at the exp_id
in the df1, I want to fill this column).
Note that each expedition is created by a device, and evidently, each event is generated by a device. You may say this is feasible by joining the two tables based on the id of a device. However, the problem is that each device can be associated with multiple expeditions.
So, the objective is to see during a certain time period the device was related to which expedition so we can match events with that expedition. If you look at the third row of df1 you will see the difficulty I have for the time period condition. Because considering the duration in which the third row was recorded, we can not relate it to the expedition a.
Here comes the other problem. Sometimes the expeditions are not finished, so, we have to consider the last seen event timestamp (which is the current_event_timestamp in df2).
>df1
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 NA
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 NA
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") NA
>df2
device_serial exp_id start_timestamp end_timestamp current_event_id current_event_timestamp
123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
123 b 2019-09-04 10:00:00 NULL 2 2019-11-23 12:06:28
The result that I am looking for is a table like this df3:
>df3
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 a
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 b
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") b
Thanks for reading this question and helping me to solve it.

Here are some suggestions, if I understand you correctly.
First, your data, with a few edits:
Per #r2evans comment, I'm assuming the NULL was meant to be
NA_real
"current_event_timestamp" from df2 in the first block of
code does not match what you typed out in the second block; I used
the datetime from the second block, as it led to the answer you were
looking for
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"),
"second_event" = c("9346","a839", "d939"),
"device_serial" = c("123","123","123") ,
"start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),
"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"),
"exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") ,
exp_id= c("a","b") ,
start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,
end_timestamp = c("2020-01-17 00:05:10", NA_real_) ,
current_event_id = c("1", "2") ,
current_event_timestamp= c("2020-01-17 00:05:09", "2019-11-23 12:06:28"))
Now, to tidy the data a bit.
Two main points:
It seems like the start_timestamp and end_timestamp columns in df1 refer to starts
and ends of events, whereas those same column names in df2 refer to starts and
ends of expeditions. If so, it's good practice to assign these variables names
that reflect the fact that the data they contain differ. In this case, this
distinction is important when joining the two tables.
At least in your example dfs, note that all columns were read in as factors
initially. Variables are usually much easier to work with if they're stored as the
type of data they represent, and this is especially true for datetime data.
library(dplyr)
library(lubridate)
df1 <- df1 %>%
as_tibble(df1) %>% # convert to tibble; prints data type of each column
select(-exp_id, evnt_start = start_timestamp, evnt_end = end_timestamp) %>% # removing exp_id (not necessary, & messes up join) & changing names of time cols.
mutate(evnt_start = as_datetime(evnt_start), # converting time columns to datetime type
evnt_end = as_datetime(evnt_end))
df1
# A tibble: 3 x 5
first_event second_event device_serial evnt_start evnt_end
<fct> <fct> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28
df2 <- df2 %>%
as_tibble(df2) %>% # convert to tibble
rename(exp_start = start_timestamp, exp_end = end_timestamp) %>% # changing names of time cols
mutate_at(.vars=c("exp_start", "exp_end", "current_event_timestamp"), ~as_datetime(.)) # converting time cols from factor into datetime type
df2
# A tibble: 2 x 6
device_serial exp_id exp_start exp_end current_event_id current_event_timestamp
<fct> <fct> <dttm> <dttm> <fct> <dttm>
1 123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
2 123 b 2019-09-04 10:00:00 NA 2 2019-11-23 12:06:28
Now, to try for a solution using dplyr::left_join and dplyr::filter:
df3 <- df2 %>%
mutate(exp_end_or_current = if_else(is.na(exp_end), current_event_timestamp, exp_end)) %>% #creating a new col with either exp_end OR, if NA, then current timestamp
left_join(df1, ., by = ("device_serial")) %>% #join df2 to df1 by serial #
filter(evnt_start > exp_start & evnt_end < exp_end_or_current) %>% #filter, keeping only records where EVENT start & end times are between expedition start & end times
select(-c(exp_end, current_event_id, current_event_timestamp))
df3
# A tibble: 3 x 8
first_event second_event device_serial evnt_start evnt_end exp_id exp_start exp_end_or_current
<fct> <fct> <fct> <dttm> <dttm> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38 a 2019-12-03 07:12:20 2020-01-17 00:05:10
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28

Related

R:stringr - How to locate a position of a word in a string separated by semicolons?

I'm looking for R solution to the following problem:
I have a disease registry formatted as shown below:
Patient
Diagnosis
Date of diagnosis 1
...
Date of diagnosis 47
...
Dates of diagnosis n
ID0001
C18.9 - Malignant neoplasm of colon [first disease mentioned]; Disease 2; ...; Disease n
2020-01-21
...
...
...
...
....
...
...
...
...
...
...
ID18000
[...]; C18.9 - Malignant neoplasm of colon [mentioned as 47th diagnosis out of 95]; [...]
...
...
2005-03-04
...
...
ID18001
C18.9 - Malignant neoplasm of colon [the last of n mentioned]
...
...
...
...
2011-02-11
Where for each row (patient) there's a column with semicolon-separated disease names and consecutive columns format of each of diagnoses.
I want to derive from this dataset a binary variable for particular diagnosis and additional column with its date (for example "Colon cancer"). To do that, one has to now the position of the disease in Diagnosis column (as this will reflect the number of the Date column). As shown the place where the disease is mentioned can vary and therefore the column with dates varies between patients
My initial idea was to split the Diagnosis column to separate ones by semicolon but considering the size of the dataset it's not optimal.
I'm wondering whether there's any function available in stringr package that could solve this without a need for column split?
Thank you for help!
Hopefully ive understood correctly but heres my solution using dplyr
df <- data.frame(Patient =c("ID0001","ID0002","ID0003"),Diagnosis=c("Disease1; Disease2; Disease3"),Date_of_diagnosis1=as.Date("2020-01-21"),Date_of_diagnosis2 = as.Date("2020-01-23"),Date_of_diagnosis3=as.Date("2015-12-01"))
df %>%
mutate(Diagnosis = strsplit(Diagnosis, ";")) %>%
unnest(Diagnosis)
Output
Patient Diagnosis Date_of_diagnosis1 Date_of_diagnosis2 Date_of_diagnosi~
<chr> <chr> <date> <date> <date>
1 ID0001 "Disease1" 2020-01-21 2020-01-23 2015-12-01
2 ID0001 " Disease2" 2020-01-21 2020-01-23 2015-12-01
3 ID0001 " Disease3" 2020-01-21 2020-01-23 2015-12-01
4 ID0002 "Disease1" 2020-01-21 2020-01-23 2015-12-01
5 ID0002 " Disease2" 2020-01-21 2020-01-23 2015-12-01
6 ID0002 " Disease3" 2020-01-21 2020-01-23 2015-12-01
7 ID0003 "Disease1" 2020-01-21 2020-01-23 2015-12-01
8 ID0003 " Disease2" 2020-01-21 2020-01-23 2015-12-01
9 ID0003 " Disease3" 2020-01-21 2020-01-23 2015-12-01
I suggest you should convert your data from this embedded-and-wide format into a simpler one-diagnosis/date-per-row long format.
Your sample data is not much to work with, so here is some fake data, I hope it is somewhat representative:
dat <- data.frame(
Patient = c("ID0001","ID18000","ID18001"),
Diagnosis = c("Disease1;Disease2;Disease3", "Disease2;Disease17", "Disease1;Disease4;Disease5"),
Date_of_diagnoses1 = c("2018-01-21", "2018-01-22", "2019-01-23"),
Date_of_diagnoses2 = c("2019-02-21", "2019-02-22", "2019-02-23"),
Date_of_diagnoses3 = c("2020-03-21", NA, "2020-03-23")
)
dat
# Patient Diagnosis Date_of_diagnoses1 Date_of_diagnoses2 Date_of_diagnoses3
# 1 ID0001 Disease1;Disease2;Disease3 2018-01-21 2019-02-21 2020-03-21
# 2 ID18000 Disease2;Disease17 2018-01-22 2019-02-22 <NA>
# 3 ID18001 Disease1;Disease4;Disease5 2019-01-23 2019-02-23 2020-03-23
Using the tidyverse:
library(dplyr)
library(stringr)
library(purrr) # pmap_chr
dat %>%
tidyr::pivot_longer(-c(Patient, Diagnosis), names_to = "Sequence", values_to = "Date") %>%
filter(!is.na(Date)) %>%
mutate(
Date = as.Date(Date),
Sequence = as.integer(str_extract(Sequence, "[0-9]+$")),
Diagnosis = purrr::pmap_chr(list(strsplit(Diagnosis, ";", fixed = TRUE), Sequence), `[[`)
)
# # A tibble: 8 x 4
# Patient Diagnosis Sequence Date
# <chr> <chr> <int> <date>
# 1 ID0001 Disease1 1 2018-01-21
# 2 ID0001 Disease2 2 2019-02-21
# 3 ID0001 Disease3 3 2020-03-21
# 4 ID18000 Disease2 1 2018-01-22
# 5 ID18000 Disease17 2 2019-02-22
# 6 ID18001 Disease1 1 2019-01-23
# 7 ID18001 Disease4 2 2019-02-23
# 8 ID18001 Disease5 3 2020-03-23
Assumptions about the data:
The number of Date_of_diagnoses# fields is always correct, i.e., there are always as many Date* columns as ;-delimited diagnoses in Diagnosis;
The number at end of each Date counts correctly and can be used as an index on Diagnosis; this is not a strict requirement (it's not hard to work around), but I found it convenient to use it, and have some more assurance that we're always using the correct Date with the correct Diagnosis
Diagnosis is perfectly formed, no embedded semi-colons that will cloud the extraction
In general, while this does lengthen your data (perhaps significantly, depending on the number of diagnoses per patient), it also provides a much cleaner (in my opinion) view of the data: the ability to extract individual diseases much more easily.

Is x between two dates?

I have another question in the same project scope pandas dataframe groupby datetime month however I fear the data structure might be to complicated so I am trying an alternative approach. I am hoping this achieves the same result.
I am ideally looking to build a matrix of phone numbers as rows and start and end dates as columns and identify the period in which a telephone call was made.
This will be achieved by transforming a dataset of dates and phone numbers to a complete list of dates, identifying an end day match, and then seeing if the date the telephone call was made falls within that period.
The original data looks like:
Date = as.Date(c("2019-03-01", "2019-03-15","2019-03-29", "2019-04-10","2019-03-05","2019-03-20"))
Phone = c("070000001","070000001","070000001","070000001","070000002","070000002")
df<-data.frame(Date,Phone)
df
## Date Phone
## 1 2019-03-01 070000001
## 2 2019-03-15 070000001
## 3 2019-03-29 070000001
## 4 2019-04-10 070000001
## 5 2019-03-05 070000002
## 6 2019-03-20 070000002
Ideally I would want it to look like this:
## Date Phone INT_1 INT_2 INT_3 INT_4 INT_5
## 1 2019-03-01 070000001 X X X X X
## 2 2019-03-15 070000002 X X X
Where INT is a series of dates + 30 and X indicates that the telephone number appeared in that rolling period.
To do this I assume you need two datasets. The one above, of telephone numbers by date called, and a second which is the complete list of days and their = 30 day counter parts.
dates<-as.data.frame(seq(as.Date("2016/7/1"), as.Date("2019/7/1"),"days"),
responseName = c('start'))
dates$end<-dates$start+30
## INT start end
## 1 2016-07-01 2016-07-31
## 2 2016-07-02 2016-08-01
## 3 2016-07-03 2016-08-02
## 4 2016-07-04 2016-08-03
But how do I get the two to evaluate together? I am assuming some kind of merge and expand of the telephone data into the date list then spread the dates by the row index/ INT?
I think that to match the two dataframes you could use a fuzzyjoin. For example, if I define a dataframe of phone numbers and usage dates as:
library(dplyr)
library(fuzzyjoin)
fake_phone_data <- tibble(
date = as.Date(c("2019-01-03", "2019-01-27", "2019-02-12", "2019-02-25", "2019-02-26")),
phone = c("1", "1", "2", "2", "2")
)
and a dataframe of starting/ending dates (plus an ID column) as:
id_dates <- tibble(
ID = c("1", "2", "3", "4"),
starting_date = as.Date(c("2019-01-01", "2019-01-16", "2019-02-01", "2019-02-16")),
ending_date = as.Date(c("2019-01-15", "2019-01-31", "2019-02-15", "2019-02-27"))
)
then I can join the two dataframes using a fuzzyjoin, i.e. two rows are matched if the date of the phone call happens between the starting date and the end date of the corresponding period:
fuzzy_left_join(
fake_phone_data,
id_dates,
by = c(
"date" = "starting_date",
"date" = "ending_date"
),
match_fun = list(`>=`, `<`)
)
#> # A tibble: 5 x 5
#> date phone ID starting_date ending_date
#> <date> <chr> <chr> <date> <date>
#> 1 2019-01-03 1 1 2019-01-01 2019-01-15
#> 2 2019-01-27 1 2 2019-01-16 2019-01-31
#> 3 2019-02-12 2 3 2019-02-01 2019-02-15
#> 4 2019-02-25 2 4 2019-02-16 2019-02-27
#> 5 2019-02-26 2 4 2019-02-16 2019-02-27
Created on 2019-07-19 by the reprex package (v0.3.0)
Does it solve your problem?
This approach is very similar to this question.

How to generate a unique ID for each group based on relative date interval in R using dplyr?

I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00

R Function involving two for loops - baseball data

For those into sports, I am working on a function that adds a column with the pitch count for a game in a given season for a pitcher.
For example's sake, data used is a data frame called pitcher that contains a game_date and sv_id (date/timestamp or the pitch). My goal is to order the sv_id in ascending order for each unique game_date and then add a column with a numbering system for this order. So for example, if for game_date=9/9/2018 there were 3 pitches thrown with sv_id's equal to 090918_031456, 090918_031613, and 090918_031534, I would first want to sort this data into chronological order (090918_031456,090918_031534,090918_031613) and then have a new column with the values 1,2,3 respectively to act as a pitch count. Below is my function so far. I originally thought I would make a list of lists but now I am not sure that is the right way to go about this. Please help! This is also my first time posting on here so any advice is appreciated. Thank you!!!
` pitchCount <- function(game_date, sv_id){
gameUnique<-unique(pitcher$game_date)
PC<-list()
for (j in 1:length(gameUnique)){
PCLocal<-filter(pitcher,game_date==gameUnique[j])
PCLocal[order(PCLocal$sv_id),]
for (i in 1:length(PCLocal$sv_id)){
PCLocal$PC[i]=i
}
PC[j]=PCLocal$PC
}
return(PC)
}
pitch.Count <- pitchCount(pitcher$game_date,pitcher$sv_id)
pitcher$PC<-pitch.Count
`
So you want to count pitches as they come in order, right? Should be no need for a loop. In R, loops are rarely needed.
Check if this is what you want. A tidyverse/dplyr solution.
The sv_id variable is in a format that can be converted to POSIX (a type of date format). This makes it simple to sort in order.
library(tidyverse)
# Create data_frame
pitcher <- data_frame(game_date = as.Date(c("2018-09-09", "2018-09-09", "2018-09-09")),
sv_id = c("090918_031456", "090918_031613", "090918_031534"))
# First, convert sv_id strings to POSIX format (this can be done in the code below but this makes it clearer.
pitcher$sv_id <- as.POSIXct(c("090918_031456", "090918_031613", "090918_031534"), format = "%y%m%d_%H%M%S", tz = "GMT")
# Create pitch count
pitcher %>%
arrange(sv_id) %>%
mutate(Count = 1, pitchcount = cumsum(Count), Count = NULL)
# A tibble: 3 x 3
game_date sv_id pitchcount
<date> <dttm> <dbl>
1 2018-09-09 2009-09-18 03:14:56 1
2 2018-09-09 2009-09-18 03:15:34 2
3 2018-09-09 2009-09-18 03:16:13 3
Try using data.table.
library(data.table)
pitcher_dt <- data.table(pitcher)
> pitcher_dt
game_date sv_id
1: 2018-01-02 090918_031456
2: 2018-01-02 090918_031613
3: 2018-01-02 090918_031534
We can add Count column by := and add a position of 'sv_id' by order(sv_id).
pitcher_dt [, Count := order(sv_id)]
> pitcher_dt
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031613 3
3: 2018-01-02 090918_031534 2
Since Count only puts the position of 'sv_id', in this case (1,3,2), we can either sort 'Count' or 'sv_id' in ascending order
pitcher_dt[,order(Count)] or pitcher_dt[,order(sv_id)]
> pitcher_dt[order(Count)]
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031534 2
3: 2018-01-02 090918_031613 3
For me, it is easy to manipulate data with data.table. But, you can also use dplyr.
Introduction to data.table is a good start to learn about data.table.
I am not sure how is your data looks like, but I assume the following from your description
> df
# A tibble: 9 x 2
game_date sv_id
<date> <chr>
1 2018-09-09 090918_031456
2 2018-09-09 090918_031613
3 2018-09-09 090918_031534
4 2018-05-17 090918_031156
5 2018-05-17 090918_031213
6 2018-06-30 090918_031177
7 2018-06-30 090918_031211
8 2018-06-30 090918_031144
9 2018-06-30 090918_031203
Then you use dplyr to do generate your target
library(dplyr)
df <- df %>%
group_by(game_date) %>%
mutate(count = n_distinct(sv_id)) %>% #count sv_id with each game_date
arrange(desc(sv_id))
The output is:
# A tibble: 9 x 3
# Groups: game_date [3]
game_date sv_id count
<date> <chr> <int>
1 2018-06-30 090918_031144 4
2 2018-05-17 090918_031156 2
3 2018-06-30 090918_031177 4
4 2018-06-30 090918_031203 4
5 2018-06-30 090918_031211 4
6 2018-05-17 090918_031213 2
7 2018-09-09 090918_031456 3
8 2018-09-09 090918_031534 3
9 2018-09-09 090918_031613 3
I hope this could help

Convert weekly Data frame to monthly data frame in R

My dat looks like below with different node_desc having weekly data for 4 years
ID1 ID2 DATE_ value
1: 00001 436 2014-06-29 175.8164
2: 00001 436 2014-07-06 188.9264
3: 00001 436 2014-07-13 167.5376
4: 00001 436 2014-07-20 160.7907
5: 00001 436 2014-07-27 185.3018
6: 00001 436 2014-08-03 179.5748
would like to convert data frame to monthly.Trying below code
df %>%
tq_transmute(select = c(value,ID1),
mutate_fun = apply.monthly,
FUN = mean)
But my output looks like below
DATE_ value
<dttm> <dbl>
1 2014-06-29 00:00:00 144.
2 2014-07-27 00:00:00 143.
3 2014-08-31 00:00:00 143.
4 2014-09-28 00:00:00 152.
5 2014-10-26 00:00:00 156.
6 2014-11-30 00:00:00 166.
But I would like to have ID1,ID2,Date(monthly) and value(either getting the mean or max of 4 weeks) instead of just having date and value,because I have data of different ID1's for 4 years.Can someone help me in R
Here's my take
dta <- data.frame(id1=rep("00001",6),id2=rep("436",6),
date_=as.Date(c("29jun2014","6jul2014","13jul2014","20jul2014","27jul2014","3aug2014"),"%d%B%Y"),
value=c(175.8164,188.9264,167.5376,160.7907,185.3018,179.5748))
And dplyr would do the rest. Here I summarize the data by taking the mean value
library(dplyr)
my_dta <- dta %>% mutate(month_=format(as.yearmon(date_),"%b"))
my_dta %>% group_by(.dots=c("id1","id2")) %>% summarise(mvalue=mean(value))
The problem you have is that your dataset doesn't have daily data. The apply.monthly function comes from xts, but tidyquant uses wrappers around a lot of functions so they work in a more tidy way. apply.monthly needs an xts object, which is basicly a matrix with a time index.
Also know that apply.monthly returns the last available day of the month in your timeseries. Looking at your example set, the last day it returns for july 2017 will the 27th. Now if you have 5 records (weeks) in a month the mean function will do this over 5 records. It will never be exactly 1 month as weekly data never covers monthly data.
But with tidyquant you can get sort of a monthly result with ID1 and ID2 with your data if you join the outcome with the original data. See code below. I haven't removed any unwanted columns.
df1 %>%
tq_transmute(select = c(value, ID1),
mutate_fun = apply.monthly,
FUN = mean) %>%
mutate(DATE_ = as.Date(DATE_)) %>%
inner_join(df1, by = "DATE_")
# A tibble: 3 x 5
DATE_ value.x ID1 ID2 value.y
<date> <dbl> <fct> <fct> <dbl>
1 2014-06-29 176. 00001 436 176.
2 2014-07-27 176. 00001 436 185.
3 2014-08-03 180. 00001 436 180.
data:
df1 <- data.frame(ID1 = rep("00001", 6),
ID2 = rep("436", 6),
DATE_ = as.Date(c("2014-06-29", "2014-07-06", "2014-07-13", "2014-07-20", "2014-07-27", "2014-08-03")),
value = c(175.8164,188.9264,167.5376,160.7907,185.3018,179.5748)
)

Resources