I have two datasets that I would like to merge together in an unusual way. One dataset is my master set that contains an identifier and a datetime relevant to that ID. An ID can appear multiple times with different dates attached to it:
> head(Master_Data)
# A tibble: 5 x 2
ID Date
<chr> <dttm>
1 a 2018-03-31 00:00:00
2 a 2018-02-28 00:00:00
3 b 2018-06-07 00:00:00
4 c 2018-01-31 00:00:00
5 b 2018-02-09 00:00:00
The other dataset has the same ID, a different date and a score associated with that ID and date. IDs can also show up multiple times in this dataset as well with different dates and scores:
> head(Score_Data)
# A tibble: 6 x 3
ID Date Score
<chr> <dttm> <dbl>
1 a 2018-01-19 00:00:00 3
2 a 2018-01-01 00:00:00 5
3 a 2018-03-05 00:00:00 7
4 b 2018-01-31 00:00:00 1
5 b 2018-08-09 00:00:00 5
6 c 2018-01-17 00:00:00 10
What I would like to do is add an additional column to Master_Data that gives a mean of the score for that ID in the Score_Data df. The tricky part is that for each row in Master_Data, I only want to include scores in the average if the date variable in Score_Data is earlier than the date variable for a given row in Master_Data
Example:
For row 1 in Master_Data, I would want the new column to return a value of (3+5+7)/3 = 5. However, for row 2 I would only want to see (3+5)/2 = 4 since row 3 in Score_Data has a date after 2/28
Thoughts on what would be the best approach here to get this new column in Master_Data?
This solution would work for smaller data sets, but as the size of the data grows you'll start to notice performance issues.
library(lubridate)
library(dplyr)
master_data <- data.frame(
ID = c('a','a','b','c','b'),
Date = c('2018-03-31 00:00:00',
'2018-02-28 00:00:00',
'2018-06-07 00:00:00',
'2018-01-31 00:00:00',
'2018-02-09 00:00:00'))
master_data$Date <- ymd_hms(master_data$Date)
Score_Data <- data.frame(
ID = c('a','a','a','b','b','c'),
Date = c('2018-01-19 00:00:00',
'2018-01-01 00:00:00',
'2018-03-05 00:00:00',
'2018-01-31 00:00:00',
'2018-08-09 00:00:00',
'2018-01-17 00:00:00'),
Score = c(3,5,7,1,5,10))
Score_Data$Date <- ymd_hms(Score_Data$Date)
output <- apply(master_data, 1, function(x){
value <- Score_Data %>%
filter(ID == x[['ID']]) %>%
filter(Date < x[['Date']]) %>%
summarise(Val = mean(Score))
})
master_data$Output <- unlist(output)
Related
I am looking to calculate a 3 month rolling sum of values in one column of a data frame based upon the dates in another column and product.
newResults data frame columns : Product, Date, Value
In this example, I wish to calculate the rolling sum of value for Product for 3 months. I have sorted the data frame on Product and Date.
Dataset Example:
Sample Dataset
My Code:
newResults = newResults %>%
group_by(Product) %>%
mutate(Roll_12Mth =
rollapplyr(Value, width = 1:n() - findInterval( Date %m-% months(3), date), sum)) %>%
ungroup
Error: Problem with mutate() input Roll_12Mth.
x could not find function "%m-%"
i Input Roll_12Mth is rollapplyr(...).
Output:
Output
If the dates are always spaced 1 month apart, it is easy.
dat=data.frame(Date=seq(as.Date("2/1/2017", "%m/%d/%Y"), as.Date("1/1/2018", "%m/%d/%Y"), by="month"),
Product=rep(c("A", "B"), each=6),
Value=c(4182, 4822, 4805, 6235, 3665, 3326, 3486, 3379, 3596, 3954, 3745, 3956))
library(zoo)
library(dplyr)
dat %>%
group_by(Product) %>%
arrange(Date, .by_group=TRUE) %>%
mutate(Value=rollapplyr(Value, 3, sum, partial=TRUE))
Date Product Value
<date> <fct> <dbl>
1 2017-02-01 A 4182
2 2017-03-01 A 9004
3 2017-04-01 A 13809
4 2017-05-01 A 15862
5 2017-06-01 A 14705
6 2017-07-01 A 13226
7 2017-08-01 B 3486
8 2017-09-01 B 6865
9 2017-10-01 B 10461
10 2017-11-01 B 10929
11 2017-12-01 B 11295
12 2018-01-01 B 11655
I have a date variable called DATE as follows:
DATE
2019-12-31
2020-01-01
2020-01-05
2020-01-09
2020-01-25
I am trying to return a result that counts the number of times the date occur in a week considering the Week variable starts from the minimum of DATE variable. So it would look something like this:
Week Count
1 3
2 1
3 0
4 1
Thanks in advance.
From base R
dates <- c('2019-12-31','2020-01-01','2020-01-05','2020-01-09','2020-01-25')
weeks <- strftime(dates, format = "%V")
table(weeks)
We subtract DATE values with minimum DATE value to get the difference in days between DATES. We divide the difference by 7 to get it in weeks and count it. We then use complete to fill the missing week information.
df %>%
dplyr::count(week = floor(as.integer(DATE - min(DATE))/7) + 1) %>%
tidyr::complete(week = min(week):max(week), fill = list(n = 0))
# week n
# <dbl> <dbl>
#1 1 3
#2 2 1
#3 3 0
#4 4 1
If your DATE column is not of date class, first run this :
df$DATE <- as.Date(df$DATE)
I have strange problem with calculation, and I am not sure what I should do. I have a data that looks like this:
and I need to sort by ID and Date at first,which I did. Then i need to find the baseline date, only if duration for that date is <=0 and closest to 0, that one can be used as baseline, then I need to calculate usable=current score/baseline date score. so the final results should look like this:
What should I do? How can I check the oldest day and build "usable" to use the score/oldest score?
The codes for sample data are:
ID <-c("1","1","1","1","2","2","2","2")
Date<- c("4/19/2018","7/27/2018","8/24/2018","9/21/2018","10/19/2018","12/14/2018","1/11/2019","1/24/2019")
Duration <- c("-13","-7","95","142","2","36","75","81")
score <- c("0.06","0.071","0.054","0.0258","0.0208","0.0448","0.0638","0.0227")
Sample.data <- data.frame(ID, Date, Duration, score)
The columns in 'Sample.data' are all character class as the values were quoted (used R 4.0.0. If it was < R 4.0, stringsAsFactors = TRUE by default), so we used type.convert to change the class based on the values automatically, then before we do the arrange on 'ID', 'Date', convert the 'Date' to Date class (in case there are some inconsistency in the original data with respect to the order), after grouping by 'ID', create the new column 'Useable' with an if/else condition to return the standardized 'score' with the first value of 'score' or else return NA
library(dplyr)
library(lubridate)
Sample.data <- Sample.data %>%
type.convert(as.is = TRUE) %>%
mutate(Date = mdy(Date)) %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(Useable = if(first(Duration) <=0) c(NA, score[-1]/first(score))
else NA_real_)
Sample.data
# A tibble: 8 x 5
# Groups: ID [2]
# ID Date Duration score Useable
# <int> <date> <int> <dbl> <dbl>
#1 1 2018-04-19 -13 0.06 NA
#2 1 2018-07-27 86 0.071 1.18
#3 1 2018-08-24 95 0.054 0.9
#4 1 2018-09-21 142 0.0258 0.43
#5 2 2018-10-19 2 0.0208 NA
#6 2 2018-12-14 36 0.0448 NA
#7 2 2019-01-11 75 0.0638 NA
#8 2 2019-01-24 81 0.0227 NA
I am new to R and I would like to ask how to transform the below data set into the two outcome tables which
have unique name as the row and list the trip 1, 2, 3, 4, 5 and so on of each person and have the avg trip n grand total at last column n row.
The second table I want to know the lag days between trips and avg. lag day of each person as the last column. Lag is the day between trips.
Dataset
name <- c('Mary', 'Sue', 'Peter', 'Mary', 'Mary', 'John', 'Sue', 'Peter',
'Peter', 'John', 'John', 'John', 'Mary', 'Mary')
date <- c('01/04/2018', '03/02/2017', '01/01/2019', '24/04/2017',
'02/03/2019', '31/05/2019', '08/09/2019', '17/12/2019',
'02/08/2017', '10/11/2017', '30/12/2017', '18/02/2018',
'18/02/2018', '18/10/2019')
data <- data.frame(name, date)
The desired results:
Result 1
Name Trip 1 Trip2 Total trips
Mary dd/mm/yyyy dd/mm/yyyy 2
John dd/mm/yyyy. N/A 1
Total Trip 2 1 3
Result 2
Name Lag1 Lag2 Avg.Lag
Mary 3 4 3.5
John 5 1 3
Result 1 can be achieved by arranging the data by date (first convert to date format) and doing a group_by() per person to calculate the rank and count of the trips. These can then by pivoted into columns using pivot_wider() from the tidyr package (the paste0() lines are to ensure readable column names).
For result 2 the difference in days needs to be calculated between trips using difftime(), which will give an NA for the first trip. The rest of the procedure is similar to result 1, but some columns have to be removed before the pivot.
library(dplyr)
library(tidyr)
name <- c('Mary','Sue','Peter','Mary','Mary','John','Sue','Peter','Peter','John',
'John','John','Mary','Mary')
date <- c('01/04/2018','03/02/2017','01/01/2019','24/04/2017',
'02/03/2019','31/05/2019','08/09/2019','17/12/2019',
'02/08/2017','10/11/2017','30/12/2017','18/02/2018',
'18/02/2018','18/10/2019')
data <- data.frame(name,date, stringsAsFactors = F)
data <- data %>%
mutate(date = as.Date(date, format = '%d/%m/%Y')) %>%
arrange(name, date) %>%
group_by(name) %>%
mutate(trip_nr = rank(date),
total_trips = n()) %>%
ungroup()
result1 <- data %>%
mutate(trip_nr = paste0('Trip_', trip_nr)) %>%
pivot_wider(names_from = trip_nr, values_from = date)
result2 <- data %>%
group_by(name) %>%
mutate(lag = difftime(date, lag(date), units = 'days'),
lag_avg = mean(lag, na.rm = T)) %>%
ungroup() %>%
filter(!is.na(lag)) %>%
mutate(lag_nr = paste0('Lag_', trip_nr-1)) %>%
select(-date,-trip_nr,-total_trips) %>%
pivot_wider(names_from = lag_nr, values_from = lag)
This gives the output for result1:
# A tibble: 4 x 7
name total_trips Trip_1 Trip_2 Trip_3 Trip_4 Trip_5
<chr> <int> <date> <date> <date> <date> <date>
1 John 4 2017-11-10 2017-12-30 2018-02-18 2019-05-31 NA
2 Mary 5 2017-04-24 2018-02-18 2018-04-01 2019-03-02 2019-10-18
3 Peter 3 2017-08-02 2019-01-01 2019-12-17 NA NA
4 Sue 2 2017-02-03 2019-09-08 NA NA NA
and result2:
# A tibble: 4 x 6
# Groups: name [4]
name lag_avg Lag_1 Lag_2 Lag_3 Lag_4
<chr> <drtn> <drtn> <drtn> <drtn> <drtn>
1 John 189.00 days 50 days 50 days 467 days NA days
2 Mary 226.75 days 300 days 42 days 335 days 230 days
3 Peter 433.50 days 517 days 350 days NA days NA days
4 Sue 947.00 days 947 days NA days NA days NA days
enter code here
data$date <- as.character(data$date)
data <- data[order(as.Date(data$date,"%d/%m/%Y")),]
data <- data.table(data)
data[,date := as.Date(date,"%d/%m/%Y")]
#trips
data[,Trips:=seq(.N),by="name"]
#time diff in "days" between trips
data[,Lag:=shift(date,1),by="name"]
data[,diff:=difftime(Lag,date,"days"),by="name"]
data[,diff:=abs(as.numeric(diff))]
#creating second summary table
data_summary_second_table <- data[,.(Avg_lag=mean(diff,na.rm = TRUE)),by="name"]
I have the following dataset:
I want to measure the cumulative total at a daily level. So the result look something like:
I can use dplyr's cumsum function but the count for "missing days" won't show up. As an example, the date 1/3/18 does not exist in the original dataframe. I want this missed date to be in the resultant dataframe and its cumulative sum should be the same as the last known date i.e. 1/2/18 with the sum being 5.
Any help is appreciated! I am new to the language.
I'll use this second data.frame to fill out the missing dates:
daterange <- data.frame(Date = seq(min(x$Date), max(x$Date), by = "1 day"))
Base R:
transform(merge(x, daterange, all = TRUE),
Count = cumsum(ifelse(is.na(Count), 0, Count)))
# Date Count
# 1 2018-01-01 2
# 2 2018-01-02 5
# 3 2018-01-03 5
# 4 2018-01-04 5
# 5 2018-01-05 10
# 6 2018-01-06 10
# 7 2018-01-07 10
# 8 2018-01-08 11
# ...
# 32 2018-02-01 17
dplyr
library(dplyr)
x %>%
right_join(daterange) %>%
mutate(Count = cumsum(if_else(is.na(Count), 0, Count)))
Data:
x <- data.frame(Date = as.Date(c("1/1/18", "1/2/18", "1/5/18", "1/8/18", "2/1/18"), format="%m/%d/%y"),
Count = c(2,3,5,1,6))