Sum variable between dates in R? - r

How do you sum a value that occurs between two dates in R?
For example, I have two data tables, df1 has start and end dates, df2 has values corresponding to certain dates between the start and end dates in df1. I would like to sum the values in df2 between each Start and End date in df1 and record that information in df1.
df1 <- data.frame(Start = c('1/1/20', '5/1/20', '10/1/20', '2/2/21', '3/20/21'),
End = c('1/7/20', '5/7/20', '10/7/20', '2/7/21', '3/30/21'))
df2 <- data.frame(Date = c('1/1/20','1/3/20' ,'5/1/20','5/2/20','6/2/20' ,'6/4/20','10/1/20', '2/2/21', '3/20/21'),value=c('1','2','5','15','20','2','3','78','100'))
I have tried following the example at the following link that provides information on counting between two dates in R but I am struggling to apply it to the function sum. Sum/count between two dates in R
Thank you!

We can use a non-equi join in data.table after converting the date columns to Date class
library(data.table)
library(lubridate)
setDT(df1)[df2, value := sum(value),
on = .(Start <= Date, End >= Date), by = .EACHI]
-output
df1
# Start End value
#1: 2020-01-01 2020-01-07 2
#2: 2020-05-01 2020-05-07 15
#3: 2020-10-01 2020-10-07 3
#4: 2021-02-02 2021-02-07 78
#5: 2021-03-20 2021-03-30 100
data
df1[] <- lapply(df1, mdy)
df2$Date <- mdy(df2$Date)
df2$value <- as.numeric(df2$value)

Related

Join two dataframe with two columns [ one of datetime column ] in R

I have two df and I'm trying to left or right join them based on two-column. ID and Datetime column. how do I allow DateTime from another df to match the first df even if it's within 10-20sec difference range?
df1 :
ID
Datetime
123
2021-04-02 09:50:11
456
2021-04-02 09:50:15
df2:
ID
Datetime
123
2021-04-02 09:50:31
456
2021-04-02 09:50:23
if the times are within 10-20 diff on df2, return all the columns and DateTime column from the df2 to new,df3. For all matching IDs and yyyy-mm-dd H:M matches to both dfs. so if the change in :SS is between 10-20 on df2, pick it and do join, If its not within 10-20sec range,skip. someone, please help?
Your sample data is very minimalistic. Not sure how you wanetd to implement the 10-20 secs. I assumed everything within -20 to +20 seconds should be matched. This can easily be adjusted in filtering part ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20).
Here is a data.table approach
library(data.table)
# Sample data
DT1 <- fread("ID Datetime
123 2021-04-02T09:50:11
456 2021-04-02T09:50:15")
DT2 <- fread("ID Datetime
123 2021-04-02T09:50:31
456 2021-04-02T09:50:23")
# Set datetimes to posix
DT1[, Datetime := as.POSIXct(Datetime)]
DT2[, Datetime := as.POSIXct(Datetime)]
# possible rowwise approach
DT1[, rowid := .I]
setkey(DT1, rowid)
DT1[DT1, Datetime2 := DT2[ID == i.ID & Datetime <= (i.Datetime + 20) & Datetime >= (i.Datetime - 20),
lapply(.SD, paste0, collapse = ";"), .SDcols = c("Datetime")],
by = .EACHI][, rowid := NULL][]
# ID Datetime Datetime2
# 1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
# 2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23
If I understand correctly, the OP wants to retrieve those rows of df2 (including all columns) which have a matching ID in df1 and where the time difference of the time stamps Datetime between df1 and df2 is less or equal than a given value.
So, for the given sample data
if the allowed time difference is 20 seconds at most both rows of df2 are returned.
If the allowed time difference is 10 seconds at most only the second row of df2 with ID == 456 is returned.
If the allowed time difference is 5 seconds at most an empty dataset is returned because non of df2's rows fulfills the conditions.
One possible approach is to use a non-equi join which is available with data.table:
library(data.table)
timediff <- 10 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[df1, on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .SD][
, c("from", "to") := NULL][]
df3
ID Datetime
1: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 20
the result is
df3
ID Datetime
1: 123 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:23
If the code is run with
timediff <- 5
df3 becomes an empty data.table.
EDIT: Show Datetime from df1 and df2
By request of the OP, here is a version which returns both Datetime columns from df1 and df2, renamed as Datetime1 and Datetime2, resp.,:
library(data.table)
timediff <- 20 # given time difference in seconds
setDT(df1)[, Datetime := as.POSIXct(Datetime)]
setDT(df2)[, Datetime := as.POSIXct(Datetime)]
df2[, c("from", "to") := .(Datetime - timediff, Datetime + timediff)]
df3 <- df2[setDT(df1), on = c("ID", "from <= Datetime", "to >= Datetime"),
nomatch = NULL, .(ID, Datetime1 = i.Datetime, Datetime2 = x.Datetime)]
df3
ID Datetime1 Datetime2
1: 123 2021-04-02 09:50:11 2021-04-02 09:50:31
2: 456 2021-04-02 09:50:15 2021-04-02 09:50:23

Combine data frame with one value for each date with data frame with several entries per date

I want to merge two dataframes. DF2 has one temperature value for each day while DF1 has several entries for each day. So I want to look up the temperature for one day in DF2 and have it added to every entry of this day in dataframe 1.
I guess a loop would work best but being quite new to R I can't figure out how it has to look like
DF1$Date<-c(1.8.18, 1.8.18, 2.8.18)
DF2$Date<-c(1.8.18, 2.8.18, 3.8.18)
DF2$Temperature<-c(17,18,17)
DF2$Difference<-c(0.5,0.4,0.5)
This is the expected output:
DF1$Date<-c(1.8.18, 1.8.18, 2.8.18)
DF1$Temperature<-c(17,17,18)
DF1$Difference<-c(0.5,0.5,0.4)
I would highly recommend using the tidyverse library for general data wrangling (and lubridate for date manipulation, although you don't necessarily need lubridate for this question).
This could work in your case:
library(tidyverse)
# Create the dataframes
DF1 <- data.frame(c("1.8.18", "1.8.18", "2.8.18"))
DF2 <- data.frame(c("1.8.18", "2.8.18", "3.8.18"),
c(17,18,17),
c(0.5,0.4,0.5)
)
names(DF1) <- "Date"
names(DF2) <- c("Date", "Temperature", "Difference")
#### OUTPUT ####
> DF1
# Date
# 1 1.8.18
# 2 1.8.18
# 3 2.8.18
> DF2
# Date Temperature Difference
# 1 1.8.18 17 0.5
# 2 2.8.18 18 0.4
# 3 3.8.18 17 0.5
So above I just recreated your dataframes. DF1 has just the one column, DF2 has 3 columns.
# join dataframes by what the "Date" columns have in common
left_join(x = DF1, y = DF2, by = "Date")
This should get your expected output.
> DF3
# Date Temperature Difference
# 1 1.8.18 17 0.5
# 2 1.8.18 17 0.5
# 3 2.8.18 18 0.4
For more details check out the join function in dplyr (which is part of tidyverse library).
I would either take your Date variable as a date variable or a character variable. I would not use it as a factor variable for this purpose
library(tidyverse)
DF1$Date = as.Date(DF1$Date, "%d.%m.%y")
DF2$Date= as.Date(DF2$Date, "%d.%m.%y")
left_join(x = DF1, y = DF2, by = "Date")
OR
DF1$Date = as.character(DF1$Date)
DF2$Date = as.character(DF2$Date)
left_join(x = DF1, y = DF2, by = "Date")
Using it as a factor, you will get an error message, and you have a good chance to get it wrong,

Join two data frames together and use most recent result as rows added

I am trying to achieve the 'Final.Data' output shown below.
We start with the Reference data and I want to add the 'Add.Data' but join on the 'Person' and return the most recent result prior to the reference (date).
I am looking for dplyr, data.table or sql solutions in r.
I then want to be able to reproduce this for 1000s of entries, so looking for a reasonable efficient solution.
library(tibble)
Reference.Data <- tibble(Person = "John",
Date = "2019-07-10")
Add.Data <- tibble(Person = "John",
Order.Date = c("2019-07-09","2019-07-08") ,
Order = 1:2)
Final.Data <- tibble(Person = "John",
Date = "2019-07-10",
Order.Date = "2019-07-09",
Order = 1)
A roling join to the nearest before date should work pretty fast..
#data preparation:
# convert to data.tables, set dates as 'real' dates
DT1 <- setDT(Reference.Data)[, Date := as.IDate( Date )]
DT2 <- setDT(Add.Data)[, Order.Date := as.IDate( Order.Date )]
#set keys (this also orders the dates, convenient for the join later)
setkey(DT1, Person, Date)
setkey(DT2, Person, Order.Date)
#perform rolling update join on DT1
DT1[ DT2, `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]
# Person Date Order.date Order
# 1: John 2019-07-10 2019-07-09 1
An approach using data.table non-equi join and update by reference directly on Reference.Data:
library(data.table)
setDT(Add.Data)
setDT(Reference.Data)
setorder(Add.Data, Person, Order.Date)
Reference.Data[, (names(Add.Data)) :=
Add.Data[.SD, on=.(Person, Order.Date<Date), mult="last",
mget(paste0("x.", names(Add.Data)))]
]
output:
Person Date Order.Date Order
1: John 2019-07-10 2019-07-09 1
Another data.table solution:
setDT(Add.Data)[, Order.Date := as.Date(Order.Date)]
setDT(Reference.Data)[, Date := as.Date(Date)]
Reference.Data[, c("Order.Date", "Order") := Add.Data[.SD,
on = .(Person, Order.Date = Date),
roll = TRUE,
.(x.Order.Date, x.Order)]]
Reference.Data
# Person Date Order.Date Order
# 1: John 2019-07-10 2019-07-09 1
We can do a inner_join and then group by 'Person', slice the row with the max 'Order.Date'
library(tidyverse)
inner_join(Add.Data, Reference.Data) %>%
group_by(Person) %>%
slice(which.max(as.Date(Order.Date)))
# A tibble: 1 x 4
# Groups: Person [1]
# Person Order.Date Order Date
# <chr> <chr> <int> <chr>
#1 John 2019-07-09 1 2019-07-10
Or using data.tabl#
library(data.table)
setDT(Add.Data)[as.data.table(Reference.Data), on = .(Person)][,
.SD[which.max(as.Date(Order.Date))], by = Person]
Left join the Reference.Data to the Add.Data joining on Person and on Order.Date being at or before Date. Group that by the original Reference.Data rows and take the maximum Order.Date from those. The way it works is that the Add.Data row that is used for each row of Reference.Data will be the one with the maximum Order.Date so the correct Order will be shown.
Note that dot is an SQL operator and order is an SQL keyword so we must surround names with a dot or the name order (regardless of case) with square brackets.
library(sqldf)
sqldf("select r.*, max(a.[Order.Date]) as [Order.Date], a.[Order]
from [Reference.Data] as r
left join [Add.Data] as a on r.Person = a.Person and a.[Order.Date] <= r.Date
group by r.rowid")
giving:
Person Date Order.Date Order
1 John 2019-07-10 2019-07-09 1
I haven't checked how fast this is (adding indexes could speed it up if need be) but with only a few thousand rows efficiency is not likely as important as readability.

How to filter large data-sets by two attributes and split into subsets? R / Grep

I found myself at the limits of the grep() function or perhaps there are efficient ways of doing this.
Start off a sample data-frame:
Date <- c( "31-DEC-2014","31-DEC-2014","31-DEC-2014","30-DEC-2014",
"30-DEC-2014","30-DEC-2014", "29-DEC-2014","29-DEC-2014","29-DEC-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
df <- as.data.frame(cbind(Date, ISIN, price))
And the desired Result is a list() containing subsets of the main data file which looks like the below (x3 for the 3 individual Identifiers in Result_I)
The idea is that the data should first filter by ISIN and then filter by Date. this 2 step process should keep my data intact.
Result_d <- c("31-DEC-2014", "30-DEC-2014","29-DEC-2014")
Result_I <- c("LU0168343191","LU0168343191","LU0168343191")
Result_P <- c(1,4,7)
Result_df <- cbind(Result_d, Result_I, Result_P)
Please keep in mid the above is for demo purposes and the real data-set has 5M rows and 50 columns over a period of 450+ different dates as per Result_d so i am lookign for something that is applicable irrespective of nrow or ncol
What i have so far:
I take all unique dates and store:
Unique_Dates <- unique(df$Date)
The same for the Identifiers:
Unique_ID <- unique(df$ISIN)
Now the grepping issue:
If i wanted all rows containing Unique_Dates i would do something like:
pattern <- paste(Unique_dates, collapse = "|")
result <- as.matrix(df[grep(pattern, df$Date),])
and this will retrieve basically the entire data set. i am wondering if anyone knows an efficient way of doing this.
Thanks in advance.
Using dplyr:
library(dplyr)
Date <- c( "31-Dec-2014","31-Dec-2014","31-Dec-2014","30-Dec-2014",
"30-Dec-2014","30-Dec-2014", "29-Dec-2014","29-Dec-2014","29-Dec-2014" )
ISIN <- c("LU0168343191", "TW0002418001", "GB00B3FFY088","LU0168343191",
"TW0002418001", "GB00B3FFY088","LU0168343191", "TW0002418001", "GB00B3FFY088")
price <-c(seq(1:9))
DF <- data.frame(Date, ISIN, price,stringsAsFactors=FALSE)
DF$Date=as.Date(DF$Date,"%d-%b-%Y")
#Examine data ranges and frequencies
#date range
range(DF$Date)
#date frequency count
table(DF$Date)
#ISIN frequency count
table(DF$ISIN)
#select ISINs for filtering, with user defined choice of filters
# numISIN = 2
# subISIN = head(names(sort(table(DF$ISIN))),numISIN)
subISIN = names(sort(table(DF$ISIN)))[2]
subDF=DF %>%
dplyr::group_by(ISIN) %>%
dplyr::arrange(ISIN,Date) %>%
dplyr::filter(ISIN %in% subISIN) %>%
as.data.frame()
#> subDF
# Date ISIN price
#1 2014-12-29 LU0168343191 7
#2 2014-12-30 LU0168343191 4
#3 2014-12-31 LU0168343191 1
We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Date', specify the 'i' based on the index returned with grep and Subset the Data.table (.SD) based on the 'i' index.
library(data.table)
setDT(df)[grep("LU", ISIN), .SD, by = Date]
# Date ISIN price
#1: 31-DEC-2014 LU0168343191 1
#2: 30-DEC-2014 LU0168343191 4
#3: 29-DEC-2014 LU0168343191 7

R: ifelse statement test involving multiple dataframes

I am trying to create a new variable using ifelse by combining data from two data.frames (similar to this question but without factors).
My problem is that df1 features yearly data, whereas vars in df2 are temporally aggregated: e.g. df1 has multiple obs (1997,1998,...,2005) and df2 only has a range (1900-2001).
For illustration, a 2x2 example would look like
df1$id <- c("2","20")
df1$year <- c("1960","1870")
df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")
I want to combine both in such a way that the id (same variable exists in both) is matched, and further, the year in df1 is within the range of df2. I tried the following
df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear &
df1$year<df2$endyear,1,0)
Which ideally should return 1 and 0, respectively.
But instead I get warning messages:
1: In df1$id == df2$id : longer object length is not a multiple of
shorter object length
2: In df1$year >= df2$styear : longer object length is not a
multiple of shorter object length
3: In df1$year < df2$endyear : longer object length is not a
multiple of shorter object length
For the record, the 'real' df1 has 500 obs and df2 has 14. How can I make this work?
Edit: I realised some obs in df2 are repeated, with multiple periods e.g.
id styear endyear
1 1800 1915
1 1950 2002
2 1912 1988
3 1817 2000
So, I believe what I need is something like a double-ifelse:
df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear &
df1$year<df2$endyear | df1$year>=df2$styear &
df1$year<df2$endyear,1,0)
Obviously, this wouldn't work, but it is a way to get out of the duplicates-problem.
For example, if id=1 in df1$year=1801, it will pass the first year-range test (1801 is between 1800-1915), but fail the second one (1801 is not between 1950-2002), so it is only coded once and no extra rows are added (currently the duplicates add extra rows).
df1$id <- c("2","20")
df1$year <- c("1960","1870")
df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")
library(dplyr)
df3 <- left_join(df1,df2,by = "id") %>% filter(year <= endyear,year >= startyear)
I highly recommend the dplyr package for data manipulation.
With base R:
df1 <- data.frame(id=c(2,20,22), year=c(1960,1870, 2016))
df2 <- data.frame(id=c(2,20,21), styear=c(1800,1900,2000), endyear=c(2001,1950,2016))
df1
id year
1 2 1960
2 20 1870
3 22 2016
df2
id styear endyear
1 2 1800 2001
2 20 1900 1950
3 21 2000 2016
df1 <- merge(df1, df2, by='id', all.x = TRUE)
df1$new.var <- !is.na(df1$styear) & df1$year>=df1$styear & df1$year< df1$endyear
df1 <- df1[c('id', 'year', 'new.var')]
df1
id year new.var
1 2 1960 TRUE
2 20 1870 FALSE
3 22 2016 FALSE
Alright, I made it work for myself. Beware, it is quite convoluted and probably contain some redundancies. After a brief look at the data wrangling cheatsheet, assuming you have df1 and df2 with an identical var and df2 contains new.var, one can do the following:
library(dplyr)
#Join everything, all values and rows
df3 <- full_join(df1,df2,by="id")
#filter out obs those year is greater than endyear
df3 <- filter(df3,df3$year<=df3$endyear)
#same, the other way around
df3 <- filter(df3,df3$year>=df3$styear)
df3 <- distinct(df3) #remove duplicate rows (at least I had some)
As far as I can tell by looking at the end result, this method only extracts information from the correct time period while dropping all other time periods in df2. Then, it is a matter of merging with the original data.frame (df1) and filling in the NAs:
df1 <- merge(df1,df3,by=(id),all.x=TRUE)
df1 <- distinct(df1) #just to make sure, I still had three
df1$new.var <- ifelse(is.na(df1$new.var),0,df1$new.var)
which is what I wanted.
This can be solved easily and efficiently using non-equi joins in data.table devel version (1.9.7+):
library(data.table)
setDT(df1); setDT(df2) # converting to data.table in place
df1[, new.var := df2[df1, on = .(id, styear <= year, endyear >= year),
.N > 0, by = .EACHI]$V1]
df1
# id year new.var
#1: 2 1960 TRUE
#2: 20 1870 FALSE
The above join looks for matches in df2 for each row of df1 (by = .EACHI), and checks the number of matching rows (.N).

Resources