I have 3 data frames, df1 = a time interval, df2 = list of IDs, df3 = list of IDs with associated date.
df1 <- structure(list(season = structure(c(2L, 1L), .Label = c("summer",
"winter"), class = "factor"), mindate = structure(c(1420088400,
1433131200), class = c("POSIXct", "POSIXt")), maxdate = structure(c(1433131140,
1448945940), class = c("POSIXct", "POSIXt")), diff = structure(c(150.957638888889,
183.040972222222), units = "days", class = "difftime")), .Names = c("season",
"mindate", "maxdate", "diff"), row.names = c(NA, -2L), class = "data.frame")
df2 <- structure(list(ID = c(23796, 23796, 23796)), .Names = "ID", row.names = c(NA,
-3L), class = "data.frame")
df3 <- structure(list(ID = c("23796", "123456", "12134"), time = structure(c(1420909920,
1444504500, 1444504500), class = c("POSIXct", "POSIXt"), tzone = "US/Eastern")), .Names = c("ID",
"time"), row.names = c(NA, -3L), class = "data.frame")
The code should compare if df2$ID == df3$ID. If true, and if df3$time >= df1$mindate and df3$time <= df1$maxdate, then df1$maxdate - df3$time, else df1$maxdate - df1$mindate. I tried using the ifelse function. This works when i manually specify specific cells, but this is not what i want as I have many more (uneven rows) for each of the dfs.
df1$result <- ifelse(df2[1,1] == df3[1,1] & df3[1,2] >= df1$mindate & df3[1,2] <= df1$maxdate,
difftime(df1$maxdate,df3[1,2],units="days"),
difftime(df1$maxdate,df1$mindate,units="days")
EDIT: The desired output is (when removing last row of df2):
season mindate maxdate diff result
1 winter 2015-01-01 2015-05-31 23:59:00 150.9576 days 141.9576
2 summer 2015-06-01 2015-11-30 23:59:00 183.0410 days 183.0410
Any ideas? I don't see how I could merge dfs to make them of the same length. Note that df2 can be of any row length and not affect the code. Issues arise when df1 and df3 differ in # of rows.
The > and < are vectorized:
transform(df1,result=ifelse(df3$ID%in%df2$ID & df3$time>mindate & df3$time <maxdate, difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
You can also use the between function from data.table library
library(data.table)
transform(df1,result=ifelse(df3$ID%in%df2$ID&df3$time%between%df1[2:3],
difftime(maxdate,df3$time),difftime(maxdate,mindate)))
season mindate maxdate diff result
1 winter 2014-12-31 21:00:00 2015-05-31 20:59:00 150.9576 days 141.9576
2 summer 2015-05-31 21:00:00 2015-11-30 20:59:00 183.0410 days 183.0410
Related
Updated to a more realistic example; this time added duplicates in interp_b.
I am trying to populate a field in one dataframe (interp_b) using the values from a second dataframe (bait). I want to look at each row's obs_datetime in interp_b, and determine when that plot-station-year was last baited, prior to the obs_datetime. This will later be used to calculate a time-since-bait for each obs_datetime. Bait times are in the bait dataframe in column bait_datetime. The results should go in a field called latestbait_datetime in the interp_b dataframe.
I was visualizing an iterative process where interp_b "latestbait_datetime" keeps getting recalculated until the last row in the bait dataframe is reached. The for-loop I tried is clearly running through the rows and doing the specified calculations but I can't seem to get the output in the format I want; it is producing output for each loop rather than rewriting and updating the interp_b dataframe.
Here is some code to build the two dataframes; interp_b and bait (please excuse the inelegance)
# interp_b dataframe----
structure(list(plot_station_year = c("Cow_C2_2019", "RidingStable_C3_2018",
"RidingStable_C3_2018", "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600,
1544954400, 1541084400, 1515160800, 1567756800), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"))), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
# bait dataframe----
structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019",
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400,
1559746800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(plot_station_year = structure(list(), class = c("collector_character",
"collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
and the desired result would look like this
Below are two of my attempts. The first resulted in a dataframe that only contained the final run of the loop and the second attempt resulted in a dataframe containing all of the run results (as expected with the bind).
library(tidyverse)
#attempt #1----
for (i in 1:nrow(bait)) {
print(paste("row =",i))
interpbait <- interp_b %>%
mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))
}
#attempt #2----
resultb <- data.frame()
for (i in 1:nrow(bait)) {
print(paste("row =",i))
interpbait2 <- interp_b %>%
mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))
resultb <- bind_rows(resultb, interpbait2)
print(resultb)
}
Any help would be greatly appreciated.
I'm not sure how long this will take, but here is a tidyverse solution. For each row in interp_b, we filter the bait dataframe to the correct plot_station_year, and ensure that all date-times are less than the row in interp_b. Then, we arrange the filtered bait data by descending datetime (so that the most recent dates are on top). We slice the first row of that dataframe so that we only get the most recent date. Then we "pull out" the date-time from the dataframe, and add it onto the appropriate row in interp_b.
library(tidyverse)
library(progress) # for progress bar
# create progress bar to update, so that you can estimate the amount of time it will take to finish the entire loop
pb <- progress_bar$new(total = nrow(interp_b))
for (i in 1:nrow(interp_b)) {
last_time_baited <- bait %>%
#filter bait dataframe to appropriate plot, station, year based on
# the row in interp_b
filter(plot_station_year == interp_b$plot_station_year[i],
# ensure all datetimes are less than that row in interp_b
bait_datetime < interp_b$obs_datetime[i]) %>%
# arrange by datetime (most recent datetimes first)
arrange(desc(bait_datetime)) %>%
# take the top row - this will be the most recent date-time that
# the plot-station was baited
slice(1) %>%
# "pull" that value out of the dataframe so you have a value,
# not a tibble
pull(bait_datetime) #
# update the row in interp_b with the date_time baited
interp_b$latestbait_datetime[i] <- last_time_baited
pb$tick() # print progress
}
The resulting table matches your expected output (interp_b):
# A tibble: 5 x 3
plot_station_year obs_datetime latestbait_datetime
<chr> <dttm> <dttm>
1 Cow_C2_2019 2019-06-02 15:00:00 2019-05-10 11:00:00
2 RidingStable_C3_2018 2018-12-16 10:00:00 2018-12-01 10:00:00
3 RidingStable_C3_2018 2018-11-01 15:00:00 NA
4 Raf_C1_2018 2018-01-05 14:00:00 2017-04-04 11:00:00
5 Metcalfe_C2_2019 2019-09-06 08:00:00 NA
You could perform an outer join with data.table, and then select the highest bait_datetime for each plot_station_year.
Edit: I edited my answer to reflect the possibility that there could be multiple obs_datetime for a given unique plot_station_year in interp2. To preserve these, we index them and include the index in the filtering step.
One potential improvement with large files (not tested) could be to merge using roll, instead of performing an outer merge and then to filter.
That version is shown in the end of the reproducible example:
library(data.table)
interp2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "RidingStable_C3_2018",
"Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 1559487300,
1544954400, 1515160800, 1567756800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_), class = c("POSIXct", "POSIXt"))), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
bait2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "Cow_C2_2019",
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400,
1496674800, 1576674800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(plot_station_year = structure(list(), class = c("collector_character",
"collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
# add index idx by plot_station_year, remove empty column, set keys
setDT(interp2)[, "latestbait_datetime" := NULL][, idx := 1:.N, by=plot_station_year]
setkeyv(interp2, c("plot_station_year", "idx", "obs_datetime"))
# same for bait2: set as data.table, set keys
setDT(bait2, key=c("plot_station_year", "bait_datetime"))
## option 1: merge files, then filter
# outer join on interp2 and bait2 on first column (and order by bait_datetime)
expected_out <- merge(interp2, bait2, by="plot_station_year", all=TRUE)
# set keys for sorting
setkey(expected_out, plot_station_year, idx, bait_datetime)
# select highest bait_datetime below obs_datetime by plot_station_year and idx
expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][,
tail(.SD, 1), by=.(plot_station_year, idx)]
# rename and sort columns
setnames(expected_out, old="bait_datetime", new="latestbait_datetime")
setorder(expected_out, -latestbait_datetime, idx, na.last = TRUE)[]
#> plot_station_year idx obs_datetime latestbait_datetime
#> 1: Cow_C2_2019 1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4: Raf_C1_2018 1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 1 2019-09-06 08:00:00 <NA>
## option 2 (might use less memory): rolling join
bait2[, latestbait_datetime := bait_datetime]
out_alt <- bait2[interp2, .(plot_station_year, obs_datetime, idx, latestbait_datetime),
on=c("plot_station_year", "bait_datetime==obs_datetime"), roll=Inf]
# order
setorder(out_alt, -latestbait_datetime, idx, na.last = TRUE)[]
#> plot_station_year obs_datetime idx latestbait_datetime
#> 1: Cow_C2_2019 2019-06-02 15:00:00 1 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2019-06-02 14:55:00 2 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 2018-12-16 10:00:00 1 2018-12-01 15:00:00
#> 4: Raf_C1_2018 2018-01-05 14:00:00 1 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 2019-09-06 08:00:00 1 <NA>
setcolorder(out_alt, c(1,3,2,4))[]
#> plot_station_year idx obs_datetime latestbait_datetime
#> 1: Cow_C2_2019 1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4: Raf_C1_2018 1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 1 2019-09-06 08:00:00 <NA>
## test that both options give the same result:
identical(expected_out, out_alt)
#> [1] TRUE
I am very new to R programming and am trying to determine the number of days apportioned per month between two dates.
I have a dataset that has the following structure:
from_date
to_date
quantity
Example data:
2019-06-15 2019-09-10 55
2019-07-11 2019-10-05 17
I would like to call a function that returns a dataset/vector? that holds 3 values as there will be a maximum difference between from_date and to_date of 3 months.
I have tried using lubridate::floor_date() to work backward from the to_date
Not sure if you are looking for some result like below:
df$quantity <- with(df,as.Date(to_date)-as.Date(from_date))
or
df$quantity <- apply(df, 1, function(v) diff(as.Date(v)))
yielding
> df
from_date to_date quantity
1 2019-06-15 2019-09-10 87
2 2019-07-11 2019-10-05 86
Data
df <- structure(list(from_date = structure(1:2, .Label = c("2019-06-15",
"2019-07-11"), class = "factor"), to_date = structure(1:2, .Label = c("2019-09-10",
"2019-10-05"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Error in seq.Date(as.Date(retail$Valid_from), as.Date(retail$Valid_to), :
'from' must be of length 1
I have tried both the methods as mentioned in the question :
How should I deal with 'from' must be of length 1 error?
I basically want to repeat the quantity for each day in a given date range :
HSD_RSP Valid_from Valid_to
70 1/1/2018 15/1/2018
80 1/16/2018 1/31/2018
.
.
.
Method 1 :
byDay = ddply(retail, .(HSD_RSP), transform,
day=seq(as.Date(retail$Valid_from), as.Date(retail$Valid_to), by="day"))
Method 2 :
dt <- data.table(retail)
dt <- dt[,seq(as.Date(Valid_from),as.Date(Valid_to),by="day"),
by=list(HSD_RSP)]
HSD_RSP final_date
70 1/1/2018
70 2/1/2018
70 3/1/2018
70 4/1/2018
.
.
.
output of
dput(head(retail))
structure(list(HSD_RSP = c(61.68, 62.96, 63.14, 60.51, 60.34,
61.63), Valid_from = structure(c(1483315200, 1484524800, 1487116800,
1491004800, 1491523200, 1492300800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Valid_to = structure(c(1484438400, 1487030400,
1490918400, 1491436800, 1492214400, 1493510400), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
Convert to date, create a sequence of dates between Valid_from and Valid_to and unnest
library(tidyverse)
df %>%
mutate_at(vars(starts_with("Valid")), as.Date, "%m/%d/%Y") %>%
mutate(Date = map2(Valid_from, Valid_to, seq, by = "1 day")) %>%
unnest(Date) %>%
select(-Valid_from, -Valid_to)
# HSD_RSP Date
# <int> <date>
# 1 70 2018-01-01
# 2 70 2018-01-02
# 3 70 2018-01-03
# 4 70 2018-01-04
# 5 70 2018-01-05
# 6 70 2018-01-06
# 7 70 2018-01-07
# 8 70 2018-01-08
# 9 70 2018-01-09
#10 70 2018-01-10
# … with 21 more rows
data
df <- structure(list(HSD_RSP = c(70L, 80L), Valid_from = structure(1:2,
.Label = c("1/1/2018", "1/16/2018"), class = "factor"), Valid_to =
structure(1:2, .Label = c("1/15/2018", "1/31/2018"), class = "factor")),
class = "data.frame", row.names = c(NA, -2L))
Using Ronak Shah's data structure, using data.table:
library(data.table)
dt <- as.data.table(df1)
dt[, .(final_date = seq(as.Date(Valid_from, "%m/%d/%Y"), as.Date(Valid_to, "%m/%d/%Y"), by = "day")),
by = HSD_RSP]
HSD_RSP final_date
1: 70 2018-01-01
2: 70 2018-01-02
3: 70 2018-01-03
4: 70 2018-01-04
....
data:
df <- structure(list(HSD_RSP = c(70L, 80L), Valid_from = structure(1:2,
.Label = c("1/1/2018", "1/16/2018"), class = "factor"), Valid_to =
structure(1:2, .Label = c("1/15/2018", "1/31/2018"), class = "factor")),
class = "data.frame", row.names = c(NA, -2L))
I have a large dataset with multiple groups within the dataset of IDs with Start & Stop datetimes. What I'm trying to do is within each group identify where a subgroup occurred. A subgroup within a group would be when two ID's overlap with their START & END datetime columns. Below is script to create a sample dataset in R for one group. What I want to do is within each group create a column called, "Grp" that groups those subgroups with overlapping START & END datetimes.
What I have...
structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1490918400,
1508363100, 1508379300), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1492247700,
1509062400, 1509031800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END"))
What I want is...
structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1508379300,
1508363100, 1490918400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1509031800,
1509062400, 1492247700), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Grp = c(1,2,2,1)), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END","Grp"))
I've tried using lubridate's interval, and finding an overlap that way, but no luck. Any help would be greatly appreciated.
Atfter sorting by START, the condition for a new group is that the END of the previous row is less than the START of the next group:
head(df1$END, -1) < tail(df1$START,-1)
df1 <- structure(list(ID = c(1,2,3,4), START = structure(c(1490904000, 1490918400,
1508363100, 1508379300), tzone = "UTC", class = c("POSIXct",
"POSIXt")), END = structure(c(1492050600, 1492247700,
1509062400, 1509031800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), class = "data.frame", row.names = c(NA, -4L), .Names = c("ID","START",
"END"))
df1
ID START END
1 1 2017-03-30 20:00:00 2017-04-13 02:30:00
2 2 2017-03-31 00:00:00 2017-04-15 09:15:00
3 3 2017-10-18 21:45:00 2017-10-27 00:00:00
4 4 2017-10-19 02:15:00 2017-10-26 15:30:00
df1a <- df1[ order(df1$START), ]
df1a$grp <- cumsum( c( 1, head(df1$END, -1) < tail(df1$START,-1) ))
df1a
#---------------
ID START END grp
1 1 2017-03-30 20:00:00 2017-04-13 02:30:00 1
2 2 2017-03-31 00:00:00 2017-04-15 09:15:00 1
3 3 2017-10-18 21:45:00 2017-10-27 00:00:00 2
4 4 2017-10-19 02:15:00 2017-10-26 15:30:00 2
Here's a function that answers the first part of my response to the comment below:
grp_overlaps <- function(endings, beginnings){
cumsum(c( 1, head(endings, -1) < tail(beginnings, -1) )) }
library(data.table)
I am trying to do this.
wd <- structure(list(Year = c(2006L, 2006L, 2006L), day = c(361L, 361L,
360L), hour = c(14L, 8L, 8L), mint = c(30L, 0L, 30L), valu1 = c(0.5,
0.3, 0.4), Date = structure(c(1167229800, 1167206400, 1167121800
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Year",
"day", "hour", "mint", "valu1", "Date"), row.names = c(NA, -3L
), class = "data.frame")
wg <- c("2006/12/27 14:23:59", "2006/12/27 16:47:59", "2006/12/27 19:12:00")
w <- c("0.4", "0.2", "0.5")
wf=data.frame(wg,w)
wg <- as.POSIXct(wf$wg, format = "%Y/%m/%d %T", tz = "UTC")
WG <- data.table(start = wg, end = wg)
setkey(WG)
## Do the same for `wd` adding +/- 30 minutes
setDT(wd)[, `:=`(start = Date - 1800L, end = Date + 1800L)]
## Run foverlaps and extract the match `valu1` column
foverlaps(wd, WG, nomatch = 0L)[, .(wdDate = Date, valu1, WGDate = start)]
wdDate valu1 WGDate
1: 2006-12-27 14:30:00 0.5 2006-12-27 14:23:59
As you can see in the final results only valu1 was extracted from wd but I would like also to extract the corresponding values from w in wf.
So I want something like this:
wdDate valu1 WGDate w
1: 2006-12-27 14:30:00 0.5 2006-12-27 14:23:59 0.4
Any idea is welcome.
Real data:
head(wf)
date1 date2 date3n wg w whyt
1 <NA> 2003-01-01 <NA> <NA> NA NA
2 <NA> 2003-01-02 <NA> <NA> NA NA
3 <NA> 2003-01-03 <NA> 2003/01/03 10:30:00 0.2137352 0.34
4 <NA> 2003-01-04 <NA> <NA> NA NA
Facing a problem here:
In my previous answer I've created WG because you provided wg as a single vector. If you already have a data set called wf, this whole proccess is not needed. You just need to adjust wf correctly and then run foverlaps. In other words, forget about WG and do the following
setDT(wf)[, wg := as.POSIXct(wg, format = "%Y/%m/%d %T", tz = "UTC")]
wf[, `:=`(start = wg, end = wg)]
setkey(wf, start, end)
setDT(wd)[, `:=`(start = Date - 1800L, end = Date + 1800L)]
foverlaps(wd, wf, nomatch = 0L)[, .(wdDate = Date, valu1, WGDate = start, w)]
# wdDate valu1 WGDate w
# 1: 2006-12-27 14:30:00 0.5 2006-12-27 14:23:59 0.4