I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.
I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.
I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.
Here's an example:
test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
test_data$date <- ymd(test_data$date)
select_first_1_value <- test_data %>%
mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
filter(value == 1) %>%
filter(row_number() == 1) %>%
ungroup
With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.
But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.
Any help is greatly appreciated. Thank you.
There really are two problems here:
How to roll by date rather than number of points.
how to find the first stretch of 14 days of 1's assuming that missing dates are 1.
Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.
test_data2 <- data.frame(
date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"),
value = c(0, 1, 1, 1, 0))
1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.
library(zoo)
z <- read.zoo(test_data2)
zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
time(r)[which(r == 1)[1]]
## [1] "1999-12-02"
2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above
ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
tt[which(ok)[1]]
## [1] "1999-12-02"
3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it. This assumes that there is at least one 0 row before the first run of 14+ ones. Below we have returned back to using test_data from the question although this would have also worked with test_data2.
library(dplyr)
test_data %>%
mutate(date = as.Date(date)) %>%
filter(value == 0) %>%
mutate(diff = as.numeric(lead(date) - date)) %>%
filter(diff > 14) %>%
head(1) %>%
mutate(date = date + 1)
## date value diff
## 1 2000-01-02 0 17
rollapply over dates rather than points
4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14+ ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector using findInterval and then use rollapply in the usual way but with those widths rather than using a scalar width. This only involves one extra line of code to calculate the widths, w.
# using test_data from question
tt <- as.Date(test_data$date)
w <- findInterval(tt + 13, tt, rightmost.closed = TRUE) - seq_along(tt) + 1
r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
tt[which(r == 1)[1]]
## [1] "2000-01-02"
There are further examples in ?rollapply showing how to roll by time rather than number of points.
sqldf
5) A completely different way of approaching the problem of finding the first 14+ ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all rows of b within the indicated date range and of a taking the minimum value of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then extract the date at the end.
library(sqldf)
test <- transform(test_data, date = as.Date(date))
sqldf("select a.*, min(b.value) min14
from test a
left join test b on b.date between a.date and a.date + 13
group by a.rowid
having min14 = 1
limit 1")$date
## [1] "2000-01-02"
You may look into runner package where you can pass k as days/weeks etc. See this example, to sum the last 3 days of value.
library(dplyr)
library(runner)
test_data %>%
mutate(date = as.Date(date),
sum_val = runner(value, k = "3 days", idx = date, f = sum))
# date value sum_val
#1 2000-01-01 0 0
#2 2000-01-02 1 1
#3 2000-01-03 1 2
#4 2000-01-17 1 1
#5 2000-01-18 0 1
Notice row 4 has value 1 (and not 3) because there is only 1 value that occurred in last 3 days.
Related
I have a dataframe containing timestamps of visits to a location by RFID tagged individuals. Here is a simplified example
Time<- c("07:00:48", "11:45:34", "11:46:28","11:46:29", "11:47:17","11:47:18")
ID<- c("00003F9776","01103F9702","01103FA8DD","01103FA8DD","01103F9702","01103F9702")
df<- data.frame(Time, ID)
df
Time ID
1 07:00:48 00003F9776
2 11:45:34 01103F9702
3 11:46:28 01103FA8DD
4 11:46:29 01103FA8DD
5 11:47:17 01103F9702
6 11:47:18 01103F9702
All I want to do is remove the visits which are close to each other in time, say for example, if they within 10 seconds of each other. In the above example I'd remove rows 3 and 5.
My actual data sets contain thousands of entries like this. Is there a simple way to achieve this?
Here is one dplyr answer -
library(dplyr)
df %>%
mutate(Timestamp = as.POSIXct(Time, format = '%T')) %>%
filter(difftime(Timestamp, lag(Timestamp, default = first(Timestamp) - 11), units = 'sec') > 10) %>%
select(-Timestamp)
# Time ID
#1 07:00:48 00003F9776
#2 11:45:34 01103F9702
#3 11:46:28 01103FA8DD
#4 11:47:17 01103F9702
To keep the first row in the output I used default value of lag as first(Timestamp) - 11 so that it satisfies the condition (difftime > 10) to select the row.
you could easy do this with data.table package
library(data.table)
df<- data.frame(Time, ID) %>%
as.data.table() %>% ## convert to table
mutate(Time = as.ITime(Time)) ## Time as time
create diff col to filter
df[ , diff := difftime(Time, shift(Time), units="secs") ] %>%
filter(diff > 10 | is.na(diff))
thats all.
...or calculate & filter in one line
df[difftime(Time, shift(Time), units="secs") > 10 | is.na(difftime(Time, shift(Time), units="secs"))]
is.na() added to show also first row
I am struggling a little with the logic for recoding nested data into a long "continuous" format based on dates in R
Below is a dummy example of my data. I have three sets of dates The start and stop time for a participant that is stored in long format, and then the start of another incident that is stored as wide data.
GC_ID HMIS_Start HMIS_Stop CPS Start CPS Start 2 CPS Start 3
------- ------------ ----------- ----------- ------------- -------------
1 1/10/14 1/20/14 1/15/14 6/2/14 NA
1 4/10/14 5/30/14 1/15/14 6/2/14 NA
1 12/1/14 12/2/14 1/15/14 6/2/14 NA
1 1/1/15 2/28/15 1/15/14 6/2/14 NA
2 8/13/13 8/17/14 NA NA NA
3 5/1/15 5/2/15 1/16/13 6/26/14 7/27/15
3 6/4/16 7/10/16 1/16/13 6/26/14 7/27/15
4 10/15/13 10/25/13 2/18/15 NA NA
4 12/25/13 1/18/14 2/18/15 NA NA
4 2/8/15 7/20/15 2/18/15 NA NA
My goal is to create two long continuous variables that go along with each months from August 2013 to December 2015. For one of the two variables, I would like to code a 1 for each month that target month is within an HMIS_start and HMIS_stop time for a participant AND has at least one CPS Start date within that month. The second variable would do a similar thing, but it would be if the CPS Start date happened in the month after the HMIS Stop date.
So participant 1's data could look like this:
I assume I need to create a blank data set with the ID variable and then the month/year variable. Then I would use a for loop for each ID to run an "if_then" statement comparing IF the month is greater then the HMIS start and less then the HMIS stop AND if the CPS start is within that month too.
I am mostly just struggling with how to create that process and use the for loop logically given that there are long data already in the file and multiple lines of long data per participant that need to be compared to all possible CPS start dates
Any thoughts or code tips on how to tackle this?
I am not sure how you came to your answers, and I will update this code once that is provided. But I used library(tidyverse) and library(lubridate) for this:
dat <- data.frame(GC_ID = c(1,1,1,1,2,3,3,4,4,4),
HMIS_Start = c("1/10/14", "4/10/14", "12/1/14", "1/1/15", "8/13/13", "5/1/15", "6/4/16", "10/15/13", "12/25/13","2/8/15"), HMIS_Stop = c("1/20/14", "5/30/14", "12/2/14", "2/28/15", "8/17/14", "5/2/15", "7/10/16", "10/25/13", "1/18/14", "7/20/15"), CPS_Start = c("1/15/14","1/15/14","1/15/14","1/15/14",NA, "1/16/13", "1/16/13", "2/18/15", "2/18/15", "2/18/15"), CPS_Start_2 = c("6/2/15", "6/2/15", "6/2/15", "6/2/15", NA, "6/26/14", "6/26/14", NA, NA, NA), CPS_Start_3 = c(NA,NA,NA,NA,NA,"7/27/15", "7/27/15", NA,NA,NA))
dats <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, ~as.Date(., format = "%m/%d/%y")) %>%
gather(Var, Dates, -GC_ID, -HMIS_Start, -HMIS_Stop) %>%
filter(!is.na(Dates)) %>%
mutate(HMIS_CPS_SAME = if_else(month(HMIS_Start) == month(HMIS_Stop) &
year(HMIS_Start) == year(HMIS_Stop) &
month(HMIS_Start) == month(Dates) &
year(HMIS_Start) == year(Dates), 1, 0 ),
CPS_After = if_else(month(HMIS_Stop) + 1 == month(Dates) &
year(HMIS_Stop) == year(Dates), 1,0 ),
Months = month(HMIS_Start),
Years = year(HMIS_Start)) %>%
arrange(GC_ID, HMIS_Start, Dates) %>%
group_by(GC_ID, Months, Years) %>%
summarise(HMIS_CPS_SAME = max(HMIS_CPS_SAME),
CPS_After = max(CPS_After)) %>%
ungroup()
full_dat <- merge(data.frame(GC_ID = unique(dat$GC_ID)), data.frame(Dates = seq.Date(as.Date("2013-08-01"), as.Date("2015-12-01"), by = "month"))) %>%
mutate(Months = month(Dates), Years = year(Dates)) %>%
left_join(dats, by = c("GC_ID", "Months", "Years")) %>%
mutate_if(is.numeric , replace_na, replace = 0)
First I created the data in R and R format. Then I converted the data to date format for the 5 columns you mentioned. I made the data long to do the comparisons specified, then found the max for each GC_ID, Months, Years. Then I used a cartesian join for each date and GC_ID and got the months and years from those and joined our dats to full_dat by GC_ID, Months, Years. The last mutate_if is to convert all NA values to 0. NO Looping Needed! :-)
I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2
Suppose I have a series of observations representing date intervals, e.g.
library(dplyr)
library(magrittr)
df <-
data_frame(start = as.Date(c('2000-01-01', '2000-01-03', '2000-01-08',
'2000-01-20', '2000-01-22')),
end = as.Date(c('2000-01-02', '2000-01-05', '2000-01-10',
'2000-01-21', '2000-02-10')))
I would like to group these observations such that the start time of observation n occurs within some specified interval following the end date of observation n-1. For instance, if we set that interval to be 5 days, we would see something like:
# start end group
# (date) (date) (dbl)
# 1 2000-01-01 2000-01-02 1
# 2 2000-01-03 2000-01-05 1
# 3 2000-01-08 2000-01-10 1
# 4 2000-01-20 2000-01-21 2
# 5 2000-01-22 2000-02-10 2
(For the sake of simplicity, I'm assuming no overlap in dates, although this isn't necessarily the case in the data). I thought about using igraph to create a weighted edgelist, but that seemed overly complicated. Efficiency is, I believe, important: I'll be running this on roughly 4 million groups of data of about 5-10 rows each.
While my solution does work, to me it seems error-prone, slow, and clunky. I'm thinking using a package or some vectorization would really improve matters.
group_dates <- function(df, interval){
# assign first date to first group
df %<>% arrange(start, end)
df[1, 'group'] <- 1
# for each start date, determine if it is within `interval` days of the
# closest end date
lapply(df$start[-1], function(cur_start){
earlier_data <- df[df$end <= cur_start, ]
diffs <- cur_start - earlier_data$end
min_interval <- diffs[which.min(diffs)]
closest_group <- earlier_data$group[which.min(diffs)]
if(min_interval <= interval){
df[df$start == cur_start, 'group'] <<- closest_group
} else {
df[df$start == cur_start, 'group'] <<- closest_group + 1
}
})
return(df)
}
You can do that relatively easily with dplyr.
The idea is the following:
Lag the end data (shifting it down by one)
Calculate the difference between start date and the lagged end date
Adding 'BreakPoints' - A variable with TRUE when the difference is more than 5 days and FALSE otherwise
Calculating the cumulative sum of this break-point. This will add 1 every time it find a new breakpoint so a new interval should be started
Something like this should work for you:
df %>%
mutate(lagged_end = lag(end),
diff = start - lagged_end,
new_interval = diff > 5,
new_interval = ifelse(is.na(new_interval), FALSE, new_interval),
interval_number = cumsum(new_interval))
This should be also quite quick since it's all in dplyr
This isn't as elegant as Lorenzo Rossi's solution, but offers a slightly different approach using cut.Date and 2 lines of code:
breakpoints <- c(FALSE, sapply(2:nrow(df), function(x) df[x,"start"] - df[x-1,"end"]) > 5)
clusterLabels <- as.numeric(cut.Date(df$start, c(min(df$start), df[breakpoints, "start"], max(df$start)+1)))
I'm currently writing a script in the R Programming Language and I've hit a snag.
I have time series data organized in a way where there are 30 days in each month for 12 months in 1 year. However, I need the data organized in a proper 365 days in a year calendar, as in 30 days in a month, 31 days in a month, etc.
Is there a simple way for R to recognize there are 30 days in a month and to operate within that parameter? At the moment I have my script converting the number of days from the source in UNIX time and it counts up.
For example:
startingdate <- "20060101"
endingdate <- "20121230"
date <- seq(from = as.Date(startingdate, "%Y%m%d"), to = as.Date(endingdate, "%Y%m%d"), by = "days")
This would generate an array of dates with each month having 29 days/30 days/31 days etc. However, my data is currently organized as 30 days per month, regardless of 29 days or 31 days present.
Thanks.
The first 4 solutions are basically variations of the same theme using expand.grid. (3) uses magrittr and the others use no packages. The last two work by creating long sequence of numbers and then picking out the ones that have month and day in range.
1) apply This gives a series of yyyymmdd numbers such that there are 30 days in each month. Note that the line defining yrs in this case is the same as yrs <- 2006:2012 so if the years are handy we could shorten that line. Omit as.numeric in the line defining s if you want character string output instead. Also, s and d are the same because we have whole years so we could omit the line defining d and use s as the answer in this case and also in general if we are always dealing with whole years.
startingdate <- "20060101"
endingdate <- "20121230"
yrs <- seq(as.numeric(substr(startingdate, 1, 4)), as.numeric(substr(endingdate, 1, 4)))
g <- expand.grid(yrs, sprintf("%02d", 1:12), sprintf("%02d", 1:30))
s <- sort(as.numeric(apply(g, 1, paste, collapse = "")))
d <- s[ s >= startingdate & s <= endingdate ] # optional if whole years
Run some checks.
head(d)
## [1] 20060101 20060102 20060103 20060104 20060105 20060106
tail(d)
## 20121225 20121226 20121227 20121228 20121229 20121230
length(d) == length(2006:2012) * 12 * 30
## [1] TRUE
2) no apply An alternative variation would be this. In this and the following solutions we are using yrs as calculated in (1) so we omit it to avoid redundancy. Also, in this and the following solutions, the corresponding line to the one setting d is omitted, again, to avoid redundancy -- if you don't have whole years then add the line defining d in (1) replacing s in that line with s2.
g2 <- expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30))
s2 <- with(g2, sort(as.numeric(paste0(yr, mon, day))))
3) magrittr This could also be written using magrittr like this:
library(magrittr)
expand.grid(yr = yrs, mon = sprintf("%02d", 1:12), day = sprintf("%02d", 1:30)) %>%
with(paste0(yr, mon, day)) %>%
as.numeric %>%
sort -> s3
4) do.call Another variation.
g4 <- expand.grid(yrs, 1:12, 1:30)
s4 <- sort(as.numeric(do.call("sprintf", c("%d%02d%02d", g4))))
5) subset sequence Create a sequence of numbers from the starting date to the ending date and if each number is of the form yyyymmdd pick out those for which mm and dd are in range.
seq5 <- seq(as.numeric(startingdate), as.numeric(endingdate))
d5 <- seq5[ seq5 %/% 100 %% 100 %in% 1:12 & seq5 %% 100 %in% 1:30]
6) grep Using seq5 from (5)
d6 <- as.numeric(grep("(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|30)$", seq5, value = TRUE))
Here's an alternative:
date <- unclass(startingdate):unclass(endingdate) %% 30L
month <- rep(1:12, each = 30, length.out = NN <- length(date))
year <- rep(1:(NN %/% 360 + 1), each = 360, length.out = NN)
(of course, we can easily adjust by adding constants to taste if you want a specific day to be 0, or a specific month, etc.)