FInding date gaps in R - r

I am using R and have a vector of dates as Day of Year (DOY) in which some days are missing. I want to find where these missing days are.
DOY <- c(1,2,5,6,7,10,15,16,17)
I want an output which tells me that missing days are between day:
2 to 5
7 to 10
10 to 15
(Or the indices of these locations)

rDOY <- range(DOY);
rnDOY <- seq(rDOY[1],rDOY[2])
rnDOY[!rnDOY %in% DOY]
[1] 3 4 8 9 11 12 13 14
If instead you don't want the mssing days and do wnat the beginnings and ends of the missing items:
> DOY[ diff(DOY)!=1]
[1] 2 7 10
> DOY[-1] [ diff(DOY)!=1]
[1] 5 10 15

Related

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Extract intervals from time data in R

My problem is simple. I have table where each row is event (month, day, hour, minute is given). However, the machine was set to record 24/7. So I have more events (rows) than I need. How to remove surplus rows from daytime and to keep only rows from night (from sunset to sunrise)?
Dreadful thing is, that the timing of sunrise/sunset is slightly different each day.
In this example I provide two tables. First is table with all events, second contain timings of sunset/sunrise for each day.
If it is possible to extract, please notice that EACH night consists from two dates could be a additional column inserted in table containing ID of night? (see scheme below)
# table with all events
my.table <- data.frame(event = 1:34,
day = rep(c(30,31,1,2,3), times = c(8,9,7,8,2)),
month = rep(c(3,4), each = 17),
hour = c(13,13,13,13,22,
22,23,23,2,2,2,
14,14,14,19,22,22,
2,2,2,14,15,22,22,
3,3,3,14,14,14,
23,23,2,14),
minute = c(11,13,44,55,27,
32,54,57,10,14,
26,12,16,46,30,
12,13,14,16,45,
12,15,12,15,24,
26,28,12,16,23,12,13,11,11))
# timings of sunset/sunrise for each day
sun.table <- data.frame(day = c(30,31,31,1,1,2,2,3),
month = rep(c(3,4), times = c(3,5)),
hour = rep(c(19,6), times = 4),
minute = c(30,30,31,29,32,
28,33,27),
type = rep(c("sunset","sunrise"), times = 4))
# rigth solution reduced table would contain only rows:
# 5,6,7,8,9,10,11,16,17,18,19,20,23,24,25,26,27,31,32,33.
# nrow("reduced table") == 20
Here's one possible strategy
#convert sun-up, sun-down times to proper dates
ss <- with(sun.table, ISOdate(2000,month,day,hour,minute))
up <- ss[seq(1,length(ss),by=2)]
down <- ss[seq(2,length(ss),by=2)]
Here I assume the table is ordered and starts with a sunrise and alternates back and forth and ends with a sunset. Date values also need a year, here I just hard coded 2000. As long as your data doesn't span years (or leap days) that should be fine, but you'll probably want to pop in the actual year of your observations.
Now do the same for events
tt <- with(my.table, ISOdate(2000,month,day,hour,minute))
Find rows during the day
daytime <- sapply(tt, function(x) any(up<x & x<down))
and extract those rows
my.table[daytime, ]
# event day month hour minute
# 5 5 30 3 22 27
# 6 6 30 3 22 32
# 7 7 30 3 23 54
# 8 8 30 3 23 57
# 9 9 31 3 2 10
# 10 10 31 3 2 14
# 11 11 31 3 2 26
# 16 16 31 3 22 12
# 17 17 31 3 22 13
# 18 18 1 4 2 14
# 19 19 1 4 2 16
# 20 20 1 4 2 45
# 23 23 1 4 22 12
# 24 24 1 4 22 15
# 25 25 2 4 3 24
# 26 26 2 4 3 26
# 27 27 2 4 3 28
# 31 31 2 4 23 12
# 32 32 2 4 23 13
# 33 33 3 4 2 11
Here we only grab values that are after sunrise and before sunset. Since there isn't enough information in the sun.table to make sure that row 34 actually happens before subset, it is not returned.

Prepare Time Series for Machine Learning - Long to Wide Format

I have a data frame of time series data in a 'long' format where there is 1 row/observation per day. I would like to transform this data into a 'wide' format. Each row/observation should have the time series value for the current date and the previous 2 days.
To provide a concrete example, I will use the Air Quality data available in R. This is what my input data frame looks like.
> input <- airquality[1:4,c("Month", "Day", "Ozone")]
> input
Month Day Ozone
1 5 1 41
2 5 2 36
3 5 3 12
4 5 4 18
I would like to transform this input so that it looks like the following.
output <- data.frame(Month = 5, Day = 1:4, Ozone=c(41,36,12,18), Ozone.Prev.1=c(NA,41,36,12), Ozone.Prev.2=c(NA,NA,41,36))
> output
Month Day Ozone Ozone.Prev.1 Ozone.Prev.2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36
Any suggestions on a nice, clean way to do this? Many thanks in advance.
You can use the lag function from zoo, but the following small function get's the trick done without using additional packages:
shift_vector = function(vec, n) c(rep(NA, n), head(vec, -n))
output = transform(input, prev_1 = shift_vector(Ozone, 1),
prev_2 = shift_vector(Ozone, 2))
output
Month Day Ozone prev_1 prev_2
1 5 1 41 NA NA
2 5 2 36 41 NA
3 5 3 12 36 41
4 5 4 18 12 36

Drawdown duration

How can I get the duration of the drawdowns in a zoo serie?
the drawdowns can be calculated with cummax(mydata)-mydata. Whenever this value is above zero I have a drawdown.
The Drawdown is the measure of the decline from a historical peak (maximum).
It lasts till this value is reached again.
The PerformanceAnalytics package has several functions to do this operation.
> library(PerformanceAnalytics)
> data(edhec)
> dd <- findDrawdowns(edhec[,"Funds of Funds", drop=FALSE])
> dd$length
[1] 3 3 6 5 4 11 14 5 2 10 2 6 3 2 4 9 2 2 13 8 5 5 4 2 7
[26] 6 11 3 2 23
As a side note, if you have two dates in a time series and need to know the time between them, just use diff. You can also use the lubridate package.

Resources