How to subset dataframe on dates? - r

I have got a panel dataframe in R with a many rows. I wish to subset the dataframe to only include the last 10 (or last observation 10 days before the end of the month) days of each month. However the months are varying and not all month include end of the month observations. I need a subset of the data to include of every month the final 10 or five days.
CIV50s = CIV50sub %>%
select(cusip, date, impl_volatility) %>%
group_by(year(date), month(date), cusip) %>%
summarize(impl_volatility = tail(impl_volatility, 1)) %>%
mutate(date = make_date(`year(date)`, `month(date)`))
I have tried this. However this only gives me the last day of the month observation. I need either the last 10 days or the last observations 10 days before the end of the month.
my dataset looks like this:

Here are two possible solutions. The first is quick but imprecise, as you can extract the day of each date and filter those from 21 onward. But this doesn't work precisely since months have different lengths.
library(dplyr)
library(lubridate)
df <- data.frame(t=seq(ymd('2018-01-01'),ymd('2019-01-01'),by='days'))
#extract day of month
df$day <- as.numeric(format(df$t,'%d'))
df %>% filter(day>=20) # can change this to 21 or other number
t day
1 2018-01-20 20
2 2018-01-21 21
3 2018-01-22 22
4 2018-01-23 23
5 2018-01-24 24
6 2018-01-25 25
7 2018-01-26 26
The other option is to add the length of each month, find the last 10 days, then filter based on the difference. Either option will work if you have missing days for the last days of each month.
df %>% mutate(month=as.numeric(format(t,'%m')),
month.length=case_when(month %in% c(1,3,5,7,8,10,12)~31,
month==2~28,
TRUE~30),
diff=month.length-day) %>%
filter(diff<=10)
t day month month.length diff
1 2018-01-21 21 1 31 10
2 2018-01-22 22 1 31 9
3 2018-01-23 23 1 31 8
4 2018-01-24 24 1 31 7
5 2018-01-25 25 1 31 6
6 2018-01-26 26 1 31 5
7 2018-01-27 27 1 31 4
8 2018-01-28 28 1 31 3
9 2018-01-29 29 1 31 2
10 2018-01-30 30 1 31 1
11 2018-01-31 31 1 31 0
12 2018-02-18 18 2 28 10
13 2018-02-19 19 2 28 9
14 2018-02-20 20 2 28 8
15 2018-02-21 21 2 28 7
16 2018-02-22 22 2 28 6

Related

select rows in tibble by a random interval

I'm trying take a sequence of dates--and starting with the first date--select subsequent dates by a random number generated from a normal distribution. At the moment I have code that selects the row number by a random number, but it uses the same number every time. In this example, it selects a row every 12 days:
set.seed(123)
library(tidyverse)
library(lubridate)
start_date <- as.Date('2018-03-01')
end_date <- as.Date('2018-07-01')
seq_dates <- seq(ymd(start_date), ymd(end_date), by='1 days')
seq_dates <- seq_dates %>%
as.tibble()
seq_dates
seq_dates %>%
filter(row_number() %% round(rnorm(n=1, mean=14, sd=3), 0) == 1)
Is there a way I can do this with dplyr, but select a row from the start date at a random interval every time? So from 2018-03-01 the next date might be 12 days later, then 14 days later, then 19 days later, etc?
library(dplyr)
set.seed(10)
n <- rnorm(50, 14, 3)
rows <- cumsum(round(n, 0))
diff(rows) # random ~normal increments used when selecting your rows
# [1] 13 10 12 15 15 10 13 9 13 17 16 13 17 16 14 11 13 17 15 12 7 12 8 10 13 12 11 14 13 8 14 17
# [33] 15 10 10 15 9 13 12 17 12 12 17 11 14 15 13 12 16
seq_dates %>%
slice(rows[rows <= n()])
# # A tibble: 9 x 1
# value
# <date>
# 1 2018-03-14
# 2 2018-03-27
# 3 2018-04-06
# 4 2018-04-18
# 5 2018-05-03
# 6 2018-05-18
# 7 2018-05-28
# 8 2018-06-10
# 9 2018-06-19

R - Detect end of observations in groups and remove redundant rows

I have a data.frame consisting of about 300k rows with 24 rows for each ID - each row representing an hourly observation of that ID. My problem lies in that for some IDs the observation ends before the 24 hours has gone by - yet still have 24 rows with the remaining rows having NA in their 3 observation variables.
In a simplified table would be something like this
ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
1 0 29 32 34 19 21
1 1 21 12 NA 19 21
1 2 NA 24 NA 19 21
1 3 NA NA NA 19 21
1 4 NA  NA NA 19 21
2 0 41 16 21 13 24
2 1 NA NA NA 13 24
2 2 11 30 41 13 24
2 3 21 NA NA 13 24
2 4 24 35 21 13 24
2 5 NA NA NA 13 24
2 6 NA NA NA 13 24
3 0 NA NA NA 35 46
3 1 23 34 24 35 46
3 2 NA 26 NA 35 46
3 3 NA NA 24 35 46
3 4 12 29 42 35 46
3 5 NA NA NA 35 46
3 6 NA NA NA 35 46
In the table, each ID would represent a scenario that should be handled appropriately:
ID 1: Ordinary with observations starting from hour 0 and observation ending at hour 3 - and thus row with hour 3 and 4 for that group should be removed
ID 2: Has an hour (1) where all three observation variables are set at NA, but observation is resumed and ends at hour 5 - and thus row 2 should be kept (due to faulty registration and not end of observation) and rows with hour 5 and 6 should be removed.
ID 3: Starts out with an row with NA in all three observation variables, but observation begins then next hour and ends at hour 5. This is akin to the scenario for ID 2, but this time occurring at the very start (instead of in the middle of the observations). However, this still represent a faulty registration and should be kept and rows from hour 5 and 6 in this group should be removed.
Conceptually, I would think a possible solution would be do a group_by ID and then for R to go through the rows in a group in reverse (from bottom and up) until it encounters a row where "OBS_1", "OBS_2" and "OBS_3" are not all NA and remove the rows examined before reaching to this row and then move on to examine the next group.
Any help would be greatly appreciated!
If your MISC and MISC_2 values are consistent for each ID, you could
filter all rows that have na values then fill back in the missing data with complete and fill.
library(dplyr)
library(tidyr)
df %>% filter(!(is.na(OBS_1)&is.na(OBS_2)&is.na(OBS_3))) %>%
group_by(ID) %>%
complete(HOUR=0:max(HOUR)) %>%
fill(MISC,MISC_2) %>% fill(MISC,MISC_2,.direction = "up")
# A tibble: 13 x 7
# Groups: ID [3]
# ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 29 32 34 19 21
# 2 1 1 21 12 NA 19 21
# 3 1 2 NA 24 NA 19 21
# 4 2 0 41 16 21 13 24
# 5 2 1 NA NA NA 13 24
# 6 2 2 11 30 41 13 24
# 7 2 3 21 NA NA 13 24
# 8 2 4 24 35 21 13 24
# 9 3 0 NA NA NA 35 46
# 10 3 1 23 34 24 35 46
# 11 3 2 NA 26 NA 35 46
# 12 3 3 NA NA 24 35 46
# 13 3 4 12 29 42 35 46
This filters only the missing values if the no observation for the day are existing after this and keeps all missing observations that do not indicate the end of the observations for the day. These also allow for your other variables to vary during the day because it just removes them if the end of observations is reached.
df %>% arrange(rev(as.numeric(rownames(.)))) %>%
group_by(ID) %>%
mutate(rowNum = 1:n(),
naObs = cumsum((is.na(OBS_1) & is.na(OBS_2) & is.na(OBS_3))),
missingBlock = naObs != rowNum) %>%
slice(min(which(missingBlock)):n()) %>%
ungroup() %>%
arrange(rev(as.numeric(rownames(.)))) %>%
select(-rowNum, -naObs, -missingBlock)

Finding discrepancy between two data sets when setdiff is not working

I have data for spot price and day-ahead price for hour 2 and hour 3. They are as below. They are from 2015-12-31 to 2011-01-01 all the way down.
> head(da2)
Date Price Hour
43802 2015-12-31 12.56 2
43778 2015-12-30 23.59 2
43754 2015-12-29 17.07 2
> head(sp2)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 2 17.15
2 2015-12-30 2 26.23
3 2015-12-29 2 23.01
> head(da3)
Date Price Hour
43803 2015-12-31 10.46 3
43779 2015-12-30 23.55 3
43755 2015-12-29 16.52 3
> head(sp3)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 3 12.96
2 2015-12-30 3 25.65
3 2015-12-29 3 23.59
I tried to put da2$Price and sp2$Price together, and again the same for hour 3.
But unfortunately, I get this.
> rpdf2<-data.frame(da2$Date,da2$Price,sp2$Price)
Error in data.frame(da2$Date, da2$Price, sp2$Price) :
arguments imply differing number of rows: 1826, 1822
> rpdf3<-data.frame(da3$Date,da3$Price,sp3$Price)
Error in data.frame(da3$Date, da3$Price, sp3$Price) :
arguments imply differing number of rows: 1821, 1825
So I applied > setdiff(paste(da2$Date),paste(sp2$Date))
Then I found
[1] "2014-03-30" "2013-03-31" "2012-03-25" "2011-03-27"
It was okay. But when I did setdiff(paste(da3$Date),paste(sp3$Date)), It shows me character(0).
There must be 4 observations difference. But I cannot find those four. Can anyone help me with this situation? Thank you.
When setdiff(da3$Date,sp3$Date)
result is
[1] 16800.04 16799.04 16798.04 16797.04 16796.04 16795.04 16794.04 16793.04 16792.04 16791.04 16790.04 16789.04 16788.04 16787.04 16786.04 16785.04 16784.04
[18] 16783.04 16782.04 16781.04 16780.04 16779.04 16778.04 16777.04 16776.04 16775.04 16774.04 16773.04 16772.04 16771.04 16770.04 16769.04 16768.04 16767.04
[35] 16766.04 16765.04 16764.04 16763.04 16762.04 16761.04 16760.04 16759.04 16758.04 16757.04 16756.04 16755.04 16754.04 16753.04 16752.04 16751.04 16750.04
[52] 16749.04 16748.04 16747.04 16746.04 16745.04 16744.04 16743.04 16742.04 16741.04 16740.04 16739.04 16738.04 16737.04 16736.04 16735.04 16734.04 16733.04
[69] 16732.04 16731.04 16730.04 16729.04 16728.04 16727.04 16726.04 16725.04 16724.04 16723.04 16722.04 16721.04 16720.04 16719.04 16718.04 16717.04 16716.04
[86] 16715.04 16714.04 16713.04 16712.04 16711.04 16710.04 16709.04 16708.04 16707.04 16706.04 16705.04 16704.04 16703.04 16702.04 16701.04 16700.04 16699.04
and so further.
One way (of many) to tackle this is instead of looking directly for the differences is to find a way to join your tables which will work regardless. To do so you simply need to generate a complete sequence of all dates from the first date on your list to the last, then left-join these to each of your daily and spot price data frames in turn. Missing date rows in each table will show as NA columns in the resulting joined table.
Example sequence, shortened to one month only for this exemplar. You'd start it at 2011-01-01 instead.
somedates = seq(as.Date("2015-12-01"), as.Date("2015-12-31"), by = "day")
Generate some test data each with four randomly missed dates to simulate your da2, da3, sp2 and sp3 tables:
library(dplyr)
set.seed(0)
da2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 20)
set.seed(1)
da3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 21)
set.seed(2)
sp2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 19)
set.seed(3)
sp3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 18)
Joining the da2, da3, sp2 and sp3 tables
With the test data generated, joining the tables to the complete sequence of dates (as a data frame) is straightforward. (NB I haven't replaced the joined column names with more meaningful versions in the result below).
all =
left_join(data.frame(Date = somedates), da2, by = "Date") %>%
left_join(da3, by = "Date") %>%
left_join(sp2, by = "Date") %>%
left_join(sp3, by = "Date")
Results from the test data joined
>all
Date hour.x price.x hour.y price.y hour.x.x price.x.x hour.y.y price.y.y
1 2015-12-01 2 20 3 21 2 19 3 18
2 2015-12-02 2 20 3 21 2 19 3 18
3 2015-12-03 NA NA 3 21 2 19 3 18
4 2015-12-04 2 20 3 21 2 19 3 18
5 2015-12-05 2 20 3 21 2 19 3 18
6 2015-12-06 2 20 3 21 2 19 3 18
7 2015-12-07 2 20 3 21 2 19 NA NA
8 2015-12-08 2 20 3 21 2 19 3 18
9 2015-12-09 2 20 3 21 NA NA 3 18
10 2015-12-10 2 20 3 21 NA NA 3 18
11 2015-12-11 2 20 3 21 2 19 3 18
12 2015-12-12 NA NA 3 21 2 19 3 18
13 2015-12-13 2 20 NA NA 2 19 NA NA
14 2015-12-14 2 20 3 21 2 19 3 18
15 2015-12-15 2 20 3 21 2 19 3 18
16 2015-12-16 2 20 3 21 2 19 3 18
17 2015-12-17 2 20 3 21 2 19 3 18
18 2015-12-18 2 20 NA NA 2 19 3 18
19 2015-12-19 NA NA 3 21 2 19 3 18
20 2015-12-20 2 20 NA NA NA NA 3 18
21 2015-12-21 2 20 3 21 2 19 3 18
22 2015-12-22 2 20 3 21 2 19 3 18
23 2015-12-23 2 20 3 21 2 19 3 18
24 2015-12-24 2 20 3 21 2 19 NA NA
25 2015-12-25 2 20 3 21 2 19 3 18
26 2015-12-26 2 20 3 21 2 19 3 18
27 2015-12-27 2 20 3 21 2 19 3 18
28 2015-12-28 2 20 3 21 2 19 3 18
29 2015-12-29 2 20 3 21 2 19 3 18
30 2015-12-30 2 20 3 21 NA NA 3 18
31 2015-12-31 NA NA NA NA 2 19 NA NA
Edit I note the numeric dates you posted as a result of your set join have a 0.04 time component as well as the whole-number date. You will need to add this to the date sequence to get the join to work. I have now tested this and without adding the time component you'd have to convert each date to a whole number. This can be done fairly simply though:
da2$Date = trunc.Date(da2$Date, "days")
da3$Date = trunc.Date(da3$Date, "days")
sp2$Date = trunc.Date(sp2$Date, "days")
sp3$Date = trunc.Date(sp3$Date, "days")
You'd do this before the joins.

reshaping 3 columns into matrix

I have 3 columns of data that I would like to reshape into a matrix where the columns are created_at and rows are citibike_station_id
head(sample)
available_bike_count created_at citibike_station_id
1 21 2015-10-08 00:00:00 72
2 7 2015-10-08 20:10:00 72
3 18 2015-10-08 06:50:00 72
4 19 2015-10-08 10:10:00 72
5 18 2015-10-08 02:30:00 72
6 17 2015-10-08 05:00:00 72
> dim(sample)
[1] 69511 3
Therefore, I have to group by created_at and by citibike_station_id
> length(unique(sample$created_at))
[1] 145
> length(unique(sample$citibike_station_id))
[1] 482
created_at represents a 10-minute time intervals - there should be 145 columns as there are 145 unique time intervals (representing one day of data) ; and there should be 482 rows as there are 482 unique values of citibike_station_id.
This is an example of what the data should look like in the end - however, in this example the column names are from a different day and year.
head(data[1:6])
station_id X2014.08.18.20.00.00 X2014.08.18.20.10.00 X2014.08.18.20.20.00
1 1 1 0 0
2 2 18 18 19
3 3 5 4 4
4 4 21 20 20
5 5 9 10 8
6 6 9 9 9
X2014.08.18.20.30.00 X2014.08.18.20.40.00
1 2 1
2 18 18
3 4 4
4 21 22
5 5 7
6 9 9
how would one do this with dplyr and tidyr ?
library(dplyr)
library(tidyr)
matrix <- sample %>%
group_by(created_at, citibike_station_id)%>%
spread(citibike_station_id, created_at)
however this does not work. Would the reshape2 package provide a better solution?

Difference in Timestamp

I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.

Resources