Create interval of dates for my existing data in R - r

I am trying to get my existing observations to 10 min intervals in R.
I did this:
data3$date= ceiling_date(as.POSIXct(data3$betdate), unit = "10 minutes")
data3 %>% group_by(date, prov) %>%
summarise(cant=n())
But the problem with this code it is that if there is no observation for one interval, the interval will not appear in the output data, which have a lot of sense because there are no observations with the date in that interval. So i need to capture the information about that intervals that does not have observations registred. Any ideas? Already thanks to all of you.

See a simplified example of #Limey's comment, using just months and data.table
# set up fake data
set.seed(1000)
library(lubridate)
# create sequence, and save it as a data.frame so it has a header
months <- seq(ymd("2022-01-01"), ymd("2022-06-01"), by = "month")
# create fake data, and remove some rows
dat <- data.frame(month = months, values = sample(100:200, length(months)))
dat <- dat[-sample(1:length(months),3),]
dat
# month values
#1 2022-01-01 167
#4 2022-04-01 150
#6 2022-06-01 128
here we perform the merge and see the NAs representing missing observations
library(data.table)
setDT(dat)
months_listed <- data.frame(month = seq(min(dat$month), max(dat$month), by = "month"))
setDT(months_listed)
merge.data.table(months_listed, dat, by = "month", all.x = T)
# month values
#1: 2022-01-01 167
#2: 2022-02-01 NA
#3: 2022-03-01 NA
#4: 2022-04-01 150
#5: 2022-05-01 NA
#6: 2022-06-01 128

Related

How to filter a data set and calculate a new variable faster in R?

I have a data set with values every minute and I want to calculate the average value for every hour. I have tried by using the group_by(), filter() and summarise() from dplyr package to reduce the data every hour. When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
> head(DF)
datetime pw cu year m d hr min
1 2017-08-18 14:56:00 0.0630341 1.94065 2017 8 18 14 53
2 2017-08-18 14:57:00 0.0604653 1.86771 2017 8 18 14 57
3 2017-08-18 14:58:00 0.0601318 1.86596 2017 8 18 14 58
4 2017-08-18 14:59:00 0.0599276 1.83761 2017 8 18 14 59
5 2017-08-18 15:00:00 0.0598998 1.84177 2017 8 18 15 0
I had to use a for loop to reduce my table, I wrote the following to do it:
datetime <- c()
eg_bf <-c ()
for(i in 1:8760){
hour= start + 3600
DF= DF %>%
filter(datetime >= start & datetime < hour) %>%
summarise(eg= mean(pw))
datetime= append(datetime, start)
eg_bf= append(eg_bf, DF$eg)
start= hour
}
new_DF= data.frame(datetime, eg_bf)
So. I was able to get my new data set with the mean value for every hour of the year.
datetime eg_bf
1 2018-01-01 00:00:00 0.025
2 2018-01-01 01:00:00 0.003
3 2018-01-01 02:00:00 0.002
4 2018-01-01 03:00:00 0.010
5 2018-01-01 04:00:00 0.015
The problem I'm facing is that It takes a lot of time to do it. The idea is to add this calculation to a shiny UI, so every time I make a change it must make the changes faster. Any idea how to improve this calculation?
you can try this. use make_date from the lubridate package to make a new date_time column using the year , month, day and hour columns of your dataset. Then group and summarise on the new column
library(dplyr)
library(lubridate)
df %>%
mutate(date_time = make_datetime(year, m, d, hr)) %>%
group_by(date_time) %>%
summarise(eg_bf = mean(pw))
#Adam Gruer's answer provides a nice solution for the date variable that should solve your question. The calculation of the mean per hour does work with just dplyr, though:
df %>%
group_by(year, m, d, hr) %>%
summarise(test = mean(pw))
# A tibble: 2 x 5
# Groups: year, m, d [?]
year m d hr test
<int> <int> <int> <int> <dbl>
1 2017 8 18 14 0.0609
2 2017 8 18 15 0.0599
You said in your question:
When I use only these functions I am able to get the mean value for every hour but only every month and I want it for each day.
What did you do differently?
Even if you've found your answer, I believe this is worth mentioning:
If you're working with a lot of data and speed is an issue, then you might want ot see if you can use data.table instead of dplyr
You can see with a simple benchmarking how much faster data.table is:
library(dplyr)
library(lubridate)
library(data.table)
library(microbenchmark)
set.seed(123)
# dummy data, one year, one entry per minute
# first as data frame
DF <- data.frame(datetime = seq(as.POSIXct("2018-01-01 00:00:00"),
as.POSIXct("2019-01-02 00:00:00"), 60),
pw = runif(527041)) %>%
mutate(year = year(datetime), m=month(datetime),
d=day(datetime), hour = hour(datetime))
# save it as a data.table
dt <- as.data.table(DF)
# transformation with dplyr
f_dplyr <- function(){
DF %>%
group_by(year, m, d, hour) %>%
summarize(eg_bf = mean(pw))
}
# transformation with data.table
f_datatable <- function() {
dt[, mean(pw), by=.(year, m, d, hour)]
}
# benchmarking
microbenchmark(f_dplyr(), f_datatable())
#
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f_dplyr() 41.240235 44.075019 46.85497 45.64998 47.95968 76.73714 100 b
# f_datatable() 9.081295 9.712694 12.53998 10.55697 11.33933 41.85217 100 a
check out this post it tells a lot data.table vs dplyr: can one do something well the other can't or does poorly?
As I understood you have a data frame of 365 * 24 * 60 rows. The code below returns the result instantly. The outcome is mean(pw) grouped by every hour of the year.
remove(list = ls())
library(dplyr)
library(lubridate)
library(purrr)
library(tibble)
date_time <- seq.POSIXt(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
by = "1 min"
)
n <- length(date_time)
data <- tibble(
date_time = date_time,
pw = runif(n),
cu = runif(n),
ye = year(date_time),
mo = month(date_time),
da = day(date_time),
hr = hour(date_time)
)
grouped <- data %>%
group_by(
ye, mo, da, hr
) %>%
summarise(
mean_pw = mean(pw)
)

How do I create a daily time series using data that isn't taken daily

I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)

Plotting daily summed values of data against months [duplicate]

This question already has answers here:
How to change x axis from years to months with ggplot2
(2 answers)
Closed 5 years ago.
I am trying to make a ggplot of solar irradiance (from a weather file) on y-axis and time in months on x-axis.
My data consists of values collected on hour basis for 12 months so overall there are 8760 rows filled with data values.
Now, I want to make plot in such a way that for a single day, I only get a point on plot by adding values for a complete day (Not like taking all the values and plotting them. I believe geom_freqpoly() can plot this type of data. I have looked for this but not finding enough examples in the way I want. (Or if there is some approach that can help me achieve the plot I want as I am not sure what exactly I have to do to add points for a day. Otherwise writing code for 365 days is crazy)
I want the following kind of plot
My plot is showing all the reading for a year and looks like this
My code for this plotting is :
library(ggplot2)
cmsaf_data <- read.csv("C://Users//MEJA03514//Desktop//main folder//Irradiation data//tmy_era_25.796_45.547_2005_2014.csv",skip=16, header=T)
time<- strptime(cmsaf_data[,2], format = "%m/%d/%Y %H:%M")
data <- cbind(time,cmsaf_data[5])
#data %>% select(time)
data <- data.frame(data, months = month(time),days = mday(time))
data <- unite(data, date_month, c(months, days), remove=FALSE, sep="-")
data <- subset(data, data[,2]>0)
GHI <- data[,2]
date_month <- data[,3]
ggplot(data, aes(date_month, GHI))+geom_line()
whereas my data looks like this :
head(data)
time Global.horizontal.irradiance..W.m2.
1 2007-01-01 00:00:00 0
2 2007-01-01 01:00:00 0
3 2007-01-01 02:00:00 0
4 2007-01-01 03:00:00 0
5 2007-01-01 04:00:00 0
6 2007-01-01 05:00:00 159
As I want 1 point for a day, how can I perform sum function so that I can get the output I require and show months names on x-axis (may be using something from time and date that can do this addition for a day and give 365 vales for a year in output)
I have no idea at all of any such function or approach.
Your help will be appreciated!
Here is a solution using the tidyverse and lubridate packages. As you haven't provided complete sample data, I've generated some random data.
library(tidyverse)
library(lubridate)
data <- tibble(
time = seq(ymd_hms('2007-01-01 00:00:00'),
ymd_hms('2007-12-31 23:00:00'),
by='hour'),
variable = sample(0:400, 8760, replace = TRUE)
)
head(data)
#> # A tibble: 6 x 2
#> time variable
#> <dttm> <int>
#> 1 2007-01-01 00:00:00 220
#> 2 2007-01-01 01:00:00 348
#> 3 2007-01-01 02:00:00 360
#> 4 2007-01-01 03:00:00 10
#> 5 2007-01-01 04:00:00 18
#> 6 2007-01-01 05:00:00 227
summarised <- data %>%
mutate(date = date(time)) %>%
group_by(date) %>%
summarise(total = sum(variable))
head(summarised)
#> # A tibble: 6 x 2
#> date total
#> <date> <int>
#> 1 2007-01-01 5205
#> 2 2007-01-02 3938
#> 3 2007-01-03 5865
#> 4 2007-01-04 5157
#> 5 2007-01-05 4702
#> 6 2007-01-06 4625
summarised %>%
ggplot(aes(date, total)) +
geom_line()
In order to get a sum for every month of every year, you need to create a Column which describes a specific month of a specific year (Yearmon).
Then you can group over that Column and sum over that group giving you one sum for every month of every year.
Then you just plot it and set the labels of the x-axis to your liking.
library(ggplot2)
library(dplyr)
library(zoo)
library(scales)
# Create dummy data for time column
time <- seq.POSIXt(from = as.POSIXct("2007-01-01 00:00:00"),
to = as.POSIXct("2017-01-01 23:00:00"),
by = "hour")
# Create dummy data.frame
data <- data.frame(Time = time,
GHI = rnorm(length(time)))
############################
# Add column Yearmon to the data.frame
# Groupy by Yearmon and summarise with sum
# This creates one sum per Yearmon
# ungroup is often not neccessary, however
# not doing this caused problems for me in the past
# Change type of Yearmon to Date for ggplot
#
df <- mutate(data,
Yearmon = as.yearmon(Time)) %>%
group_by(Yearmon) %>%
summarise(GHI_sum = sum(GHI)) %>%
ungroup() %>%
mutate(Yearmon = as.Date(Yearmon))
# Plot the chart with special scale lables
ggplot(df, aes(Yearmon, GHI_sum))+
geom_line()+
scale_x_date(labels = date_format("%m/%y"))
I hope this helps.

Filter a data frame by two time series

Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)

calculate stats based on dynamic window using dplyr

I am trying to use dplyr in R to calculate rolling stats (mean, sd, etc) based on a dynamic window based on dates and for specific models. For instance, within groupings of items, I would like to calculate the rolling mean for all data 10 days prior. The dates on the data are not sequential and not complete so I can't use a fixed window.
One way to do this is use rollapply referencing the window width as shown below. However, I'm having trouble calculating the dynamic width. I'd prefer a method that omits the intermediate step of calculating the window and simply calculate based on the date_lookback. Here's a toy example.
I've used for loops to do this, but they are very slow.
library(dplyr)
library(zoo)
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE) %>%
mutate(.,
cut_date = date - date_lookback, #calcs based on sample since this date
dyn_win = c(1,1,2,3,3,1,2,3,4,5), ##!! need to calculate this vector??
roll_mean = rollapply(data, align = "right", width = dyn_win, mean),
roll_sd = rollapply(data, align = "right", width = dyn_win, sd))
These are the roll_mean and roll_sd results I'm looking for:
> df
label date data cut_date dyn_win roll_mean roll_sd
1 a 2017-01-02 790 2016-12-23 1 790.0000 NA
2 a 2017-01-20 493 2017-01-10 1 493.0000 NA
3 a 2017-01-21 718 2017-01-11 2 605.5000 159.0990
4 a 2017-01-30 483 2017-01-20 3 564.6667 132.8847
5 a 2017-01-31 825 2017-01-21 3 675.3333 174.9467
6 b 2017-01-05 186 2016-12-26 1 186.0000 NA
7 b 2017-01-08 599 2016-12-29 2 392.5000 292.0351
8 b 2017-01-09 408 2016-12-30 3 397.6667 206.6938
9 b 2017-01-10 108 2016-12-31 4 325.2500 222.3921
10 b 2017-01-11 666 2017-01-01 5 393.4000 245.5928
Thanks in advance.
You could try explicitly referencing your dataset inside the dplyr call:
date_lookback <- 10 #days to look back for rolling calcs
df <- data.frame(label = c(rep("a",5),rep("b",5)),
date = as.Date(c("2017-01-02","2017-01-20",
"2017-01-21","2017-01-30","2017-01-31","2017-01-05",
"2017-01-08","2017-01-09","2017-01-10","2017-01-11")),
data = c(790,493,718,483,825,186,599,408,108,666),stringsAsFactors = FALSE)
df %>%
group_by(date,label) %>%
mutate(.,
roll_mean = mean(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE),
roll_sd = sd(ifelse(df$date >= date-date_lookback & df$date <= date & df$label == label,
df$data,NA),na.rm=TRUE))

Resources