Can I lag daily data by months with lag ( )? - r

I have daily data over several years for several currencies. I would like to lag variables in the data set by exactly one month (i.e. June 15 to July 15, not necessarily 30 days). NA's are fine where that is not possible.
I have gotten by so far by writing something like this:
ddply(Data, .(Currency), function(x){ #First column is date, 2nd is Currency, rest data.
y=x[,-(3)] #this is all data I want lagged
y$date=as.Date(y$date) %m+% months(1) #This increases the dates by one month
x$date=as.Date(x$date) #what data I dont want lagged is x[, (1:3)] below
z=merge(x[,(1:3)], y, by=c("date", "Currency")) #merge by date,
#lagged stuff merges with non-lagged stuff 1 month later than original obs date
return(z)
})
I can include the data if necessary, but given that I have something that works already I dont want anyone to spend time on it.
I just want to check that I cant use the lubridate %m+% months(1) syntax within the lag function. I have tried the lag function from package "statar" that uses the along_by syntax but haven't been able to figure it out.
Thanks!

Related

How can i obtain a tsibble from this tibble without using the as.Date function?

I need to convert several "tibble" into "tsibble".
Here a simple example:
require(tidyverse)
require(lubridate)
time_1 <- c(ymd_hms('20210101 000000'),
ymd_hms('20210101 080000'),
ymd_hms('20210101 160000'),
# ymd_hms('20210102 000000'),
ymd_hms('20210102 080000'),
ymd_hms('20210102 160000'))
df_1 <- tibble(time_1, y=rnorm(5))
df_1 %>%
as_tsibble(index=time_1)
This chunk of code works as expected.
But, if the dates are all midnights, this code throws an error:
time_2 <- c(ymd_hms('20210101 000000'),
ymd_hms('20210102 000000'),
ymd_hms('20210103 000000'),
# ymd_hms('20210104 000000'),
ymd_hms('20210105 000000'),
ymd_hms('20210106 000000'))
df_2 <- tibble(time_2, y=rnorm(5))
df_2 %>%
as_tsibble(index=time_2)
I don't want to solve this issue in this way because the as.Date function changes the column type.
df_2 %>%
mutate(time_2=as.Date(time_2)) %>%
as_tsibble(index=time_2)
I also don't want to fix the issue in this way because after converting the tibble into tsibble i need to apply the fill_gaps function, which doesn't create the ymd_hms('20210104 000000') in this second scenario.
df_2 %>%
as_tsibble(index=time_2, regular=FALSE)
Is this a bug?
Thanks.
This behaviour is explained in tsibble's FAQ.
Essentially subdaily data (ymd_hms()) measured at midnight each day doesn't necessarily have an interval of 1 day (24 hours). Consider that some days have shifts due to daylight savings in your time zone, and so the number of hours between midnight and midnight the next day may be 23 or 25 hours.
If you're working with data measured at a daily interval, you should use a date with ymd() precision. You can covert it back to a date time using as_datetime() if you like.
Personally I don't think this should produce an error, however it is much simpler if it does. Perhaps the appropriate interval here is 1 hour or 30 minutes (or whatever is appropriate for timezone shifts in the specified timezone).

Subset a dataframe based on numerical values of a string inside a variable

I have a data frame which is a time series of meteorological measurement with monthly resolution from 1961 till 2018. I am interested in the variable that measures the monthly average temperature since I need the multi-annual average temperature for the summers.
To do this I must filter from the "DateVaraible" column the fifth and sixth digit, which are the month.
The values in time column are formatted like this
"19610701". So I need the 07(Juli) after 1961.
I start coding for 1 month for other purposes, so I did not try anything worth to mention. I guess that .grepl could do the work, but I do not know how the "matching" operator works.
So I started with this code that works.
summersmonth<- Df[DateVariable %like% "19610101" I DateVariable %like% "19610201"]
I am expecting a code like this
summermonths <- Df[DateVariable %like% "**06**" I DateVariable%like% "**07**..]
So that all entries with month digit from 06 to 09 are saved in the new dataframe summermonths.
Thanks in advance for any reply or feedback regarding my question.
Update
Thank to your answers I got the first part, which is to convert the variable in a as.date with the format "month"(Class=char)
Now I need to select months from Juni to September .
A horrible way to get the result I wanted is to do several subset and a rbind afterward.
Sommer1<-subset(Df, MonthVar == "Mai")
Sommer2<-subset(Df, MonthVar == "Juli")
Sommer3<-subset(Df, MonthVar == "September")
SummerTotal<-rbind(Sommer1,Sommer2,Sommer3)
I would be very glad to see this written in a tidy way.
Update 2 - Solution
Here is the tidy way, as here Using multiple criteria in subset function and logical operators
Veg_Seas<-subset(Df, subset = MonthVar %in% c("Mai","Juni","Juli","August","September"))
You can convert your date variable as date (format) and take the month:
allmonths <- month(as.Date(Df$DateVariable, format="%Y%m%d"))
Note that of your column has been originally imported as factor you need to convert it to character first:
allmonths <- month(as.Date(as.character(Df$DateVariable), format="%Y%m%d"))
Then you can check whether it is a summermonth:
summersmonth <- Df[allmonths %in% 6:9, ]
Example:
as.Date("20190702", format="%Y%m%d")
[1] "2019-07-02"
month(as.Date("20190702", format="%Y%m%d"))
[1] 7
We can use anydate from anytime to convert to Date class and then extract the month
library(anytime)
month(anydate(as.character(Df$DateVariable)))

Compute average over sliding time interval (7 days ago/later) in R

I've seen a lot of solutions to working with groups of times or date, like aggregate to sum daily observations into weekly observations, or other solutions to compute a moving average, but I haven't found a way do what I want, which is to pluck relative dates out of data keyed by an additional variable.
I have daily sales data for a bunch of stores. So that is a data.frame with columns
store_id date sales
It's nearly complete, but there are some missing data points, and those missing data points are having a strong effect on our models (I suspect). So I used expand.grid to make sure we have a row for every store and every date, but at this point the sales data for those missing data points are NAs. I've found solutions like
dframe[is.na(dframe)] <- 0
or
dframe$sales[is.na(dframe$sales)] <- mean(dframe$sales, na.rm = TRUE)
but I'm not happy with the RHS of either of those. I want to replace missing sales data with our best estimate, and the best estimate of sales for a given store on a given date is the average of the sales 7 days prior and 7 days later. E.g. for Sunday the 8th, the average of Sunday the 1st and Sunday the 15th, because sales is significantly dependent on day of the week.
So I guess I can use
dframe$sales[is.na(dframe$sales)] <- my_func(dframe)
where my_func(dframe) replaces every stores' sales data with the average of the store's sales 7 days prior and 7 days later (ignoring for the first go around the situation where one of those data points is also missing), but I have no idea how to write my_func in an efficient way.
How do I match up the store_id and the dates 7 days prior and future without using a terribly inefficient for loop? Preferably using only base R packages.
Something like:
with(
dframe,
ave(sales, store_id, FUN=function(x) {
naw <- which(is.na(x))
x[naw] <- rowMeans(cbind(x[naw+7],x[naw-7]))
x
}
)
)

Plotting truncated times from zoo time series

Let's say I have a data frame with lots of values under these headers:
df <- data.frame(c("Tid", "Value"))
#Tid.format = %Y-%m-%d %H:%M
Then I turn that data frame over to zoo, because I want to handle it as a time series:
library("zoo")
df <- zoo(df$Value, df$Tid)
Now I want to produce a smooth scatter plot over which time of day each measurement was taken (i.e. discard date information and only keep time) which supposedly should be done something like this: https://stat.ethz.ch/pipermail/r-help/2009-March/191302.html
But it seems the time() function doesn't produce any time at all; instead it just produces a number sequence. Whatever I do from that link, I can't get a scatter plot of values over an average day. The data.frame code that actually does work (without using zoo time series) looks like this (i.e. extracting the hour from the time and converting it to numeric):
smoothScatter(data.frame(as.numeric(format(df$Tid,"%H")),df$Value)
Another thing I want to do is produce a density plot of how many measurements I have per hour. I have plotted on hours using a regular data.frame with no problems, so the data I have is fine. But when I try to do it using zoo then I either get errors or I get the wrong results when trying what I have found through Google.
I did manage to get something plotted through this line:
plot(density(as.numeric(trunc(time(df),"01:00:00"))))
But it is not correct. It seems again that it is just producing a sequence from 1 to 217, where I wanted it to be truncating any date information and just keep the time rounded off to hours.
I am able to plot this:
plot(density(df))
Which produces a density plot of the Values. But I want a density plot over how many values were recorded per hour of the day.
So, if someone could please help me sort this out, that would be great. In short, what I want to do is:
1) smoothScatter(x-axis: time of day (0-24), y-axis: value)
2) plot(density(x-axis: time of day (0-24)))
EDIT:
library("zoo")
df <- data.frame(Tid=strptime(c("2011-01-14 12:00:00","2011-01-31 07:00:00","2011-02-05 09:36:00","2011-02-27 10:19:00"),"%Y-%m-%d %H:%M"),Values=c(50,52,51,52))
df <- zoo(df$Values,df$Tid)
summary(df)
df.hr <- aggregate(df, trunc(df, "hours"), mean)
summary(df.hr)
png("temp.png")
plot(df.hr)
dev.off()
This code is some actual values that I have. I would have expected the plot of "df.hr" to be an hourly average, but instead I get some weird new index that is not time at all...
There are three problems with the aggregate statement in the question:
We wish to truncate the times not df.
trunc.POSIXt unfortunately returns a POSIXlt result so it needs to be converted back to POSIXct
It seems you did not intend to truncate to the hour in the first place but wanted to extract the hours.
To address the first two points the aggregate statement needs to be changed to:
tt <- as.POSIXct(trunc(time(df), "hours"))
aggregate(df, tt, mean)
but to address the last point it needs to be changed entirely to
tt <- as.POSIXlt(time(df))$hour
aggregate(df, tt, mean)

Creating a single timestamp from separate DAY OF YEAR, Year and Time columns in R

I have a time series dataset for several meteorological variables. The time data is logged in three separate columns:
Year (e.g. 2012)
Day of year (e.g. 261 representing 17-September in a Leap Year)
Hrs:Mins (e.g. 1610)
Is there a way I can merge the three columns to create a single timestamp in R? I'm not very familiar with how R deals with the Day of Year variable.
Thanks for any help with this!
It looks like the timeDate package can handle gregorian time frames. I haven't used it personally but it looks straightforward. There is a shift argument in some methods that allow you to set the offset from your data.
http://cran.r-project.org/web/packages/timeDate/timeDate.pdf
Because you mentioned it, I thought I'd show the actual code to merge together separate columns. When you have the values you need in separate columns you can use paste to bring them together and lubridate::mdy to parse them.
library(lubridate)
col.month <- "Jan"
col.year <- "2012"
col.day <- "23"
date <- mdy(paste(col.month, col.day, col.year, sep = "-"))
Lubridate is a great package, here's the official page: https://github.com/hadley/lubridate
And here is a nice set of examples: http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
You should get quite far using ISOdatetime. This function takes vectors of year, day, hour, and minute as input and outputs an POSIXct object which represents time. You just have to split the third column into two separate hour minute columns and you can use the function.

Resources