I have the following df
Id a_min_date a_max_date b_min_date b_max_date c_min_date c_max_date d_min_date a_max_date
1 2014-01-01 2014-01-10 2014-01-05 2014-01-15 NA NA 2014-02-20 2014-05-01
2 2014-02-01 2014-02-10 NA NA 2015-02-20 2015-03-01 NA NA
I have added the intervals of each group (a, b, c,d) by ID. First, I have converted the start and end dates to lubridate intervals.
I want to plot the intervals and calculate the time difference in days between the end of each group and the start of next group if there is no overlap.
I tried to use IRanges package and converted the dates into integers (as used here (link)), but does not work for me.
ir <- IRanges::IRanges(start = as.integer((as.Date(df$a_min_date))), end = as.integer((as.Date(df$a_max_date))))
bins <- disjointBins(IRanges(start(ir), end(ir) + 1))
dat <- cbind(as.data.frame(ir), bin = bins)
ggplot(dat) +
geom_rect(aes(xmin = start, xmax = end,
ymin = bin, ymax = bin + 0.9)) +
theme_bw()
I got this error for my orginal df:
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
solving row 1: range cannot be determined from the supplied arguments (too many NAs)
Does someone have another solution using other packages?
To my knowledge, IRanges is the best package out there to solve this problem.
IRanges needs range values (in this case dates) to compare and does not handle undefined values (NAs)
To solve this problem, I would remove all rows with NAs in df before doing the analysis.
df <- df[complete.cases(df[ , 1:2]),]
Explanation and other ways to remove NAs see Remove rows with all or some NAs (missing values) in data.frame.
If this does not fix the problem, you could convert the dates into integers. Important there is that the dates have the year-month-day format to result in correct intervals.
Example:
str <- "2006-06-26"
splitted<- unlist(strsplit(str,"-"))
[1] "2006" "06" "26"
result <- paste(splitted,collapse="")
[1] "20060626"
Related
I want to create a vector of time stamps consisting of 60 monthly dates and repeat the process for n number of times. That means, if n = 2, the vector should contain 120 times stamps.
A single vector of time stamps I am creating in this way,
t <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
To repeat it n number of times I am doing the following,
n <- 2
X <- data.frame(replicate(n, seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")))
Y <- stack(X)[,"values", drop=FALSE]
head(Y)
> head(Y)
values
1 16071
2 16102
3 16130
4 16161
5 16191
6 16222
As you see the values are not in time format anymore. My question is how to retain the time format in the vector Y? Is there any smarter way to do this problem?
Take a look at the 'zoo' package, there is an old thread here https://stat.ethz.ch/pipermail/r-help//2010-March/233159.html
where they talk about sort of the same problem.
Either way, after installing zoo you can do
as.Date(16071)
and it will return the date in date format. Hope this makes sense.
a = c(22,23,00,01,02) #hours from 22 in to 2 in morning next day
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
When I plot this data with ggplot2 it sorts the first column a, but I don't want it to be sorted.
The code used in ggplot2 is ggplot(df, aes(a,b)) + geom_line()
In this case, the X-axis is sorted and they are providing wrong results like
hour 0 consists of value 4, and the truth is that hour 22 consist of value 4
R needs to somehow know that what you provide in vector "a" is a time. I have changed your vector slightly to give R the necessary information:
a = as.POSIXct(c("0122","0123","0200","0201","0202"), format="%d%H")
# hours from 22 in to 2 in morning next day (as strings)
# the day is arbitrary but must be provided
b = c(4,8,-12,3,5) #some values
df = data.frame(a,b)
ggplot(df, aes(a,b)) + geom_line()
You can use paste() to glue days and hours together automatically (e.g. paste(day,22,sep=""))
Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)
I'm trying to create a function in R that replicates Excel's EOMonth function (i.e. you enter a date and a number of months and it returns you the end of the month date for the same number of months before / after the input date). I've a got a function that works on a single input using the lubridate package:
EOMonth <- function(date, months)
{
NewDate <- date %m+% months(months)
NewDate <- ceiling_date(NewDate, "month") - days(1)
}
The problem is how to 'vectorise' this (not sure that's the right word). When I pass the function a vector (in this case a column in a data frame), I get the following message:
NAs are not allowed in subscripted assignments
I don't want to ignore any NAs in the vector (because I am sending the results to a new column in the data frame). I just want the function, when it sees an NA, to return an NA, but to process all the valid dates as the function dictates. I'm really confused as to how to do this; most of the posts I have seen on this topic relate to how to ignore / remove NAs from the results.
Any help would be greatly appreciated.
Thanks.
Edit. Added in some sample data. Below is sample input data:
01/07/2016
NA
22/07/2016
NA
30/06/2016
22/07/2016
22/07/2016
29/07/2016
NA
22/07/2016
30/06/2016
NA
31/01/2016
02/08/2016
So, entering the following:
newVector <- EOMonth(OldVector, 3)
Should return the end of the month for each of the dates in 3 months' time:
31/10/2016
NA
31/10/2016
NA
30/09/2016
31/10/2016
31/10/2016
31/10/2016
NA
31/10/2016
30/09/2016
NA
NA
30/04/2016
30/11/2016
One solution is to first make a vector of NAs and then process only the non-NA elements of date. Note the NA class needs to be date or the dates are converted to numeric.
EOMonth <- function(date, months)
{
NewDate <- date(rep(NA, length(date)))
NewDate[!is.na(date)] <- date[!is.na(date)] %m+% months(months)
NewDate[!is.na(date)] <- ceiling_date(NewDate[!is.na(date)], "month") - days(1)
NewDate
}
EOMonth(OldVector, 3)
Say I have the following matrix:
x1 = 1:288
x2 = matrix(x1,nrow=96,ncol=3)
Is there an easy way to get the mean of rows 1:24,25:48,49:72,73:96 for column 2?
Basically I have a one year time series and I have to average some data every 24 hours.
There is.
Suppose we have the days :
Days <- rep(1:4,each=24)
you could do easily
tapply(x2[,2],Days,mean)
If you have a dataframe with a Date variable, you can use that one. You can do that for all variables at once, using aggregate :
x2 <- as.data.frame(cbind(x2,Days))
aggregate(x2[,1:3],by=list(Days),mean)
Take a look at the help files of these functions to start with. Also do a search here, there are quite some other interesting answers on this problem :
Aggregating daily content
Compute means of a group by factor
PS : If you're going to do a lot of timeseries, you should take a look at the zoo package (on CRAN : http://cran.r-project.org/web/packages/zoo/index.html )
1) ts. Since this is a regularly spaced time series, convert it to a ts series and then aggregate it from frequency 24 to frequency 1:
aggregate(ts(x2[, 2], freq = 24), 1, mean)
giving:
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 108.5 132.5 156.5 180.5
2) zoo. Here it is using zoo. The zoo package can also handle irregularly spaced series (if we needed to extend this). Below day.hour is the day number (1, 2, 3, 4) plus the hour as a fraction of the day so that floor(day.hour) is just the day number:
library(zoo)
day.hour <- seq(1, length = length(x2[, 2]), by = 1/24)
z <- zoo(x2[, 2], day.hour)
aggregate(z, floor, mean)
## 1 2 3 4
## 108.5 132.5 156.5 180.5
If zz is the output from aggregate then coredata(zz) and time(zz) are the values and times, respectively, as ordinary vectors.
Quite compact and computationally fast way of doing this is to reshape the vector into a suitable matrix and calculating the column means.
colMeans(matrix(x2[,2],nrow=24))