I have some data, where I aggregate the information on a unique minute basis with the below code based on a dataset for 1 day.
I would however like to be able to run this code with a datafile that is combined of multiple days. I have a date column in the dataset, so I can use that as a unique identifier for each day. Is there a way to aggregate the data on a 1 minute basis, given that the dates aren't the same?
The problem is that the unique function extracts the unique events that occur the first day, and then adds all the same events that happen in that minute afterwards. If i base it on the date too, I believe I can create unique 1-minute entries for each day in one long dataset.
Below is the code that works for a single days data.
novo <- read.csv("C:/Users/Morten/Desktop/data.csv", header = TRUE, stringsAsFactors=FALSE )
TimeStamp <- novo[,1]
price <- novo[, 2]
volume <- novo[,3]
nV <- sum(volume)
MinutesFloor <- unique(floor(TimeStamp))
nTradingMinutes <- length(MinutesFloor)
PriceMin <- rep(0, nTradingMinutes)
VolumeMin <- rep(0, nTradingMinutes)
for( j in 1:nTradingMinutes){
ThisMinutes <- (floor(TimeStamp) == MinutesFloor[j])
PriceMin[j] <- mean(price[ThisMinutes])
VolumeMin[j] <- sum(volume[ThisMinutes])
}
Thanks in advance
data format:
date,"ord","shares","finalprice","time","stock"
20100301,C,80,389,540.004,1158
20100301,C,77,389,540.004,1158
20100301,C,60,389,540.004,1158
20100301,C,28,389,540.004,1158
20100301,C,7,389,540.004,1158
20100302,C,25,394.7,540.00293333,1158
20100302,C,170,394.7,540.00293333,1158
20100302,C,40,394.7,540.00293333,1158
20100302,C,75,394.7,540.00293333,1158
20100302,C,100,394.7,540.00293333,1158
20100302,C,1,394.7,540.00293333,1158
I would like to suggest a radically simplified version of your code.
You are doing quite a few things rather inefficient. R is made to compute summary statistics clustered by different data values.
We will use this methods heavily.
I assume your data to be of the form you provided. At my system, this looks like
novo <- read.csv("test.csv", header = TRUE, stringsAsFactors=FALSE )
This gives us:
> str(novo)
'data.frame': 11 obs. of 6 variables:
$ date : int 20100301 20100301 20100301 20100301 20100301 20100302 20100302 20100302 20100302 20100302 ...
$ ord : chr "C" "C" "C" "C" ...
$ shares : int 80 77 60 28 7 25 170 40 75 100 ...
$ finalprice: num 389 389 389 389 389 ...
$ time : num 540 540 540 540 540 ...
$ stock : int 1158 1158 1158 1158 1158 1158 1158 1158 1158 1158 ...
Now, I assume that your date is ordered YearMonthDate. If you have a different ordering, you would have to alter the format command below. Furthermore, your time probably is in minutes.
Then we can create timestamps containing both the date and the time using the POSIXct datatype:
timestamps <- as.POSIXct(as.character(novo$date), format='%Y%m%d') + novo$time*60
Now, we do the rounding up minutes by creating a factor variable and using the cut function:
timestampsByMinute <- droplevels(cut(timestamps, 'min'))
Note that the additional droplevels function just removes the minutes that have no data item s available.
Finally, we may compute the summary statistics you did in the for-loop:
tapply is a function taking it's first argument, dividing it into groups defined by the second argument and applying the function given as third argument to that data. Thus we may just throw the tapply function on your data. (I have the feeling that the column numbers you used in your code do not match the column names in your example data - feel free to adapt to different columns if I interpreted your meaning the wrong way)
PriceMin <- tapply(novo$finalprice, timestampsByMinute, mean)
VolumeMin <- tapply(novo$shares, timestampsByMinute, sum)
This gives us
> PriceMin
2010-03-01 09:00:00 2010-03-02 09:00:00
389.0 394.7
> VolumeMin
2010-03-01 09:00:00 2010-03-02 09:00:00
252 411
which is probably what you want.
Note that tapply is much faster that the loop you used. If you have huge datafiles, this may be important.
I hope there are no errors left in my code - testing was not easy given the fact that you provided only data for one minute per day.
Edit:
As per request, here a small modification that removes the time information from the data:
> unname(VolumeMin)
[1] 252 411
> unname(PriceMin)
[1] 389.0 394.7
Related
I'm pretty new to r, and so was wondering if someone might be able to help with the error message I've been receiving.
I have a data.frame AUC_sheet which contains the column AUC_sheet$sys_timewhich is in POSIXct and represents the times that blood pressure readings were taken during an operation.
I would like to convert AUC_sheet out of POSIXct, so that I can get an accurate result during subsequent area under the curve calculations. I've used the following for loop to perform the conversion:
for(i in 1:length(AUC_sheet$sys_time)){
AUC_sheet$sys_time[i] <- as.numeric(difftime(time1 = AUC_sheet$sys_time[1],
time2 = AUC_sheet$sys_time[i], units = "hours"))
}
But I keep getting an error message as follows
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
I've tried using origin = "1970-01-01" but it tells me this is an unused argument.
Is there something glaringly obvious that I'm doing wrong?
Thanks in advance, and sorry if I've not provided enough data, I can post more as an edit if needed
EDIT
AUC_sheet$sys_time looks like this
sys_value sys_time
<dbl> <time>
1 85 2013-08-28 12:48:24
2 NA 2013-08-28 12:48:39
3 NA 2013-08-28 12:48:54
4 NA 2013-08-28 12:49:08
5 NA 2013-08-28 12:49:24
6 170 2013-08-28 12:49:38
7 150 2013-08-28 12:49:54
8 167 2013-08-28 12:50:09
9 175 2013-08-28 12:50:24
10 167 2013-08-28 12:50:39
# ... with 549 more rows
Your problem is not the as.numeric call itself. The problem is that you are trying to write the result of that call to a column which is a POSIXct column. So, R tries to convert it to the correct format for you, and fails, because the conversion method requires an origin.
If you write to a new column (or, better yet, write the for loop as a single vectorised operation to avoid the issue) then you shouldn't have a problem.
# make up some dummy data for testing
AUC = data.frame(sys_value = 1:100, sys_time = as.POSIXct(Sys.time() + 1:100))
for(i in 1:length(AUC$sys_time)){
AUC$sys_time[i] <- as.numeric(difftime(time1 = AUC$sys_time[1],
time2 = AUC$sys_time[i], units = "hours"))
} # this doesn't work, because you're mixing data types in the sys_time column
for(i in 1:length(AUC$sys_time)){
AUC$sys_time_diff[i] <- as.numeric(difftime(time1 = AUC$sys_time[1],
time2 = AUC$sys_time[i], units = "hours"))
} # this works, because the numeric data isn't being added to a time column
In R I have data
USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
...
I want to create a new data set data_new that contain all USER in the time 10 o'clock to 11 o'clock.
The types of USER and BIRTH are strings/characters. I tried this:
data_new= data$BIRTH > as.POSIXct("10:00:00", format="%H:%M:%S")
& data$BIRTH < as.POSIXct("11:00:00", format="%H:%M:%S")
but here R gives we FALSE for all entries, so this don't work.
How can I solve this?
Update
Say I want to find the number of users for all hours. I use the answer and try this
u=c()
for(j in 1:24) {
data_new=data[times > "00:00:00"+(j-1) & times < "01:00:00"+j ,]
#saving the number of users in vector u
u[j]=dim(data_new)[1]
}
but R can't figure out the term "00:00:00"+(j-1).
If df is your data frame:
df <- read.table(text = 'USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
121 "2014-12-26 10:07:35"
121 "2014-12-26 11:07:35"
121 "2014-12-26 10:38:35"', header = T)
df$BIRTH <- ymd_hms(df$BIRTH)
times <- strftime(df$BIRTH, format = "%H:%M:%S")
df[times > "10:00:00" & times < "11:00:00",]
Output:
USER BIRTH
3 121 2014-12-26 10:07:35
5 121 2014-12-26 10:38:35
One way to do something to each subset of your data is to use the split-lapply paradigm. In this case, you would convert data$BIRTH to POSIXlt and split by the hour component of the POSIXlt object. That will give you a list where each list element contains all the data for a specific hour.
data <- read.csv(text = "USER,BIRTH
11,2013-01-11 22:31:11
12,2014-12-26 04:07:35
21,2014-12-26 10:07:35
121,2014-12-26 11:07:35
112,2014-12-26 10:38:35")
data_by_hour <- split(data, as.POSIXlt(data$BIRTH)$hour)
Then you can use lapply (or sapply) to do whatever you want to each of those subsets. To count the number of observations per hour:
# number of observations for each hour
sapply(data_by_hour, nrow)
4 10 11 22
1 2 1 1
You can also do this with xts.
library(xts)
# Create xts object from 'data' data.frame
# Note: xts objects are based on a matrix, so you cannot have columns with
# mixed types like you can with a data.frame.
x <- xts(data["USER"], as.POSIXct(data$BIRTH))
period.apply(x, endpoints(x, "hours"), nrow)
# USER
# 2013-01-11 22:31:11 1
# 2014-12-26 04:07:35 1
# 2014-12-26 10:38:35 2
# 2014-12-26 11:07:35 1
Note that you can do time-of-day subsetting with xts. It avoids potential locale-related collation order issues caused by using logical operators on character strings.
x["T10:00/T11:00"]
# USER
# 2014-12-26 10:07:35 21
# 2014-12-26 10:38:35 112
I have a data frame with 3 years worth of sales data that I'm trying to convert to a time series. Manually creating subsets for each of the 36 months:
mydfJan2011 <- subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))
...
mydfDec2013 <- subset(myDataFrame,
as.Date("2013-12-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2013-12-31"))
and then summing them up and putting them into a vector
counts[1] <- sum(mydfJan2011$itemsSold)
...
counts[36] <- sum(mydfDec2013$itemsSold))
to get the values for the time series works fine, but I'd like to make it a little more automatic as I have to create more than one time series, so I'm trying to turn it into a loop.
In order to do that, I need to create a string with a subset command like this:
"subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))"
But when I use paste, the result is this:
myString
>"subset(myDataFrame, as.Date(\"2011-02-01\") <= myDataFrame$Dates & myDataFrame$Dates <= as.Date(\"2011-02-28\"))"
and
eval(parse(text = myString))
results in the following error message:
Error in charToDate(x) :
character string is not in a standard unambiguous format
whereas just typing in the command (without escapes) results in the subset I'm trying to create.
I've tried playing around with single and double quotes, substitute and deparse, but none of it results in any kind of subset of my data frame.
Any suggestions?
Even another way of splitting up the data by month and summing it up would be welcome.
Thanks,
Signe
Here is a solution using tapply:
with(sales, tapply(itemsSold, substr(Dates, 1, 7), sum))
Produces monthly sums (I limited my data to 9 months for illustrative purposes, but this extends to longer periods):
2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09
1592.097 1468.427 1594.386 1563.014 1595.489 1560.361 1553.128 1663.705 1325.519
tapply computes the sum of values in a vector (sales$sales) grouped by the values of another vector (substr(sales$date, 1, 7), which is basically "yyyy-mm"). with allows me to avoid me typing sales$ repeatedly. You should almost never have to use eval(parse(...)). There is almost always a better, faster way to do it without resorting to that.
And here is the data I used:
set.seed(1)
sales <- data.frame(Dates=seq(as.Date("2011-01-01"), as.Date("2011-09-30"), by="+1 day"))
sales$itemsSold <- runif(nrow(sales), 1, 100)
For reference, there are also several 3rd party packages that simplify this type of computation (see data.table, dplyr).
Here's a data.table approach that aggregates by year and month, using the first of the month as the respective group label:
library(data.table)
##
mDt <- Dt[
,list(monthSold=sum(itemsSold)),
keyby=list(mDay=as.Date(paste0(
year(Dates),"-",month(Dates),"-01")))]
##
R> head(mDt)
mDay monthSold
1: 2012-01-01 179
2: 2012-02-01 128
3: 2012-03-01 152
4: 2012-04-01 160
5: 2012-05-01 152
6: 2012-06-01 141
Data:
set.seed(123)
Dt <- data.table(
Dates=seq.Date(
from=as.Date("2012-01-01"),
to=as.Date("2014-12-31"),
by="day"),
itemsSold=rpois(1096,5))
This is a pseudo followup to this question: Why is ggplot graphing null percentage data points?
Let's say this is my dataset:
Date AE AA AEF Percent
1/1/2012 1211 1000 3556 0.03
1/2/2012 100 2000 3221 0.43
1/3/2012 3423 10000 2343 0.54
1/4/2012 10000 3000 332 0.43
1/5/2012 2342 500 4435 0.43
1/6/2012 2342 800 2342 0.23
1/7/2012 2342 1500 1231 0.12
1/8/2012 111 2300 333
1/9/2012 1231 1313 3433
1/10/2012 3453 5654 222
1/11/2012 3453 3453 454
1/12/2012 5654 7685 3452
> str(data)
'data.frame': 12 obs. of 5 variables:
$ Date : Factor w/ 12 levels "10/11/2012","10/12/2012",..: 1 2 3 4 5 6 7 8 9 10 ...
$ AE : int 1211 100 3423 10000 2342 2342 2342 111 1231 3453 ...
$ AA : int 1000 2000 10000 3000 500 800 1500 2300 1313 5654 ...
$ AEF : int 3556 3221 2343 332 4435 2342 1231 333 3433 222 ...
$ Percent: num 0.03 0.43 0.54 0.43 0.43 0.23 0.12 NA NA NA ...
I need something to tell that the 'Date' column is a Date type as opposed to a numeric or character type (this is because I have to convert the 'Date' column of the data input into an actual Date with as.Date(), ASSSUMING that I do not know the column names of the data set).
is.numeric(data[[1]]) returns False
is.character(data[[1]]) returns False
I made the 'Date' column in Excel, formatting the column in the 'Date' format, then saved the file as a csv. What type is this in R? I seek an expression similar to the above that returns TRUE.
Use inherits to detect if argument has datatype Date:
is.date <- function(x) inherits(x, 'Date')
sapply(list(as.Date('2000-01-01'), 123, 'ABC'), is.date)
#[1] TRUE FALSE FALSE
If you want to check if character argument can be converted to Date then use this:
is.convertible.to.date <- function(x) !is.na(as.Date(as.character(x), tz = 'UTC', format = '%Y-%m-%d'))
sapply(list('2000-01-01', 123, 'ABC'), is.convertible.to.date)
# [1] TRUE FALSE FALSE
You could try to coerce all the columns to as.Date and see which ones succeed. You would need to specify the format you expect dates to be in though. E.g.:
data <- data.frame(
Date=c("10/11/2012","10/12/2012"),
AE=c(1211,100),
Percent=c(0.03,0.43)
)
sapply(data, function(x) !all(is.na(as.Date(as.character(x),format="%d/%m/%Y"))))
#Date AE Percent
#TRUE FALSE FALSE
I know this question is old, but I did want to mention that there is now a function in the lubridate package for is.Date and also is.POSIXt
sapply(list(as.Date('2000-01-01'), 123, 'ABC'), is.Date)
[1] TRUE FALSE FALSE
The OP clearly asks for just a check:
I need something to tell that the 'Date' column is a Date type
So how many date classes come with R? Exactly two: Date and POSIXt (excluding their derivatives like POSIXct and POSIXlt).
So we can just check on that, and make it more robust than the answers already given:
is.Date <- function(x) {
inherits(x, c("Date", "POSIXt"))
}
As robust as it gets.
is.Date(as.Date("2020-02-02"))
#> [1] TRUE
is.Date(as.POSIXct("2020-02-02"))
#> [1] TRUE
is.Date(as.POSIXlt("2020-02-02"))
#> [1] TRUE
If you want to know if a column would be successfully tranformable/coercible to a Date type, then that's another question. This is as requested for: 'to tell that [...] is a Date type'.
To work with dates I use a function to identify if the strings are dates, and if they are, convert them to a predefined format (in this case I choose ''%d/%m/%Y'):
standarDates <- function(string) {
patterns = c('[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]','[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]','[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]')
formatdates = c('%Y/%m/%d','%d/%m/%Y','%Y-%m-%d')
standardformat='%d/%m/%Y'
for(i in 1:3){
if(grepl(patterns[i], string)){
aux=as.Date(string,format=formatdates[i])
if(!is.na(aux)){
return(format(aux, standardformat))
}
}
}
return(FALSE)
}
Suppose you have the vector
a=c("2018-24-16","1587/03/16","fhjfmk","9885/04/16")
> sapply(a,standarDates)
2018-24-16 1587/03/16 fhjfmk 9885/04/16
"FALSE" "16/03/1587" "FALSE" "16/04/9885"
with the command
"FALSE"%in%sapply(a,standarDates)
[1] True
you can figure out if all the elements are dates.
The advantage of this function is that you can add more patterns and date formats according to the data with you are working and end with a standard format also for all those dates. (The disadvantage is that it isn't exactly what the question is asking)
I hope this helps
Function that i created based on answers here, and using now
is.Date <- function(date) {
if (sapply(date, function(x)
! all(is.na(as.Date(
as.character(x),
format = c("%d/%m/%Y", "%d-%m-%Y", "%Y/%m/%d", "%Y-%m-%d")
))))) {
return(TRUE)
} else{
return(FALSE)
}
}
I will refer to a simple example and I hope it can be generalized.
Say that you have a date
d1<-Sys.Date()
d1
"2020-02-12"
deparse(d1)
"structure(18304, class = \"Date\")"
Thus
grep("Date",deparse(d1))>=1
TRUE
alternatively use
class(d1)
"Date"
I know this is an old question and that I am not providing a self-developed answer, but perhaps some non-experts in R (like myself) could find useful to use the skim function (of the skimr package) to check whether one or more variables of a data frame (df) are in the Date format.
The syntax is just: skim (df) or skimr::skim(df), if one does not want to load the package for just doing this check.
The obtained output is a very detailed summary of the data frame in which variables are grouped by format (character, Date, numeric...) and additional info is provided (e.g. whether there are missing values, descriptive statistics, etc).
This is my way of doing it. Works most of the time but needs improvement
MissLt <- function(x, ratio = 0.5){
sum(is.na(x))/length(x) < ratio
}
IS.Date <- function(x, addformat = NULL, exactformat = NULL){
if (is.null(exactformat)){
format = c("%m/%d/%Y", "%m-%d-%Y","%Y/%m/%d" ,"%Y-%m-%d", addformat)
y <- as.Date(as.character(x),format= format)
MissLt(y,ratio = 1-(1/length(y)))}
else{
y <- as.Date(as.character(x),format= exactformat)
MissLt(y,ratio = 1-(1/length(y)))}
}
sapply(data, IS.Date)
I am familiar with the zoo function rollapply which allows you to do rolling computations on zoo or xts objects and you can specify the rolling increment via the by parameter. I am specifically interested in applying a function every month but using all of the past daily data in the computation. For example say my data set looks like this:
dte, val
1/01/2001, 10
1/02/2001, 11
...
1/31/2001, 2
2/01/2001, 54
2/02/2001, 34
...
2/30/2001, 29
I would like to select the end of each month and apply a function that uses all the daily data. This doesn't seem like it would work with rollapply since the by argument would be 30 sometimes, 29 other months, etc. My current idea is:
f <- function(xts_obj) { coef(lm(a ~ b, data=as.data.frame(xts_obj)))[1] }
month_end <- endpoints(my_xts, on="months", k=1)
rslt <- apply(month_end, 1, function(idx) { my_xts[paste0("/",idx)] })
Surely there is a better way to do this that would be quicker no?
To clarify: I would like to use overlapping periods just the rolling should be done monthly.
If I understand correctly, you can get the dates of your endpoints, then for each endpoint (i.e. using lapply or for), call rollapply using data up to that point.
getSymbols("SPY", src='yahoo', from='2012-01-01', to='2012-08-01')
idx <- index(SPY)[endpoints(SPY, 'months')]
out <- lapply(idx, function(i) {
as.xts(rollapplyr(as.zoo(SPY[paste0("/", i)]), 5,
function(x) coef(lm(x[, 4] ~ x[, 1]))[2], by.column=FALSE))
})
sapply(out, NROW)
#[1] 16 36 58 78 100 121 142 143
I temporarily coerce to zoo for the rollapplyr to make sure the rollapply.zoo method is being used (as opposed to the unexported rollapply.xts method), then coerce back to xts
As an answer to "Is the zoo/xts conversion needed?":
It isn't needed in this case, but rollapply won't work if you send it a dataframe, as I recently discovered from this StackOverflow answer
You want period.apply(), or its convenience helper apply.monthly(), both in xts.
Example:
R> foo <- xts(1:100, order.by=Sys.Date()+0:99)
R> apply.monthly(foo, sum)
[,1]
2012-08-31 105
2012-09-30 885
2012-10-31 1860
2012-11-25 2200
R>
or equally
R> apply.monthly(foo, quantile)
0% 25% 50% 75% 100%
2012-08-31 1 4.25 7.5 10.75 14
2012-09-30 15 22.25 29.5 36.75 44
2012-10-31 45 52.50 60.0 67.50 75
2012-11-25 76 82.00 88.0 94.00 100
R>
just to prove that functions returning more than one value can be used too.