Related
I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.
I have hear a really silly output format from observations which I've to read in with scan.
Here's a snipplet from (data.dat), where I've marked header and data blocks:
06.02.2014 # header
PNP
-0,005
00:05#587 # values
00:15#591
23:50#587
23:55#587
07.02.2014 # header
PNP
-0,005
00:10#587 # values
00:15#590
23:55#590
24:00#593
08.02.2014 # header
PNP
-0,005
00:05#590 # value
00:10#595
00:15#600
23:50#600
23:55#607
The problems are:
I've got date for several years in 5min resolution,
each day has is own header (constant length), beginning with the date and two additional entries,
the length of the time series (format HH:MM#value)for each day is not constant, data gaps exists (not shown in the example)
My aim is a data.frame of the form date, time, value.
So, I need a loop or something, which analyses the single list elements (output from scan(file=data.dat, what=" ") as character). Since the time blocks have different lengths, I'd like to subsetting my daily data beginning with the date, skipping some further header elements, and than strsplit the time#value elements of the list, which has been outputted by
crap <- scan(file = data.dat, what=" ") # import as list
the strsplit works well with
tmp <- strsplit(crap[4:8], split="#")
df <- data.frame(date=as.Date(crap[1],format = "%d.%m.%Y"), time=sapply(tmp, "[[", 1), W=sapply(tmp, "[[", 2))
However, I've no idea how to analyse the elements from the list (as characters), if they have an valid date format.
Cheers!
I have a solution but it may be very specific to the question you asked and what I interpreted.
First read the data and remove the PNP and -0,005 from the data.
crap <- read.table(file = "data.dat",comment.char = " ")
a <- as.vector(crap$V1)
a <- a[-grep("PNP|-0,005",x = a)]
Now I extract the dates contained in the vector a
dateId <- grep(".",x=a,fixed=T)
uniquedate <- as.matrix(a[dateId])
> uniquedate
[,1]
[1,] "06.02.2014"
[2,] "07.02.2014"
[3,] "08.02.2014"
Now I create a vector of dates of same length as no. of values in the dataset by repeating the dates for the number of values present in the corresponding date.
len <- length(dateId)
dateRepVal <- c(diff(dateId)-1,(length(a) - dateId[len]))
dates <- unlist(sapply(1:len,FUN = function(x){rep(uniquedate[x],dateRepVal[x])}))
All other elements expect the date in our dataset "a" are time-value pair.using this information now I get the time and val by using the strsplit function and then create the dataframe.
timeVal <- strsplit(a[-dateId],split = "#")
time <- sapply(timeVal, "[[", 1)
val <- sapply(timeVal, "[[", 2)
DF <- data.frame(date = dates,time=time,val=val)
The final required output looks like below.
>DF
date time val
1 06.02.2014 00:05 587
2 06.02.2014 00:15 591
3 06.02.2014 23:50 587
4 06.02.2014 23:55 587
5 07.02.2014 00:10 587
6 07.02.2014 00:15 590
7 07.02.2014 23:55 590
8 07.02.2014 24:00 593
9 08.02.2014 00:05 590
10 08.02.2014 00:10 595
11 08.02.2014 00:15 600
12 08.02.2014 23:50 600
13 08.02.2014 23:55 607
Hope this solves the problem.
This question asks about aggregation by time period in R, what pandas calls resampling. The most useful answer uses the XTS package to group by a given time period, applying some function such as sum() or mean().
One of the comments suggested there was something similar in lubridate, but didn't elaborate. Can someone provide an idiomatic example using lubridate? I've read through the lubridate vignette a couple times and can imagine some combination of lubridate and plyr, however I want to make sure there isn't an easier way that I'm missing.
To make the example more real, let's say I want the daily sum of bicycles traveling northbound from this dataset:
library(lubridate)
library(reshape2)
bikecounts <- read.csv(url("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD"), header=TRUE, stringsAsFactors=FALSE)
names(bikecounts) <- c("Date", "Northbound", "Southbound")
Data looks like this:
> head(bikecounts)
Date Northbound Southbound
1 10/02/2012 12:00:00 AM 0 0
2 10/02/2012 01:00:00 AM 0 0
3 10/02/2012 02:00:00 AM 0 0
4 10/02/2012 03:00:00 AM 0 0
5 10/02/2012 04:00:00 AM 0 0
6 10/02/2012 05:00:00 AM 0 0
I don't know why you'd use lubridate for this. If you're just looking for something less awesome than xts you could try this
tapply(bikecounts$Northbound, as.Date(bikecounts$Date, format="%m/%d/%Y"), sum)
Basically, you just need to split by Date, then apply a function.
lubridate could be used for creating a grouping factor for split-apply problems. So, for example, if you want the sum for each month (ignoring year)
tapply(bikecounts$Northbound, month(mdy_hms(bikecounts$Date)), sum)
But, it's just using wrappers for base R functions, and in the case of the OP, I think the base R function as.Date is the easiest (as evidenced by the fact that the other Answers also ignored your request to use lubridate ;-) ).
Something that wasn't covered by the Answer to the other Question linked to in the OP is split.xts. period.apply splits an xts at endpoints and applies a function to each group. You can find endpoints that are useful for a given task with the endpoints function. For example, if you have an xts object, x, then endpoints(x, "months") would give you the row numbers that are the last row of each month. split.xts leverages that to split an xts object -- split(x, "months") would return a list of xts objects where each component was for a different month.
Although, split.xts() and endpoints() are primarily intended for xts objects, they also work on some other objects as well, including plain time based vectors. Even if you don't want to use xts objects, you still may find uses for endpoints() because of its convenience or its speed (implemented in C)
> split.xts(as.Date("1970-01-01") + 1:10, "weeks")
[[1]]
[1] "1970-01-02" "1970-01-03" "1970-01-04"
[[2]]
[1] "1970-01-05" "1970-01-06" "1970-01-07" "1970-01-08" "1970-01-09"
[6] "1970-01-10" "1970-01-11"
> endpoints(as.Date("1970-01-01") + 1:10, "weeks")
[1] 0 3 10
I think lubridate's best use in this problem is for parsing the "Date" strings into POSIXct objects. i.e. the mdy_hms function in this case.
Here's an xts solution that uses lubridate to parse the "Date" strings.
x <- xts(bikecounts[, -1], mdy_hms(bikecounts$Date))
period.apply(x, endpoints(x, "days"), sum)
apply.daily(x, sum) # identical to above
For this specific task, xts also has an optimized period.sum function (written in Fortran) that is very fast
period.sum(x, endpoints(x, "days"))
Using ddply from plyr package:
library(plyr)
bikecounts$Date<-with(bikecounts,as.Date(Date, format = "%m/%d/%Y"))
x<-ddply(bikecounts,.(Date),summarise, sumnorth=sum(Northbound),sumsouth=sum(Southbound))
> head(x)
Date sumnorth sumsouth
1 2012-10-02 1165 773
2 2012-10-03 1761 1760
3 2012-10-04 1767 1708
4 2012-10-05 1590 1558
5 2012-10-06 926 1080
6 2012-10-07 951 1191
> tail(x)
Date sumnorth sumsouth
298 2013-07-26 1964 1999
299 2013-07-27 1212 1289
300 2013-07-28 902 1078
301 2013-07-29 2040 2048
302 2013-07-30 2314 2226
303 2013-07-31 2008 2076
Here is an option using data.table
after importing the csv:
library(data.table)
# convert the data.frame to data.table
bikecounts <- data.table(bikecounts)
# Calculate
bikecounts[, list(NB=sum(Northbound), SB=sum(Southbound)), by=as.Date(Date, format="%m/%d/%Y")]
as.Date NB SB
1: 2012-10-02 1165 773
2: 2012-10-03 1761 1760
3: 2012-10-04 1767 1708
4: 2012-10-05 1590 1558
5: 2012-10-06 926 1080
---
299: 2013-07-27 1212 1289
300: 2013-07-28 902 1078
301: 2013-07-29 2040 2048
302: 2013-07-30 2314 2226
303: 2013-07-31 2008 2076
Note, you can also use fread() ("fast read") from the data.table package to read in the CSV into a data.table in one step.
The only draw back is you to manually convert the date/time from string.
eg:
bikecounts <- fread("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD", header=TRUE, stringsAsFactors=FALSE)
setnames(bikecounts, c("Date", "Northbound", "Southbound"))
bikecounts[, Date := as.POSIXct(D, format="%m/%d/%Y %I:%M:%S %p")]
Here is the requested lubridate solution, which I also added to the linked question. It uses a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(zoo.ts, month, sum)
ts.daily.mean <- aggregate(zoo.ts, day, mean)
ts.mins.mean <- aggregate(zoo.ts, minutes, mean)
Obviously, you need to first convert your data to a zoo() object, which is easy enough. You can also use yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.
This is a pseudo followup to this question: Why is ggplot graphing null percentage data points?
Let's say this is my dataset:
Date AE AA AEF Percent
1/1/2012 1211 1000 3556 0.03
1/2/2012 100 2000 3221 0.43
1/3/2012 3423 10000 2343 0.54
1/4/2012 10000 3000 332 0.43
1/5/2012 2342 500 4435 0.43
1/6/2012 2342 800 2342 0.23
1/7/2012 2342 1500 1231 0.12
1/8/2012 111 2300 333
1/9/2012 1231 1313 3433
1/10/2012 3453 5654 222
1/11/2012 3453 3453 454
1/12/2012 5654 7685 3452
> str(data)
'data.frame': 12 obs. of 5 variables:
$ Date : Factor w/ 12 levels "10/11/2012","10/12/2012",..: 1 2 3 4 5 6 7 8 9 10 ...
$ AE : int 1211 100 3423 10000 2342 2342 2342 111 1231 3453 ...
$ AA : int 1000 2000 10000 3000 500 800 1500 2300 1313 5654 ...
$ AEF : int 3556 3221 2343 332 4435 2342 1231 333 3433 222 ...
$ Percent: num 0.03 0.43 0.54 0.43 0.43 0.23 0.12 NA NA NA ...
I need something to tell that the 'Date' column is a Date type as opposed to a numeric or character type (this is because I have to convert the 'Date' column of the data input into an actual Date with as.Date(), ASSSUMING that I do not know the column names of the data set).
is.numeric(data[[1]]) returns False
is.character(data[[1]]) returns False
I made the 'Date' column in Excel, formatting the column in the 'Date' format, then saved the file as a csv. What type is this in R? I seek an expression similar to the above that returns TRUE.
Use inherits to detect if argument has datatype Date:
is.date <- function(x) inherits(x, 'Date')
sapply(list(as.Date('2000-01-01'), 123, 'ABC'), is.date)
#[1] TRUE FALSE FALSE
If you want to check if character argument can be converted to Date then use this:
is.convertible.to.date <- function(x) !is.na(as.Date(as.character(x), tz = 'UTC', format = '%Y-%m-%d'))
sapply(list('2000-01-01', 123, 'ABC'), is.convertible.to.date)
# [1] TRUE FALSE FALSE
You could try to coerce all the columns to as.Date and see which ones succeed. You would need to specify the format you expect dates to be in though. E.g.:
data <- data.frame(
Date=c("10/11/2012","10/12/2012"),
AE=c(1211,100),
Percent=c(0.03,0.43)
)
sapply(data, function(x) !all(is.na(as.Date(as.character(x),format="%d/%m/%Y"))))
#Date AE Percent
#TRUE FALSE FALSE
I know this question is old, but I did want to mention that there is now a function in the lubridate package for is.Date and also is.POSIXt
sapply(list(as.Date('2000-01-01'), 123, 'ABC'), is.Date)
[1] TRUE FALSE FALSE
The OP clearly asks for just a check:
I need something to tell that the 'Date' column is a Date type
So how many date classes come with R? Exactly two: Date and POSIXt (excluding their derivatives like POSIXct and POSIXlt).
So we can just check on that, and make it more robust than the answers already given:
is.Date <- function(x) {
inherits(x, c("Date", "POSIXt"))
}
As robust as it gets.
is.Date(as.Date("2020-02-02"))
#> [1] TRUE
is.Date(as.POSIXct("2020-02-02"))
#> [1] TRUE
is.Date(as.POSIXlt("2020-02-02"))
#> [1] TRUE
If you want to know if a column would be successfully tranformable/coercible to a Date type, then that's another question. This is as requested for: 'to tell that [...] is a Date type'.
To work with dates I use a function to identify if the strings are dates, and if they are, convert them to a predefined format (in this case I choose ''%d/%m/%Y'):
standarDates <- function(string) {
patterns = c('[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]','[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9]','[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]')
formatdates = c('%Y/%m/%d','%d/%m/%Y','%Y-%m-%d')
standardformat='%d/%m/%Y'
for(i in 1:3){
if(grepl(patterns[i], string)){
aux=as.Date(string,format=formatdates[i])
if(!is.na(aux)){
return(format(aux, standardformat))
}
}
}
return(FALSE)
}
Suppose you have the vector
a=c("2018-24-16","1587/03/16","fhjfmk","9885/04/16")
> sapply(a,standarDates)
2018-24-16 1587/03/16 fhjfmk 9885/04/16
"FALSE" "16/03/1587" "FALSE" "16/04/9885"
with the command
"FALSE"%in%sapply(a,standarDates)
[1] True
you can figure out if all the elements are dates.
The advantage of this function is that you can add more patterns and date formats according to the data with you are working and end with a standard format also for all those dates. (The disadvantage is that it isn't exactly what the question is asking)
I hope this helps
Function that i created based on answers here, and using now
is.Date <- function(date) {
if (sapply(date, function(x)
! all(is.na(as.Date(
as.character(x),
format = c("%d/%m/%Y", "%d-%m-%Y", "%Y/%m/%d", "%Y-%m-%d")
))))) {
return(TRUE)
} else{
return(FALSE)
}
}
I will refer to a simple example and I hope it can be generalized.
Say that you have a date
d1<-Sys.Date()
d1
"2020-02-12"
deparse(d1)
"structure(18304, class = \"Date\")"
Thus
grep("Date",deparse(d1))>=1
TRUE
alternatively use
class(d1)
"Date"
I know this is an old question and that I am not providing a self-developed answer, but perhaps some non-experts in R (like myself) could find useful to use the skim function (of the skimr package) to check whether one or more variables of a data frame (df) are in the Date format.
The syntax is just: skim (df) or skimr::skim(df), if one does not want to load the package for just doing this check.
The obtained output is a very detailed summary of the data frame in which variables are grouped by format (character, Date, numeric...) and additional info is provided (e.g. whether there are missing values, descriptive statistics, etc).
This is my way of doing it. Works most of the time but needs improvement
MissLt <- function(x, ratio = 0.5){
sum(is.na(x))/length(x) < ratio
}
IS.Date <- function(x, addformat = NULL, exactformat = NULL){
if (is.null(exactformat)){
format = c("%m/%d/%Y", "%m-%d-%Y","%Y/%m/%d" ,"%Y-%m-%d", addformat)
y <- as.Date(as.character(x),format= format)
MissLt(y,ratio = 1-(1/length(y)))}
else{
y <- as.Date(as.character(x),format= exactformat)
MissLt(y,ratio = 1-(1/length(y)))}
}
sapply(data, IS.Date)
I have some data, where I aggregate the information on a unique minute basis with the below code based on a dataset for 1 day.
I would however like to be able to run this code with a datafile that is combined of multiple days. I have a date column in the dataset, so I can use that as a unique identifier for each day. Is there a way to aggregate the data on a 1 minute basis, given that the dates aren't the same?
The problem is that the unique function extracts the unique events that occur the first day, and then adds all the same events that happen in that minute afterwards. If i base it on the date too, I believe I can create unique 1-minute entries for each day in one long dataset.
Below is the code that works for a single days data.
novo <- read.csv("C:/Users/Morten/Desktop/data.csv", header = TRUE, stringsAsFactors=FALSE )
TimeStamp <- novo[,1]
price <- novo[, 2]
volume <- novo[,3]
nV <- sum(volume)
MinutesFloor <- unique(floor(TimeStamp))
nTradingMinutes <- length(MinutesFloor)
PriceMin <- rep(0, nTradingMinutes)
VolumeMin <- rep(0, nTradingMinutes)
for( j in 1:nTradingMinutes){
ThisMinutes <- (floor(TimeStamp) == MinutesFloor[j])
PriceMin[j] <- mean(price[ThisMinutes])
VolumeMin[j] <- sum(volume[ThisMinutes])
}
Thanks in advance
data format:
date,"ord","shares","finalprice","time","stock"
20100301,C,80,389,540.004,1158
20100301,C,77,389,540.004,1158
20100301,C,60,389,540.004,1158
20100301,C,28,389,540.004,1158
20100301,C,7,389,540.004,1158
20100302,C,25,394.7,540.00293333,1158
20100302,C,170,394.7,540.00293333,1158
20100302,C,40,394.7,540.00293333,1158
20100302,C,75,394.7,540.00293333,1158
20100302,C,100,394.7,540.00293333,1158
20100302,C,1,394.7,540.00293333,1158
I would like to suggest a radically simplified version of your code.
You are doing quite a few things rather inefficient. R is made to compute summary statistics clustered by different data values.
We will use this methods heavily.
I assume your data to be of the form you provided. At my system, this looks like
novo <- read.csv("test.csv", header = TRUE, stringsAsFactors=FALSE )
This gives us:
> str(novo)
'data.frame': 11 obs. of 6 variables:
$ date : int 20100301 20100301 20100301 20100301 20100301 20100302 20100302 20100302 20100302 20100302 ...
$ ord : chr "C" "C" "C" "C" ...
$ shares : int 80 77 60 28 7 25 170 40 75 100 ...
$ finalprice: num 389 389 389 389 389 ...
$ time : num 540 540 540 540 540 ...
$ stock : int 1158 1158 1158 1158 1158 1158 1158 1158 1158 1158 ...
Now, I assume that your date is ordered YearMonthDate. If you have a different ordering, you would have to alter the format command below. Furthermore, your time probably is in minutes.
Then we can create timestamps containing both the date and the time using the POSIXct datatype:
timestamps <- as.POSIXct(as.character(novo$date), format='%Y%m%d') + novo$time*60
Now, we do the rounding up minutes by creating a factor variable and using the cut function:
timestampsByMinute <- droplevels(cut(timestamps, 'min'))
Note that the additional droplevels function just removes the minutes that have no data item s available.
Finally, we may compute the summary statistics you did in the for-loop:
tapply is a function taking it's first argument, dividing it into groups defined by the second argument and applying the function given as third argument to that data. Thus we may just throw the tapply function on your data. (I have the feeling that the column numbers you used in your code do not match the column names in your example data - feel free to adapt to different columns if I interpreted your meaning the wrong way)
PriceMin <- tapply(novo$finalprice, timestampsByMinute, mean)
VolumeMin <- tapply(novo$shares, timestampsByMinute, sum)
This gives us
> PriceMin
2010-03-01 09:00:00 2010-03-02 09:00:00
389.0 394.7
> VolumeMin
2010-03-01 09:00:00 2010-03-02 09:00:00
252 411
which is probably what you want.
Note that tapply is much faster that the loop you used. If you have huge datafiles, this may be important.
I hope there are no errors left in my code - testing was not easy given the fact that you provided only data for one minute per day.
Edit:
As per request, here a small modification that removes the time information from the data:
> unname(VolumeMin)
[1] 252 411
> unname(PriceMin)
[1] 389.0 394.7