I have an data set which looks like this:
VisitID Start
1 0 2015-02-15 09:46:43.17
2 1 2015-02-15 09:47:37.84
3 2 2015-02-15 09:58:46.42
4 3 2015-02-15 09:58:48.46
5 4 2015-02-15 10:28:25.09
6 5 2015-02-15 10:33:43.53
I want to make a bar plot of count per one hour(y-axis) vs. absolute time(x-axis), meaning how many observations were in one hour.
can you please help?
Thanks,
Guy
Something like this should work :
DF <- read.csv(text=
"VisitID,Start
0,2015-02-15 09:46:43.17
1,2015-02-15 09:47:37.84
2,2015-02-15 09:58:46.42
3,2015-02-15 09:58:48.46
4,2015-02-15 10:28:25.09
5,2015-02-15 10:33:43.53",stringsAsFactors=FALSE)
DF$StartDate <- strptime(DF$Start, tz='GMT', format="%Y-%m-%d %H:%M:%OS")
hours <- vapply(split(1:nrow(DF),format(DF$StartDate,"%Y-%m-%d %H:00:00",tz='UTC')),length,0)
barplot(hours)
Related
Given a date and the day of the week it is, I want to know if there is a code that tells me which of those days of the month it is. For example in the picture below, given 2/12/2020 and "Wednesday" I want to be given the output "2" for it being the second Wednesday of the month.
You can do that in base R in essentially one operation. You also do not need the second input column.
Here is slower walkthrough:
Code
dates <- c("2/12/2020","2/11/2020","2/10/2020","2/7/2020","2/6/2020", "2/5/2020")
Dates <- anytime::anydate(dates) ## one of several parsers
dow <- weekdays(Dates) ## for illustration, base R function
cnt <- (as.integer(format(Dates, "%d")) - 1) %/% 7 + 1
res <- data.frame(dt=Dates, dow=dow, cnt=cnt)
res
(Final) Output
R> res
dt dow cnt
1 2020-02-12 Wednesday 2
2 2020-02-11 Tuesday 2
3 2020-02-10 Monday 2
4 2020-02-07 Friday 1
5 2020-02-06 Thursday 1
6 2020-02-05 Wednesday 1
R>
Functionality like this is often in dedicated date/time libraries. I wrapped some code from the (C++) Boost date_time library in package RcppBDH -- that allowed to easily find 'the third Wednesday in the last month each quarter' and alike.
(lubridate::day(your_date) - 1) %/% 7 + 1
The idea here is that the first 7 days of the month are all the first for their weekday. Next 7 are 2nd, etc.
> (1:30 - 1) %/% 7 + 1
# [1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5
Just to offer an alternative calculation for the nth-weekday of the month, you can just divide the day by 7 and always round up:
date <- lubridate::mdy("02/12/2020")
ceiling(day(date)/7)
I have a dataframe as follows:
Date Price1 Price2 Price3 Price4 .... Price 24
2017-10-15 60.43 49.40 48.72 48.32
2017-10-16 38.09 30.00 24.47 24.88
2017-10-17 48.80 46.76 46.73 45.82
The goal is to turn the dataframe object into a temporal series, predicting as well the date 2017-10-18, with all the corresponding 24 price/values.
Actually, I get the ts object, but it appears the following error at time to compute Error in ets(stock_prize) : y should be a univariate time series
Any advice?
I think your data structure is not correct. I suggest you should make those dates a factor and make only one column for the values. For example you have something like this:
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
mydates2 <-as.Date(c("2008-06-22", "2005-02-13"))
mydates3 <-as.Date(c("2009-06-22", "2006-02-13"))
hours <- c(8,9)
values <- c(1,2)
a=data.frame(mydates,mydates2,mydates3,hours,values)
a
This is how your data looks:
mydates mydates2 mydates3 hours values
1 2007-06-22 2008-06-22 2009-06-22 8 1
2 2004-02-13 2005-02-13 2006-02-13 9 2
But you should transform them to look something like this:
dates=c(mydates,mydates2,mydates3)
hours_factor=rep(hours,3)
ordered_values=rep(values,3)
b=data.frame(dates,hours_factor,ordered_values)
b
This is how your data shoud look like:
dates hours_factor ordered_values
1 2007-06-22 8 1
2 2004-02-13 9 2
3 2008-06-22 8 1
4 2005-02-13 9 2
5 2009-06-22 8 1
6 2006-02-13 9 2
After that you can make the variables a ts class. You can use ts function for that. If you want to predict next date value you can do an auto-regression. It is very well documented in the Internet, but please know your data have to match some requirements first.
I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]
I have this dataset
Time Forums_Read
1 00:01 1
2 00:04 1
3 00:05 3
4 00:06 3
5 00:07 3
6 00:08 6
7 00:10 2
8 00:11 2
9 00:12 1
I am trying to geom_line the data. However, it needs to be of type POSIXct.
The structure of the column-Time is:
Factor w/ 1254 levels "00:01","00:04",..
Is there any solution for this?
Thanks!
The date class POSIXct requires a starting point. We can use any since the values in the Time column are being compared to each other. This function call will convert the dates to the proper format.
df$Time <- as.POSIXct(df$Time, format="%M:%S")
I used the format "%M:%S" to indicate minutes and seconds. If you have hours and minutes represented in your data, use "%H:%M". For more information on date formatting see ?strptime.
Along the lines of #Pierre Lafortune's comment,
Df$Time2 <- as.POSIXct(
paste0(Sys.Date(), " 00:", as.character(Df$Time)))
##
library(ggplot2)
##
ggplot(
data = Df,
aes(x = Time2, y = Forums_Read)) +
geom_line()
Data:
Df <- read.table(text = " Time Forums_Read
1 00:01 1
2 00:04 1
3 00:05 3
4 00:06 3
5 00:07 3
6 00:08 6
7 00:10 2
8 00:11 2
9 00:12 1",
header = TRUE,
stringsAsFactors = TRUE)
I have a dataframe that looks like this:
month create_time request_id weekday
1 4 2014-04-25 3647895 Friday
2 12 2013-12-06 2229374 Friday
3 4 2014-04-18 3568796 Friday
4 4 2014-04-18 3564933 Friday
5 3 2014-03-07 3081503 Friday
6 4 2014-04-18 3568889 Friday
And I'd like to get the count of request_ids by the weekday. How would I do this in R?
I've tried a lot of stuff based on ddply and aggregate with no luck.
Try using aggregate
> aggregate(request_id ~ weekday, FUN=length, dat=df)
weekday request_id
1 Friday 6
There are several valid ways to do it. I usually go with my trusty sqldf(). If the dataframe is named D, then
library(sqldf)
counts <- sqldf('select weekday, count(request_id) as nrequests from D group by weekday')
sqldf() can be wordy, but it is just so easy to remember and get right the first time!
or ... u could try:
count(df,"weekday")
or
library(plyr)
ddply(df,.(weekday),summarise,count=length(month))
Another option is to use a table and take the rowSums
> rowSums(with(dat, table(weekday, request_id)))
Friday
6