I was looking around the web but could not find the answer that I' looking for.
Here is my input data:
Date Calls
2012-01-01 3
2012-01-01 3
2012-01-01 10
2012-03-02 15
2012-03-02 7
2012-03-02 5
2012-04-02 0
2012-04-02 5
2012-04-02 18
2012-04-02 1
2012-04-02 0
2012-05-02 2
I want to plot a hist that will have the sum of calls for each of the days in the "Date" column.
Yes, it can be done by identifying the levels of Date column and add up the corresponding Calls but wondering
if as an elegant way to do it. "Date" column is as "Date" data.class().
According to this example, the final hist should have 4 bins of (16, 27, 24, 2).
Cheers,
Well, technically a histogram is really only to estimate the density function of continuous data and the way you have your data coded, it's more like Date is a categorical variable. So you probably just want a bar chart of counts than a true histogram. You can do what with ggplot with
qplot(Date,Calls, data=dd, stat="summary", fun.y="sum", geom="bar")
Read data:
d <- read.table(text=
"Date Calls
2012-01-01 3
2012-01-01 3
2012-01-01 10
2012-03-02 15
2012-03-02 7
2012-03-02 5
2012-04-02 0
2012-04-02 5
2012-04-02 18
2012-04-02 1
2012-04-02 0
2012-05-02 2",
header=TRUE)
d$Date <- as.Date(d$Date)
library(plyr)
s <- ddply(d,"Date",summarize,Calls=sum(Calls))
library(ggplot2)
If we use Date as the x variable we get month labels:
ggplot(s,aes(x=Date,y=Calls))+geom_bar(stat="identity")
You might prefer the particular date labels:
ggplot(s,aes(x=factor(Date),y=Calls))+geom_bar(stat="identity")
Or non-default labels:
ggplot(s,aes(x=format(Date,"%d-%b"),y=Calls))+geom_bar(stat="identity")+
labs(x="Date")
It should also be possible to do this by constructing your own hist object and passing it to plot.histogram, but I think this way is easier ...
Related
I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1
I have a dataframe as follows:
Date Price1 Price2 Price3 Price4 .... Price 24
2017-10-15 60.43 49.40 48.72 48.32
2017-10-16 38.09 30.00 24.47 24.88
2017-10-17 48.80 46.76 46.73 45.82
The goal is to turn the dataframe object into a temporal series, predicting as well the date 2017-10-18, with all the corresponding 24 price/values.
Actually, I get the ts object, but it appears the following error at time to compute Error in ets(stock_prize) : y should be a univariate time series
Any advice?
I think your data structure is not correct. I suggest you should make those dates a factor and make only one column for the values. For example you have something like this:
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
mydates2 <-as.Date(c("2008-06-22", "2005-02-13"))
mydates3 <-as.Date(c("2009-06-22", "2006-02-13"))
hours <- c(8,9)
values <- c(1,2)
a=data.frame(mydates,mydates2,mydates3,hours,values)
a
This is how your data looks:
mydates mydates2 mydates3 hours values
1 2007-06-22 2008-06-22 2009-06-22 8 1
2 2004-02-13 2005-02-13 2006-02-13 9 2
But you should transform them to look something like this:
dates=c(mydates,mydates2,mydates3)
hours_factor=rep(hours,3)
ordered_values=rep(values,3)
b=data.frame(dates,hours_factor,ordered_values)
b
This is how your data shoud look like:
dates hours_factor ordered_values
1 2007-06-22 8 1
2 2004-02-13 9 2
3 2008-06-22 8 1
4 2005-02-13 9 2
5 2009-06-22 8 1
6 2006-02-13 9 2
After that you can make the variables a ts class. You can use ts function for that. If you want to predict next date value you can do an auto-regression. It is very well documented in the Internet, but please know your data have to match some requirements first.
I am new to time series and was hoping someone could provide some input/ideas here.
I am trying to find ways to impute missing values.
I was hoping to find the moving average, but most of the packages (smooth, mgcv, etc.) don't seem to take time intervals into consideration.
For example, the dataset might look like something below and I would want value at 2016-01-10 to have the greatest influence in calculating the missing value:
Date Value Diff_Days
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-01-30 50 16
I have instances where NA might be the first observation or the last observation. Sometimes NA values also occur multiple times, at which point the rolling window would need to expand, and this is why I would like to use the moving average.
Is there a package that would take date intervals / separate weights into consideration?
Or please suggest if there is a better way to impute NA values in such cases.
You can use glm or any different model.
Input
con <- textConnection("Date Value Diff_Days
2015-12-14 NA 0
2016-01-01 10 13
2016-01-10 14 4
2016-01-14 NA 0
2016-01-28 30 14
2016-02-14 NA 0
2016-02-18 NA 0
2016-02-29 50 16")
df <- read.table(con, header = T)
df$Date <- as.Date(df$Date)
df$Date.numeric <- as.numeric(df$Date)
fit <- glm(Value ~ Date.numeric, data = df)
df.na <- df[is.na(df$Value),]
predicted <- predict(fit, df.na)
df$Value[is.na(df$Value)] <- predicted
plot(df$Date, df$Value)
points(df.na$Date, predicted, type = "p", col="red")
df$Date.numeric <- NULL
rm(df.na)
print(df)
Output
Date Value Diff_Days
1 2015-12-14 -3.054184 0
2 2016-01-01 10.000000 13
3 2016-01-10 14.000000 4
4 2016-01-14 18.518983 0
5 2016-01-28 30.000000 14
6 2016-02-14 40.092149 0
7 2016-02-18 42.875783 0
8 2016-02-29 50.000000 16
Example data:
dates=seq(as.POSIXct("2015-01-01 00:00:00"), as.POSIXct("2015-01-07 00:00:00"), by="day")
data=rnorm(7,1,2)
groupID=c(12,14,16,24,35,46,54)
DF=data.frame(Date=dates,Data=data,groupID=groupID)
BB=c(12,12,16,24,35,35)
DF[DF$groupID %in% BB,]
Date Data groupID
1 2015-01-01 4.4104202 12
3 2015-01-03 2.1557735 16
4 2015-01-04 -0.9880946 24
5 2015-01-05 -0.3396025 35
I need to filter the data frame DF according to values in my vector BB which match the groupID column. However, if BB contains repetitions, this is not reflected in the result.
Since my vector BB includes two values of 1, and two of 5, the output should in fact be:
Date Data groupID
1 2015-01-01 4.4104202 12
1 2015-01-01 4.4104202 12
3 2015-01-03 2.1557735 16
4 2015-01-04 -0.9880946 24
5 2015-01-05 -0.3396025 35
5 2015-01-05 -0.3396025 35
Is there a way to achieve this? And to keep the ordering of the vector BB if possible?
Use match() (or findInterval()):
DF[match(BB,DF$groupID),];
## Date Data groupID
## 1 2015-01-01 1.2199835 12
## 1.1 2015-01-01 1.2199835 12
## 3 2015-01-03 1.8141556 16
## 4 2015-01-04 0.2748579 24
## 5 2015-01-05 3.2030200 35
## 5.1 2015-01-05 3.2030200 35
(Note that the Data column is different because you used rnorm() to generate it without calling set.seed() first. It is recommended to call set.seed() in any code sample where you incorporate randomness so that exact results can be reproduced.)
You can transform BB into a data.frame and use merge() to merge DF and BB according to their groupID, to be specific:
dates=seq(as.POSIXct("2015-01-01 00:00:00"), as.POSIXct("2015-01-07 00:00:00"), by="day")
groupID=c(12,14,16,24,35,46,54)
set.seed(1234)
data=rnorm(7,1,2)
DF=data.frame(Date=dates,Data=data,groupID=groupID)
BB=data.frame(groupID=c(12,12,16,24,35,35))
Test result:
>merge(DF,BB,by="groupID")
groupID Date Data
1 12 2015-01-01 -1.414131
2 12 2015-01-01 -1.414131
3 16 2015-01-03 3.168882
4 24 2015-01-04 -3.691395
5 35 2015-01-05 1.858249
6 35 2015-01-05 1.858249
I have this dataset
Time Forums_Read
1 00:01 1
2 00:04 1
3 00:05 3
4 00:06 3
5 00:07 3
6 00:08 6
7 00:10 2
8 00:11 2
9 00:12 1
I am trying to geom_line the data. However, it needs to be of type POSIXct.
The structure of the column-Time is:
Factor w/ 1254 levels "00:01","00:04",..
Is there any solution for this?
Thanks!
The date class POSIXct requires a starting point. We can use any since the values in the Time column are being compared to each other. This function call will convert the dates to the proper format.
df$Time <- as.POSIXct(df$Time, format="%M:%S")
I used the format "%M:%S" to indicate minutes and seconds. If you have hours and minutes represented in your data, use "%H:%M". For more information on date formatting see ?strptime.
Along the lines of #Pierre Lafortune's comment,
Df$Time2 <- as.POSIXct(
paste0(Sys.Date(), " 00:", as.character(Df$Time)))
##
library(ggplot2)
##
ggplot(
data = Df,
aes(x = Time2, y = Forums_Read)) +
geom_line()
Data:
Df <- read.table(text = " Time Forums_Read
1 00:01 1
2 00:04 1
3 00:05 3
4 00:06 3
5 00:07 3
6 00:08 6
7 00:10 2
8 00:11 2
9 00:12 1",
header = TRUE,
stringsAsFactors = TRUE)