how to specify x-axis value for date in R - r

I want to figure out when exactly the number of calls sharply increased.
Here is my original code:
plot(breaks, cumfreq0, main="Cumulative percentage of calls happened in NOV.7th", xlab="time", ylab = "cumulative percentage of calls", sub = "(each dot represents a single period of time on Nov.7th)")
but I don't think the time scale on the x-axis is specific enough.
How can I change it?
I tried some times as shown here but it seems that codes does fit time object.
Many many thanks for any help

Please see below, i just replicated your example with dummy data
> df
# A tibble: 55 x 2
datetime Freq
<dttm> <int>
1 2018-11-01 12:41:57 215
2 2018-11-01 12:41:58 163
3 2018-11-01 12:47:06 225
4 2018-11-01 12:51:00 69
5 2018-11-01 12:57:37 203
6 2018-11-01 12:57:38 248
7 2018-11-01 12:57:38 58
8 2018-11-01 13:29:15 179
9 2018-11-01 13:37:45 233
10 2018-11-01 14:24:43 150
# ... with 45 more rows
And the code to kind of plot you are expecting with x-axis as timestamp and you can give whichever format you want
plot(df$datetime,df$Freq,xaxt="n")
axis.POSIXct(1, at=df$datetime, labels=format(df$datetime, "%m/%d/%Y %H:%M:%S"))

Related

Calculating week numbers WITHOUT a yearwise reset (i.e. week_id = 55 is valid and shows it is a year after) + with a specified start date

This probably seems straightforward, but I am pretty stumped.
I have a set of dates ~ August 1 of each year and need to sum sales by week number. The earliest date is 2008-12-08 (YYYY-MM-DD). I need to create a "week_id" field where week #1 begins on 2008-12-08. And the date 2011-09-03 is week 142. Note that this is different since the calculation of week number does not reset every year.
I am putting up a small example dataset here:
data <- data.frame(
dates = c("2008-12-08", "2009-08-10", "2010-03-31", "2011-10-16", "2008-06-03", "2009-11-14" , "2010-05-05", "2011-09-03"))
data$date = as.Date(data$date)
Any help is appreciated
data$week_id = as.numeric(data$date - as.Date("2008-12-08")) %/% 7 + 1
This would take the day difference between the two dates and find the integer number of 7 days elapsed. I add one since we want the dates where zero weeks have elapsed since the start to be week 1 instead of week 0.
dates date week_id
1 2008-12-07 2008-12-07 0 # added for testing
2 2008-12-08 2008-12-08 1
3 2008-12-09 2008-12-09 1 # added for testing
4 2008-12-14 2008-12-14 1 # added for testing
5 2008-12-15 2008-12-15 2 # added for testing
6 2009-08-10 2009-08-10 36
7 2010-03-31 2010-03-31 69
8 2011-10-16 2011-10-16 149
9 2008-06-03 2008-06-03 -26
10 2009-11-14 2009-11-14 49
11 2010-05-05 2010-05-05 74
12 2011-09-03 2011-09-03 143

Frequency not reflecting in time series object R despite specifying it using as.ts

I have a dataframe test_data. I convert it to time series.
test_data
Date Quantity Discount Segment Ship_Mode
1 2018-02-01 345 5000 20 20
2 2018-03-01 500 300 50 20
3 2018-04-01 400 400 40 30
4 2018-05-01 200 400 100 20
test_data<-as.ts(test_data, frequency=12)
Now when I do below, I get the frequency of my data as 1 despite specifying it above as 12. What am I doing wrong?
frequency(test_data)
[1] 1
It seems as.ts does not have frequency parameter, you should be using ts for that.
test_data<-ts(test_data, frequency=12)
frequency(test_data)
#[1] 12

My T-test argument of syntax is right from my data set? [A beginner of R]

I started R only a bit ago. So I would like to ask you if I am doing t-test right with my purpose.
I have two data set like this.
> head(da1)
LiefertagDeliveryDate Price Hour
1 2015-12-31 28.82 1
25 2015-12-30 42.97 1
49 2015-12-29 43.38 1
73 2015-12-28 48.54 1
97 2015-12-27 46.36 1
121 2015-12-26 42.68 1
And,
> head(sp1)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 1 16.06
2 2015-12-30 1 28.51
3 2015-12-29 1 20.59
4 2015-12-28 1 27.94
5 2015-12-27 1 13.42
6 2015-12-26 1 -36.07
So basically everyday, from 2015-12-31 down to 2011-01-01 for hour 1. I would like to conduct t.test for those data to compare the average of price for hour 1 in the each data set and see if it is significant.
For this purpose, I conducted,
t.test(da1$Price, sp1$Price, data=rp1, var.equal=TRUE, conf.level = 0.95,
alternative = 'two.sided', paired=F)
(For equal variance with 95% confidence level for two sided. not paired because it is two different data)
Did I do it right for my purpose? I am very new with R, so I am not really sure if I did good work and since I am a programming beginner, strings and logicals all of that sound very foreign which makes me hard to understand the description in help.
Thank you for your kind advise in advance.
Have a great day people!

Time difference between rows of a dataframe

I have been zoning in the R part of StackOverflow for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem.
I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :
route_short_name trip_id direction_id departure_time stop_sequence
33A 1.1598.0-33A-b12-1.451.I 1 16:15:00 start
33A 1.1598.0-33A-b12-1.451.I 1 16:57:00 end
41C 10.3265.0-41C-b12-1.277.I 1 08:35:00 start
41C 10.3265.0-41C-b12-1.277.I 1 09:26:00 end
41C 100.3260.0-41C-b12-1.276.I 1 09:40:00 start
41C 100.3260.0-41C-b12-1.276.I 1 10:53:00 end
114 1000.987.0-114-b12-1.86.O 0 21:35:00 start
114 1000.987.0-114-b12-1.86.O 0 22:02:00 end
39 10000.2877.0-39-b12-1.242.I 1 11:15:00 start
39 10000.2877.0-39-b12-1.242.I 1 12:30:00 end
It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:
route_short_name trip_id direction_id duration
33A 1.1598.0-33A-b12-1.451.I 1 42
41C 10.3265.0-41C-b12-1.277.I 1 51
41C 100.3260.0-41C-b12-1.276.I 1 73
114 1000.987.0-114-b12-1.86.O 0 27
39 10000.2877.0-39-b12-1.242.I 1 75
I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.
Does anyone have a clue?
We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73
Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book

calculating the number of "open hours" per day between two dates

I have a data frame with start dates and end dates, along with the number of people registered for an event. I would like to calculate the number of hours each party is present for within a specific timeframe (e.g., 07:00 - 17:00)
If I use the following example data.frame...
d <- data.frame(startDate = c(as.POSIXct("2011-06-04 08:00:00"), as.POSIXct("2011-06-03 08:00:00"),
as.POSIXct("2011-09-12 10:00:00")),
endDate = c(as.POSIXct("2011-06-06 11:00:00"), as.POSIXct("2011-06-04 11:00:00"),
as.POSIXct("2011-09-12 18:00:00")),
partysize = c(124,442,323))
open <- "07:00"
close <- "17:00"
I would like my result set to look something like this:
day numhours partysize
2011-06-04 9 124
2011-06-05 10 124
2011-06-06 4 124
2011-06-03 9 442
2011-06-04 4 442
2011-09-12 7 323
note: numhours is the number of hours the date was included between the open and close times
Thanks in advance,
--JT
Sorry its very messy and I used 7 and 17 instead of your open and close
app.days<-mapply(function(x,y){x+y*60*60*24},as.POSIXct(format(d$startDate,"%Y-%m-%d")),lapply(floor(-(d$startDate-d$endDate)/24),seq,from=0))
start.date<-mapply(function(x,y){pmax(x+7*60*60,y)},app.days,d$startDate)
end.date<-mapply(function(x,y){pmin(x+17*60*60,y)},app.days,d$endDate)
app.hours<-mapply(function(x,y){as.numeric(x-y)},end.date,start.date)
res<-mapply(function(x,y,z){data.frame(day=as.Date(x),numhours=y,partysize=z)},app.days,app.hours,as.list(d$partysize))
res1<-data.frame(day=as.Date(unlist(res[1,]),origin="1970-01-01"),numhours=unlist(res[2,]),partysize=unlist(res[3,]))
> res1
day numhours partysize
1 2011-06-04 9 124
2 2011-06-05 10 124
3 2011-06-06 4 124
4 2011-06-03 9 442
5 2011-06-04 4 442
6 2011-09-12 7 323
Basically we identify how many days each party size stays for. For a given day we find the applicable open and close. Then we subtract open from close. The dataframe is eventually formed but it could probably have been created in the res<- step.....

Resources