I have a data frame of character times that I need to average by provider, but I'm not sure how to average them using just the time without the date. For the example below:
provider time
USA 9:26:46
USDA 9:26:18
USDA 9:10:17
OIL 10:00:00
USA 6:20:56
USDA 7:19:13
OIL 11:00:00
The correct output for OIL would be the average between 10:00 and 11:00, and would look like:
provider average
OIL 10:30
Does anyone know how to average just time without incorporating date using POSIX?
mean works as expected. You can format the date to %H:%M:%S afterwards.
df <- read.table(text="provider time
USA 9:26:46
USDA 9:26:18
USDA 9:10:17
OIL 10:00:00
USA 6:20:56
USDA 7:19:13
OIL 11:00:00",head=TRUE)
df$time <- as.POSIXct(df$time,format="%T",origin="1970-01-01")
format(as.POSIXct(tapply(df$time,df$provider,mean),origin="1970-01-01"),format="%H:%M:%S")
OIL USA USDA
"10:30:00" "07:53:51" "08:38:36"
To get the providers' name :
m <- format(as.POSIXct(tapply(df$time,df$provider,mean),origin="1970-01-01"),format="%H:%M:%S")
m <- as.data.frame(m)
m$provider <- row.names(m)
Related
I have a data.frame representing the time sheet for several staff over a period of several months spanning 2 years. The data looks like:
Name Month 1 2 3 ... 31 Total Job ... [more columns]
John Smith Aug 2017 1:20 1:20 Typing
Mary Jones Sep 2017 Prooing
John Smith Oct 2017 0:15 1:10 1:25 Typing
...
Jim Miles Feb 2018 1:30 2:10 3:40 Admin
There are 31 columns, each representing a date in the corresponding month. There will be multiple rows with the same Name.
So looking at the first entry, John Smith did 1 hour and 20 minutes of work on 1 August 2017.
What I want to do is to analyse these data in a granular way, e.g.
How many hours did John Smith spend on Typing in Sept 2017?
How much Proofing was done in Jan-Feb 2018?
I am a bit stuck on how to proceed in order to have the data to analyse. Suggestions appreciated.
Added for clarification:
Having read three very helpful replies and looked at tidyr, I have clarified my thoughts and think that I need to modify the data so there is one row for each entry, so the example table will become:
Name Date Duration Job ... [more columns]
John Smith 01 Aug 2017 1:20 Typing
John Smith 02 Oct 2017 0:15 Typing
John Smith 31 Oct 2017 0:15 Typing
...
Jim Miles 02 Feb 2018 1:30 Admin
Jim Miles 03 Feb 2018 2:10 Admin
Date will need to be formatted correctly but that is not major. The problem is matching the day of month to the relevant Month and year to produce the composite date. Any ideas welcome.
I would approach this by converting total time spent to numeric. Depending on the structure of the data, you could split this string by a colon and convert minutes to hours and sum to get decimal hours.
Something along the lines of this:
x <- c("1:20", "1:25", "3:40")
x <- strsplit(x, ":")
sapply(x, FUN = function(m) {
m <- as.numeric(m)
sum(m[1], m[2]/60)
})
[1] 1.333333 1.416667 3.666667
Then, you could use aggregate to sum by month-year and name.
aggregate(Total ~ Name + Month + Job, data = xy, FUN = sum)
If you need to report by month only, you would have to extract month name in one way or another, but nothing hard.
After following up on the suggestions of #Khlick, I succeeded in using gather():
mydata <- mydata %>% gather(new_date, time_spent, "1":"31")
This produced two new columns, new_date and time_spent, then created a new row for each data point of columns 1–31.
So now I had, for each data point, the month, e.g. Aug 2017 in one column, the day the work was done, e.g. 12 in another. I changed the month to a date in the original spreadsheet, so it became 2017-08-01 (all dates now have 01). Then in R I used substr() and paste() to replace the day to the correct one, i.e. 2017-08-12.
Finally, I was left with a large number of rows with no value in time_spent. I removed those rows.
I now have:
Name Date Duration Job ... [more columns]
John Smith 2017-08-01 1:20 Typing
John Smith 2017-10-02 0:15 Typing
John Smith 2017-10-31 0:15 Typing
...
Jim Miles 2018-02-02 1:30 Admin
Jim Miles 2018-02-03 2:10 Admin
I did a few spot checks and seems data was transformed correctly. Thanks to all, especially to #Khlick.
My data set looks like this: Daily closing share price is given for 25 years from 1991 to 2016 for each trading date.
Company Code Company Name Daily Trading Dates Daily Closing Share price
43677 CENTURY ENKA LTD. 1/1/1991 3550.00
-do- -do- 1/2/1991 3600.00
. 3700.00
. 3800.00
12/31/1991 x
. x
. x
1/1/2016 x
. x
. x
12/31/2016 x
i think the SMA function in the TTR package may help. here is an example:
library(TTR)
p1 <- c(45,68,98,97,45,12,46,98,45,65,97,48,65,95) #dummy price data
SMA(p1,4) #calculate a 4 period simple moving average
#here is outcome
[1] NA NA NA 77.00 77.00 63.00 50.00 50.25 50.25 63.50 76.25
[12] 63.75 68.75 76.25
so within the SMA function, if you set the second argument to 252 -- the number of trading days in a year -- i think you will get an annual average share price for the past year for each date in your dataframe.
I would use package lubridateand either tapply or ave. In what follows I assume that your data is in the form of a data.frame named dat.
library(lubridate)
yr <- year(mdy(date))
res1 <- tapply(dat$price, yr, FUN = mean)
res2 <- ave(dat$price, yr, FUN = mean)
The difference between the two is that ave returns a vector the length of the input vector, whereas tapply returns a vector with as many elements as groups defined by the grouping variable(s), in this case yr.
I have a dataset x_output that looks like this:
timestamp city wait_time weekday
2015-07-14 09:00:00 Boston 1.4 Tuesday
2015-07-14 09:01:00 Boston 2.5 Tuesday
2015-07-14 09:02:00 Boston 2.8 Tuesday
2015-07-14 09:03:00 Boston 1.6 Tuesday
2015-07-14 09:04:00 Boston 1.5 Tuesday
2015-07-14 09:05:00 Boston 1.4 Wednesday
I would like to find the mean wait_time, grouped by city, weekday, and time. Basically, given your city, what is the average wait time for Monday, for example? Then Tuesday?
I'm having difficulty creating the time column given x_output$timestamp; I'm currently using:
x_output$time <- strsplit(as.character(x_output$timestamp), split = " ")[[1]][2]
However, that simply puts "09:00" in every row, not the correct time for each individual row.
Secondly, I need to have a 3-way grouping to find the mean wait_time given city, weekday and time. This is something that's fairly straightforward to do in python pandas, but I can find very little documentation on it in R (and unfortunately I need to do it in R, not python).
I've looked into using data.table but that hasn't seemed to work. Is there a simple function like there would be in python pandas (eg. df.groupby(['col1', 'col2', 'col3']).mean())?
Mean wait_time grouped by city, weekday, time:
library(plyr)
ddply(x_output, .(city, weekday, time), summarize, avg=mean(wait_time))
If you wanted data.table
x_output[, list(avg=mean(wait_time)), .(city, weekday, time)]
I'm having difficulty creating the time column given x_output$timestamp
Well, what is the time column supposed to have in it? Just the time component of timestamp? Is timestamp a POSIXct or a string?
If it is a POSIXct, then you can just convert to character, specifying the time format:
x_output$time <- as.character(x_output$timestamp, '%H:%M')
# or as.factor(as.character(...)) if you need it to be a factor.
# in data.table: x[, time:=as.character(timestamp, '%H:%M')]
This will make the time column a string with the hour and minutes. See ?strptime for more options on converting that datetime to a string (e.g. if you want to include seconds).
If it is a string, you could strsplit and extract the second component:
vapply(strsplit(x_output$timestamp, ' '), '[', i=2, 'template')
which will give you "HH:MM:SS" as your time format. If you want to do a custom time format, probably best to convert your timestamp string into a POSIXct and back out to the specific format like already mentioned.
I often used the ts() objects for yearly, quarterly or monthly time series, but now I would like to use it for weekly. Now the challenge is that not every year has the same number of weeks (either 52 or 53 weeks). How to deal with this?
I usually take the first day of the week as an identifier for the week (e.g. 2013-05-20 or 2013-05-27).
Can anybody advise how I would create a proper weekly time serie for the following dataset (x).
Date Qty
2013-05-20 25
2013-05-27 60
....
Something along the lines of:
ts <- ts(x$Qty, start=as.Date(x$Date), frequency=????)
Thank you for your assistance.
DF <- read.table(text="Date Qty
2013-05-20 25
2013-05-27 60",header=TRUE)
DF$Date <- as.Date(DF$Date)
library(xts)
my.xts <- as.xts(DF[,-1,drop=FALSE],order.by=DF$Date)
as.ts(my.xts)
# Time Series:
# Start = 1
# End = 8
# Frequency = 0.142857142857143
# [1] 25 60
I have the following data that is a time series collection of rain gauge readings. The Time Stamp is each time the rain gauge makes an increased count, and the Volume is the amount of rain added to the bucket. I need to aggregate the data into a few different categories of Hourly, 6 Hours, daily, weekly on the total amount of rain added to the bucket. I tried using some of the other data aggregation methods posted around StachOverflow but they assume normal collection intervals. I am not very good with R so forgive me if this is a super easy edit to code that has already been posted.
I know the data is a snap shot from excel but that was just so it would format nicely for visual purpose in this forum because I can't figure out how to post a table
Attached is the CSV of the data
Data File Here
An option is to use package Lubridate:
library(lubridate)
timeseries <- read.csv("project1.csv", sep=",", header=T, dec=".")
timeseries[,1] <- mdy_hm(timeseries[,1])
The dates have been converted into POSIXct, which is widely recognized in R.
Next the dates are rounded to the nearest unit.
The unit can be set to for instance: hours, days, months, etc.
The rounded dates are stored in a new data.frame which is then joined with the original data.frame.
The last step is to aggregate the values to the rounded dates.
rdate <- ceiling_date(x=timeseries[,1],unit="hour")
temp <- cbind(rdate,timeseries)
timeseries_hour <- aggregate(x=temp[3],by=list(temp[,1]),FUN=sum)
Part of the result:
head(timeseries_hour)
Group.1 Ppt..Amount
1 1996-05-02 01:00:00 0.03
2 1996-05-02 02:00:00 0.02
3 1996-05-02 05:00:00 0.01
4 1996-05-02 06:00:00 0.04
5 1996-05-02 07:00:00 0.38
6 1996-05-02 08:00:00 0.13