I have the following data set which shows the start and the end of an episode (date and time)
ep <- data.frame(start=c("2009-07-13 23:45:00", "2009-08-14 08:30:00",
"2009-09-16 15:30:00"),
end=c("2009-07-14 00:03:00", "2009-08-15 08:35:00",
"2009-09-19 07:30:00"))
I need to convert it into a data frame which would show in each calendar day how many minutes of episodes there were. For the above example it would be:
2009-07-13 15
2009-07-14 3
2009-08-14 930
2009-08-15 515
2009-09-16 510
2009-09-17 1440
2009-09-18 1440
2009-09-19 450
I appreciate any help
This works, but seems slightly inelegant. First, create a vector that is a sequence of times between each start and end time by minutes:
tmp <- do.call(c, apply(ep, 1,
function(x) head(seq(from = as.POSIXct(x[1]),
to = as.POSIXct(x[2]),by = "mins"),
We use head(...., -1) to remove the last minute from each sequence so as the minutes match what you wanted.
Next, split this vector into minutes occurring on individual days, and count how many minuts there are per day:
tmp <- sapply(split(tmp, format(tmp, format = "%Y-%m-%d")), length)
Note that for some reason (probably time-zone related) that we can't just use as.Date(tmp) to get a vector of dates, we need to explicitly format the times to show only the date parts.
The final step is to arrange the tmp object that contains everything we need into the format you requested:
mins <- data.frame(Date = names(tmp), Minutes = tmp, row.names = NULL)
This gives:
> mins
Date Minutes
1 2009-07-13 15
2 2009-07-14 3
3 2009-08-14 930
4 2009-08-15 515
5 2009-09-16 510
6 2009-09-17 1440
7 2009-09-18 1440
8 2009-09-19 450
I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]
I have a large number of files (~1200) which each contains a large timeserie with data about the height of the groundwater. The starting date and length of the serie is different for each file. There can be large data gaps between dates, for example (small part of such a file):
Date Height (cm)
14-1-1980 7659
28-1-1980 7632
14-2-1980 7661
14-3-1980 7638
28-3-1980 7642
14-4-1980 7652
25-4-1980 7646
14-5-1980 7635
29-5-1980 7622
13-6-1980 7606
27-6-1980 7598
14-7-1980 7654
28-7-1980 7654
14-8-1980 7627
28-8-1980 7600
12-9-1980 7617
14-10-1980 7596
28-10-1980 7601
14-11-1980 7592
28-11-1980 7614
11-12-1980 7650
29-12-1980 7670
14-1-1981 7698
28-1-1981 7700
13-2-1981 7694
17-3-1981 7740
30-3-1981 7683
14-4-1981 7692
14-5-1981 7682
15-6-1981 7696
17-7-1981 7706
28-7-1981 7699
28-8-1981 7686
30-9-1981 7678
17-11-1981 7723
11-12-1981 7803
18-2-1982 7757
16-3-1982 7773
13-5-1982 7753
11-6-1982 7740
14-7-1982 7731
15-8-1982 7739
14-9-1982 7722
14-10-1982 7794
15-11-1982 7764
14-12-1982 7790
14-1-1983 7810
28-3-1983 7836
28-4-1983 7815
31-5-1983 7857
29-6-1983 7801
28-7-1983 7774
24-8-1983 7758
28-9-1983 7748
26-10-1983 7727
29-11-1983 7782
27-1-1984 7801
28-3-1984 7764
27-4-1984 7752
28-5-1984 7795
27-7-1984 7748
27-8-1984 7729
28-9-1984 7752
26-10-1984 7789
28-11-1984 7797
18-12-1984 7781
28-1-1985 7833
21-2-1985 7778
22-4-1985 7794
28-5-1985 7768
28-6-1985 7836
26-8-1985 7765
19-9-1985 7760
31-10-1985 7756
26-11-1985 7760
20-12-1985 7781
17-1-1986 7813
28-1-1986 7852
26-2-1986 7797
25-3-1986 7838
22-4-1986 7807
27-5-1986 7785
24-6-1986 7787
26-8-1986 7744
23-9-1986 7742
22-10-1986 7752
1-12-1986 7749
17-12-1986 7758
I want to calculate the average height over 5 years. So, in case of the example 14-1-1980 + 5 years, 14-1-1985 + 5 years, .... The amount of datapoints is different for each calculation of the average. It is very likely that the date 5 years later will not be in the dataset as a datapoint. Hence, I think I need to tell R somehow to take an average in a certain timespan.
I searched on the internet but didn't find something that fitted my needs. A lot of useful packages like uts, zoo, lubridate and the function aggregate passed by. Instead of getting closer to the solution I get more and more confused about which approach is the best for my problem.
Thanks a lot in advance!
As #vagabond points out, you'll want to combine your 1200 files into a single data frame (the plyr package would allow you to do something simple like: data.all <- adply(dir([DATA FOLDER]), 1, read.csv).
Once you have the data, the first step would be to transform the Date column into proper POSIXct date data. Right now the data appear to be strings, and we want them to have an underlying numerical representation (which POSIXct does):
df$date.new <- as.Date(dmy(df$Date))
Date Height date.new
1 14-1-1980 7659 1980-01-14
2 28-1-1980 7632 1980-01-28
3 14-2-1980 7661 1980-02-14
4 14-3-1980 7638 1980-03-14
5 28-3-1980 7642 1980-03-28
6 14-4-1980 7652 1980-04-14
Note that the date.new column looks like a string, but is in fact Date data, and can be handled with numerical operations (addition, comparison, etc.).
Next, we might construct a set of date periods, over which we want to compute averages. Your example mentions 5 years, but with the data you provided, that's not a very illustrative example. So here I'm creating 1-year periods starting at every day between Jan 14 1980 and Jan 14 1985
date.start <- as.Date(as.Date('1980-01-14') : as.Date('1985-01-14'), origin = '1970-01-01')
date.end <- date.start + years(1)
dates <- data.frame(start = date.start, end = date.end)
start end
1 1980-01-14 1981-01-14
2 1980-01-15 1981-01-15
3 1980-01-16 1981-01-16
4 1980-01-17 1981-01-17
5 1980-01-18 1981-01-18
6 1980-01-19 1981-01-19
Then we can use the dplyr package to move through each row of this data frame and compute a summary average of Height:
df.mean <- dates %>%
group_by(start, end) %>%
summarize(height.mean = mean(df$Height[df$date.new >= start & df$date.new < end]))
start end height.mean
<date> <date> <dbl>
1 1980-01-14 1981-01-14 7630.273
2 1980-01-15 1981-01-15 7632.045
3 1980-01-16 1981-01-16 7632.045
4 1980-01-17 1981-01-17 7632.045
5 1980-01-18 1981-01-18 7632.045
6 1980-01-19 1981-01-19 7632.045
The foverlaps function is IMHO the perfect candidate for such a situation:
# convert to a data.table with setDT()
# convert the 'Date'-column to date-format
# create a begin & end date for the required period
setDT(dat)[, Date := as.Date(Date, '%d-%m-%Y')
][, `:=` (begindate = Date, enddate = Date + years(1))]
# set the keys (necessary for the foverlaps function)
setkey(dat, begindate, enddate)
res <- foverlaps(dat, dat, by.x = c(1,3))[, .(moving.average = mean(i.Height)), Date]
the result:
> head(res,15)
Date moving.average
1: 1980-01-14 7633.217
2: 1980-01-28 7635.000
3: 1980-02-14 7637.696
4: 1980-03-14 7636.636
5: 1980-03-28 7641.273
6: 1980-04-14 7645.261
7: 1980-04-25 7644.955
8: 1980-05-14 7646.591
9: 1980-05-29 7647.143
10: 1980-06-13 7648.400
11: 1980-06-27 7652.900
12: 1980-07-14 7655.789
13: 1980-07-28 7660.550
14: 1980-08-14 7660.895
15: 1980-08-28 7664.000
Now you have for each date an average of all the values that lie the date and one year ahead of that date.
Hey I just tried after seeing your question!!! Ran on a sample data frame. Try it on yours after understanding the code and then let me know!
Bdw instead of having an interval of 5 years, I used just 2 months (2*30 = approx 2 months) as the interval!
df = data.frame(Date = c("14-1-1980", "28-1-1980", "14-2-1980", "14-3-1980", "28-3-1980",
"14-4-1980", "25-4-1980", "14-5-1980", "29-5-1980", "13-6-1980:",
"27-6-1980", "14-7-1980", "28-7-1980", "14-8-1980"), height = 1:14)
# as.Date(df$Date, "%d-%m-%Y")
df1 = data.frame(orig = NULL, dest = NULL, avg_ht = NULL)
orig = as.Date(df$Date, "%d-%m-%Y")[1]
dest = as.Date(df$Date, "%d-%m-%Y")[1] + 2*30 #approx 2 months
dest_final = as.Date(df$Date, "%d-%m-%Y")[14]
while (dest < dest_final){
m = mean(df$height[which(as.Date(df$Date, "%d-%m-%Y")>=orig &
as.Date(df$Date, "%d-%m-%Y")<dest )])
df1 = rbind(df1,data.frame(orig=orig,dest=dest,avg_ht=m))
orig = dest
dest = dest + 2*30
print(paste("orig:",orig, " + ","dest:",dest))
> df1
orig dest avg_ht
1 1980-01-14 1980-03-14 2.0
2 1980-03-14 1980-05-13 5.5
3 1980-05-13 1980-07-12 9.5
I hope this works for you as well
This is my best try, but please keep in mind that I am working with the years instead of the full date, i.e. based on the example you provided I am averaging over beginning of 1980- end of 1984.
#extract the year of each measurement
years<-as.integer(str_sub(dat[,1], start= -4))
#find how many 5-year intervals there are
for (i in 1:groups){
#extract the indices of the dates vector whithin the 5-year period
colnames(meangroups)<-c("Year:Year+4","Mean Height (cm)")
I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time
We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877
I've got logs of events that contain:
start time, end time, category id and count. They cover several months.
I'd like to aggregate them over time to be able to trace histograms over a given day, week, month.
So I assume the best way to do this is to bin the periods in buckets. I think 5 minutes would be good.
e.g. If an event starts at 1.01pm and ends at 1.07pm, I'd like to obtain 2 records for it as it covers 2 periods of 5 minutes (0-5 and 5-10) and replicate the rest of the original data for these new records (category and count)
if my input logs (x) are as such:
start / end / catid / count
2012-11-17 15:05:02.0, 2012-11-17 15:12:52.0, 1, 2
2012-11-17 15:07:13.0, 2012-11-17 15:17:47.0, 2, 10
2012-11-17 15:11:00.0, 2012-11-17 15:12:33.0, 3, 5
2012-11-17 15:12:01.0, 2012-11-17 15:20:00.0, 4, 1
I'm trying to get the output bucketed in 5 minutes (b) this way:
start / catid / count
2012-11-17 15:05:00.0 1, 2
2012-11-17 15:10:00.0 1, 2
2012-11-17 15:05:00.0 2, 10
2012-11-17 15:10:00.0 2, 10
2012-11-17 15:15:00.0 2, 10
2012-11-17 15:10:00.0 3, 5
2012-11-17 15:10:00.0 4, 1
2012-11-17 15:15:00.0 4, 1
Then I can easily aggregate the new data frame (b) over category ids for the period I want (hour, day, week, month)
I'm starting with R and I found a lot explanations on how to bucket a time value but not a period of time.
I've had a look at zoo and xts but I couldn't quite find what to do.
Hopefully that makes sense to some of you.
I've slightly modified Ram's suggestion to get the correct calculation of blocks using the rounded endtime rather than the original end time. (Thanks Ram!)
mnslot=15 # size of the buckets/slot in minutes
#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)
#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)
# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numblocks)){
for (newrow in 1:numblocks[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
This produces the required output. it is a bit slow though:p
Here's a fully working version. It involves step-by-step data manipulation for what you are after.
#storing the original data as a csv
df <- read.csv("tsdata.csv")
#a utility function to convert formats
unix2POSIXct <- function (time) structure(time, class = c("POSIXt", "POSIXct") )
#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numdups)){
for (newrow in 1:numdups[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
Which produces:
start CatId Count
1 2012-11-17 15:05:00 1 2
2 2012-11-17 15:10:00 1 2
3 2012-11-17 15:05:00 2 10
4 2012-11-17 15:10:00 2 10
5 2012-11-17 15:15:00 2 10
6 2012-11-17 15:10:00 3 5
7 2012-11-17 15:10:00 4 1
8 2012-11-17 15:15:00 4 1
That's not an easy one ... I am also missing the structure of the whole problem so I hope it is ok if I limit myself to outlining the basic approach, if things are unclear you can come back to me.
First (if I were you) I would install the 'lubridate' package, which makes playing around with dates/times a lot easier.
Then maybe try something like this:
z <- strptime("17/11/12 15:05:00.0", "%d/%m/%y %H:%M:%OS")
This will define your starting point in time, if that is supposed to be defined by the first logs(x) time then there is the minute command available e.g.
z <- strptime("17/11/12 15:05:02.0", "%d/%m/%y %H:%M:%OS")
minute(z)<-5;second(z)<-0.0 #I guess, you get the concept
Then produce a sequence of 5 minute intervals
This will produce a sequence of 20, 5 minute time intervals, here again I do not know how flexible the whole thing is supposed to be.
Finally you could then play around with for instance modulo operations
z2 should be the end time, I just added 2 minutes "manually" here to illustrate the concept
(as.integer(z2-z))%%5 > 5
or if you want to see how many 5 minute spans are covered only do (as.integer(z2-z))%%5
or whatever other functions you prefer to match/distribute the log times across the z5s POSIXlt intervals.
Hope this helps a bit i.e. gives you some direction.