Hi all this should be a straightforward question, I just can't seem to figure it out. I would like to break up this data set biweekly in order to look at the annual cycle in 2 week intervals. I do not want to summarize or aggregate the data. I would like to do exactly what the 'week' function is doing, but every 2 weeks instead. Below is an example of the data and code. Any help would be greatly appreciated!
DF<-dput(head(indiv))
structure(list(event.id = 1142811808:1142811813, timestamp = structure(c(1323154800,
1323200450, 1323202141, 1323203545, 1323208151, 1323209966), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), argos.altitude = c(43, 43, 39, 43,
44, 42), argos.best.level = c(0, -136, -128, -136, -126, -137
), argos.calcul.freq = c(0, 676813.1, 676802.4, 676813.1, 676810,
676811.8), argos.lat1 = c(43.857, 43.916, 43.87, 43.89, 43.891,
43.89), argos.lat2 = c(43.857, 35.141, 49.688, 35.254, 40.546,
54.928), argos.lc = structure(c(7L, 6L, 2L, 3L, 4L, 3L), .Label = c("0",
"1", "2", "3", "A", "B", "G", "Z"), class = "factor"), argos.lon1 = c(-77.244,
-77.326, -77.223, -77.21, -77.208, -77.21), argos.lon2 = c(-77.244,
-121.452, -46.86, -118.496, -94.12, -16.159), argos.nb.mes.identical = c(0L,
2L, 6L, 4L, 5L, 6L), argos.nopc = c(0L, 1L, 2L, 3L, 4L, 4L),
argos.sensor.1 = c(0L, 149L, 194L, 1L, 193L, 193L), argos.sensor.2 = c(0L,
220L, 216L, 1L, 216L, 212L), argos.sensor.3 = c(0L, 1L, 1L,
0L, 3L, 1L), argos.sensor.4 = c(0L, 1L, 5L, 1L, 5L, 5L),
tag.local.identifier = c(112571L, 112571L, 112571L, 112571L,
112571L, 112571L), utm.easting = c(319655.836066914, 313250.096346666,
321382.422921619, 322486.41178559, 322650.029658403, 322486.41178559
), utm.northing = c(4858437.89950188, 4865173.18448801, 4859836.18321128,
4862029.54057323, 4862136.31345349, 4862029.54057323), utm.zone = structure(c(7L,
7L, 7L, 7L, 7L, 7L), .Label = c("12N", "13N", "14N", "15N",
"16N", "17N", "18N", "19N", "20N", "22N", "39N"), class = "factor"),
study.timezone = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Eastern Daylight Time",
"Eastern Standard Time"), class = "factor"), study.local.timestamp = structure(c(1323154800,
1323200450, 1323202141, 1323203545, 1323208151, 1323209966
), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 1120:1125, class = "data.frame")
weeknumber<-week(timestamps(DF))
I don't use lubridate, but here's a base R solution to subset your data fortnightly. We look if the week numbers as numeric modulo 2 are not zero and the year week is not duplicated. All using strftime.
res <- DF[as.numeric(strftime(DF$timestamp, "%U")) %% 2 != 0 &
!duplicated(strftime(DF$timestamp, "%U %y")), ]
res
# timestamp x
# 1 2011-12-06 01:00:00 0.73178884
# 13 2011-12-18 01:00:00 -0.19310018
# 27 2012-01-01 01:00:00 1.13017531
# 41 2012-01-15 01:00:00 1.06546084
# 55 2012-01-29 01:00:00 -0.16664011
# 69 2012-02-12 01:00:00 -1.86596108
# 83 2012-02-26 01:00:00 0.59200189
# 97 2012-03-11 01:00:00 1.08327366
# 111 2012-03-25 01:00:00 -0.71291090
# 125 2012-04-08 02:00:00 0.51984052
# 139 2012-04-22 02:00:00 0.32738506
# 153 2012-05-06 02:00:00 2.50837829
# 167 2012-05-20 02:00:00 0.75116168
# 181 2012-06-03 02:00:00 -0.56359736
# 195 2012-06-17 02:00:00 0.60658448
# 209 2012-07-01 02:00:00 -0.07242813
# 223 2012-07-15 02:00:00 0.13811301
# 237 2012-07-29 02:00:00 0.19454153
# 251 2012-08-12 02:00:00 0.23119092
# 265 2012-08-26 02:00:00 -0.97278351
# 279 2012-09-09 02:00:00 -1.18143276
# 293 2012-09-23 02:00:00 -0.43294048
# 307 2012-10-07 02:00:00 0.05664472
# 321 2012-10-21 02:00:00 -0.90725782
# 335 2012-11-04 01:00:00 0.78939068
# 349 2012-11-18 01:00:00 -0.46047924
# 363 2012-12-02 01:00:00 1.45941339
Check by differencing.
## check
diff(res$timestamp)
# Time differences in days
# [1] 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
# [21] 14 14 14 14 14
Data:
DF <- data.frame(timestamp=as.POSIXct(seq(as.Date("2011-12-06"), as.Date("2012-12-06"), "day")),
x=rnorm(367))
As I had said in my comment to your previous (since deleted) question, use seq.Date and either cut or findInterval.
I'll create a vector of "every other Monday", starting on January 1st, 2011. This is arbitrary, but you will want to ensure that you choose (1) a day that is meaningful to you, (2) a start-point that is before your earliest data, and (3) a length.out= that extends beyond your latest data.
every_other_monday <- seq(as.Date("2011-01-03"), by = "14 days", length.out = 26)
every_other_monday
# [1] "2011-01-03" "2011-01-17" "2011-01-31" "2011-02-14" "2011-02-28" "2011-03-14" "2011-03-28" "2011-04-11" "2011-04-25"
# [10] "2011-05-09" "2011-05-23" "2011-06-06" "2011-06-20" "2011-07-04" "2011-07-18" "2011-08-01" "2011-08-15" "2011-08-29"
# [19] "2011-09-12" "2011-09-26" "2011-10-10" "2011-10-24" "2011-11-07" "2011-11-21" "2011-12-05" "2011-12-19"
every_other_monday[ findInterval(as.Date(DF$timestamp), every_other_monday) ]
# [1] "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05"
(The choice to start on Jan 3 was conditioned on the assumption that your real data spans a much larger length of time. You don't need a full year's worth of biweeks in every_other_monday, nor does it need to be a Monday, it can be whatever base-date you choose. So long as it includes at least one date before and after the actual DF dates, you should be covered.)
Alternative: round to the week-level, then filter out those where the modulus of its julian day is odd. (The reason I chose "modulus of its julian day" is to reduce the chance that it could shift based on slight changes in data range.)
weeks <- lubridate::floor_date(as.Date(DF$timestamp), unit = "weeks")
weeks
# [1] "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04"
isodd <- as.POSIXlt(weeks)$yday %% 2 == 1
weeks[isodd] <- weeks[isodd] - 7L
weeks # technically, now "biweeks"
# [1] "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27"
See example below. This function uses which.max and sapply to round the date variable to the nearest Sunday within two week intervals.
library(lubridate)
## Create Data Frame
DF <- data.frame(timestamp=as.POSIXct(seq(as.Date("2011-12-06"), as.Date("2012-12-06"), "day")))
## Create two week intervals (change the start date if you don't want to start on Sundays)
every_other_sunday <- seq(as.Date("2011-12-18"), by = "14 days", length.out = 27)
## Make the date variable
DF$date <- as.Date(DF$timestamp)
## Function to find the closest Sunday from the intervals created above
find_closest_sunday <- function(index){
which.max(abs(every_other_sunday - DF$date[index] - 7) <= min(abs(every_other_sunday - DF$date[index] - 7)))
}
## Add the new variable to your dataset
DF$every_two_weeks <- every_other_sunday[sapply(seq_along(DF$date), function(i) find_closest_sunday(i))]
## Check that the function worked correctly
DF[,c("date", "every_two_weeks")]
## If you want the week number instead of a date, wrap the every_two_weeks variable in the week() function
week(DF$every_two_weeks)
Related
I've got this sample dataframe, that keeps track of the time when a lamp is switched on and off.
time lamp status
1 2015-01-01 12:18:17 2 ON
2 2015-01-01 13:07:29 28 ON
3 2015-01-01 13:11:50 28 OFF
4 2015-01-01 13:18:28 2 OFF
5 2015-01-01 14:07:29 28 ON
6 2015-01-01 14:11:35 28 OFF
7 2015-01-01 14:18:28 2 ON
5 2015-01-01 14:18:57 2 OFF
What I want to achieve is to add a fourth column, containing the duration of a period where a lamp has been switched on (in seconds).
The desired output:
time lamp status duration
1 2015-01-01 12:18:17 2 ON 3611
2 2015-01-01 13:07:29 28 ON 261
3 2015-01-01 13:11:50 28 OFF NA
4 2015-01-01 13:18:28 2 OFF NA
5 2015-01-01 14:07:29 28 ON 246
6 2015-01-01 14:11:35 28 OFF NA
7 2015-01-01 14:18:28 2 ON 29
5 2015-01-01 14:18:57 2 OFF NA
I already succeeded in doing this with a custom function, involving while and for-loops. BUT...
I'm a beginner in R, and I'm pretty sure this can be done more simple and elegant (using subsets, apply, and/or ....). I just can't figure out how?
Any ideas, of leads in the right direction?
This works for me:
library(dplyr)
df <- df %>% mutate(sec=as.numeric(time)) %>% group_by(lamp) %>% mutate(duration=c(diff(sec), NA)) %>% select(-sec)
df$duration[df$status=="OFF"] <- NA
#### 1 2015-01-01 12:18:17 2 ON 3611
#### 2 2015-01-01 13:07:29 28 ON 261
#### 3 2015-01-01 13:11:50 28 OFF NA
Your data:
df=structure(list(time = structure(c(1420111097, 1420114049, 1420114310,
1420114708, 1420117649, 1420117895, 1420118308, 1420118337), class = c("POSIXct",
"POSIXt"), tzone = ""), lamp = c(2L, 28L, 28L, 2L, 28L, 28L,
2L, 2L), status = structure(c(2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("OFF",
"ON"), class = "factor"), duration = c(2952, 261, NA, NA, 246,
NA, 29, NA)), .Names = c("time", "lamp", "status", "duration"
), row.names = c(NA, -8L), class = "data.frame")
I love R but some problems are just plain hard.
The challenge is to find the first instance of a rolling sum that is less than 30 in an irregular time series having a time-based window greater than or equal to 6 hours. I have a sample of the series
Row Person DateTime Value
1 A 2014-01-01 08:15:00 5
2 A 2014-01-01 09:15:00 5
3 A 2014-01-01 10:00:00 5
4 A 2014-01-01 11:15:00 5
5 A 2014-01-01 14:15:00 5
6 B 2014-01-01 08:15:00 25
7 B 2014-01-01 10:15:00 25
8 B 2014-01-01 19:15:00 2
9 C 2014-01-01 08:00:00 20
10 C 2014-01-01 09:00:00 5
11 C 2014-01-01 13:45:00 1
12 D 2014-01-01 07:00:00 1
13 D 2014-01-01 08:15:00 13
14 D 2014-01-01 14:15:00 15
For Person A, Rows 1 & 5 create a minimum 6 hour interval with a running sum of 25 (which is less than 30).
For Person B, Rows 7 & 8 create a 9 hour interval with a running sum of 27 (again less than 30).
For Person C, using Rows 9 & 10, there is no minimum 6 hour interval (it is only 5.75 hours) although the running sum is 26 and is less than 30.
For Person D, using Rows 12 & 14, the interval is 7.25 hours but the running sum is 30 and is not less than 30.
Given n observations, there are n*(n-1)/2 intervals that must be compared. For example, with n=2 there is just 1 interval to evaluate. For n=3 there are 3 intervals. And so on.
I assume that this is an variation of the subset sum problem (http://en.wikipedia.org/wiki/Subset_sum_problem)
While the data can be sorted I suspect this requires a brute force solution testing each interval.
Any help would be appreciated.
Edit: here's the data with DateTime column formatted as POSIXct:
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388560500, 1388564100, 1388566800,
1388571300, 1388582100, 1388560500, 1388567700, 1388600100,
1388559600, 1388563200, 1388580300, 1388556000, 1388560500,
1388582100), class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L,
1L, 13L, 15L)), .Names = c("Person", "DateTime", "Value"), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"), class = "data.frame")
I have found this to be a difficult problem in R as well. So I made a package for it!
library("devtools")
install_github("boRingTrees","mgahan")
require(boRingTrees)
Of course, you will have to figure out your units correctly for the upper bound.
Here is some more documentation if you are interested.
https://github.com/mgahan/boRingTrees
For the data df that #beginneR provided, you could use the following code to get a 6 hour rolling sum.
require(data.table)
setDT(df)
df[ , roll := rollingByCalcs(df,dates="DateTime",target="Value",
by="Person",stat=sum,lower=0,upper=6*60*60)]
Person DateTime Value roll
1: A 2014-01-01 01:15:00 5 5
2: A 2014-01-01 02:15:00 5 10
3: A 2014-01-01 03:00:00 5 15
4: A 2014-01-01 04:15:00 5 20
5: A 2014-01-01 07:15:00 5 25
6: B 2014-01-01 01:15:00 25 25
7: B 2014-01-01 03:15:00 25 50
8: B 2014-01-01 12:15:00 2 2
9: C 2014-01-01 01:00:00 20 20
10: C 2014-01-01 02:00:00 5 25
11: C 2014-01-01 06:45:00 1 26
12: D 2014-01-01 00:00:00 1 1
13: D 2014-01-01 01:15:00 13 14
14: D 2014-01-01 07:15:00 15 28
The original post is pretty unclear to me, so this might not be exactly what he wanted. If a column with the desired output was presented, I imagine I could be of more help.
We assume that an interval is defined by two rows for the same person. For each person, We want the first such interval (time-wise) of at least 6 hours for which the sum of Value of those two rows and any intermediate rows is less than 30. If there is more than one such first interval for a person pick one arbitrarily.
This can be represented by a triple join in SQL. The inner select picks out all rows consisting of the start of interval (a.DateTime), the end of interval (b.DateTime) and rows between them (c.DateTime) grouping by Person and interval and summing over the Value provided it spans at least 6 hours. The outer select then keeps only those rows whose total is < 30 and for each Person keeps only the one whose DateTime is least. If there is more than one first row (time-wise) for a Person it picks one arbitrarily.
library(sqldf)
sqldf(
"select Person, min(Datetime) DateTime, hours, total
from (select a.Person,
a.DateTime,
(b.Datetime - a.DateTime)/3600 hours,
sum(c.Value) total
from DF a join DF b join DF c
on a.Person = b.Person and a.Person = c.Person and hours >= 6
and c.DateTime between a.DateTime and b.DateTime
group by a.Person, a.DateTime, b.DateTime)
where total < 30
group by Person"
)
giving:
Person DateTime hours total
1 A 2014-01-01 08:15:00 6.00 25
2 B 2014-01-01 10:15:00 9.00 27
3 D 2014-01-01 07:00:00 7.25 29
Note: We used this data:
DF <- data.frame( Row = 1:14,
Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388582100, 1388585700, 1388588400, 1388592900,
1388603700, 1388582100, 1388589300, 1388621700, 1388581200,
1388584800, 1388601900, 1388577600, 1388582100, 1388603700),
class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L, 1L, 13L, 15L) )
As of version 1.9.8 (on CRAN 25 Nov 2016), the data.table package has gained the ability to aggregate in a non-equi join.
library(data.table)
tmp <- setDT(df)[, CJ(start = DateTime, end = DateTime)[
, hours := difftime(end, start, units = "hours")][hours >= 6], by = Person]
df[tmp, on = .(Person, DateTime >= start, DateTime <= end),
.(hours, total = sum(Value)), by = .EACHI][
total < 30, .SD[1L], by = Person]
Person DateTime hours total
1: A 2014-01-01 08:15:00 6.00 hours 25
2: B 2014-01-01 10:15:00 9.00 hours 27
3: D 2014-01-01 07:00:00 7.25 hours 29
tmp contains all possible intervals of 6 and more hours for each person. It is created through a cross join CJ() and subsequent filtering:
tmp
Person start end hours
1: A 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours
2: B 2014-01-01 08:15:00 2014-01-01 19:15:00 11.00 hours
3: B 2014-01-01 10:15:00 2014-01-01 19:15:00 9.00 hours
4: D 2014-01-01 07:00:00 2014-01-01 14:15:00 7.25 hours
5: D 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours
These intervals are being used to aggregate over in the non-equi join. The result is filtered for a total value of less than 30 and the first occurrence for each person is picked, finally.
I have a data frame that currently contains two ‘time’ columns in HH:MM:SS format. I would like to condense this data frame so that I only have one row for each unique ‘id’ value. I would like to keep the row for each unique ‘id’ value which has a ‘time1’ value that is the nearest match to the ‘time2’ value. However, 'time1' needs to be greater than ‘time2’.
Here is a simple example:
> dput(df)
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), count = c(23L, 23L, 23L, 23L, 45L, 45L,
45L, 45L, 67L, 67L, 67L, 67L, 88L, 88L, 88L, 88L), time1 = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("00:13:00",
"01:13:00", "07:18:00", "18:14:00"), class = "factor"), time2 = structure(c(4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("00:00:00",
"06:00:00", "12:00:00", "18:00:00"), class = "factor"), afn = c(3.36,
0.63, 1.77, 3.89, 3.36, 0.63, 1.77, 3.89, 3.36, 0.63, 1.77, 3.89,
3.36, 0.63, 1.77, 3.89), dfn = c(201.67, 157.27, 103.55, 191.41,
201.67, 157.27, 103.55, 191.41, 201.67, 157.27, 103.55, 191.41,
201.67, 157.27, 103.55, 191.41)), .Names = c("id", "count", "time1",
"time2", "afn", "dfn"), class = "data.frame", row.names = c(NA,
-16L))
> df
id count time1 time2 afn dfn
1 1 23 00:13:00 18:00:00 3.36 201.67
2 1 23 00:13:00 00:00:00 0.63 157.27
3 1 23 00:13:00 06:00:00 1.77 103.55
4 1 23 00:13:00 12:00:00 3.89 191.41
5 2 45 01:13:00 18:00:00 3.36 201.67
6 2 45 01:13:00 00:00:00 0.63 157.27
7 2 45 01:13:00 06:00:00 1.77 103.55
8 2 45 01:13:00 12:00:00 3.89 191.41
9 3 67 18:14:00 18:00:00 3.36 201.67
10 3 67 18:14:00 00:00:00 0.63 157.27
11 3 67 18:14:00 06:00:00 1.77 103.55
12 3 67 18:14:00 12:00:00 3.89 191.41
13 4 88 07:18:00 18:00:00 3.36 201.67
14 4 88 07:18:00 00:00:00 0.63 157.27
15 4 88 07:18:00 06:00:00 1.77 103.55
16 4 88 07:18:00 12:00:00 3.89 191.41
I would like to end up with this matrix in the above case:
id count time1 time2 afn dfn
1 23 00:13:00 00:00:00 0.63 157.27
2 45 01:13:00 00:00:00 0.63 157.27
3 67 18:14:00 18:00:00 3.36 201.67
4 88 07:18:00 06:00:00 1.77 103.55
I have used the ddply() function to condense data frames in the past, but not with an incorporated matching rule. I have to apply this is a data frame with lots of columns (many more than the simple example given here) so any suggestions about how to do this would be brilliant. Any help would be greatly appreciated. Many thanks!
Here are a few solutions.
1) ave This uses chron times as well as subset and ave from the base of R:
library(chron)
delta <- as.vector(times(df$time1) - times(df$time2))
df2 <- subset(df, delta > 0)
df2[ave(delta, df2$id, FUN = function(delta) delta == min(delta)) == 1, ]
2) dplyr This uses chron times and the dplyr package:
library(chron)
library(dplyr)
df %.%
mutate(delta = as.vector(times(time1) - times(time2))) %.%
filter(delta > 0) %.%
group_by(id) %.%
filter(delta == min(delta)) %.%
select(- delta)
3) sqldf
library(sqldf)
sqldf("select *, min(strftime('%s', time1) - strftime('%s', time2)) delta
from (select * from df where strftime('%s', time1) > strftime('%s', time2))
group by id")[seq_along(df)]
or perhaps this variation where we calculate delta in R and then use sqldf:
library(sqldf)
library(chron)
df2 = transform(df, delta = as.vector(times(time1) - times(time2)))
sqldf("select *, min(delta) delta
from (select * from df2 where delta > 0)
group by id")[-ncol(df2)]
4) data.table
library(data.table)
library(chron)
DT <- data.table(df)
DT[, delta := times(time1) - times(time2)
][delta > 0
][, .SD[delta == min(delta)], by = id
][, seq_along(df), with = FALSE]
ADDED additional solutions. Corrected library and subset statements. Minor improvements.
Here's an approach with the powerful dplyr package:
library(dplyr)
(df %.%
mutate(timeDiff = as.integer(strptime(time1, "%X") - strptime(time2, "%X")),
posDiff = timeDiff >= 0) %.%
filter(posDiff) %.%
group_by(id) %.%
filter(min(timeDiff) == timeDiff))[names(df)]
# id count time1 time2 afn dfn
# 1 1 23 00:13:00 00:00:00 0.63 157.27
# 2 2 45 01:13:00 00:00:00 0.63 157.27
# 3 3 67 18:14:00 18:00:00 3.36 201.67
# 4 4 88 07:18:00 06:00:00 1.77 103.55
An approach using ddply and merge. (Assuming that the "nearest match times" are the minimum absolute values of the difftimes)
t1 <- strptime(df$time1, "%H:%M:%S")
t2 <- strptime(df$time2, "%H:%M:%S")
df$min.diff <- abs(as.numeric(difftime(t1, t2, units='mins')))
d1 <- ddply(df, .(id), summarize, min.diff = min(min.diff))
> merge(df, d1, by = c("id", "min.diff"))
id min.diff count time1 time2 afn dfn
1 1 13 23 00:13:00 00:00:00 0.63 157.27
2 2 73 45 01:13:00 00:00:00 0.63 157.27
3 3 14 67 18:14:00 18:00:00 3.36 201.67
4 4 78 88 07:18:00 06:00:00 1.77 103.55
I have a dataset like this:
Year MM DD HH
158 2010 7 1 5
159 2010 7 1 5
160 2010 7 1 6
161 2010 7 1 6
structure(list(Year = c(2010L, 2010L, 2010L, 2010L), MM = c(7L,
7L, 7L, 7L), DD = c(1L, 1L, 1L, 1L), HH = c(5L, 5L, 6L, 6L)), .Names = c("Year",
"MM", "DD", "HH"), row.names = 158:161, class = "data.frame")
How can I create a one datetime object from this data set (new column for this data)?
There are a few options, here's one (where x is your data.frame):
x$datetime <- ISOdatetime(x$Year, x$MM, x$DD, x$HH, 0, 0)
You can pass in the correct time zone if need be, see ?ISOdatetime.
You can now do this in lubridate using make_date or make_datetime:
From the cran doc:
make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L,
sec = 0, tz = "UTC")
make_date(year = 1970L, month = 1L, day = 1L)
Assuming you have a your data in a dataframe x:
transform(x,datetime = as.POSIXct(paste(paste(Year,MM,DD,sep="-"), paste(HH,"00",sep=":"))))
Year MM DD HH datetime
158 2010 7 1 5 2010-07-01 05:00:00
159 2010 7 1 5 2010-07-01 05:00:00
160 2010 7 1 6 2010-07-01 06:00:00
161 2010 7 1 6 2010-07-01 06:00:00
I am working on gps data right now, the position of the animal has been collected if possible every 4 hours. The data looks like this (XY data is not shown here for some reasons):
ID TIME POSIXTIME date_only
1 1 12:00 2005-05-08 12:00:00 2005-05-08
2 2 16:01 2005-05-08 16:01:00 2005-05-08
3 3 20:01 2005-05-08 20:01:00 2005-05-08
4 4 0:01 2005-05-09 00:01:00 2005-05-09
5 5 8:01 2005-05-09 08:01:00 2005-05-09
6 6 12:01 2005-05-09 12:01:00 2005-05-09
7 7 16:02 2005-05-09 16:02:00 2005-05-09
8 8 20:02 2005-05-09 20:02:00 2005-05-09
9 9 0:01 2005-05-10 00:01:00 2005-05-10
10 10 4:00 2005-05-10 04:00:00 2005-05-10
I would now like to take only the first locations per day. In most cases, this will be at 0:01 o'clock. However, sometimes it will be 4:01 or even later as there is missing data.
How can I get only the first locations per day? They should be included in a new dataframe. I tried it with :
tapply(as.numeric(Kandularaw$TIME),list(Kandularaw$date_only),min, na.rm=T)
However, this did not work as R takes strange values when TIME is set as numeric.
Is it possible do do it with an ifelse statement? If yes, how would it look like roughly?
I am grateful for every help I can get. Thank you for your efforts.
Cheers,
Jan
I am guessing you really want a row number as an index into a position record. If you know that these rows are ordered by date-time, and you are getting satisfactory group splits with that second argument to tapply (however it was created), then try this:
idx <- tapply(1:NROW(Kandularaw), Kandularaw$date_only, "[", 1)
If you want records (rows) in that same dataframe then just use:
Kandularaw[ idx, ]
I would approach this from a simpler point of view. First, ensure that POSIXTIME is one of the "POSIX" classes. Then order the data by POSIXTIME. At this point we can use any of the split-apply-combine idioms to do what you want, making use of the head() function. Here I use aggregate():
Using this example data set:
dat <- structure(list(ID = 1:10, TIME = structure(c(4L, 6L, 8L, 1L,
3L, 5L, 7L, 9L, 1L, 2L), .Label = c("00:01:00", "04:00:00", "08:01:00",
"12:00:00", "12:01:00", "16:01:00", "16:02:00", "20:01:00", "20:02:00"
), class = "factor"), POSIXTIME = structure(1:10, .Label = c("2005/05/08 12:00:00",
"2005/05/08 16:01:00", "2005/05/08 20:01:00", "2005/05/09 00:01:00",
"2005/05/09 08:01:00", "2005/05/09 12:01:00", "2005/05/09 16:02:00",
"2005/05/09 20:02:00", "2005/05/10 00:01:00", "2005/05/10 04:00:00"
), class = "factor"), date_only = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L), .Label = c("2005/05/08", "2005/05/09",
"2005/05/10"), class = "factor")), .Names = c("ID", "TIME", "POSIXTIME",
"date_only"), class = "data.frame", row.names = c(NA, 10L))
First, get POSIXTIME and date_only in the correct formats:
dat <- transform(dat,
POSIXTIME = as.POSIXct(POSIXTIME, format = "%Y/%m/%d %H:%M:%S"),
date_only = as.Date(date_only, format = "%Y/%m/%d"))
Next, order by POSIXTIME:
dato <- with(dat, dat[order(POSIXTIME), ])
The final step is to use aggregate() to split the data by date_only and use head() to select the first row:
aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
notice I pass the n argument of head() the value 1, indicating that it should extract only the first row of each days observations. Because we sorted by datetime and split on date, the first row should be the first observation per day. Do be aware of rounding issues however.
The final step results in:
> aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
date ID TIME POSIXTIME
1 2005-05-08 1 12:00:00 2005-05-08 12:00:00
2 2005-05-09 4 00:01:00 2005-05-09 00:01:00
3 2005-05-10 9 00:01:00 2005-05-10 00:01:00
Instead of dato[,1:3] refer to whatever columns in your original data set contain the variables (locations?) you wanted.