Selecting specific rows in R - r

I am working on gps data right now, the position of the animal has been collected if possible every 4 hours. The data looks like this (XY data is not shown here for some reasons):
ID TIME POSIXTIME date_only
1 1 12:00 2005-05-08 12:00:00 2005-05-08
2 2 16:01 2005-05-08 16:01:00 2005-05-08
3 3 20:01 2005-05-08 20:01:00 2005-05-08
4 4 0:01 2005-05-09 00:01:00 2005-05-09
5 5 8:01 2005-05-09 08:01:00 2005-05-09
6 6 12:01 2005-05-09 12:01:00 2005-05-09
7 7 16:02 2005-05-09 16:02:00 2005-05-09
8 8 20:02 2005-05-09 20:02:00 2005-05-09
9 9 0:01 2005-05-10 00:01:00 2005-05-10
10 10 4:00 2005-05-10 04:00:00 2005-05-10
I would now like to take only the first locations per day. In most cases, this will be at 0:01 o'clock. However, sometimes it will be 4:01 or even later as there is missing data.
How can I get only the first locations per day? They should be included in a new dataframe. I tried it with :
tapply(as.numeric(Kandularaw$TIME),list(Kandularaw$date_only),min, na.rm=T)
However, this did not work as R takes strange values when TIME is set as numeric.
Is it possible do do it with an ifelse statement? If yes, how would it look like roughly?
I am grateful for every help I can get. Thank you for your efforts.
Cheers,
Jan

I am guessing you really want a row number as an index into a position record. If you know that these rows are ordered by date-time, and you are getting satisfactory group splits with that second argument to tapply (however it was created), then try this:
idx <- tapply(1:NROW(Kandularaw), Kandularaw$date_only, "[", 1)
If you want records (rows) in that same dataframe then just use:
Kandularaw[ idx, ]

I would approach this from a simpler point of view. First, ensure that POSIXTIME is one of the "POSIX" classes. Then order the data by POSIXTIME. At this point we can use any of the split-apply-combine idioms to do what you want, making use of the head() function. Here I use aggregate():
Using this example data set:
dat <- structure(list(ID = 1:10, TIME = structure(c(4L, 6L, 8L, 1L,
3L, 5L, 7L, 9L, 1L, 2L), .Label = c("00:01:00", "04:00:00", "08:01:00",
"12:00:00", "12:01:00", "16:01:00", "16:02:00", "20:01:00", "20:02:00"
), class = "factor"), POSIXTIME = structure(1:10, .Label = c("2005/05/08 12:00:00",
"2005/05/08 16:01:00", "2005/05/08 20:01:00", "2005/05/09 00:01:00",
"2005/05/09 08:01:00", "2005/05/09 12:01:00", "2005/05/09 16:02:00",
"2005/05/09 20:02:00", "2005/05/10 00:01:00", "2005/05/10 04:00:00"
), class = "factor"), date_only = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L), .Label = c("2005/05/08", "2005/05/09",
"2005/05/10"), class = "factor")), .Names = c("ID", "TIME", "POSIXTIME",
"date_only"), class = "data.frame", row.names = c(NA, 10L))
First, get POSIXTIME and date_only in the correct formats:
dat <- transform(dat,
POSIXTIME = as.POSIXct(POSIXTIME, format = "%Y/%m/%d %H:%M:%S"),
date_only = as.Date(date_only, format = "%Y/%m/%d"))
Next, order by POSIXTIME:
dato <- with(dat, dat[order(POSIXTIME), ])
The final step is to use aggregate() to split the data by date_only and use head() to select the first row:
aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
notice I pass the n argument of head() the value 1, indicating that it should extract only the first row of each days observations. Because we sorted by datetime and split on date, the first row should be the first observation per day. Do be aware of rounding issues however.
The final step results in:
> aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
date ID TIME POSIXTIME
1 2005-05-08 1 12:00:00 2005-05-08 12:00:00
2 2005-05-09 4 00:01:00 2005-05-09 00:01:00
3 2005-05-10 9 00:01:00 2005-05-10 00:01:00
Instead of dato[,1:3] refer to whatever columns in your original data set contain the variables (locations?) you wanted.

Related

make 'week' function biweek

Hi all this should be a straightforward question, I just can't seem to figure it out. I would like to break up this data set biweekly in order to look at the annual cycle in 2 week intervals. I do not want to summarize or aggregate the data. I would like to do exactly what the 'week' function is doing, but every 2 weeks instead. Below is an example of the data and code. Any help would be greatly appreciated!
DF<-dput(head(indiv))
structure(list(event.id = 1142811808:1142811813, timestamp = structure(c(1323154800,
1323200450, 1323202141, 1323203545, 1323208151, 1323209966), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), argos.altitude = c(43, 43, 39, 43,
44, 42), argos.best.level = c(0, -136, -128, -136, -126, -137
), argos.calcul.freq = c(0, 676813.1, 676802.4, 676813.1, 676810,
676811.8), argos.lat1 = c(43.857, 43.916, 43.87, 43.89, 43.891,
43.89), argos.lat2 = c(43.857, 35.141, 49.688, 35.254, 40.546,
54.928), argos.lc = structure(c(7L, 6L, 2L, 3L, 4L, 3L), .Label = c("0",
"1", "2", "3", "A", "B", "G", "Z"), class = "factor"), argos.lon1 = c(-77.244,
-77.326, -77.223, -77.21, -77.208, -77.21), argos.lon2 = c(-77.244,
-121.452, -46.86, -118.496, -94.12, -16.159), argos.nb.mes.identical = c(0L,
2L, 6L, 4L, 5L, 6L), argos.nopc = c(0L, 1L, 2L, 3L, 4L, 4L),
argos.sensor.1 = c(0L, 149L, 194L, 1L, 193L, 193L), argos.sensor.2 = c(0L,
220L, 216L, 1L, 216L, 212L), argos.sensor.3 = c(0L, 1L, 1L,
0L, 3L, 1L), argos.sensor.4 = c(0L, 1L, 5L, 1L, 5L, 5L),
tag.local.identifier = c(112571L, 112571L, 112571L, 112571L,
112571L, 112571L), utm.easting = c(319655.836066914, 313250.096346666,
321382.422921619, 322486.41178559, 322650.029658403, 322486.41178559
), utm.northing = c(4858437.89950188, 4865173.18448801, 4859836.18321128,
4862029.54057323, 4862136.31345349, 4862029.54057323), utm.zone = structure(c(7L,
7L, 7L, 7L, 7L, 7L), .Label = c("12N", "13N", "14N", "15N",
"16N", "17N", "18N", "19N", "20N", "22N", "39N"), class = "factor"),
study.timezone = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Eastern Daylight Time",
"Eastern Standard Time"), class = "factor"), study.local.timestamp = structure(c(1323154800,
1323200450, 1323202141, 1323203545, 1323208151, 1323209966
), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = 1120:1125, class = "data.frame")
weeknumber<-week(timestamps(DF))
I don't use lubridate, but here's a base R solution to subset your data fortnightly. We look if the week numbers as numeric modulo 2 are not zero and the year week is not duplicated. All using strftime.
res <- DF[as.numeric(strftime(DF$timestamp, "%U")) %% 2 != 0 &
!duplicated(strftime(DF$timestamp, "%U %y")), ]
res
# timestamp x
# 1 2011-12-06 01:00:00 0.73178884
# 13 2011-12-18 01:00:00 -0.19310018
# 27 2012-01-01 01:00:00 1.13017531
# 41 2012-01-15 01:00:00 1.06546084
# 55 2012-01-29 01:00:00 -0.16664011
# 69 2012-02-12 01:00:00 -1.86596108
# 83 2012-02-26 01:00:00 0.59200189
# 97 2012-03-11 01:00:00 1.08327366
# 111 2012-03-25 01:00:00 -0.71291090
# 125 2012-04-08 02:00:00 0.51984052
# 139 2012-04-22 02:00:00 0.32738506
# 153 2012-05-06 02:00:00 2.50837829
# 167 2012-05-20 02:00:00 0.75116168
# 181 2012-06-03 02:00:00 -0.56359736
# 195 2012-06-17 02:00:00 0.60658448
# 209 2012-07-01 02:00:00 -0.07242813
# 223 2012-07-15 02:00:00 0.13811301
# 237 2012-07-29 02:00:00 0.19454153
# 251 2012-08-12 02:00:00 0.23119092
# 265 2012-08-26 02:00:00 -0.97278351
# 279 2012-09-09 02:00:00 -1.18143276
# 293 2012-09-23 02:00:00 -0.43294048
# 307 2012-10-07 02:00:00 0.05664472
# 321 2012-10-21 02:00:00 -0.90725782
# 335 2012-11-04 01:00:00 0.78939068
# 349 2012-11-18 01:00:00 -0.46047924
# 363 2012-12-02 01:00:00 1.45941339
Check by differencing.
## check
diff(res$timestamp)
# Time differences in days
# [1] 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
# [21] 14 14 14 14 14
Data:
DF <- data.frame(timestamp=as.POSIXct(seq(as.Date("2011-12-06"), as.Date("2012-12-06"), "day")),
x=rnorm(367))
As I had said in my comment to your previous (since deleted) question, use seq.Date and either cut or findInterval.
I'll create a vector of "every other Monday", starting on January 1st, 2011. This is arbitrary, but you will want to ensure that you choose (1) a day that is meaningful to you, (2) a start-point that is before your earliest data, and (3) a length.out= that extends beyond your latest data.
every_other_monday <- seq(as.Date("2011-01-03"), by = "14 days", length.out = 26)
every_other_monday
# [1] "2011-01-03" "2011-01-17" "2011-01-31" "2011-02-14" "2011-02-28" "2011-03-14" "2011-03-28" "2011-04-11" "2011-04-25"
# [10] "2011-05-09" "2011-05-23" "2011-06-06" "2011-06-20" "2011-07-04" "2011-07-18" "2011-08-01" "2011-08-15" "2011-08-29"
# [19] "2011-09-12" "2011-09-26" "2011-10-10" "2011-10-24" "2011-11-07" "2011-11-21" "2011-12-05" "2011-12-19"
every_other_monday[ findInterval(as.Date(DF$timestamp), every_other_monday) ]
# [1] "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05" "2011-12-05"
(The choice to start on Jan 3 was conditioned on the assumption that your real data spans a much larger length of time. You don't need a full year's worth of biweeks in every_other_monday, nor does it need to be a Monday, it can be whatever base-date you choose. So long as it includes at least one date before and after the actual DF dates, you should be covered.)
Alternative: round to the week-level, then filter out those where the modulus of its julian day is odd. (The reason I chose "modulus of its julian day" is to reduce the chance that it could shift based on slight changes in data range.)
weeks <- lubridate::floor_date(as.Date(DF$timestamp), unit = "weeks")
weeks
# [1] "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04" "2011-12-04"
isodd <- as.POSIXlt(weeks)$yday %% 2 == 1
weeks[isodd] <- weeks[isodd] - 7L
weeks # technically, now "biweeks"
# [1] "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27" "2011-11-27"
See example below. This function uses which.max and sapply to round the date variable to the nearest Sunday within two week intervals.
library(lubridate)
## Create Data Frame
DF <- data.frame(timestamp=as.POSIXct(seq(as.Date("2011-12-06"), as.Date("2012-12-06"), "day")))
## Create two week intervals (change the start date if you don't want to start on Sundays)
every_other_sunday <- seq(as.Date("2011-12-18"), by = "14 days", length.out = 27)
## Make the date variable
DF$date <- as.Date(DF$timestamp)
## Function to find the closest Sunday from the intervals created above
find_closest_sunday <- function(index){
which.max(abs(every_other_sunday - DF$date[index] - 7) <= min(abs(every_other_sunday - DF$date[index] - 7)))
}
## Add the new variable to your dataset
DF$every_two_weeks <- every_other_sunday[sapply(seq_along(DF$date), function(i) find_closest_sunday(i))]
## Check that the function worked correctly
DF[,c("date", "every_two_weeks")]
## If you want the week number instead of a date, wrap the every_two_weeks variable in the week() function
week(DF$every_two_weeks)

Combine hour and minutes columns in one column and get the time difference in R

Dear R community members,
I have got a 12-hour format dataset designed as follows:
departurehour departureminute arrivalhour arrivalminute
4 30 4 50
9 10 9 30
8 10 8 18
And i want to get the following output with commute time being (in minutes format). Commute time = Arrivaltime - Departuretime.
Departuretime Arrivaltime Commutetime
4:30 4:50 20
9:10 9:30 20
8:10 8:18 8
I would greatly appreciate your timely help.
Thank you very much in advance.
We can combine departurehour and departureminute to get departuretime and do the same for arrivaltime. Subtract the values from arrivaltime and departuretime using difftime to get time difference in minutes.
library(dplyr)
library(tidyr)
df %>%
unite(departuretime, departurehour, departureminute, sep = ":") %>%
unite(arrivaltime, arrivalhour, arrivalminute, sep = ":") %>%
mutate(Commutetime = as.numeric(difftime(
as.POSIXct(sprintf("%04s", arrivaltime), format = "%H:%M"),
as.POSIXct(sprintf("%04s", departuretime), format = "%H:%M"),
units = "mins")))
# departuretime arrivaltime Commutetime
#1 4:30 4:50 20
#2 9:10 9:30 20
#3 8:10 8:18 8
With dplyr:
df %>%
mutate(ArrivalTime = paste0(arrivalhour,":",arrivalminute),
DepartTime = paste0(departurehour,":",departureminute)) %>%
select(ends_with("Time")) %>%
mutate(DepartTime = strptime(DepartTime, format="%H:%M"),
ArrivalTime = strptime(ArrivalTime, format="%H:%M"),
Total =difftime(ArrivalTime, DepartTime))
ArrivalTime DepartTime Total
1 2020-04-16 04:50:00 2020-04-16 04:30:00 20 mins
2 2020-04-16 09:30:00 2020-04-16 09:10:00 20 mins
3 2020-04-16 08:18:00 2020-04-16 08:10:00 8 mins
NOTE
This needs some date component for difftime to work.
Data
df <- structure(list(departurehour = c(4L, 9L, 8L), departureminute = c(30L,
10L, 10L), arrivalhour = c(4L, 9L, 8L), arrivalminute = c(50L,
30L, 18L)), class = "data.frame", row.names = c(NA, -3L))
Here is an option with data.table
library(data.table)
setDT(df1)[, .(departuretime = sprintf("%02d:%02d", departurehour,
departureminute), arrivaltime = sprintf("%02d:%02d", arrivalhour,
arrivalminute))][, CommuteTime :=
as.numeric(as.ITime(arrivaltime) - as.ITime(departuretime))/60][]
# departuretime arrivaltime CommuteTime
#1: 04:30 04:50 20
#2: 09:10 09:30 20
#3: 08:10 08:18 8
data
df1 <- structure(list(departurehour = c(4L, 9L, 8L), departureminute = c(30L,
10L, 10L), arrivalhour = c(4L, 9L, 8L), arrivalminute = c(50L,
30L, 18L)), class = "data.frame", row.names = c(NA, -3L))

date (time) conversion in a long shape dataframe

This is a screenshot of my data frame. The data frame is long (each row included multiple measurements for each patient_id). The number of repeated measurement (rows) is different for each patient_id. In R software, I want to generate a new date variable as each date (in order) minus the first date and save it as days.
Using dplyr you should group by id and then mutate to add a the new column as follow.
library(tidyverse)
# example data frame (always dput a simple piece of your data)
df <- structure(list(patient_id = c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 2L), date = structure(c(17600, 17601, 17602, 17603, 17604,
17605, 17606, 17607, 17608, 17609), class = "Date")), class = "data.frame",
row.names = c(NA, -10L))
The key is to store your date variable as a date object in your data frame, this way you can do arithmetic with it. To convert your date variable you can use as_date function from the lubridate package.
df %>%
group_by(patient_id) %>% # group by patient
mutate(days_since_first_time = date - min(date)) %>%
arrange(patient_id, date)
# this is the output
patient_id date days_since_first_time
1 2018-03-10 0
1 2018-03-14 4
1 2018-03-18 8
2 2018-03-11 0
2 2018-03-12 1
2 2018-03-13 2
2 2018-03-15 4
2 2018-03-16 5
2 2018-03-17 6
2 2018-03-19 8

R: Compute a rolling sum on irregular time series grouped by id variables with time-based window

I love R but some problems are just plain hard.
The challenge is to find the first instance of a rolling sum that is less than 30 in an irregular time series having a time-based window greater than or equal to 6 hours. I have a sample of the series
Row Person DateTime Value
1 A 2014-01-01 08:15:00 5
2 A 2014-01-01 09:15:00 5
3 A 2014-01-01 10:00:00 5
4 A 2014-01-01 11:15:00 5
5 A 2014-01-01 14:15:00 5
6 B 2014-01-01 08:15:00 25
7 B 2014-01-01 10:15:00 25
8 B 2014-01-01 19:15:00 2
9 C 2014-01-01 08:00:00 20
10 C 2014-01-01 09:00:00 5
11 C 2014-01-01 13:45:00 1
12 D 2014-01-01 07:00:00 1
13 D 2014-01-01 08:15:00 13
14 D 2014-01-01 14:15:00 15
For Person A, Rows 1 & 5 create a minimum 6 hour interval with a running sum of 25 (which is less than 30).
For Person B, Rows 7 & 8 create a 9 hour interval with a running sum of 27 (again less than 30).
For Person C, using Rows 9 & 10, there is no minimum 6 hour interval (it is only 5.75 hours) although the running sum is 26 and is less than 30.
For Person D, using Rows 12 & 14, the interval is 7.25 hours but the running sum is 30 and is not less than 30.
Given n observations, there are n*(n-1)/2 intervals that must be compared. For example, with n=2 there is just 1 interval to evaluate. For n=3 there are 3 intervals. And so on.
I assume that this is an variation of the subset sum problem (http://en.wikipedia.org/wiki/Subset_sum_problem)
While the data can be sorted I suspect this requires a brute force solution testing each interval.
Any help would be appreciated.
Edit: here's the data with DateTime column formatted as POSIXct:
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388560500, 1388564100, 1388566800,
1388571300, 1388582100, 1388560500, 1388567700, 1388600100,
1388559600, 1388563200, 1388580300, 1388556000, 1388560500,
1388582100), class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L,
1L, 13L, 15L)), .Names = c("Person", "DateTime", "Value"), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"), class = "data.frame")
I have found this to be a difficult problem in R as well. So I made a package for it!
library("devtools")
install_github("boRingTrees","mgahan")
require(boRingTrees)
Of course, you will have to figure out your units correctly for the upper bound.
Here is some more documentation if you are interested.
https://github.com/mgahan/boRingTrees
For the data df that #beginneR provided, you could use the following code to get a 6 hour rolling sum.
require(data.table)
setDT(df)
df[ , roll := rollingByCalcs(df,dates="DateTime",target="Value",
by="Person",stat=sum,lower=0,upper=6*60*60)]
Person DateTime Value roll
1: A 2014-01-01 01:15:00 5 5
2: A 2014-01-01 02:15:00 5 10
3: A 2014-01-01 03:00:00 5 15
4: A 2014-01-01 04:15:00 5 20
5: A 2014-01-01 07:15:00 5 25
6: B 2014-01-01 01:15:00 25 25
7: B 2014-01-01 03:15:00 25 50
8: B 2014-01-01 12:15:00 2 2
9: C 2014-01-01 01:00:00 20 20
10: C 2014-01-01 02:00:00 5 25
11: C 2014-01-01 06:45:00 1 26
12: D 2014-01-01 00:00:00 1 1
13: D 2014-01-01 01:15:00 13 14
14: D 2014-01-01 07:15:00 15 28
The original post is pretty unclear to me, so this might not be exactly what he wanted. If a column with the desired output was presented, I imagine I could be of more help.
We assume that an interval is defined by two rows for the same person. For each person, We want the first such interval (time-wise) of at least 6 hours for which the sum of Value of those two rows and any intermediate rows is less than 30. If there is more than one such first interval for a person pick one arbitrarily.
This can be represented by a triple join in SQL. The inner select picks out all rows consisting of the start of interval (a.DateTime), the end of interval (b.DateTime) and rows between them (c.DateTime) grouping by Person and interval and summing over the Value provided it spans at least 6 hours. The outer select then keeps only those rows whose total is < 30 and for each Person keeps only the one whose DateTime is least. If there is more than one first row (time-wise) for a Person it picks one arbitrarily.
library(sqldf)
sqldf(
"select Person, min(Datetime) DateTime, hours, total
from (select a.Person,
a.DateTime,
(b.Datetime - a.DateTime)/3600 hours,
sum(c.Value) total
from DF a join DF b join DF c
on a.Person = b.Person and a.Person = c.Person and hours >= 6
and c.DateTime between a.DateTime and b.DateTime
group by a.Person, a.DateTime, b.DateTime)
where total < 30
group by Person"
)
giving:
Person DateTime hours total
1 A 2014-01-01 08:15:00 6.00 25
2 B 2014-01-01 10:15:00 9.00 27
3 D 2014-01-01 07:00:00 7.25 29
Note: We used this data:
DF <- data.frame( Row = 1:14,
Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388582100, 1388585700, 1388588400, 1388592900,
1388603700, 1388582100, 1388589300, 1388621700, 1388581200,
1388584800, 1388601900, 1388577600, 1388582100, 1388603700),
class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L, 1L, 13L, 15L) )
As of version 1.9.8 (on CRAN 25 Nov 2016), the data.table package has gained the ability to aggregate in a non-equi join.
library(data.table)
tmp <- setDT(df)[, CJ(start = DateTime, end = DateTime)[
, hours := difftime(end, start, units = "hours")][hours >= 6], by = Person]
df[tmp, on = .(Person, DateTime >= start, DateTime <= end),
.(hours, total = sum(Value)), by = .EACHI][
total < 30, .SD[1L], by = Person]
Person DateTime hours total
1: A 2014-01-01 08:15:00 6.00 hours 25
2: B 2014-01-01 10:15:00 9.00 hours 27
3: D 2014-01-01 07:00:00 7.25 hours 29
tmp contains all possible intervals of 6 and more hours for each person. It is created through a cross join CJ() and subsequent filtering:
tmp
Person start end hours
1: A 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours
2: B 2014-01-01 08:15:00 2014-01-01 19:15:00 11.00 hours
3: B 2014-01-01 10:15:00 2014-01-01 19:15:00 9.00 hours
4: D 2014-01-01 07:00:00 2014-01-01 14:15:00 7.25 hours
5: D 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours
These intervals are being used to aggregate over in the non-equi join. The result is filtered for a total value of less than 30 and the first occurrence for each person is picked, finally.

R for loop not working

I'm trying to use R to find the max value of each day for 1 to n days. My issue is there are multiple values in each day. Heres my code. After I run it incorrect number of dimensions.
Any suggestions:
Days <- unique(theData$Date) #Gets each unique Day
numDays <- length(Days)
Time <- unique(theData$Time) #Gets each unique time
numTime <- length(Time)
rowCnt <- 1
for (i in 1:numDays) #Do something for each individual day. In this case find max
{
temp <- which(theData[i]$Date == numDays[i])
temp <- theData[[i]][temp,]
High[rowCnt, (i-2)+2] <- max(temp$High) #indexing for when I print to CSV
rowCnt <- rowCnt + 1
}
Heres what it should come out to: Except 1 to n days and times.
Day Time Value
20130310 09:30:00 5
20130310 09:31:00 1
20130310 09:32:00 2
20130310 09:33:00 3
20130311 09:30:00 12
20130311 09:31:00 0
20130311 09:32:00 1
20130311 09:33:00 5
so this should return:
day time value
20130310 09:33:00 3
20130311 09:30:00 12
Any help would be greatly appreciated! Thanks!
Here is the solution using plyr package
mydata<-structure(list(Day = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), .Label = c("", "x", "y"), class = "factor"), Value = c(0L,
1L, 2L, 3L, 12L, 0L, 1L, 5L), Time = c(5L, 6L, 7L, 8L, 1L, 2L,
3L, 4L)), .Names = c("Day", "Value", "Time"), row.names = c(NA,
8L), class = "data.frame")
library(plyr)
ddply(mydata,.(Day),summarize,max.value=max(Value))
Day max.value
1 x 3
2 y 12
Updated1: If your day is say 10/02/2012 12:00:00 AM, then you need to use:
mydata$Day<-with(mydata,as.Date(Day, format = "%m/%d/%Y"))
ddply(mydata,.(Day),summarize,max.value=max(Value))
Please see here for the example.
Updated2: as per new data: If your day is like the one you updated, you don't need to do anything. You can just use the code as following:
mydata1<-structure(list(Day = c(20130310L, 20130310L, 20130310L, 20130310L,
20130311L, 20130311L, 20130311L, 20130311L), Time = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("9:30:00", "9:31:00",
"9:32:00", "9:33:00"), class = "factor"), Value = c(5L, 1L, 2L,
3L, 12L, 0L, 1L, 5L)), .Names = c("Day", "Time", "Value"), class = "data.frame", row.names = c(NA,
-8L))
ddply(mydata,.(Day),summarize,Time=Time[which.max(Value)],max.value=max(Value))
Day Time max.value
1 20130310 9:30:00 5
2 20130311 9:30:00 12
If you want the time to appear in the output, then just use Time=Time[which.max(Value)] which gives the time at the maximum value.
This is a base function approach:
> do.call( rbind, lapply(split(dfrm, dfrm$Day),
function (df) df[ which.max(df$Value), ] ) )
Day Time Value
20130310 20130310 09:30:00 5
20130311 20130311 09:30:00 12
To explain what's happening it's good to learn to read R functions from the inside out (since they are often built around each other.) You wanted lines from a dataframe, so you would either need to build a numeric or logical vector that spanned the number of rows, .... or you can take the route I did and break the problem up by Day. That's what split does with dataframes. Then within each dataframe I applied a function, which.max to just a single day's subset of the data. Since I only got the results back from lapply as a list of dataframes, I needed to squash them back together, and the typical method for doing so is do.call(rbind, ...).
If I took the other route of making a vector for selection that applied to the whole dataframe I would use ave:
> dfrm[ with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) ) , ]
Day Time Value
1 20130310 09:30:00 5
1.1 20130310 09:30:00 5
Huh? That's not right... What's the problem?
with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) )
[1] 1 0 0 0 1 0 0 0
So despite asking for a logical vector with the "==" function, I got conversion to a numeric vector, something I still don't understand. But converting to logical outside that result I succeed again:
> dfrm[ as.logical( with(dfrm, ave(Value, Day,
FUN=function(v) v==max(v) ) ) ), ]
Day Time Value
1 20130310 09:30:00 5
5 20130311 09:30:00 12
Also note that the ave function (unlike tapply or aggregate) requires that you offer the function as a named argument with FUN=function(.). That is a common error I make. If you see the "error message unique() applies only to vectors", it seems out of the blue, but means that ave tried to group an argument that it expected to be discrete and you gave it a function.
Unlike other programming languages, in R it is considered good practice to avoid using for loops. Instead try something like:
index <- sapply(Days, function(x) {
which.max(Value)
})
theData[index, c("Day", "Time", "Value")]
This means for each value of Days, find the maximum value of Value and return its index. Then you can select the rows and columns of interest.
I recommend reading the help documentation for apply(), lapply(), sapply(), tapply(), mapply() (I'm probably forgetting one of them…) in and the plyr package.

Resources