R for loop not working - r

I'm trying to use R to find the max value of each day for 1 to n days. My issue is there are multiple values in each day. Heres my code. After I run it incorrect number of dimensions.
Any suggestions:
Days <- unique(theData$Date) #Gets each unique Day
numDays <- length(Days)
Time <- unique(theData$Time) #Gets each unique time
numTime <- length(Time)
rowCnt <- 1
for (i in 1:numDays) #Do something for each individual day. In this case find max
{
temp <- which(theData[i]$Date == numDays[i])
temp <- theData[[i]][temp,]
High[rowCnt, (i-2)+2] <- max(temp$High) #indexing for when I print to CSV
rowCnt <- rowCnt + 1
}
Heres what it should come out to: Except 1 to n days and times.
Day Time Value
20130310 09:30:00 5
20130310 09:31:00 1
20130310 09:32:00 2
20130310 09:33:00 3
20130311 09:30:00 12
20130311 09:31:00 0
20130311 09:32:00 1
20130311 09:33:00 5
so this should return:
day time value
20130310 09:33:00 3
20130311 09:30:00 12
Any help would be greatly appreciated! Thanks!

Here is the solution using plyr package
mydata<-structure(list(Day = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L), .Label = c("", "x", "y"), class = "factor"), Value = c(0L,
1L, 2L, 3L, 12L, 0L, 1L, 5L), Time = c(5L, 6L, 7L, 8L, 1L, 2L,
3L, 4L)), .Names = c("Day", "Value", "Time"), row.names = c(NA,
8L), class = "data.frame")
library(plyr)
ddply(mydata,.(Day),summarize,max.value=max(Value))
Day max.value
1 x 3
2 y 12
Updated1: If your day is say 10/02/2012 12:00:00 AM, then you need to use:
mydata$Day<-with(mydata,as.Date(Day, format = "%m/%d/%Y"))
ddply(mydata,.(Day),summarize,max.value=max(Value))
Please see here for the example.
Updated2: as per new data: If your day is like the one you updated, you don't need to do anything. You can just use the code as following:
mydata1<-structure(list(Day = c(20130310L, 20130310L, 20130310L, 20130310L,
20130311L, 20130311L, 20130311L, 20130311L), Time = structure(c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("9:30:00", "9:31:00",
"9:32:00", "9:33:00"), class = "factor"), Value = c(5L, 1L, 2L,
3L, 12L, 0L, 1L, 5L)), .Names = c("Day", "Time", "Value"), class = "data.frame", row.names = c(NA,
-8L))
ddply(mydata,.(Day),summarize,Time=Time[which.max(Value)],max.value=max(Value))
Day Time max.value
1 20130310 9:30:00 5
2 20130311 9:30:00 12
If you want the time to appear in the output, then just use Time=Time[which.max(Value)] which gives the time at the maximum value.

This is a base function approach:
> do.call( rbind, lapply(split(dfrm, dfrm$Day),
function (df) df[ which.max(df$Value), ] ) )
Day Time Value
20130310 20130310 09:30:00 5
20130311 20130311 09:30:00 12
To explain what's happening it's good to learn to read R functions from the inside out (since they are often built around each other.) You wanted lines from a dataframe, so you would either need to build a numeric or logical vector that spanned the number of rows, .... or you can take the route I did and break the problem up by Day. That's what split does with dataframes. Then within each dataframe I applied a function, which.max to just a single day's subset of the data. Since I only got the results back from lapply as a list of dataframes, I needed to squash them back together, and the typical method for doing so is do.call(rbind, ...).
If I took the other route of making a vector for selection that applied to the whole dataframe I would use ave:
> dfrm[ with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) ) , ]
Day Time Value
1 20130310 09:30:00 5
1.1 20130310 09:30:00 5
Huh? That's not right... What's the problem?
with(dfrm, ave(Value, Day, FUN=function(v) v==max(v) ) )
[1] 1 0 0 0 1 0 0 0
So despite asking for a logical vector with the "==" function, I got conversion to a numeric vector, something I still don't understand. But converting to logical outside that result I succeed again:
> dfrm[ as.logical( with(dfrm, ave(Value, Day,
FUN=function(v) v==max(v) ) ) ), ]
Day Time Value
1 20130310 09:30:00 5
5 20130311 09:30:00 12
Also note that the ave function (unlike tapply or aggregate) requires that you offer the function as a named argument with FUN=function(.). That is a common error I make. If you see the "error message unique() applies only to vectors", it seems out of the blue, but means that ave tried to group an argument that it expected to be discrete and you gave it a function.

Unlike other programming languages, in R it is considered good practice to avoid using for loops. Instead try something like:
index <- sapply(Days, function(x) {
which.max(Value)
})
theData[index, c("Day", "Time", "Value")]
This means for each value of Days, find the maximum value of Value and return its index. Then you can select the rows and columns of interest.
I recommend reading the help documentation for apply(), lapply(), sapply(), tapply(), mapply() (I'm probably forgetting one of them…) in and the plyr package.

Related

Number of continuous weeks by group

How do I find number of continuous weeks by group but counted from the max date in the dataset?
Say I have this dataframe:
id Week
1 A 2/06/2019
2 A 26/05/2019
3 A 19/05/2019
4 A 12/05/2019
5 A 5/05/2019
6 B 2/06/2019
7 B 26/05/2019
8 B 12/05/2019
9 B 5/05/2019
10 C 26/05/2019
11 C 19/05/2019
12 C 12/05/2019
13 D 2/06/2019
14 D 26/05/2019
15 D 19/05/2019
16 E 2/06/2019
17 E 19/05/2019
18 E 12/05/2019
19 E 5/05/2019
My desired output is:
id count
1: A 5
2: B 2
3: D 3
4: E 1
I am currently converting dates into factor to get ordered number and checking against the reference number created based on the number of rows in each group.
library(data.table)
df <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 5L),
.Label = c("A", "B", "C", "D", "E"), class = "factor"),
Week = structure(c(3L, 4L, 2L, 1L, 5L, 3L, 4L, 1L, 5L, 4L, 2L, 1L, 3L, 4L, 2L, 3L, 2L, 1L, 5L),
.Label = c("12/05/2019", "19/05/2019", "2/06/2019", "26/05/2019", "5/05/2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -19L))
dt <- data.table(df)
dt[, Week_no := as.factor(as.Date(Week, format = "%d/%m/%Y"))]
dt[, Week_no := factor(Week_no)]
dt[, Week_no := as.numeric(Week_no)]
max_no <- max(dt$Week_no)
dt[, Week_ref := max_no:(max_no - .N + 1), by = "id"]
dt[, Week_diff := Week_no - Week_ref]
dt[Week_diff == 0, list(count = .N), by = "id"]
Here's one way to do this:
dt <- dt[, Week := as.Date(Week, format = "%d/%m/%Y")]
ids_having_max <- dt[.(max(Week)), id, on = "Week"]
dt <- dt[.(ids_having_max), on = "id"
][order(-Week), .(count = sum(rleid(c(-7L, diff(Week))) == 1)), by = "id"]
Breaking it into steps:
We leave Week as a date because it can already be compared,
and you can subtract dates to get time differences.
We then get all the ids that contain the maximum date in the whole table.
This is using secondary indices.
We use secondary indices again to filter out those ids that were not part of the previous result
(the dt[.(ids_having_max), on = "id" part).
The last frame is tricky.
We group by id and make sure that rows are ordered by Week in descending order.
Then the logic is as follows.
When you have contiguous weeks,
diff(Week) is always -7 with the chosen sorting.
Computing diff returns a shorter vector because the first result is computed by subtracting the first input element from the second,
so we prepend a -7 to make sure that it is the first element in the input to rleid.
With rleid we assign a 1 to the first -7 and keep the 1 until we see something different from -7.
Something different means weeks stopped being contiguous.
The sum(rleid(c(-7L, diff(Week))) == 1) will simply return how many rows had a rleid equal to 1.
Example of the last part for B:
Differences: -7, -14, -7
After prepending -7: -7, -7, -14, -7
After rleid: 1, 1, 2, 3
From the previous, two had a rleid == 1
Apologies for dplyr solution, but I presume a similar approach can be achieved more concisely with data.table.
library(dplyr)
df$Week = lubridate::dmy(df$Week)
df %>%
group_by(id) %>%
arrange(id, Week) %>%
# Assign group to each new streak
mutate(new_streak = cumsum(Week != lag(Week, default = 0) + 7)) %>%
add_count(id, new_streak) %>%
slice(n()) # Only keep last week
So I would suggest converting the format of the data column to show week number "%W" as follows
dt[, Week_no := format(as.Date(Week, format = "%d/%m/%Y"),"%W")]
Then find the amount of unique week numbers for each id value
dt[,(length(unique(Week_no))),by="id"]
FULL DISCLOSURE
I realise that when I run this I get a different table than you present, as R counts the week by the week # for the given year.
If this doesnt answer your question just let me know and I can try to update

How can I convert groupedData into Dataframe in R

Consider I have the below dataframe
AccountId,CloseDate
1,2015-05-07
2,2015-05-09
3,2015-05-01
4,2015-05-07
1,2015-05-09
1,2015-05-12
2,2015-05-12
3,2015-05-01
3,2015-05-01
3,2015-05-02
4,2015-05-17
1,2015-05-12
I want to group it based on AccountId and then I want to add another column naming date_diff which will contain the difference in CloseDate between the current row and previous row. Please note that I want this date_diff to be calculated only for rows having same AccountId. So I need to group the data before adding another column
Below is the R code that I am using
df <- read.df(sqlContext, "/home/ubuntu/work/csv/sample.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
df$CloseDate <- to_date(df$CloseDate)
groupedData <- SparkR::group_by(df, df$AccountId)
SparkR::mutate(groupedData, DiffCloseDt = as.numeric(SparkR::datediff((CloseDate),(SparkR::lag(CloseDate,1)))))
To add another column I am using mutate. But as the group_by returns groupedData I am not able to use mutate here. I am getting the below error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘mutate’ for signature ‘"GroupedData"’
So how can I convert GroupedData into Dataframe so that I can add columns using mutate?
What you want is not possible to achieve using group_by. As already explained quite a few times on SO :
Using groupBy in Spark and getting back to a DataFrame
How to do custom operations on GroupedData in Spark?
DataFrame groupBy behaviour/optimization
group_by on a DataFrame doesn't physical group the data. Moreover order of operations after applying group_by is nondeterministic.
To achieve desired output you'll have to use window functions and provide an explicit ordering:
df <- structure(list(AccountId = c(1L, 2L, 3L, 4L, 1L, 1L, 2L, 3L,
3L, 3L, 4L, 1L), CloseDate = structure(c(3L, 4L, 1L, 3L, 4L,
5L, 5L, 1L, 1L, 2L, 6L, 5L), .Label = c("2015-05-01", "2015-05-02",
"2015-05-07", "2015-05-09", "2015-05-12", "2015-05-17"), class = "factor")),
.Names = c("AccountId", "CloseDate"),
class = "data.frame", row.names = c(NA, -12L))
hiveContext <- sparkRHive.init(sc)
sdf <- createDataFrame(hiveContext, df)
registerTempTable(sdf, "df")
query <- "SELECT *, LAG(CloseDate, 1) OVER (
PARTITION BY AccountId ORDER BY CloseDate
) AS DateLag FROM df"
dfWithLag <- sql(hiveContext, query)
withColumn(dfWithLag, "diff", datediff(dfWithLag$CloseDate, dfWithLag$DateLag)) %>%
head()
## AccountId CloseDate DateLag diff
## 1 1 2015-05-07 <NA> NA
## 2 1 2015-05-09 2015-05-07 2
## 3 1 2015-05-12 2015-05-09 3
## 4 1 2015-05-12 2015-05-12 0
## 5 2 2015-05-09 <NA> NA
## 6 2 2015-05-12 2015-05-09 3

Randomly subsetting 1 observation per site and date

I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.
More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.
So the full dataset looks something like:
Site Date Ladybug
Baumgarten 6/24/2014 2
Baumgarten 8/6/2014 0
Baumgarten 8/20/2014 3
Baumgarten 7/8/2014 0
Baumgarten 7/22/2014 1
Berkevich 7/9/2014 0
Berkevich 7/23/2014 4
Berkevich 8/8/2014 0
Berkevich 8/22/2014 0
Boehm 6/24/2014 2
# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"), Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014", "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L, 4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA, -10L))
And my desired subsetted dataset would look something like:
Site Date Ladybugs
Baumgarten 8/20/2014 3
Berkevich 7/9/2014 0
Boehm 6/24/2014 2
I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.
Any advice would be much appreciated. Thanks.
Assuming your data is a data.frame named df, you could use dplyr and do the following:
library(dplyr)
df %>%
group_by(Site) %>%
sample_n(1)
# Source: local data frame [3 x 3]
# Groups: Site [3]
#
# Site Date Ladybug
# (fctr) (fctr) (int)
# 1 Baumgarten 8/20/2014 3
# 2 Berkevich 8/22/2014 0
# 3 Boehm 6/24/2014 2
Using data.table you can use:
require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]
This gives you
Site Date Ladybug
1: Baumgarten 8/20/2014 3
2: Berkevich 7/23/2014 4
3: Boehm 6/24/2014 2
You could also use base-R for this. It splits the data by site, samples one row and returns that. Then results get bound together.
set.seed(123)
res <- do.call(rbind,lapply(split(dat,dat$Site),function(x){x[sample(nrow(x),1),]}))
Another possibility is data.table:
library(data.table)
setDT(dat)
set.seed(123)
res <- dat[,.SD[sample(.N,1)],Site]
A possibly inefficient method, but it gets the job done.
levels <- length(unique(data$Site))
rowselect<- sapply(1:levels, function(x) {
elem <- which(array==unique(array)[x])
if(length(elem)<2){
return(elem)
} else {
return(sample(elem,1))
}
})
this gives the rowindex for 1 randomly selected row for each site.

Counting occurence based on condition for each element of a column

I am analysing air traffic movements at an airport. My data set comprises aircraft block off times (leaving the gate) and the respective take-off times.
I am looking for an efficient way to count the (cumulative) occurrence of take-off events based on a condition given by the block-time of a flight.
Being relatively new to R, I have managed to code this by
looping over all rows of my data;
subsetting the data for the block time (condition event) in that row; and
counting the number of rows for the (temporary) data frame.
My solution is pretty slow already for a month of data (~ 50.000 flights), so it will be cumbersome to analyse larger time frames of one or two years.
I failed to find a similar problem on stackoverflow (or elsewhere) that applies to my problem. Neither could I make an apply() or sapply() work properly.
This is my code:
## count depeartures before own off-block
data$CUM_DEPS <- rep(NA, nrow(data)) # initialise column for dep count
for(i in 1:nrow(data)){ # loop over the data
data$CUM_DEPS[i] <- nrow(data[data$TAKE_OFF_TIME < data$BLOCK_OFF_TIME[i],])
}
Any pointers?
As suggested, this is a snapshot of the data and the result column i created.
FLTID TAKE_OFF_TIME BLOCK_OFF_TIME CUM_DEPS
Flight1 2013-07-01 05:02:42 2013-07-01 04:51:00 0
Flight2 2013-07-01 05:04:30 2013-07-01 04:53:52 0
Flight3 2013-07-01 05:09:01 2013-07-01 04:55:14 0
Flight4 2013-07-01 05:10:30 2013-07-01 05:00:57 0
Flight5 2013-07-01 05:12:58 2013-07-01 05:00:06 0
Flight6 2013-07-01 05:18:45 2013-07-01 05:04:14 1
Flight7 2013-07-01 05:22:12 2013-07-01 05:03:39 1
Flight8 2013-07-01 05:26:02 2013-07-01 05:09:32 3
Flight9 2013-07-01 05:27:24 2013-07-01 05:19:24 6
Flight10 2013-07-01 05:31:32 2013-07-01 05:17:05 5
From above code, it seems like you are doing one-to-many comparison.
The thing that makes your code slow is that you are subsetting data based on boolean index
for every single step.
data$CUM_DEPS <- rep(NA, nrow(data))
take_off_time = data$TAKE_OFF_TIME
for(i in 1:nrow(data)){
data$CUM_DEPS[i] = sum(data$BLOCK_OFF_TIME[i] > take_off_time)
}
This small modification will make it much faster, although I cannot say with an exact
number since I do not have a reproducible example.
The biggest difference is that I store date vector 'take_off_time' rather than
calling from the dataframe for every single iteration, and not subsetting data based on boolean, but summing single boolean.
Above all is from the assumption that you have processed date correctly so that it can be compared with inequality.
You could check where, in-between "TAKE_OFF_TIME"s, each "BLOCK_OFF_TIME" falls. findInterval is fast for this; the following looks valid, but maybe you'll have to check findInterval's arguments to suit your exact problem.
findInterval(as.POSIXct(DF[["BLOCK_OFF_TIME"]]),
as.POSIXct(DF[["TAKE_OFF_TIME"]]))
#[1] 0 0 0 0 0 1 1 3 6 5
And, for the record, the loop using sapply:
BOT = as.POSIXct(DF[["BLOCK_OFF_TIME"]])
TOT = as.POSIXct(DF[["TAKE_OFF_TIME"]])
sapply(BOT, function(x) sum(TOT < x))
#[1] 0 0 0 0 0 1 1 3 6 5
Where "DF":
DF = structure(list(FLTID = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 2L), .Label = c("Flight1", "Flight10", "Flight2", "Flight3",
"Flight4", "Flight5", "Flight6", "Flight7", "Flight8", "Flight9"
), class = "factor"), TAKE_OFF_TIME = structure(1:10, .Label = c("2013-07-01 05:02:42",
"2013-07-01 05:04:30", "2013-07-01 05:09:01", "2013-07-01 05:10:30",
"2013-07-01 05:12:58", "2013-07-01 05:18:45", "2013-07-01 05:22:12",
"2013-07-01 05:26:02", "2013-07-01 05:27:24", "2013-07-01 05:31:32"
), class = "factor"), BLOCK_OFF_TIME = structure(c(1L, 2L, 3L,
5L, 4L, 7L, 6L, 8L, 10L, 9L), .Label = c("2013-07-01 04:51:00",
"2013-07-01 04:53:52", "2013-07-01 04:55:14", "2013-07-01 05:00:06",
"2013-07-01 05:00:57", "2013-07-01 05:03:39", "2013-07-01 05:04:14",
"2013-07-01 05:09:32", "2013-07-01 05:17:05", "2013-07-01 05:19:24"
), class = "factor"), CUM_DEPS = c(0L, 0L, 0L, 0L, 0L, 1L, 1L,
3L, 6L, 5L)), .Names = c("FLTID", "TAKE_OFF_TIME", "BLOCK_OFF_TIME",
"CUM_DEPS"), class = "data.frame", row.names = c(NA, -10L))

Selecting specific rows in R

I am working on gps data right now, the position of the animal has been collected if possible every 4 hours. The data looks like this (XY data is not shown here for some reasons):
ID TIME POSIXTIME date_only
1 1 12:00 2005-05-08 12:00:00 2005-05-08
2 2 16:01 2005-05-08 16:01:00 2005-05-08
3 3 20:01 2005-05-08 20:01:00 2005-05-08
4 4 0:01 2005-05-09 00:01:00 2005-05-09
5 5 8:01 2005-05-09 08:01:00 2005-05-09
6 6 12:01 2005-05-09 12:01:00 2005-05-09
7 7 16:02 2005-05-09 16:02:00 2005-05-09
8 8 20:02 2005-05-09 20:02:00 2005-05-09
9 9 0:01 2005-05-10 00:01:00 2005-05-10
10 10 4:00 2005-05-10 04:00:00 2005-05-10
I would now like to take only the first locations per day. In most cases, this will be at 0:01 o'clock. However, sometimes it will be 4:01 or even later as there is missing data.
How can I get only the first locations per day? They should be included in a new dataframe. I tried it with :
tapply(as.numeric(Kandularaw$TIME),list(Kandularaw$date_only),min, na.rm=T)
However, this did not work as R takes strange values when TIME is set as numeric.
Is it possible do do it with an ifelse statement? If yes, how would it look like roughly?
I am grateful for every help I can get. Thank you for your efforts.
Cheers,
Jan
I am guessing you really want a row number as an index into a position record. If you know that these rows are ordered by date-time, and you are getting satisfactory group splits with that second argument to tapply (however it was created), then try this:
idx <- tapply(1:NROW(Kandularaw), Kandularaw$date_only, "[", 1)
If you want records (rows) in that same dataframe then just use:
Kandularaw[ idx, ]
I would approach this from a simpler point of view. First, ensure that POSIXTIME is one of the "POSIX" classes. Then order the data by POSIXTIME. At this point we can use any of the split-apply-combine idioms to do what you want, making use of the head() function. Here I use aggregate():
Using this example data set:
dat <- structure(list(ID = 1:10, TIME = structure(c(4L, 6L, 8L, 1L,
3L, 5L, 7L, 9L, 1L, 2L), .Label = c("00:01:00", "04:00:00", "08:01:00",
"12:00:00", "12:01:00", "16:01:00", "16:02:00", "20:01:00", "20:02:00"
), class = "factor"), POSIXTIME = structure(1:10, .Label = c("2005/05/08 12:00:00",
"2005/05/08 16:01:00", "2005/05/08 20:01:00", "2005/05/09 00:01:00",
"2005/05/09 08:01:00", "2005/05/09 12:01:00", "2005/05/09 16:02:00",
"2005/05/09 20:02:00", "2005/05/10 00:01:00", "2005/05/10 04:00:00"
), class = "factor"), date_only = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L, 3L), .Label = c("2005/05/08", "2005/05/09",
"2005/05/10"), class = "factor")), .Names = c("ID", "TIME", "POSIXTIME",
"date_only"), class = "data.frame", row.names = c(NA, 10L))
First, get POSIXTIME and date_only in the correct formats:
dat <- transform(dat,
POSIXTIME = as.POSIXct(POSIXTIME, format = "%Y/%m/%d %H:%M:%S"),
date_only = as.Date(date_only, format = "%Y/%m/%d"))
Next, order by POSIXTIME:
dato <- with(dat, dat[order(POSIXTIME), ])
The final step is to use aggregate() to split the data by date_only and use head() to select the first row:
aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
notice I pass the n argument of head() the value 1, indicating that it should extract only the first row of each days observations. Because we sorted by datetime and split on date, the first row should be the first observation per day. Do be aware of rounding issues however.
The final step results in:
> aggregate(dato[,1:3], by = list(date = dato$`date_only`), FUN = head, n = 1)
date ID TIME POSIXTIME
1 2005-05-08 1 12:00:00 2005-05-08 12:00:00
2 2005-05-09 4 00:01:00 2005-05-09 00:01:00
3 2005-05-10 9 00:01:00 2005-05-10 00:01:00
Instead of dato[,1:3] refer to whatever columns in your original data set contain the variables (locations?) you wanted.

Resources