Grouping rows of large dataset in R

Grouping rows of large dataset in R - r

I am trying to calculate driver activity using GPS data. I've written a loop that calculates the difference in time between two consecutive points in a dataframe over the range of values, summing it as it goes.
Here is an example of my data:
DriveNo Date.and.Time Latitude Longitude
1 156 2014-01-31 23:00:00 41.88367 12.48778
2 187 2014-01-31 23:00:01 41.92854 12.46904
3 297 2014-01-31 23:00:01 41.89107 12.49270
4 89 2014-01-31 23:00:01 41.79318 12.43212
5 79 2014-01-31 23:00:01 41.90028 12.46275
6 191 2014-01-31 23:00:02 41.85231 12.57741
Reprex:
taxi_noOutlier <- structure(list(DriveNo = c(156, 187, 297, 89, 79, 191),
Date.and.Time = structure(c(1391209200.73917, 1391209201.14846,
1391209201.22007, 1391209201.47085, 1391209201.63114, 1391209202.04855),
class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Latitude = c(41.883670807, 41.928543091, 41.891067505, 41.793178558,
41.900276184, 41.852306366),
Longitude = c(12.48777771, 12.469037056, 12.492704391, 12.432122231,
12.46274662, 12.577406883)),
row.names = c(NA, 6L), class = "data.frame")
And the loop:
taxi_156 <- filter(taxi_noOutlier, DriveNo == 156)
datelist = taxi_156$Date.and.Time
dlstandard = as.POSIXlt(datelist)
diffsum <- as.numeric(sum(Filter(function(x) x <= 60, difftime(tail(dlstandard, -1), head(dlstandard, -1), units = 'secs'))))
print(paste("The total activity time for driver #156 is ", diffsum))
Which gives an output of:
[1] "The total activity time for driver #264 is 705655.37272048"
My question is, how can I expand this code to find the activity for each other driver? (There are 374 unique drivers, each with thousands of points.) I have tried to replicate the above code using a loop that would calculate the time difference for each DriveNo, but I am new to R and I my understanding of loop syntax isn't great.
Can I filter into separate dataframes using a method like this? (This gives an error to do with unexpected bracketing).
for (i in seq_along(taxi_noOutlier$DriveNo))
{
taxi_[[i]] <- filter(taxi_noOutlier, DriveNo == [[i]])
}
and then use my original code on each one? Or is there a more efficient way? Thanks

You can group_by each DriveNo get the difference between consecutive Date.and.Time, remove the values which are less than a minute and sum the differences.
library(dplyr)
taxi_noOutlier %>%
group_by(DriveNo) %>%
mutate(difftime = difftime(Date.and.Time, lag(Date.and.Time), units = 'secs')) %>%
filter(difftime <= 60) %>%
summarise(diffsum = sum(as.numeric(difftime), na.rm = TRUE)) -> result
result

Is this what you need.
The separate dataframes are stored in the list taxi.list.
taxi.list <- list()
for (i in taxi_noOutlier$DriveNo){
name <- paste0("taxi_",i)
taxi.list[[name]] <- filter(taxi_noOutlier, DriveNo == i)
#same as
#taxi.list[[name]] <- taxi_noOutlier %>% filter(DriveNo == i)
}

Related

R - Using time to classify

a little help would not go too far.
I have to do a job based on the use of hours (24h) with 2 intervals (or more).
For example I use an interval from 08:30:00 to 18:30:00 and 18:31:00 to 08:29:00 (day+1)
I give illustrates what I did, a first view it works but on thousands of data, I realized that it was not optimal and that there was bad data classification...( 20:37 was in on "Work" instead of "Sleep")
Any methodological advice is welcome.
library("lubridate")
library("dplyr")
library("data.table")
#database
datetime <- seq(from =ymd_hms("2014-02-24 00:00:00"),to= ymd_hms("2014-03-20 23:59:59"), by="minute")
set.seed(123)
values <- sample(seq(from = 20, to = 50, by = 30), size = length(datetime), replace = TRUE)
df <- tibble(datetime, values)
#datastep
df <- df %>% mutate(H1 = hms(as.numeric(datetime)),
Hour = hour(H1),
Minute = minute(H1) )
setDT(df)[(H>= 8 & M>29) & (H<= 18 & M<31), statut:= "Work"] # range beetween 08h20 and 18h30
test[(H<= 8 & M<30) | (H>= 18 & M>30), statut:= "Sleep"] # range 18h31 to 08h29 day+1

Is there a way to make a time series out of an unevenly intervaled data frame in R?

I have a data set with paired values which I have converted into a data frame like this:
(50.0, 0.0), (49, 27.891), (48, 28.119),
(47, 28.146), (46, 28.158), (45, 28.195),
(44, 28.261), (43, 28.274), (42, 28.316),
(41, 28.326), (40, 28.608), (39, 28.687),
(38, 28.736), (37, 28.746)
numeric_data
clean_time_numeric clean_position_numeric
1 0.000 50
2 27.891 49
3 28.119 48
4 28.146 47
5 28.158 46
This data frame has time points and the position of a slider at that time point. I want to make a time series with intervals of 0.001 with the corresponding position of the slider in the next column, so the position would be 50 until the 27,891st row.
I have tried this piece of code with the xts and zoo packages that I saw from another post:
df1.zoo <- zoo(clean_time_numeric)
df2 <- as.data.frame(as.zoo(merge(as.xts(df1.zoo), as.xts(zoo(,seq(start(df1.zoo[1]),end(df1.zoo[89]), order.by = as.POSIXct.numeric(clean_time_numeric, tryformats = "%Y%m%d%H%M%S")))))))
but this error keeps showing up:
Error in xts(coredata(x), order.by = order.by, frequency = frequency, :
order.by requires an appropriate time-based object
I am new to coding in R so I'm not really sure how to approach this or if there's an easier way to solve this, any suggestions are welcome!
Thank you,
Edit: I also tried this:
numeric_data$clean_time_numeric<- as.POSIXct.numeric(numeric_data$clean_time_numeric, tz= "GMT", origin = "1970-01-01", tryformats = "%H:%M:%S")
tseries <- data.frame(x = seq(head(numeric_data$clean_time_numeric,1),tail(numeric_data$clean_time_numeric,1),by = "sec"))
res <-merge(tseries, numeric_data, by.x="x",by.y="clean_time_numeric",all.x = TRUE)
xts(res$clean_position_numeric,order.by = res$x)
With this, only the first data point is correct - the rest are NA and it stops way before the end

A possible solution:
create a sequence with 0.001 interval
join this sequence to the original dataframe
use zoo::na.locf to replace NA by last known value
df <- read.table(text = "
clean_time_numeric clean_position_numeric
0.000 50
27.891 49
28.119 48
28.146 47
28.158 46",header=T)
time.001 <- data.frame(time = seq(min(df$clean_time_numeric), max(df$clean_time_numeric), by =0.001))
library(dplyr)
df.001 <- dplyr::full_join(df, time.001, by = c("clean_time_numeric"="time")) %>%
arrange(clean_time_numeric) %>%
mutate(clean_position_numeric = zoo::na.locf(clean_position_numeric))
head(df.001)
clean_time_numeric clean_position_numeric
1 0.000 50
2 0.001 50
3 0.002 50
4 0.003 50
5 0.004 50
6 0.005 50
tail(df.001)
clean_time_numeric clean_position_numeric
28155 28.153 47
28156 28.154 47
28157 28.155 47
28158 28.156 47
28159 28.157 47
28160 28.158 46

Using the numeric_data data frame shown reproducibly in the Note at the end, convert it to a zoo series using read.zoo. Then set its frequency to 1000 (this is the number of points per unit interval), convert to ts class and use na.locf0 (or na.approx for linear interpolation or na.spline for spline interpolation) to fill in the NAs that were created by the conversion from zoo to ts.
library(zoo)
z <- read.zoo(numeric_data)
frequency(z) <- 1000
tt <- na.locf0(as.ts(z))
length(tt)
## [1] 28159
deltat(tt)
## [1] 0.001
range(time(tt))
## [1] 0.000 28.158
We can now
leave it as a ts object, tt, or
convert it to a zoo series: as.zoo(tt), or
convert it to a data frame: fortify.zoo(tt)
Note
The input in reproducible form:
numeric_data <-
structure(list(clean_time_numeric = c(0, 27.891, 28.119, 28.146,
28.158), clean_position_numeric = 50:46), class = "data.frame", row.names = c(NA, -5L))

How to use apply.daily/period.apply for calculating maximum per column in XTS time series?

I have a problem using the period.apply function for my case of a high resolution time series analysis.
I want to calculate statistics(Mean for different Periods, Stddev etc.) for my data which is in 10 min intervals. To calculate hourly means worked fine like described in this answer.
It creates a new xts object with means calculated for each column. How do I calculate maximum values for each column?
This reproducible example describes the structure of my data:
library(xts)
start <- as.POSIXct("2018-05-18 00:00")
tseq <- seq(from = start, length.out = 1440, by = "10 mins")
Measurings <- data.frame(
Time = tseq,
Temp = sample(10:37,1440, replace = TRUE, set.seed(seed = 10)),
Variable1 = sample(1:200,1440, replace = TRUE, set.seed(seed = 187)),
Variable2 = sample(300:800,1440, replace = TRUE, set.seed(seed = 333))
)
Measurings_xts <- xts(Measurings[,-1], Measurings$Time)
HourEnds <- endpoints(Measurings_xts, "hours")
Measurings_mean <- period.apply(Measurings_xts, HourEnds, mean)
I thought it would be easy to just change the function argument from mean to max, like this:
Measurings_max <- period.apply(Measurings_xts, HourEnds, max)
It delivers output, but only one column with the overall maximum values. I need the hourly maximums of each column. A simple solution would be much appreciated.

The mean example works by column because there's a zoo method that calls mean on each column (this method is used because xts extends zoo).
The max example returns one number because there is no max.xts or max.zoo method, so it returns the maximum of the entire xts/zoo object.
A simple solution is to define a helper function:
colMax <- function(x, na.rm = FALSE) {
apply(x, 2, max, na.rm = na.rm)
}
Then use that in your period.apply call:
epHours <- endpoints(Measurings_xts, "hours")
Measurings_max <- period.apply(Measurings_xts, epHours, colMax)
head(Measurings_max)
# Temp Variable1 Variable2
# 2018-05-18 00:50:00 29 194 787
# 2018-05-18 01:50:00 28 178 605
# 2018-05-18 02:50:00 26 188 756
# 2018-05-18 03:50:00 34 152 444
# 2018-05-18 04:50:00 33 145 724
# 2018-05-18 05:50:00 35 187 621

R filtering/selecting data by POSIXct time and a condition

I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad

Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()

Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.

Calculate number of days between two dates in r

I need to calculate the number of days elapsed between multiple dates in two ways and then output those results to new columns: i) number of days that has elapsed as compared to the first date (e.g., RESULTS$FIRST) and ii) between sequential dates (e.g., RESULTS$BETWEEN). Here is an example with the desired results. Thanks in advance.
library(lubridate)
DATA = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")))
RESULTS = data.frame(DATE = mdy(c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013",
"12/16/2013", "12/16/2015")),
FIRST = c(0, 24, 53, 107, 161, 891), BETWEEN = c(0, 24, 29, 54, 54, 730))

#Using dplyr package
library(dplyr)
df1 %>% # your dataframe
mutate(BETWEEN0=as.numeric(difftime(DATE,lag(DATE,1))),BETWEEN=ifelse(is.na(BETWEEN0),0,BETWEEN0),FIRST=cumsum(as.numeric(BETWEEN)))%>%
select(-BETWEEN0)
DATE BETWEEN FIRST
1 2013-07-08 0 0
2 2013-08-01 24 24
3 2013-08-30 29 53
4 2013-10-23 54 107
5 2013-12-16 54 161
6 2015-12-16 730 891

This will get you what you want:
d <- as.Date(DATA$DATE, format="%m/%d/%Y")
first <- c()
for (i in seq_along(d))
first[i] <- d[i] - d[1]
between <- c(0, diff(d))
This uses the as.Date() function in the base package to cast the vector of string dates to date values using the given format. Since you have dates as month/day/year, you specify format="%m/%d/%Y" to make sure it's interpreted correctly.
diff() is the lagged difference. Since it's lagged, it doesn't include the difference between element 1 and itself, so you can concatenate a 0.
Differences between Date objects are given in days by default.
Then constructing the output dataframe is simple:
RESULTS <- data.frame(DATE=DATA$DATE, FIRST=first, BETWEEN=between)

For the first part:
DATA = data.frame((c("7/8/2013", "8/1/2013", "8/30/2013", "10/23/2013","12/16/2013", "12/16/2015")))
names(DATA)[1] = "V1"
date = as.Date(DATA$V1, format="%m/%d/%Y")
print(date-date[1])
Result:
[1] 0 24 53 107 161 891
For second part - simply use a for loop

You can just add each column with the simple difftime and lagged diff calculations.
DATA$FIRST <- c(0,
with(DATA,
difftime(DATE[2:length(DATE)],DATE[1], unit="days")
)
)
DATA$BETWEEN <- c(0,
with(DATA,
diff(DATE[1:(length(DATE) - 1)], unit="days")
)
)
identical(DATA, RESULTS)
[1] TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Grouping rows of large dataset in R - r

Is this what you need. The separate dataframes are stored in the list taxi.list. taxi.list <- list() for (i in taxi_noOutlier$DriveNo){ name <- paste0("taxi_",i) taxi.list[[name]] <- filter(taxi_noOutlier, DriveNo == i) #same as #taxi.list[[name]] <- taxi_noOutlier %>% filter(DriveNo == i) }

Related

R - Using time to classify

Is there a way to make a time series out of an unevenly intervaled data frame in R?

How to use apply.daily/period.apply for calculating maximum per column in XTS time series?

R filtering/selecting data by POSIXct time and a condition

Calculate number of days between two dates in r

Categories

Resources