How to set unit for diff()? - r

I have a data frame with several different variables (e.g. location, species, date & time). I'm trying to find the difference between two timestamps within the same column, according to location and species.
What my data frame looks like:
dat <- data.frame(
location = c("A","A","A","B","B","B","C","C","C"),
ID = c("x","y","x","x","x","y","y","x","x"),
datetime = c("2019-09-02 11:33:00","2019-09-03 10:00:00","2019-08-23 14:22:34","2019-09-12 12:18:00","2019-09-15 09:40:00","2019-09-15 09:40:00","2019-09-15 10:05:00","2019-08-23 13:58:18","2019-09-16 09:34:00"))
I grouped my data frame by location and ID and calculated the time difference with this:
Data1 <- Data %>% group_by(location, ID)
Data2<-mutate(Data1,diff:=c(1000, diff(datetime)))
This successfully gives me the time difference, but for some reason they're randomly in different units (seconds, minutes, hours). I tried this instead:
Data2<-mutate(Data1,diff:=c(1000, diff(datetime, units="mins")))
but the output doesn't change. Is there a way to set the units, and if not is there an alternative way to get the time difference in a data frame sorted by specific variables?

We can use difftime
library(dplyr)
Data1 %>%
mutate(diff = difftime(datetime, lag(datetime,
default = first(datetime)), unit = 'mins'))

Related

How to separate a time series panel by the number of missing observations at the end?

Consider a set of time series having the same length. Some have missing data in the end, due to the product being out of stock, or due to delisting.
If the series contains at least four missing observations (in my case it is value = 0 and not NA) at the end, I consider the series as delisted.
In my time series panel, I want to separate the series with delisted id's from the other ones and create two different dataframes based on this separation.
I created a simple reprex to illustrate the problem:
library(tidyverse)
library(lubridate)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),2),
value = c(c(rep(1,17),0,0,0,0,2,2,3), c(rep(9,20),0,0,0,0))
)
I am searching for a pipeable tidyverse solution.
Here is one possibility to find delisted ids
data %>%
group_by(id) %>%
mutate(delisted = all(value[(n()- 3):n()] == 0)) %>%
group_by(delisted) %>%
group_split()
In the end I use group_split to split the data into two parts: one containing delisted ids and the other one contains the non-delisted ids.

Converting two columns to rows to correspond with user_id from a time-series dataset

I have a time series dataset I need to modify and I'd like to convert two columns to rows with corresponding values for each user_id. Here is an image of the time series dataset called AllData.
AllData time series dataset
As you can see each user_id has 25 values for Week and Average.Steps. I was able to condense the dataset by user_id with the following code to create a new data set called Users. I've provided an image of the output following the code:
Users <- AllData %>%
select(user_id,gender,age,income_level,Province,current_provider,Baseline,Experience,Engagement_pre, first_baseline)%>%
distinct(user_id, gender,age,income_level,Province,current_provider,Baseline,Experience,Engagement_pre, first_baseline)%>%
mutate(female = if_else(gender == 'Female', 1, 0))
Users dataset
Here each user_id is in a distinct row with values for the corresponding columns I selected. I would like to include Week and Average.Steps from AllData but I don't know how to incorporate additional code into the one I made for Users so each value of Week and Average.Steps corresponds with user_id. Any assistance would be greatly appreciated!
Do you want the average from AllData or from distinct rows? There isn't enough data in the image to understand how or what is replicated.
This is the average by user and week.
library(tidyverse)
# find the averages
summaryAD <- AllData %>%
group_by(user_id, Week) %>%
summarise(steps = mean(Average.Steps))
finalData <- left_join(Users, summaryAD) # join the averages with your data

How to take multiple Sample() vector outputs and combine them into a data frame

sorry if this is a somewhat basic question. But I am wanting to take a random sample sample of data from each day (coming from separate files), using the sample() function, and then combine each of those sampled rows from the day data and combine it all into a week data frame of only my sampled data from the day datasets.
Assume that the data is held in files name for their day, like mydata_2020_05_17.csv
library(tidyverse)
readDay <- function(date, dir, sampleN){
path <- paste0(dir, "/", "mydata_", date, ".csv")
read_csv(path) %>%
as_tibble() %>%
# You many not need this if the records already have the date
mutate(DATE = date) %>%
sample_n(sampleN, replace = FALSE)
}
Lets start on the first Sunday of the month
answerWeek = map_df(seq.Date(from = as_date("2020-05-03"), length.out = 6, by = 1),
~ readDay(.x, "~/nefarious/data", sampleN = 20))
NOT RUN because I don't have a folder full of dated csv data.
Let us know if I've mis-interpreted what you're looking for.

R aggregating irregular time series data by groups (with meta data)

Hi I have a data frame (~4 million rows) with time series data for different sites and events.
Here is a rough idea of my data, obviously on a different scale, I have several similar time series so I've kept it general as I want to be able to apply it in different cases
Data1 <- data.frame(DateTimes =as.POSIXct("1988-04-30 13:20:00")+c(1:10,12:15,20:30,5:13,16:20,22:35)*300,
Site = c(rep("SiteA",25),rep("SiteB",28)),
Quality = rep(25,53),
Value = round(runif(53,0,5),2),
Othermetadata = c(rep("E1",10),rep("E2",15),rep("E1",10),rep("E2",18)))
What I'm looking for is a simple way to group and aggregate this data to different timesteps while keeping metadata which doesn't vary within the group
I have tried using the zoo library and zoo::aggregate ie:
library(zoo)
zooData <- read.zoo(select(Data1, DateTimes, Value))
zooagg <- aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
However when I do this I'm losing all my metadata and merging different sites data.
I wondered about trying to use plyr or dplyr to split up the data and then appling the aggregate but I'm still going to lose my other columns.
Is there a better way to do this? I had a brief look at doco for xts library but couldn't see an intuitive solution in their either
*Note: as I want this to work for a few different things both the starting time step and final time step might change. With possibility for random time step, or somewhat regular time step but with missing points. And the FUN applied may vary (mostly sum or mean). As well as the fields I want to split it by *
Edit I found the solution after Hercules Apergis pushed me in the right direction.
newData <- Data1 %>% group_by(timeagg, Site) %>% summarise(Total = sum(Value))
finaldata <- inner_join(Data1,newData) %>% select(-DateTimes, - Value) %>% distinct()
The original DateTimes column wasn't a grouping variable - it was the time series, so I added a grouping variable of my aggregated time (here: time to the nearest hour) and summarised on this. Problem was if I joined on this new column I missed any points where there was time during that hour but not on the hour. Thus the inner_join %>% select %>% distinct method.
Now hopefully it works with my real data not eg data!
Given the function that you have on aggregation:
aggregate(zooData, time(zooData) - as.numeric(time(zooData))%%3600, FUN = sum, reg = T)
You want to sum the values by group of times AND NOT lose other columns. You can simply do this with the dplyr package:
library(dplyr)
newdata <- Data1 %>% group_by(DateTimes) %>% summarise(sum(Value))
finaldata <- inner_join(Data1,newdata),by="DateTimes")
The newdata is a data.frame with each group of DateTimes has the Values summed. Then inner_join merges the parts that are common on those two datasets by the DateTimes variable. Since I am not entirely sure what your desired output is, this should be a good help for starters.

ggvis: plotting data in multiple series

Here is what I have:
A data frame which contains a date field, and a number of summary statistics.
Here's what I want:
I want a chart that allows me to compare the time series week over week, to see how the performance of the process this week compares to the previous one, for example.
What I have done so far:
##Get the week day name to display
summaryData$WeekDay <- format(summaryData$Date, format = '%A')
##Get the week number to differentiate the weeks
summaryData$Week <- format(summaryData$Date, format = '%V')
summaryData %>%
ggvis(x = ~WeekDay, y = ~Referrers) %>%
layer_lines(stroke = ~Week)`
I expected it to create a chart with multiple coloured lines, each one representing a week in my data set. It does not do what I expect
Try looking at reshaper to convert your data with a factor variable for each week, or split up the data with a dplyr::lag() command.
A general way of doing graphs of multiple columns in ggivs is to use the following format
summaryData %>%
ggvis() %>%
layer_lines(x = ~WeekDay, y = ~Referrers)%>%
layer_lines(x=~WeekDay, y= ~Other)
I hope this helps

Resources