I have a df in long format with travel data.
The df looks like this:
id from to traveltime Key departuretime arrivaltime (next stop)
1 2 3 00:01:00 301 08:15:00 08:16:00
1 2 3 00:01:00 301 08:30:00 08:31:00
1 2 3 00:01:00 301 08:45:00 08:46:00
2 3 4 00:02:00 301
2 3 4 00:02:00 301
2 3 4 00:02:00 301
1 5 6 00:01:00 302 09:00:00 09:01:00
1 6 7 00:01:00 302 09:01:00 09:02:00
2 7 8 00:01:00 302
Now I want to fill the empty cells. The departure time is always the sum of the arrival time and the travel time of the previous stop. So the expected output is:
id from to traveltime Key departuretime arrivaltime (next stop)
1 2 3 00:01:00 301 08:15:00 08:16:00
1 2 3 00:01:00 301 08:30:00 08:31:00
1 2 3 00:01:00 301 08:45:00 08:46:00
2 3 4 00:02:00 301 08:16:00 08:18:00
2 3 4 00:02:00 301 08:31:00 08:33:00
2 3 4 00:02:00 301 08:33:00 08:35:00
1 5 6 00:01:00 302 09:00:00 09:01:00
1 6 7 00:01:00 302 09:01:00 09:02:00
2 7 8 00:01:00 302 09:02:00 09:03:00
I wrote some code that works fine. But I have to adapt the code for every edge in my df.
data$arrivaltime <- data$departuretime + data$traveltime
data$departuretime[which(data$id =="2" & data$Key =="301")]<-data$arrivaltime[which(data$id == "1" & data$Key =="301")]
This would work yet it is terrible time consuming. Cause I would need to adapt this code for every edge.
What I want to do now is to ?automate my code. So that I don't have to change the id and the key parameters manually.
I guess that I need to store the Keys and the ids in a list and then build a loop that iterates trough the df.
I'm new in R and I don't know how to build such a loop. So I hope that someone has an idea on that. Thank you in advance!
Related
I have some air pollution data measured by hours.
Datetime
PM2.5
Station.id
2020-01-01 00:00:00
10
1
2020-01-01 01:00:00
NA
1
2020-01-01 02:00:00
15
1
2020-01-01 03:00:00
NA
1
2020-01-01 04:00:00
7
1
2020-01-01 05:00:00
20
1
2020-01-01 06:00:00
30
1
2020-01-01 00:00:00
NA
2
2020-01-01 01:00:00
17
2
2020-01-01 02:00:00
21
2
2020-01-01 03:00:00
55
2
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.
The "most efficient" way will almost certainly use data.table. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')
I'm looking to aggregate some pedometer data, gathered in steps per minute, so I get a summed number of steps up until an EMA assessment. The EMA assessments happened four times per day. An example of the two data sets are:
Pedometer Data
ID Steps Time
1 15 2/4/2020 8:32
1 23 2/4/2020 8:33
1 76 2/4/2020 8:34
1 32 2/4/2020 8:35
1 45 2/4/2020 8:36
...
2 16 2/4/2020 8:32
2 17 2/4/2020 8:33
2 0 2/4/2020 8:34
2 5 2/4/2020 8:35
2 8 2/4/2020 8:36
EMA Data
ID Time X Y
1 2/4/2020 8:36 3 4
1 2/4/2020 12:01 3 5
1 2/4/2020 3:30 4 5
1 2/4/2020 6:45 7 8
...
2 2/4/2020 8:35 4 6
2 2/4/2020 12:05 5 7
2 2/4/2020 3:39 1 3
2 2/4/2020 6:55 8 3
I'm looking to add the pedometer data to the EMA data as a new variable, where the number of steps taken are summed until the next EMA assessment. Ideally it would like something like:
Combined Data
ID Time X Y Steps
1 2/4/2020 8:36 3 4 191
1 2/4/2020 12:01 3 5 [Sum of steps taken from 8:37 until 12:01 on 2/4/2020]
1 2/4/2020 3:30 4 5 [Sum of steps taken from 12:02 until 3:30 on 2/4/2020]
1 2/4/2020 6:45 7 8 [Sum of steps taken from 3:31 until 6:45 on 2/4/2020]
...
2 2/4/2020 8:35 4 6 38
2 2/4/2020 12:05 5 7 [Sum of steps taken from 8:36 until 12:05 on 2/4/2020]
2 2/4/2020 3:39 1 3 [Sum of steps taken from 12:06 until 3:39 on 2/4/2020]
2 2/4/2020 6:55 8 3 [Sum of steps taken from 3:40 until 6:55 on 2/4/2020]
I then need the process to continue over the entire 21 day EMA period, so the same process for the 4 EMA assessment time points on 2/5/2020, 2/6/2020, etc.
This has pushed me the limit of my R skills, so any pointers would be extremely helpful! I'm most familiar with the tidyverse but am comfortable using base R as well. Thanks in advance for all advice.
Here's a solution using rolling joins from data.table. The basic idea here is to roll each time from the pedometer data up to the next time in the EMA data (while matching on ID still). Once it's the next EMA time is found, all that's left is to isolate the X and Y values and sum up Steps.
Data creation and prep:
library(data.table)
pedometer <- data.table(ID = sort(rep(1:2, 500)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 09:35:00 EST"),
as.POSIXct("2020-02-08 17:00:00 EST"), length.out = 500), 2),
Steps = rpois(1000, 25))
EMA <- data.table(ID = sort(rep(1:2, 4*5)),
Time = rep(seq.POSIXt(as.POSIXct("2020-02-04 05:00:00 EST"),
as.POSIXct("2020-02-08 23:59:59 EST"), by = '6 hours'), 2),
X = sample(1:8, 2*4*5, rep = T),
Y = sample(1:8, 2*4*5, rep = T))
setkey(pedometer, Time)
setkey(EMA, Time)
EMA[,next_ema_time := Time]
And now the actual join and summation:
joined <- EMA[pedometer,
on = .(ID, Time),
roll = -Inf,
j = .(ID, Time, Steps, next_ema_time, X, Y)]
result <- joined[,.('X' = min(X),
'Y' = min(Y),
'Steps' = sum(Steps)),
.(ID, next_ema_time)]
result
#> ID next_ema_time X Y Steps
#> 1: 1 2020-02-04 11:00:00 1 2 167
#> 2: 2 2020-02-04 11:00:00 8 5 169
#> 3: 1 2020-02-04 17:00:00 3 6 740
#> 4: 2 2020-02-04 17:00:00 4 6 747
#> 5: 1 2020-02-04 23:00:00 2 2 679
#> 6: 2 2020-02-04 23:00:00 3 2 732
#> 7: 1 2020-02-05 05:00:00 7 5 720
#> 8: 2 2020-02-05 05:00:00 6 8 692
#> 9: 1 2020-02-05 11:00:00 2 4 731
#> 10: 2 2020-02-05 11:00:00 4 5 773
#> 11: 1 2020-02-05 17:00:00 1 5 757
#> 12: 2 2020-02-05 17:00:00 3 5 743
#> 13: 1 2020-02-05 23:00:00 3 8 693
#> 14: 2 2020-02-05 23:00:00 1 8 740
#> 15: 1 2020-02-06 05:00:00 8 8 710
#> 16: 2 2020-02-06 05:00:00 3 2 760
#> 17: 1 2020-02-06 11:00:00 8 4 716
#> 18: 2 2020-02-06 11:00:00 1 2 688
#> 19: 1 2020-02-06 17:00:00 5 2 738
#> 20: 2 2020-02-06 17:00:00 4 6 724
#> 21: 1 2020-02-06 23:00:00 7 8 737
#> 22: 2 2020-02-06 23:00:00 6 3 672
#> 23: 1 2020-02-07 05:00:00 2 6 726
#> 24: 2 2020-02-07 05:00:00 7 7 759
#> 25: 1 2020-02-07 11:00:00 1 4 737
#> 26: 2 2020-02-07 11:00:00 5 2 737
#> 27: 1 2020-02-07 17:00:00 3 5 766
#> 28: 2 2020-02-07 17:00:00 4 4 745
#> 29: 1 2020-02-07 23:00:00 3 3 714
#> 30: 2 2020-02-07 23:00:00 2 1 741
#> 31: 1 2020-02-08 05:00:00 4 6 751
#> 32: 2 2020-02-08 05:00:00 8 2 723
#> 33: 1 2020-02-08 11:00:00 3 3 716
#> 34: 2 2020-02-08 11:00:00 3 6 735
#> 35: 1 2020-02-08 17:00:00 1 5 696
#> 36: 2 2020-02-08 17:00:00 7 7 741
#> ID next_ema_time X Y Steps
Created on 2020-02-04 by the reprex package (v0.3.0)
I would left_join ema_df on pedometer_df by ID and Time. This way you get
all lines of pedometer_df with missing values for x and y (that I assume are identifiers) when it is not an EMA assessment time.
I fill the values using the next available (so the next ema assessment x and y)
and finally, group_by ID x and y and summarise to keep the datetime of assessment (max) and the sum of steps.
library(dplyr)
library(tidyr)
pedometer_df %>%
left_join(ema_df, by = c("ID", "Time")) %>%
fill(x, y, .direction = "up") %>%
group_by(ID, x, y) %>%
summarise(
Time = max(Time),
Steps = sum(Steps)
)
I am trying to fill in the gaps in one of my time series by merging a full day time series into my original time series. But for some reason I get duplicate entries and all the rest of my data is NA.
My data looks like this:
> head(data)
TIME Water_Temperature
1 2016-08-22 00:00:00 81.000
2 2016-08-22 00:01:00 80.625
3 2016-08-22 00:02:00 85.000
4 2016-08-22 00:03:00 80.437
5 2016-08-22 00:04:00 85.000
6 2016-08-22 00:05:00 80.375
> tail(data)
TIME Water_Temperature
1398 2016-08-22 23:54:00 19.5
1399 2016-08-22 23:55:00 19.5
1400 2016-08-22 23:56:00 19.5
1401 2016-08-22 23:57:00 19.5
1402 2016-08-22 23:58:00 19.5
1403 2016-08-22 23:59:00 19.5
In between are some minutes missing (1403 rows instead of 1440). I tried to fill them in using:
data.length <- length(data$TIME)
time.min <- data$TIME[1]
time.max <- data$TIME[data.length]
all.dates <- seq(time.min, time.max, by="min")
all.dates.frame <- data.frame(list(TIME=all.dates))
merged.data <- merge(all.dates.frame, data, all=T)
But that gives me a result of 1449 rows instead of 1440. The first eight minutes are duplicates in the time stamp column and all other values in Water_Temperature are NA. Looks like this:
> merged.data[1:25,]
TIME Water_Temperature
1 2016-08-22 00:00:00 NA
2 2016-08-22 00:00:00 81.000
3 2016-08-22 00:01:00 NA
4 2016-08-22 00:01:00 80.625
5 2016-08-22 00:02:00 NA
6 2016-08-22 00:02:00 85.000
7 2016-08-22 00:03:00 NA
8 2016-08-22 00:03:00 80.437
9 2016-08-22 00:04:00 NA
10 2016-08-22 00:04:00 85.000
11 2016-08-22 00:05:00 NA
12 2016-08-22 00:05:00 80.375
13 2016-08-22 00:06:00 NA
14 2016-08-22 00:06:00 80.812
15 2016-08-22 00:07:00 NA
16 2016-08-22 00:07:00 80.812
17 2016-08-22 00:08:00 NA
18 2016-08-22 00:08:00 80.937
19 2016-08-22 00:09:00 NA
20 2016-08-22 00:10:00 NA
21 2016-08-22 00:11:00 NA
22 2016-08-22 00:12:00 NA
23 2016-08-22 00:13:00 NA
24 2016-08-22 00:14:00 NA
25 2016-08-22 00:15:00 NA
> tail(merged.data)
TIME Water_Temperature
1444 2016-08-22 23:54:00 NA
1445 2016-08-22 23:55:00 NA
1446 2016-08-22 23:56:00 NA
1447 2016-08-22 23:57:00 NA
1448 2016-08-22 23:58:00 NA
1449 2016-08-22 23:59:00 NA
Does anyone has an idea whats going wrong?
EDIT:
Using the xts and zoo package now to do the job by doing:
library(xts)
library(zoo)
df1.zoo<-zoo(data[,-1],data[,1])
df2 <- as.data.frame(as.zoo(merge(as.xts(df1.zoo), as.xts(zoo(,seq(start(df1.zoo),end(df1.zoo),by="min"))))))
Very easy and effective!
Instead of merge use rbind which gives you an irregular time series without NAs to start with. If you really want a regular time series with a frequency of say 1 minute you can build a time based sequence as an index and merge it with your data after ( after using rbind) and fill the resulting NAs with na.locf. Hope this helps.
you can try merging with full_join from tidyverse
This works for me with two dataframes (daily values) sharing a column named date.
big_data<-my_data %>%
reduce(full_join, by="Date")
I have an R data.frame containing one value for every quarter of hour
Date A B
1 2015-11-02 00:00:00 0 0 //day start
2 2015-11-02 00:15:00 0 0
3 2015-11-02 00:30:00 0 0
4 2015-11-02 00:45:00 0 0
...
96 2015-11-02 23:45:00 0 0 //day end
97 2015-11-03 00:00:00 0 0 //new day
...
6 2016-03-23 01:15:00 0 0 //last record
I use xts to construct a time series
xtsA <- xts(data$A,data$Date)
by using apply.daily I get the result I expect
apply.daily(xtsA, sum)
Date A
1 2015-11-02 23:45:00 400
2 2015-11-03 23:45:00 400
3 2015-11-04 23:45:00 500
but apply.weekly seems to use Monday as last day of the week
Date A
19 2016-03-07 00:45:00 6500 //Monday
20 2016-03-14 00:45:00 5500 //Monday
21 2016-03-21 00:45:00 5000 //Monday
and I do not understand why it uses 00:45:00. Does anyone know?
Data is imported from CSV file the Date column looks like this:
data <- read.csv("...", header=TRUE)
Date A
1 151102 0000 0
...
The error is in the date time interpretation and using
data$Date <- as.POSIXct(strptime(data$Date, "%y%m%d %H%M"), tz = "GMT")
solves it, and now apply.weekly returns
Date A
1 2015-11-08 23:45:00 3500 //Sunday
2 2015-11-15 23:45:00 4000 //Sunday
...
I have a data set that has dates and times for in and out. Each line is an in and out set, but some are blank. I can remove the blanks with na.omit and a nice read in (it was a csv, and na.strings=c("") works on the read.csv).
Of course, because the real world is never like the tutorial, some of the times are only dates, so my as.POSIXlt(Dataset$In,format="%m/%d/%Y %H:%M") returns NA on the "only date no time"s.
na.omit does not remove these lines. so the questions are 2
Why doesn't na.omit work, or how can I get it to work?
Better, How can I convert one column into both Dates and Times (in the posix format) without 2 calls or with some sort of optional parameter in the format string? (or is this even possible?).
This is a sample of the dates and times. I can't share the real file, 1 it's huge, 2 it's PII.
Id,In,Out
1,8/15/2015 8:00,8/15/2015 17:00
1,8/16/2015 8:04,8/16/2015
1,8/17/2015 8:50,8/17/2015 18:00
1,8/18/2015,8/18/2015 17:00
2,8/15/2015,8/15/2015 13:00
2,8/16/2015 8:00,8/16/2015 17:00
3,8/15/2015 4:00,8/15/2015 11:00
3,8/16/2015 9:00,8/16/2015 19:00
3,8/17/2015,8/17/2015 17:00
3,,
4,,
4,8/16/2015 6:00,8/16/2015 20:00
DF <- read.table(text = "Id,In,Out
1,8/15/2015 8:00,8/15/2015 17:00
1,8/16/2015 8:04,8/16/2015
1,8/17/2015 8:50,8/17/2015 18:00
1,8/18/2015,8/18/2015 17:00
2,8/15/2015,8/15/2015 13:00
2,8/16/2015 8:00,8/16/2015 17:00
3,8/15/2015 4:00,8/15/2015 11:00
3,8/16/2015 9:00,8/16/2015 19:00
3,8/17/2015,8/17/2015 17:00", header = TRUE, sep = ",",
stringsAsFactors = FALSE) #set this option during import
DF$In[nchar(DF$In) < 13] <- paste(DF$In[nchar(DF$In) < 13], "0:00")
DF$Out[nchar(DF$Out) < 13] <- paste(DF$Out[nchar(DF$Out) < 13], "0:00")
DF$In <- as.POSIXct(DF$In, format = "%m/%d/%Y %H:%M", tz = "GMT")
DF$Out <- as.POSIXct(DF$Out, format = "%m/%d/%Y %H:%M", tz = "GMT")
# Id In Out
#1 1 2015-08-15 08:00:00 2015-08-15 17:00:00
#2 1 2015-08-16 08:04:00 2015-08-16 00:00:00
#3 1 2015-08-17 08:50:00 2015-08-17 18:00:00
#4 1 2015-08-18 00:00:00 2015-08-18 17:00:00
#5 2 2015-08-15 00:00:00 2015-08-15 13:00:00
#6 2 2015-08-16 08:00:00 2015-08-16 17:00:00
#7 3 2015-08-15 04:00:00 2015-08-15 11:00:00
#8 3 2015-08-16 09:00:00 2015-08-16 19:00:00
#9 3 2015-08-17 00:00:00 2015-08-17 17:00:00
na.omit doesn't work with POSIXlt objects because it is documented to "handle vectors, matrices and data frames comprising vectors and matrices (only)." (see help("na.omit")). And in the strict sense, POSIXlt objects are not vectors:
unclass(as.POSIXlt(DF$In))
#$sec
#[1] 0 0 0 0 0 0 0 0 0
#
#$min
#[1] 0 4 50 0 0 0 0 0 0
#
#$hour
#[1] 8 8 8 0 0 8 4 9 0
#
#$mday
#[1] 15 16 17 18 15 16 15 16 17
#
#$mon
#[1] 7 7 7 7 7 7 7 7 7
#
#$year
#[1] 115 115 115 115 115 115 115 115 115
#
#$wday
#[1] 6 0 1 2 6 0 6 0 1
#
#$yday
#[1] 226 227 228 229 226 227 226 227 228
#
#$isdst
#[1] 0 0 0 0 0 0 0 0 0
#
#attr(,"tzone")
#[1] "GMT"
There is hardly any reason to prefer POSIXlt over POSIXct (which is an integer giving the number of seconds since the origin internally and thus needs less memory).
You've been given a couple of strategies that bring these character values in and process "in-place". I almost never use as.POSIXlt since there are so many pitfalls in dealing with the list-in-list structures that it returns, especially considering its effective incompatibility with dataframes. Here's a method that does the testing and coercion at the read.-level by defining an as-method:
setOldClass("inTime", prototype="POSIXct")
setAs("character", "inTime",
function(from) structure( ifelse( is.na(as.POSIXct(from, format="%m/%d/%Y %H:%M") ),
as.POSIXct(from, format="%m/%d/%Y") ,
as.POSIXct(from, format="%m/%d/%Y %H:%M") ),
class="POSIXct" ) )
read.csv(text=txt, colClasses=c("numeric", 'inTime','inTime') )
Id In Out
1 1 2015-08-15 08:00:00 2015-08-15 17:00:00
2 1 2015-08-16 08:04:00 2015-08-16 00:00:00
3 1 2015-08-17 08:50:00 2015-08-17 18:00:00
4 1 2015-08-18 00:00:00 2015-08-18 17:00:00
5 2 2015-08-15 00:00:00 2015-08-15 13:00:00
6 2 2015-08-16 08:00:00 2015-08-16 17:00:00
7 3 2015-08-15 04:00:00 2015-08-15 11:00:00
8 3 2015-08-16 09:00:00 2015-08-16 19:00:00
9 3 2015-08-17 00:00:00 2015-08-17 17:00:00
The structure "envelope" is needed because of the rather strange behavior of ifelse, which otherwise would return a numeric object rather than an object of class-'POSIXct'.