calculations in data.table syntax [closed] - r

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 4 years ago.
Improve this question
I would like to execute a self-join in data.table, to obtain the periods between time intervals.
Example data
active <- data.table(id=c(1,1,1,2,2,3), no=c(1,2,3,1,2,1), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:30:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 01:40:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id no beg end
1: 1 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2 2018-01-01 01:30:00 2018-01-01 01:40:00
3: 1 3 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 2 1 2018-01-01 01:30:00 2018-01-01 01:40:00
5: 2 2 2018-01-01 01:50:00 2018-01-01 02:00:00
6: 3 1 2018-01-01 01:50:00 2018-01-01 02:00:00
What I want to reach is to get the inactive periods in between the active ones,
> res
id no ibeg iend
1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00
but my question is more general about the calculations in the syntax: When executing
res <- active[active, .(id=x.id, ibeg=i.end, iend=x.beg), on=.(no=(no-1), id=id)]
I match on on=.(no=no-1) but obtain an error message that column [no-1] cannot be found. I tried parentheses around no-1 but to no avail. Are calculations banned from the on=argument or is there a trick?

You can use
inactive = active[, .(no=no[-.N], ibeg=end[-.N], iend=beg[-1]), by=id]
# id no ibeg iend
# 1: 1 1 2018-01-01 01:20:00 2018-01-01 01:30:00
# 2: 1 2 2018-01-01 01:40:00 2018-01-01 01:50:00
# 3: 2 1 2018-01-01 01:40:00 2018-01-01 01:50:00

Related

data.table in R: creating variables from x into i

I have two data tables,
a <- data.table(id=c(1,2,1,2,1,2), time=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:11:00","2018-01-01 01:11:00","2018-01-01 01:12:00","2018-01-01 01:12:00")), beg=as.POSIXct(c("2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:00:00","2018-01-01 01:05:00","2018-01-01 01:01:00","2018-01-01 01:05:00")), end=as.POSIXct(c("2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00","2018-01-01 02:00:00","2018-01-01 02:05:00")))
> a
id time beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00
which has 650m lines by 4 columns, and
b <- data.table(id=c(1,2), abeg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:11:00")), aend=as.POSIXct(c("2018-01-01 01:11:00","2018-01-01 01:12:00")))
> b
id abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:11:00 2018-01-01 01:12:00
which has about 13m lines by 7 columns.
I would like to join b into a but keep all lines and columns of a. I understand that this is a left-join and would execute it as
b[a, .(id=i.id, time=i.time, beg=i.beg, end=i.end, abeg=x.abeg, aend=x.aend), on=.(id=id, abeg<=time, aend>=time)]
to obtain
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
However, executing this on a Mac takes longer than 7 hours, when I had to abort. I joined on a 50m-rows-subset of a and this took about 8 minutes. I would like to avoid a loop over subsets, so I wonder whether I can make it more efficient.
For example, I suspect the assign command := can be used somehow. In data.table join then add columns to existing data.frame without re-copy it is explained how this can be done when all variables in b are kept and amended by variables from a. However, I seem to have the reverse case: I want to keep all columns in a and amend it by columns from b.
Here's an join with update by reference which I think does what you intend to do:
a[b, on=.(id=id, time>=abeg, time<=aend), `:=`(abeg = i.abeg, aend = i.aend)]
The resulting a is then:
id time beg end abeg aend
1: 1 2018-01-01 01:10:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
2: 2 2018-01-01 01:10:00 2018-01-01 01:05:00 2018-01-01 02:05:00 <NA> <NA>
3: 1 2018-01-01 01:11:00 2018-01-01 01:00:00 2018-01-01 02:00:00 2018-01-01 01:10:00 2018-01-01 01:11:00
4: 2 2018-01-01 01:11:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00
5: 1 2018-01-01 01:12:00 2018-01-01 01:01:00 2018-01-01 02:00:00 <NA> <NA>
6: 2 2018-01-01 01:12:00 2018-01-01 01:05:00 2018-01-01 02:05:00 2018-01-01 01:11:00 2018-01-01 01:12:00

R: data.table aggregate using external grouping vector

I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]

R: fast counting of rows that match vector of conditional

I have data
dt <- data.table(beg=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:08:00")), end=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:10:00","2018-01-01 01:10:00")))
> dt
beg end
1: 2018-01-01 01:01:00 2018-01-01 01:10:00
2: 2018-01-01 01:05:00 2018-01-01 01:10:00
3: 2018-01-01 01:08:00 2018-01-01 01:10:00
and
times <- seq(from=min(dt$beg),to=max(dt$end),by="mins")
and I would like to count, as efficiently as possible, for each time in times how many intervals in df include the time.
I understand that
count <- NA
for(i in 1:length(times)){
count[i] <- sum(dt$beg<times[i] & dt$end>times[i])
}
would yield the solution
> data.table(times, count)
time count
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
but I am wondering whether there is a more time-efficient solution, e.g., using data.table.
This can be a solution
times = as.data.table(times)
ans = dt[times, .(x.beg, x.end, i.x),on = .(beg < x , end > x),allow.cartesian = TRUE]
ans[,sum(!is.na(x.end)), by = .(i.x)]
i.x V1
1: 2018-01-01 01:01:00 0
2: 2018-01-01 01:02:00 1
3: 2018-01-01 01:03:00 1
4: 2018-01-01 01:04:00 1
5: 2018-01-01 01:05:00 1
6: 2018-01-01 01:06:00 2
7: 2018-01-01 01:07:00 2
8: 2018-01-01 01:08:00 2
9: 2018-01-01 01:09:00 3
10: 2018-01-01 01:10:00 0
Cheers!

Changing unevenly spaced time data into evenly spaced hourly in R

I have some weather data that comes in unevenly spaced, and I would like to grab the simple hourly values. I need hourly so I can join this data up with a separate data.frame
Example of the weather data:
> weather_df
A tibble: 10 × 3
datetime temperature temperature_dewpoint
<dttm> <dbl> <dbl>
1 2011-01-01 00:00:00 4 -1
2 2011-01-01 00:20:00 3 -1
3 2011-01-01 00:40:00 3 -1
4 2011-01-01 01:00:00 2 -1
5 2011-01-01 01:20:00 2 0
6 2011-01-01 01:45:00 2 0
7 2011-01-01 02:05:00 1 -1
8 2011-01-01 02:25:00 2 0
9 2011-01-01 02:45:00 2 -1
10 2011-01-01 03:10:00 2 0
I would like to only have hourly data, but as you can see observations don't always fall on the hour mark. I've tried rounding but then I have multiple observations with the same time.
weather_df$datetime_rounded <- as.POSIXct(round(weather_df$datetime, units = c("hours")))
weather_df
# A tibble: 10 × 4
datetime temperature temperature_dewpoint datetime_rounded
<dttm> <dbl> <dbl> <dttm>
1 2011-01-01 00:00:00 4 -1 2011-01-01 00:00:00
2 2011-01-01 00:20:00 3 -1 2011-01-01 00:00:00
3 2011-01-01 00:40:00 3 -1 2011-01-01 01:00:00
4 2011-01-01 01:00:00 2 -1 2011-01-01 01:00:00
5 2011-01-01 01:20:00 2 0 2011-01-01 01:00:00
6 2011-01-01 01:45:00 2 0 2011-01-01 02:00:00
7 2011-01-01 02:05:00 1 -1 2011-01-01 02:00:00
8 2011-01-01 02:25:00 2 0 2011-01-01 02:00:00
9 2011-01-01 02:45:00 2 -1 2011-01-01 03:00:00
10 2011-01-01 03:10:00 2 0 2011-01-01 03:00:00
I can't determine easily which observation to keep without computing the difference of datetime from datetimerounded. There must be a more elegant way to do this. Any help would be appreciated!
Here is my non-elegant solution.
I calculated the absolute distance between datetime and datetime_rounded
weather_df$time_dist <- abs(weather_df$datetime - weather_df$datetimerounded)
Then I sorted by the distance
weather_df <- weather_df[order(weather_df$time_dist),]
The removed duplicates of the rounded column. Since its sorted it keeps the observation closest to the round hour.
weather_df <- weather_df [!duplicated(weather_df$datetimerounded),]
Then sorted back by the time
weather_df <- weather_df [order(weather_df$datetimerounded),]
Sure there has to be a better way to do this. I'm not very familiar yet with working with time series in R.

How to statistic the historical data using R language?

I have a data.frame named A as the following:
uid uname csttime action_type
1 felix 2014-01-01 01:00:00 1
1 felix 2014-01-01 02:00:00 2
1 felix 2014-01-01 03:00:00 2
1 felix 2014-01-01 04:00:00 2
1 felix 2014-01-01 05:00:00 3
2 john 2014-02-01 01:00:00 1
2 john 2014-02-01 02:00:00 1
2 john 2014-02-01 03:00:00 1
2 john 2014-02-02 08:00:00 3
.......
I want to statistic the historical action_type for each <uid,uname,csttime> combination, for example, for <1,'felix','2014-01-01 03:00:00'>, I want to know how many different action_types have ever occurred. Here, for <1,'felix','2014-01-01 03:00:00'>, the action_type_1 is 1 and the action_type_2 is 1.
If I'm understanding your question correctly I believe there is a fairly simple dplyr answer.
library(dplyr)
group_by(stack, uid, uname, csttime) %>%
count(uid, action_type)
This will yield:
uid action_type n
1 1 1 1
2 1 2 3
3 1 3 1
4 2 1 3
5 2 3 1
as you can see this gives you each unique id, the action types they have taken and the number of times. if you want to say, change to include date, you can do
group_by(stack, uid, uname, csttime) %>%
count(uid, csttime, action_type)
hope that helps.

Resources