Counting group size including zero with R's data.table - r

I have a small (< 10M row) data.table with a variable that takes on integer values. I would like to generate a count of the number of the number of times that the variable takes on each integer value, including zeroes when the variable never takes on that value.
For example, I might have:
dt <- data.table(a = c(1,1,3,3,5,5,5))
My desired output is a data.table with values:
a N
1 2
2 0
3 2
4 0
5 3
This is an extremely basic question, but it is difficult to find data.table specific answers for it. In my example, we can assume that the minimum is always 0, but the maximum variable value is unknown.

dt[, .N, by = .(a)
][data.table(a = seq(min(dt$a), max(dt$a))), on = .(a)
][is.na(N), N := 0][]
# a N
# <int> <int>
# 1: 1 2
# 2: 2 0
# 3: 3 2
# 4: 4 0
# 5: 5 3

Related

Data manipulation in R with data over time

Based on the R data.frame below, I am looking for an elegant solution to count the number of people transitioning between groups between times.
dat <- data.frame(people = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4),
time = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
group = c(5,4,4,3,2,4,4,3,2,1,5,5,4,4,4,3,3,2,2,1))
I would like a generalized solution as my problem is much larger in scale. I was considering that something with mutate could accomplish this but I'm not sure where to start
An example of the start of the output I am looking for would be this:
dat_result <- data.frame(time_start = c(1,1,1,1,1),
time_end = c(2,2,2,2,2),
group_start = c(1,1,1,1,1),
group_end = c(1,2,3,4,5),
count = "")
which would be repeated for all time transitions and all group transitions. Time is of course linear so 1 can only go to 2 and 2 to 3, etc. However, any group can transition to any other group including staying in the same group between times.
I'm using package data.table because it makes easy to work by groups. The same steps can be made using dplyr, but I'm not familiar with it.
library(data.table)
# Convert to data.table:
setDT(dat)
# Make sure your data is ordered by people and time:
setorder(dat, people, time)
# Create a new column with the next group
dat[, next.group := shift(group, -1), by = people]
# Remove rows where there's no change:
# (since this will remove data, you may want to atributte to a new object)
new <- dat[group != next.group]
# Add end.time:
new[, end.time := shift(time, -1, max(dat$time)), by = people]
# Count the ocurrences (and order the result):
> new[, .N, by = .(time, end.time, group, next.group)][order(time, end.time, group)]
time end.time group next.group N
1: 1 3 5 4 1
2: 2 3 4 3 1
3: 2 4 3 2 1
4: 2 5 5 4 1
5: 3 4 3 2 1
6: 3 4 4 3 1
7: 4 5 2 1 2
8: 4 5 3 2 1

Counting the instances of a variable that exceeds a threshold

I have a dataset with id and speed.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- cbind(id, speed)
limit <- 35
Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.
I want data to be like.
id | Speed Viol.
----------
1 | 2
---------
2 | 2
---------
3 | 1
---------
here id (count).
id1 (1) 40 (2) 50,40
id2 (1) 45,50 (2) 55
id3 (1) 50,50,60
How to do it not using if().
Here's a method tapply as suggested in the comments and the original vectors.
tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3
2 2 1
tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.
tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:
aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
id x
1 1 2
2 2 2
3 3 1
Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35
library(dplyr)
i %>%
group_by(id) %>%
mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>%
summarise(sum(viol))
# A tibble: 3 x 2
id `sum(viol)`
<dbl> <int>
1 1 2
2 2 2
3 3 1
Here is another option with data.table,
library(data.table)
setDT(i)[, id1 := rleid(speed > limit), by = id][
speed > limit, .(violations = uniqueN(id1)), by = id][]
which gives,
id violations
1: 1 2
2: 2 2
3: 3 1
aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
# id speed
#1 1 2
#2 2 2
#3 3 1
The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).

Wrapping cumulative sum from a set starting row in R

I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.

Aggregate in R taking way too long

I'm trying to count the unique values of x across groups y.
This is the function:
aggregate(x~y,z[which(z$grp==0),],function(x) length(unique(x)))
This is taking way too long (~6 hours and not done yet). I don't want to stop processing as I have to finish this tonight.
by() was taking too long as well
Any ideas what is going wrong and how I can reduce the processing time ~ 1 hour?
My dataset has 3 million rows and 16 columns.
Input dataframe z
x y grp
1 1 0
2 1 0
1 2 1
1 3 0
3 4 1
I want to get the count of unique (x) for each y where grp = 0
UPDATE: Using #eddi's excellent answer. I have
x y
1: 2 1
2: 1 3
Any idea how I can quickly summarize this as the number of x's for each value y?
So for this it will be
Number of x y
5 1
1 3
Here you go:
library(data.table)
setDT(z) # to convert to data.table in place
z[grp == 0, uniqueN(x), by = y]
# y V1
#1: 1 2
#2: 3 1
library(dplyr)
z %>%
filter(grp == 0) %>%
group_by(y) %>%
summarize(nx = n_distinct(x)))
is the dplyr way, though it may not be as fast as data.table.

Get top k records per group, where k differs by group, in R data.table

I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3

Resources