data.table equivalent of tidyr::complete() - r

tidyr::complete() adds rows to a data.frame for combinations of column values that are missing from the data. Example:
library(dplyr)
library(tidyr)
df <- data.frame(person = c(1,2,2),
observation_id = c(1,1,2),
value = c(1,1,1))
df %>%
tidyr::complete(person,
observation_id,
fill = list(value=0))
yields
# A tibble: 4 × 3
person observation_id value
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 0
3 2 1 1
4 2 2 1
where the value of the combination person == 1 and observation_id == 2 that is missing in df has been filled in with a value of 0.
What would be the equivalent of this in data.table?

I reckon that the philosophy of data.table entails fewer specially-named functions for tasks than you'll find in the tidyverse, so some extra coding is required, like:
res = setDT(df)[
CJ(person = person, observation_id = observation_id, unique=TRUE),
on=.(person, observation_id)
]
After this, you still have to manually handle the filling of values for missing levels. We can use setnafill to handle this efficiently & by-reference in recent versions of data.table:
setnafill(res, fill = 0, cols = 'value')
See #Jealie's answer regarding a feature that will sidestep this.
Certainly, it's crazy that the column names have to be entered three times here. But on the other hand, one can write a wrapper:
completeDT <- function(DT, cols, defs = NULL){
mDT = do.call(CJ, c(DT[, ..cols], list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(setDT(df), cols = c("person", "observation_id"), defs = c(value = 0))
person observation_id value
1: 1 1 1
2: 1 2 0
3: 2 1 1
4: 2 2 1
As a quick way of avoiding typing the names three times for the first step, here's #thelatemail's idea:
vars <- c("person","observation_id")
df[do.call(CJ, c(mget(vars), unique=TRUE)), on=vars]
# or with magrittr...
c("person","observation_id") %>% df[do.call(CJ, c(mget(.), unique=TRUE)), on=.]
Update: now you don't need to enter names twice in CJ thanks to #MichaelChirico & #MattDowle for the improvement.

There might be a better answer out there, but this works:
dt[CJ(person=unique(dt$person),
observation_id=unique(dt$observation_id)),
on=c('person','observation_id')]
Which gives:
person observation_id value
1: 1 1 1
2: 2 1 1
3: 1 2 NA
4: 2 2 1
Now, if you would like to be able to fill with any value (and not NA), I would suggest to wait for the corresponding feature to be finished or contribute to it :)

It is worth noting that the completeDT function above doesn't carry many of the features that tidyr::complete does. In particular, empty factor levels are dropped - unlike tidyr::complete which keeps them. If you do want to keep empty factor the function can be edited as below. The make_vals function below could be made more sophisticated to handle other variable classes eg. full sequence for integers.
library(magrittr)
library(data.table)
dat <- data.frame(
person = c(1,2,2),
observation_id = factor(c(1,1,2), 1:3),
value = c(1,1,1))
dat %>%
tidyr::complete(
person, observation_id, fill = list(value=0))
#> # A tibble: 6 x 3
#> person observation_id value
#> <dbl> <fct> <dbl>
#> 1 1 1 1
#> 2 1 2 0
#> 3 1 3 0
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 0
completeDT <- function(DT, cols, defs = NULL){
make_vals <- function(col) {
if(is.factor(col)) factor(levels(col))
else unique(col)
}
mDT = do.call(CJ, c(lapply(DT[, ..cols], make_vals), list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(DT = setDT(dat), cols = c("person", "observation_id"), defs = c(value = 0))
#> person observation_id value
#> 1: 1 1 1
#> 2: 1 2 0
#> 3: 1 3 0
#> 4: 2 1 1
#> 5: 2 2 1
#> 6: 2 3 0
Created on 2021-03-08 by the reprex package (v0.3.0)

Related

Is there an R function to count strings by re-arranging the table [duplicate]

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))

How to convert categories of variables into column and row headings in R [duplicate]

I am trying to obtain counts of each combination of levels of two variables, "week" and "id". I'd like the result to have "id" as rows, and "week" as columns, and the counts as the values.
Example of what I've tried so far (tried a bunch of other things, including adding a dummy variable = 1 and then fun.aggregate = sum over that):
library(plyr)
ddply(data, .(id), dcast, id ~ week, value_var = "id",
fun.aggregate = length, fill = 0, .parallel = TRUE)
However, I must be doing something wrong because this function is not finishing. Is there a better way to do this?
Input:
id week
1 1
1 2
1 3
1 1
2 3
Output:
1 2 3
1 2 1 1
2 0 0 1
You could just use the table command:
table(data$id,data$week)
1 2 3
1 2 1 1
2 0 0 1
If "id" and "week" are the only columns in your data frame, you can simply use:
table(data)
# week
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
You don't need ddply for this. The dcast from reshape2 is sufficient:
dat <- data.frame(
id = c(rep(1, 4), 2),
week = c(1:3, 1, 3)
)
library(reshape2)
dcast(dat, id~week, fun.aggregate=length)
id 1 2 3
1 1 2 1 1
2 2 0 0 1
Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:
xtabs(~id+week, data=dat)
week
id 1 2 3
1 2 1 1
2 0 0 1
The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.
An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:
library(data.table)
dcast(setDT(data), id ~ week)
# Using 'week' as value column. Use 'value.var' to override
# Aggregate function missing, defaulting to 'length'
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
Or setting the arguments explicitly:
dcast(setDT(data), id ~ week, value.var = "week", fun = length)
# id 1 2 3
# 1: 1 2 1 1
# 2: 2 0 0 1
For pre-data.table 1.9.2 alternatives, see edits.
A tidyverse option could be :
library(dplyr)
library(tidyr)
df %>%
count(id, week) %>%
pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
#spread(week, n, fill = 0) #In older version of tidyr
# id `1` `2` `3`
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 1 1
#2 2 0 0 1
Using only pivot_wider -
tidyr::pivot_wider(df, names_from = week,
values_from = week, values_fn = length, values_fill = 0)
Or using tabyl from janitor :
janitor::tabyl(df, id, week)
# id 1 2 3
# 1 2 1 1
# 2 0 0 1
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L,
1L, 3L)), class = "data.frame", row.names = c(NA, -5L))

How to get the first value of a column that satisfies a condition with groupby in R?

I have the following data table
library(data.table)
dt <- data.table(id_resp = c(1,1,1,1,2,2,2,2), week=c(1,2,3,4,1,2,3,4), val=c(0,0,1,1,0,0,0,2))
I would like to get the first week that has val > 0 for every id_resp
Is there a neat way to do this in R ?
We can use .I in data.table
library(data.table)
dt[dt[, .I[first(which(val > 0))], by = id_resp]$V1, ]
# id_resp week val
#1: 1 3 1
#2: 2 4 2
Using dplyr, we could use slice using similar logic
library(dplyr)
dt %>%
group_by(id_resp) %>%
slice(first(which(val > 0)))
# id_resp week val
# <dbl> <dbl> <dbl>
#1 1 3 1
#2 2 4 2
If we are sure that every id_resp would have at least one val greater than 0, we can replace first and which with which.max.
dt[dt[, .I[which.max(val > 0)], by = id_resp]$V1, ]
and
dt %>% group_by(id_resp) %>% slice(which.max(val > 0))
Using aggregate without assuming that df is sorted:
aggregate(week ~ id_resp, data=dt[dt$val>0,], FUN=min)
# id_resp week
#1 1 3
#2 2 4
Getting the first value of a column that satisfies a condition can be done like:
aggregate(week ~ id_resp, data=dt[dt$val>0,], FUN=function(x) {x[1]})
# id_resp week
#1 1 3
#2 2 4
dt[val > 0][!duplicated(id_resp)]
# id_resp week val
# 1: 1 3 1
# 2: 2 4 2
We can use .SD to subset
dt[, .SD[which(val > 0)[1]], by = id_resp]
# id_resp week val
#1: 1 3 1
#2: 2 4 2
Or with .I
dt[dt[, .I[val > 0][1], id_resp]$V1]
# id_resp week val
#1: 1 3 1
#2: 2 4 2
If we need only specific column
dt[, .(week = week[which(val >0)[1]]), by = id_resp]
Or using dplyr
library(dplyr)
dt %>%
group_by(id_resp) %>%
filter(val > 0, !duplicated(val))

fill gaps in sequence within data.table [duplicate]

tidyr::complete() adds rows to a data.frame for combinations of column values that are missing from the data. Example:
library(dplyr)
library(tidyr)
df <- data.frame(person = c(1,2,2),
observation_id = c(1,1,2),
value = c(1,1,1))
df %>%
tidyr::complete(person,
observation_id,
fill = list(value=0))
yields
# A tibble: 4 × 3
person observation_id value
<dbl> <dbl> <dbl>
1 1 1 1
2 1 2 0
3 2 1 1
4 2 2 1
where the value of the combination person == 1 and observation_id == 2 that is missing in df has been filled in with a value of 0.
What would be the equivalent of this in data.table?
I reckon that the philosophy of data.table entails fewer specially-named functions for tasks than you'll find in the tidyverse, so some extra coding is required, like:
res = setDT(df)[
CJ(person = person, observation_id = observation_id, unique=TRUE),
on=.(person, observation_id)
]
After this, you still have to manually handle the filling of values for missing levels. We can use setnafill to handle this efficiently & by-reference in recent versions of data.table:
setnafill(res, fill = 0, cols = 'value')
See #Jealie's answer regarding a feature that will sidestep this.
Certainly, it's crazy that the column names have to be entered three times here. But on the other hand, one can write a wrapper:
completeDT <- function(DT, cols, defs = NULL){
mDT = do.call(CJ, c(DT[, ..cols], list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(setDT(df), cols = c("person", "observation_id"), defs = c(value = 0))
person observation_id value
1: 1 1 1
2: 1 2 0
3: 2 1 1
4: 2 2 1
As a quick way of avoiding typing the names three times for the first step, here's #thelatemail's idea:
vars <- c("person","observation_id")
df[do.call(CJ, c(mget(vars), unique=TRUE)), on=vars]
# or with magrittr...
c("person","observation_id") %>% df[do.call(CJ, c(mget(.), unique=TRUE)), on=.]
Update: now you don't need to enter names twice in CJ thanks to #MichaelChirico & #MattDowle for the improvement.
There might be a better answer out there, but this works:
dt[CJ(person=unique(dt$person),
observation_id=unique(dt$observation_id)),
on=c('person','observation_id')]
Which gives:
person observation_id value
1: 1 1 1
2: 2 1 1
3: 1 2 NA
4: 2 2 1
Now, if you would like to be able to fill with any value (and not NA), I would suggest to wait for the corresponding feature to be finished or contribute to it :)
It is worth noting that the completeDT function above doesn't carry many of the features that tidyr::complete does. In particular, empty factor levels are dropped - unlike tidyr::complete which keeps them. If you do want to keep empty factor the function can be edited as below. The make_vals function below could be made more sophisticated to handle other variable classes eg. full sequence for integers.
library(magrittr)
library(data.table)
dat <- data.frame(
person = c(1,2,2),
observation_id = factor(c(1,1,2), 1:3),
value = c(1,1,1))
dat %>%
tidyr::complete(
person, observation_id, fill = list(value=0))
#> # A tibble: 6 x 3
#> person observation_id value
#> <dbl> <fct> <dbl>
#> 1 1 1 1
#> 2 1 2 0
#> 3 1 3 0
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 0
completeDT <- function(DT, cols, defs = NULL){
make_vals <- function(col) {
if(is.factor(col)) factor(levels(col))
else unique(col)
}
mDT = do.call(CJ, c(lapply(DT[, ..cols], make_vals), list(unique=TRUE)))
res = DT[mDT, on=names(mDT)]
if (length(defs))
res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]
res[]
}
completeDT(DT = setDT(dat), cols = c("person", "observation_id"), defs = c(value = 0))
#> person observation_id value
#> 1: 1 1 1
#> 2: 1 2 0
#> 3: 1 3 0
#> 4: 2 1 1
#> 5: 2 2 1
#> 6: 2 3 0
Created on 2021-03-08 by the reprex package (v0.3.0)

Find index of first and last occurrence in data table

I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")

Resources