I have a data.table in which I'd like to complete a column to fill in some missing values, however I'm having some trouble filling in the other columns.
dt = data.table(a = c(1, 3, 5), b = c('a', 'b', 'c'))
dt[, .(a = seq(min(a), max(a), 1), b = na.locf(b))]
# a b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 a
# 5: 5 b
However looking for something more like this:
dt %>%
complete(a = seq(min(a), max(a), 1)) %>%
mutate(b = na.locf(b))
# # A tibble: 5 x 2
# a b
# <dbl> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 c
where the last value is carried forward
Another possible solution with only the (rolling) join capabilities of data.table:
dt[.(min(a):max(a)), on = .(a), roll = Inf]
which gives:
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
On large datasets this will probably outperform every other solution.
Courtesy to #Mako212 who gave the hint by using seq in his answer.
First posted solution which works, but gives a warning:
dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]
data.table recycles observations by default when you try dt[, .(a = seq(min(a), max(a), 1))] so it never generates any NA values for na.locf to fill. Pretty sure you need to use a join here to "complete" the cases, and then you can use na.locf to fill.
dt[dt[, .(a = min(a):max(a))], on = 'a'][, .(a, b = na.locf(b))]
Not sure if there's a way to skip the separate t1 line, but this gives you the desired result.
a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c
And I'll borrow #Jaap's min/max line to avoid creating the second table. So basically you can either use his rolling join solution, or if you want to use na.locf this gets the same result.
Related
Let's say the below df
df <- data.table(id = c(1, 2, 2, 3)
, datee = as.Date(c('2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'))
); df
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 2 2022-01-02
4: 3 2022-01-03
and I wanted to keep only the non-duplicated rows
df[!duplicated(id, datee)]
id datee
1: 1 2022-01-01
2: 2 2022-01-02
3: 3 2022-01-03
which worked.
However, with the below df_1
df_1 <- data.table(a = c(1,1,2)
, b = c(1,1,3)
); df_1
a b
1: 1 1
2: 1 1
3: 2 3
using the same method does not rid the duplicated rows
df_1[!duplicated(a, b)]
a b
1: 1 1
2: 1 1
3: 2 3
What am I doing wrong?
Let's dive in to why your df_1[!duplicated(a, b)] doesn't work.
duplicated uses S3 method dispatch.
library(data.table)
.S3methods("duplicated")
# [1] duplicated.array duplicated.data.frame
# [3] duplicated.data.table* duplicated.default
# [5] duplicated.matrix duplicated.numeric_version
# [7] duplicated.POSIXlt duplicated.warnings
# see '?methods' for accessing help and source code
Looking at those, we aren't using duplicated.data.table since we're calling it with individual vectors (it has no idea it is being called from within a data.table context), so it makes sense to look into duplicated.default.
> debugonce(duplicated.default)
> df_1[!duplicated(a, b)]
debugging in: duplicated.default(a, b)
debug: .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x),
nlevels(x) + 1L) else nmax))
Browse[2]> match.call() # ~ "how this function was called"
duplicated.default(x = a, incomparables = b)
Confirming with ?duplicated:
x: a vector or a data frame or an array or 'NULL'.
incomparables: a vector of values that cannot be compared. 'FALSE' is
a special value, meaning that all values can be compared, and
may be the only value accepted for methods other than the
default. It will be coerced internally to the same type as
'x'.
From this we can see that a is being used for deduplication, and b is used as "incomparable". Because b contains the value 1 that is in a and duplicated, then rows where a==1 are not tested for duplication.
To confirm, if we change b such that it does not share (duplicated) values with a, we see that the deduplication of a works as intended (though it is silently ignoring b's dupes due to the argument problem):
df_1 <- data.table(a = c(1,1,2) , b = c(2,2,4))
df_1[!duplicated(a, b)] # accidentally correct, `b` is not used
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_1, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
df_2 <- data.table(a = c(1,1,2) , b = c(2,3,4))
df_2[!duplicated(a, b)] # wrong, `b` is not considered
# a b
# <num> <num>
# 1: 1 2
# 2: 2 4
unique(df_2, by = c("a", "b"))
# a b
# <num> <num>
# 1: 1 2
# 2: 1 3
# 3: 2 4
(Note that unique above is actually data.table:::unique.data.table, another S3 method dispatch provided by the data.table package.)
debug and debugonce are your friends :-)
I have a data table that contains several patterns for going from a to c. These patterns are assigned to different expeditions. I want to extract similar patterns for the different expedition_id.
dt<- data.table(departure = c('a', 'a', 'a', 'b', 'a','d','a', 'b'), arrival =
c('a','a','b','c','d','c','b','c'), expedition_id = c(1,2,1,1,3,3,2,2))
>dt
departure arrival expedition_id
a a 1
a a 2
a b 1
b c 1
a d 3
d c 3
a b 2
b c 2
The results that I am trying to get look like different data tables for each unique pattern.
>dt1
departure arrival expedition_list
a a 1,2
a b 1,2
b c 1,2
>dt2
departure arrival expedition_list
a d 3
d c 3
I'd appreciate your help on this one.
You can try:
library(data.table)
dt <- dt[, .(expedition_list = toString(expedition_id)), by = .(departure, arrival)]
dt_list <- split(dt, dt$expedition_list)
list2env(
setNames(
dt_list,
paste0('dt', 1:length(dt_list))
),
.GlobalEnv
)
Output:
dt1
departure arrival expedition_list
1: a a 1, 2
2: a b 1, 2
3: b c 1, 2
dt2
departure arrival expedition_list
1: a d 3
2: d c 3
You asked for data.table but for others this dplyr version might also be helpful:
dt %>%
group_by(departure, arrival) %>%
summarise(expedition_list = paste(expedition_id, collapse = ","))
It's late and I can't figure this out. I'm using lubridate and dlypr.
My data is as follows:
table1 =1 observation per subject with a date
table2= 1 or more observations per subject with associated dates
When I left join I actually add observations. This is because I have multiple records in table 2 that match the key. How can I simply make this a conditional join so that only 1 matching record from table 2 is joined given that its date is closest to the date in table 1.
Sorry if this was verbose.
Use the data.table-package to join. Use roll = "nearest" to get the nearest match..
library(data.table)
dt1 <- data.table( id = 1:10, date = 1:10, stringsAsFactors = FALSE )
dt2 <- data.table( date = 6:15, letter = letters[1:10], stringsAsFactors = FALSE )
dt1[, letter := dt2[dt1, letter, on = "date", roll = "nearest"] ][]
# id date letter
# 1: 1 1 a
# 2: 2 2 a
# 3: 3 3 a
# 4: 4 4 a
# 5: 5 5 a
# 6: 6 6 a
# 7: 7 7 b
# 8: 8 8 c
# 9: 9 9 d
# 10: 10 10 e
Apologies if this has been answered. I've gone through numerous examples today but I can't find any that match what I am trying to do.
I have a data set which I need to calculate a 3 point moving average on. I've generated some dummy data below:
set.seed(1234)
data.frame(Week = rep(seq(1:5), 3),
Section = c(rep("a", 5), rep("b", 5), rep("c", 5)),
Qty = runif(15, min = 100, max = 500),
To = runif(15, min = 40, max = 80))
I want to calculate the MA for each group based on the 'Section' column for both the 'Qty' and the 'To' columns. Ideally the output would be a data table. The moving average would start at Week 3 so would be the average of wks 1:3
I am trying to master the data.table package so a solution using that would be great but otherwise any will be much appreciated.
Just for reference my actual data set will have approx. 70 sections with c.1M rows in total. I've found the data.table to be extremely fast at crunching these kind of volumes so far.
We could use rollmean from the zoo package, in combination with data.table .
library(data.table)
library(zoo)
setDT(df)[, c("Qty.mean","To.mean") := lapply(.SD, rollmean, k = 3, fill = NA, align = "right"),
.SDcols = c("Qty","To"), by = Section]
> df
# Week Section Qty To Qty.mean To.mean
#1: 1 a 145.4814 73.49183 NA NA
#2: 2 a 348.9198 51.44893 NA NA
#3: 3 a 343.7099 50.67283 279.3703 58.53786
#4: 4 a 349.3518 47.46891 347.3271 49.86356
#5: 5 a 444.3662 49.28904 379.1426 49.14359
#6: 1 b 356.1242 52.66450 NA NA
#7: 2 b 103.7983 52.10773 NA NA
#8: 3 b 193.0202 46.36184 217.6476 50.37802
#9: 4 b 366.4335 41.59984 221.0840 46.68980
#10: 5 b 305.7005 48.75198 288.3847 45.57122
#11: 1 c 377.4365 72.42394 NA NA
#12: 2 c 317.9899 61.02790 NA NA
#13: 3 c 213.0934 76.58633 302.8400 70.01272
#14: 4 c 469.3734 73.25380 333.4856 70.28934
#15: 5 c 216.9263 41.83081 299.7977 63.89031
A solution using dplyr:
library(dplyr); library(zoo)
myfun = function(x) rollmean(x, k = 3, fill = NA, align = "right")
df %>% group_by(Section) %>% mutate_each(funs(myfun), Qty, To)
#### Week Section Qty To
#### (int) (fctr) (dbl) (dbl)
#### 1 1 a NA NA
#### 2 2 a NA NA
#### 3 3 a 279.3703 58.53786
#### 4 4 a 347.3271 49.86356
There is currently faster approach using new frollmean function in data.table 1.12.0.
setDT(df)[, c("Qty.mean","To.mean") := frollmean(.SD, 3),
.SDcols = c("Qty","To"),
by = Section]
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....