This is what my data table looks like:
library(data.table)
dt <- fread('
Product Group LastProductOfPriorGroup
A 1 NA
B 1 NA
C 2 B
D 2 B
E 2 B
F 3 E
G 3 E
')
The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA. In the third row, the product in the last row of the prior group 1 is B. I am trying to accomplish this by
dt[,LastGroupProduct:= shift(Product,1), by=shift(Group,1)]
to no avail.
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt, where newcol matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ... to denote the accumulated code:
dt[, last(Product), by = Group]$V1 is getting the last values from each group as a character vector.
shift(...) shifts the character vector in the previous call
dt[, newcol := ...[.GRP], by = Group] groups by Group and uses the internal .GRP values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Another way is to save the last group's value in a variable.
this = NA_character_ # initialize
dt[, LastProductOfPriorGroup:={ last<-this; this<-last(Product); last }, by=Group]
dt
Product Group LastProductOfPriorGroup
1: A 1 NA
2: B 1 NA
3: C 2 B
4: D 2 B
5: E 2 B
6: F 3 E
7: G 3 E
NB: last() is a data.table function which returns the last item of a vector (of the Product column in this case).
This should also be fast since no logic is being invoked to fetch the last group's value; it just relies on the groups running in order (which they do).
Related
I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).
library(data.table)
library(magrittr)
# data for reproducible example
dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'),
destination = c('B','C','D', 'G', 'H', 'I'),
freq = c(2,2,2,10,10,10),
route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
# origin destination freq route_id
# 1: A B 2 1
# 2: B C 2 1
# 3: C D 2 1
# 4: F G 10 2
# 5: G H 10 2
# 6: H I 10 2
For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.
Problems: So far, I've created a simple code that does this job but:
it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.
My slow solution
# loop
# a) get a data subset corresponding to each route_id
# b) get all combinations of origin-destination pairs
# c) row bind the new pairs to original data
for (i in unique(dt$route_id)) {
temp <- dt[ route_id== i,]
subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
setnames(subset_of_pairs, c("origin", "destination"))
dt <- rbind(dt, subset_of_pairs, fill=T)
}
# assign route_id and freq to new pairs
dt[, route_id := route_id[1L], by=origin]
dt[, freq := freq[1L], by=route_id]
# Keepe only different pairs that are unique
dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
dt <- unique(dt)
Desired output
origin destination freq route_id
1: A B 2 1
2: A C 2 1
3: A D 2 1
4: B C 2 1
5: B D 2 1
6: C D 2 1
7: F G 10 2
8: F H 10 2
9: F I 10 2
10: G H 10 2
11: G I 10 2
12: H I 10 2
One way:
res = dt[, {
stops = c(origin, last(destination))
pairs = combn(.N + 1L, 2L)
.(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]
route_id o d
1: 1 A B
2: 1 A C
3: 1 A D
4: 1 B C
5: 1 B D
6: 1 C D
7: 2 F G
8: 2 F H
9: 2 F I
10: 2 G H
11: 2 G I
12: 2 H I
This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.
If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.
Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.
How it works
Generally, data.table syntax can be treated just like a loop over groups:
DT[, {
...
}, by=g]
This has a few advantages over loops:
Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().
In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().
Further improvements
I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:
Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)
Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.
I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]
I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols
I am still puzzled by the behavior of data.table J.
> DT = data.table(A=7:3,B=letters[5:1])
> DT
A B
1: 7 e
2: 6 d
3: 5 c
4: 4 b
5: 3 a
> setkey(DT, A, B)
> DT[J(7,"e")]
A B
1: 7 e
> DT[J(7,"f")]
A B
1: 7 f # <- there is no such line in DT
but there is no such line in DT. Why do we get this result?
The data.table J(7, 'f') is literally a single-row data.table that you are joining your own data.table with. When you call x[i], you are looking at each row in i and finding all matches for this in x. The default is to give NA for rows in i that don't match anything, which is easier seen by adding another column to DT:
DT <- data.table(A=7:3,B=letters[5:1],C=letters[1:5])
setkey(DT, A, B)
DT[J(7,"f")]
# A B C
# 1: 7 f NA
What you are seeing is the only row in J with no match to anything in DT. To prevent data.table from reporting non-matches, you can use nomatch=0
DT[J(7,"f"), nomatch=0]
# Empty data.table (0 rows) of 3 cols: A,B,C
Perhaps adding an additional column will shed some light on what is going on.
DT[, C:=paste0(A, B)]
DT[J(7,"e")]
### A B C
### 1: 7 e 7e
DT[J(7,"f")]
### A B C
### 1: 7 f NA
This is the same behavior as without J:
setkey(DT, B)
DT["a"]
### B A C
### 1: a 3 3a
DT["A"]
### B A C
### 1: A NA NA
You can use the nomatch argument to change this behavior.
DT[J(7,"f"), nomatch=0L]
### Empty data.table (0 rows) of 3 cols: A,B,C
I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0