Related
I have a data.table that gives me the connections between locations (origin and destination) for different bus routes (route_id).
library(data.table)
library(magrittr)
# data for reproducible example
dt <- data.table( origin = c('A','B','C', 'F', 'G', 'H'),
destination = c('B','C','D', 'G', 'H', 'I'),
freq = c(2,2,2,10,10,10),
route_id = c(1,1,1,2,2,2), stringsAsFactors=FALSE )
# > dt
# origin destination freq route_id
# 1: A B 2 1
# 2: B C 2 1
# 3: C D 2 1
# 4: F G 10 2
# 5: G H 10 2
# 6: H I 10 2
For the purposes of what I'd want to do, if there is a route_id that gives a connection A-B and a connection B-C, then I want to add to the data a connection A-C for that same route_id and so on.
Problems: So far, I've created a simple code that does this job but:
it uses a for loop that takes a long time (my real data has hundreds of thousands observations)
it still does not cope well with direction. The direction of the connections matter here. So although there is a B-C connection in the original data, there should be no C-B in the output.
My slow solution
# loop
# a) get a data subset corresponding to each route_id
# b) get all combinations of origin-destination pairs
# c) row bind the new pairs to original data
for (i in unique(dt$route_id)) {
temp <- dt[ route_id== i,]
subset_of_pairs <- expand.grid(temp$origin, temp$destination) %>% setDT()
setnames(subset_of_pairs, c("origin", "destination"))
dt <- rbind(dt, subset_of_pairs, fill=T)
}
# assign route_id and freq to new pairs
dt[, route_id := route_id[1L], by=origin]
dt[, freq := freq[1L], by=route_id]
# Keepe only different pairs that are unique
dt[, origin := as.character(origin) ][, destination := as.character(destination) ]
dt <- dt[ origin != destination, ][order(route_id, origin, destination)]
dt <- unique(dt)
Desired output
origin destination freq route_id
1: A B 2 1
2: A C 2 1
3: A D 2 1
4: B C 2 1
5: B D 2 1
6: C D 2 1
7: F G 10 2
8: F H 10 2
9: F I 10 2
10: G H 10 2
11: G I 10 2
12: H I 10 2
One way:
res = dt[, {
stops = c(origin, last(destination))
pairs = combn(.N + 1L, 2L)
.(o = stops[pairs[1,]], d = stops[pairs[2,]])
}, by=route_id]
route_id o d
1: 1 A B
2: 1 A C
3: 1 A D
4: 1 B C
5: 1 B D
6: 1 C D
7: 2 F G
8: 2 F H
9: 2 F I
10: 2 G H
11: 2 G I
12: 2 H I
This is assuming that c(origin, last(destination)) is a full list of stops in order. If dt does not contain enough info to construct a complete order, the task becomes much more difficult.
If vars from dt are needed, an update join like res[dt, on=.(route_id), freq := i.freq] works.
Tasks like this always risk running out of memory. In this case, the OP has up to a million rows containing groups of up to 341 stops, so the end result could be as large as 1e6/341*choose(341,2) = 170 million rows. That's manageable, but in general this sort of analysis does not scale.
How it works
Generally, data.table syntax can be treated just like a loop over groups:
DT[, {
...
}, by=g]
This has a few advantages over loops:
Nothing created in the ... body will pollute the workspace.
All columns can be referenced by name.
Special symbols .N, .SD, .GRP and .BY are available, along with .() for list().
In the code above, pairs finds pairs of indices taken from 1 .. #stops (=.N+1 where .N is the number of rows in the subset of the data associated with a given route_id). It is a matrix with the first row corresponding to the first element of a pair; and the second row with the second. The ... should evaluate to a list of columns; and here list() is abbreviated as .().
Further improvements
I guess the time is mostly devoted to computing combn many times. If multiple routes have the same #stops, this can be addressed by computing beforehand:
Ns = dt[,.N, by=route_id][, unique(N)]
cb = lapply(setNames(,Ns), combn, 2)
Then grab pairs = cb[[as.character(.N)]] in the main code. Alternately, define a pairs function that uses memoization to avoid recomputing.
I was wondering what causes the following behavior that surprised me a bit - I defined a data table dt_3, then defined dt_1 to be equal to dt_3. When I then used set() to replace row elements in dt_1, the corresponding elements of dt_3 were changed as well:
refcols=c("A","B")
dt_3 = data.table(A=c(1,1,3,5,6,7), B = c("x","y","z","q","w","e"), C = rep("NO",6))
dt_2 = data.table(A=c(3,5,7), B = c("z","q","x"), D=c(3,5,99))
dt_1 = dt_3
dt_3
A B C
1: 1 x NO
2: 1 y NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
for(j in refcols){
set(dt_1,2,j,dt_2[3,get(j)])
}
Warning messages:
1: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
2: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
dt_3
A B C
1: 1 x NO
2: 7 x NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
What is causing this and is there an easier way to subset by explicit row indices for specific columns like this?
We can use copy so that when we replace the elements in one dataset, the other wont' change
dt_1<- copy(dt_3)
Regarding the second part, it is not very clear about the row index. If it is only based on the column index
for(j in refcols){
set(dt_1, i=NULL, j=j, value=dt_2[[j]])
}
dt_1
# A B C
#1: 3 z NO
#2: 5 q NO
#3: 7 x NO
#4: 3 z NO
#5: 5 q NO
#6: 7 x NO
If the 2nd row of the "A" and "B" column in 'dt_1' should be replaced by the 3rd row of 'dt_2' for corresponding columns (based on 'refcols')
for(j in refcols){
set(dt_1, i=2L, j=j, value=dt_2[[j]][3])
}
dt_1
# A B C
#1: 1 x NO
#2: 7 x NO
#3: 3 z NO
#4: 5 q NO
#5: 6 w NO
#6: 7 e NO
For dummy dataset
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
Now the dataset looks like this:
teamid member leader
1: 1 a,b c
2: 2 d,e
3: 3 c,g,h
3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.
The following is my ALMOST desired output:
teamid value leader
1: 1 a FALSE
2: 1 b FALSE
3: 1 c TRUE
4: 1 c TRUE
5: 2 d TRUE
6: 2 e TRUE
7: 3 c FALSE
8: 3 g FALSE
9: 3 h FALSE
I want to have the two columns merged into one, and add a tag if one is a team leader.
I have an ugly solution for this,
dt1 <- dt[, strsplit(member, ","), by = teamid]
dt2 <- dt[, strsplit(leader, ","), by = teamid]
setkey(dt1,teamid)
setkey(dt2,teamid)
dt3 <- merge(dt1,dt2, all = TRUE)
dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y"))
dt5 <- dt4[value!="NA_real"]
dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL]
setkey(dt6, teamid)
setnames(dt6,value,member)
Issues:
This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?
There're duplicated rows, in row 3 and row 4.
When I tried to change column name, an error came up
setnames(dt6,value,member)
Error in setnames(dt6, value, member) : object 'value' not found
Maybe the most important thing,
When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured
merge(df1,df2, all = TRUE)
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Any suggestion? Thanks a lot!
Melt first.
result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
# teamid status member
# 1: 1 member a
# 2: 1 member b
# 3: 1 leader c
# 4: 2 leader d
# 5: 2 leader e
# 6: 3 member c
# 7: 3 member g
# 8: 3 member h
If you want to get rid of the status column and add a "tag" to the member column, you can do it this way:
result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
# teamid member
# 1: 1 a
# 2: 1 b
# 3: 1 c*
# 4: 2 d*
# 5: 2 e*
# 6: 3 c
# 7: 3 g
# 8: 3 h
A slightly simpler approach may be
crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)
dt_long <- data.table(people=c(crew, leads),
status = rep(c("crew", "leader"), c(length(crew), length(leader))))
It gives me
people status
1: a crew
2: b crew
3: c crew
4: g crew
5: h crew
6: c leader
7: d leader
8: e leader
You can try a tidyverse solution now
dt %>%
separate_rows(member) %>%
separate_rows(leader) %>%
gather(status, member, -teamid) %>%
distinct() %>%
filter(member != "") %>%
mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>%
select(-status)
teamid member
1 1 a
2 1 b
3 3 c
4 3 g
5 3 h
6 1 c*
7 2 d*
8 2 e*
I am still puzzled by the behavior of data.table J.
> DT = data.table(A=7:3,B=letters[5:1])
> DT
A B
1: 7 e
2: 6 d
3: 5 c
4: 4 b
5: 3 a
> setkey(DT, A, B)
> DT[J(7,"e")]
A B
1: 7 e
> DT[J(7,"f")]
A B
1: 7 f # <- there is no such line in DT
but there is no such line in DT. Why do we get this result?
The data.table J(7, 'f') is literally a single-row data.table that you are joining your own data.table with. When you call x[i], you are looking at each row in i and finding all matches for this in x. The default is to give NA for rows in i that don't match anything, which is easier seen by adding another column to DT:
DT <- data.table(A=7:3,B=letters[5:1],C=letters[1:5])
setkey(DT, A, B)
DT[J(7,"f")]
# A B C
# 1: 7 f NA
What you are seeing is the only row in J with no match to anything in DT. To prevent data.table from reporting non-matches, you can use nomatch=0
DT[J(7,"f"), nomatch=0]
# Empty data.table (0 rows) of 3 cols: A,B,C
Perhaps adding an additional column will shed some light on what is going on.
DT[, C:=paste0(A, B)]
DT[J(7,"e")]
### A B C
### 1: 7 e 7e
DT[J(7,"f")]
### A B C
### 1: 7 f NA
This is the same behavior as without J:
setkey(DT, B)
DT["a"]
### B A C
### 1: a 3 3a
DT["A"]
### B A C
### 1: A NA NA
You can use the nomatch argument to change this behavior.
DT[J(7,"f"), nomatch=0L]
### Empty data.table (0 rows) of 3 cols: A,B,C
I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.