I have a set with two columns. The rows are pairs of values (a,b).
require(data.table)
dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3))
I want to assign to each pair of values the lowest number. BUT if one of the values appears again in a new line, it must be compared again with the new pair and selected the lowest of the history. The result must be this one:
res.dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3), res=c(1,9,8,1,5,3,1))
a b res
1: 1 2 1
2: 11 9 9
3: 11 8 8
4: 2 6 1
5: 7 5 5
6: 5 3 3
7: 6 3 1
To state the problem differently: For each row i, we need to iteratively update res with the smallest value in rows j <= i where (a_i,b_i) and (a_j,b_j) have a non-empty intersection.
We can do this efficiently with non-equi joins in data.table (v>=1.9.8), but since this feature only allows single-element comparisons (>,>=,==,<=, or <), we need to find intersections by comparing (a_i,a_j), (a_i,b_j), (b_i,a_j), (b_i,b_j) separately. (There is an intersection if at least one of these pairs contains identical elements.) Doing this iteratively accounts for the entire history, and we can stop when the result converges:
dt[, `:=`(idx=.I, res=pmin(a,b), prev_res=NA)]
while (dt[, !identical(res, prev_res)]) {
dt[, prev_res:= res]
# Use non-equi joins to update 'res' for intersecting pairs downstream
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(a==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(b==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(a==i.b, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(b==i.b, idx > i), res:= pmin(res, i.res)]
}
The result:
> dt[, .(a,b,res)]
# a b res
# 1: 1 2 1
# 2: 11 9 9
# 3: 11 8 8
# 4: 2 6 1
# 5: 7 5 5
# 6: 5 3 3
# 7: 6 3 1
Related
I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?
You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4
I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))
I have two data.tables:
Values to extract the top k from, per group.
A mapping from group to the k values to select for that group.
how to find the top N values by group or within category (groupwise) in an R data.frame addresses this question when k does not vary by group. How can I do this? Here's sample data and the desired result:
Values:
(dt <- data.table(id=1:10,
group=c(rep(1, 5), rep(2, 5))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
Mapping from group to k:
(group.k <- data.table(group=1:2,
k=2:3))
# group k
# 1: 1 2
# 2: 2 3
Desired result, which should include the first two records from group 1 and the first three records from group 2:
(result <- data.table(id=c(1:2, 6:8),
group=c(rep(1, 2), rep(2, 3))))
# id group
# 1: 1 1
# 2: 2 1
# 3: 6 2
# 4: 7 2
# 5: 8 2
Applying the solution to the above-linked question after merging returns this error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, k), by=group])
# Error: length(n) == 1L is not TRUE
I'd rather do it as:
dt[group.k, head(.SD, k), by=.EACHI, on="group"]
because it's quite clear to see what the intended operation is. j can be .SD[1:k] of course. Both these expressions will very likely be (further) optimised (for speed) in the next release.
See this post for a detailed explanation of by=.EACHI until we wrap those vignettes.
After merging in the k by group, a similar approach to https://stackoverflow.com/a/14800271/1840471's solution can be applied, you just need a unique to avoid the length(n) error:
merged <- merge(dt, group.k, by="group")
(result <- merged[, head(.SD, unique(k)), by=group])
# group id k
# 1: 1 1 2
# 2: 1 2 2
# 3: 2 6 3
# 4: 2 7 3
# 5: 2 8 3
What is the right way to add marginal sums to a data table?
What I do right now:
> (a <- data.table(x=c(1,2,1,2,2,3,3),y=c(10,10,20,20,30,30,40),z=1:7,key=c("x")))
x y z
1: 1 10 1
2: 1 20 3
3: 2 10 2
4: 2 20 4
5: 2 30 5
6: 3 30 6
7: 3 40 7
> (a <- a[a[,sum(z),by=x]])
x y z V1
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
> setnames(a,"V1","x.z")
> setkeyv(a,"y")
> (a <- a[a[,sum(z),by=y]])
y x z x.z V1
1: 10 1 1 4 3
2: 10 2 2 11 3
3: 20 1 3 4 7
4: 20 2 4 11 7
5: 30 2 5 11 11
6: 30 3 6 13 11
7: 40 3 7 13 7
> setnames(a,"V1","y.z")
I am pretty sure this is not The Right Way.
What is?
One alternative is this one:
> a[,Sum:=sum(z), by="x"]
> a
x y z Sum
1: 1 10 1 4
2: 1 20 3 4
3: 2 10 2 11
4: 2 20 4 11
5: 2 30 5 11
6: 3 30 6 13
7: 3 40 7 13
Edit: Some more explanation on := usage:
The := operator enables add/update by reference. With this, you can:
add new columns or update existing columns by reference
DT[, x2 := x+1] # add one new column
DT[, `:=`(x2 = x+1, y2 = y+1)] # adding more than 1 col
DT[, x := x+1] # modify existing column
add or update certain rows of new or existing columns by reference
DT[x == 1L, y := NA] # modify 'y' just where expression in 'i' matches
DT[x == 1L, `:=`(y = NA, z=NA)] # same but for multiple columns
DT[x == 1L, newcol := 5L] # matched rows for 'newcol' will be 5, all other 'NA'
add or update cols while grouping, by reference - by default, the computed result is recycled within each group.
DT[, zsum := sum(z), by=x]
Here, sum(z) returns 1 value for each group in x. The result is then recycled for length of that group and is added/updated by reference to zsum.
add or update during a by-without-by operation. That is, when you perform a data.table join and you want to add/update column while joining:
X <- data.table(x=rep(1:3, each=2), y=1:6, key="x")
Y <- data.table(x=1:3, y=c(3L, 1L, 2L), key="x")
X[Y, y.gt := y > i.y]
Finally, you can also remove columns by reference (i.e. instantly even it's a 20GB table) :
DT[, x := NULL] # just 1 column
DT[, c("x","y") := NULL] # 1 or more columns
toRemove = c("x","y")
DT[, (toRemove) := NULL] # wrap with brackets to lookup variable
Hope this helps clarify the usage on :=. Also check out ?set. It is similar to :=, but with the limitation that it can not be combined with joins. This allows for it to be faster inside a for loop (due to reduced overhead from not calling [.data.table) for all operations it is capable of than :=.
It can be quite handy, especially, in some scenarios. See this post for a nice usage.
I have some data.tables like so:
x <- data.table(id=rep(1:3, 2), a=1:6)
y <- data.table(id=1:3, b=2:4)
I can merge them like this:
setkey(x, id)
setkey(y, id)
x[y]
id a b
1: 1 1 2
2: 1 4 2
3: 2 2 3
4: 2 5 3
5: 3 3 4
6: 3 6 4
Now, I want to create a new column in x based off a and b which is the sum of a and b.
I can do this with:
x[y, val:=a + b]
However, now suppose for some reason that the '+' operator is not vectorised. How can I store a row-wise calculation into x where x[y] is needed for the calculation? Also, assume I cannot use mapply (because for my actual problem, mapply is not suited to the function).
I'm trying to use sapply like so to add in a row-wise manner:
x[y, sapply(1:nrow(x), function (i) a[i] + b[i])]
However this returns the incorrect result:
id V1
1: 1 3
2: 1 NA
3: 1 NA
4: 1 NA
5: 1 NA
6: 1 NA
7: 2 5
8: 2 NA
9: 2 NA
10: 2 NA
11: 2 NA
12: 2 NA
13: 3 7
14: 3 NA
15: 3 NA
16: 3 NA
17: 3 NA
18: 3 NA
If I do this it works:
x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]
# [1] 3 6 5 8 7 10
BUT when I try and assign this to a column in x, it is not stored (makes sense because it looks like I'm trying to save the new column into x[y]).
x[y][, val:=sapply(1:nrow(x), function (i) a[i] + b[i])]
Is there any way to do the above but save the output into x[, val]?
Is this how I am supposed to do it, or is there a more data.table-y way?
x[, val:=x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]]
You are doing by-without-by without knowing it, (see below for the description from the help)
Advanced: Aggregation for a subset of known groups is particularly
efficient when passing those groups in i. When i is a data.table,
DT[i,j] evaluates j for each row of i. We call this by without by or
grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
This means that j is evaluated for each row of i (cylcing through y one row at a time -- so that if you run sapply(1:nrow(x),...) in j it will create a vector of length nrow(x) each time, when this is not what you want.
So your second option is definitely a valid approach (as it is one of the recommended approaches for doing this)
Otherwise you could use .N (When grouping by i, .N is the number of rows in x matched to, for each row of i) not nrow(x), but you will have to think about the length of your objects and how your function is to be vectorized.
Take this as an example
x[y, {browser(); a+b}]
Called from: `[.data.table`(x, y, {
browser()
a + b
})
Browse[1]> a
[1] 1 4
Browse[1]> b
[1] 2
Browse[1]> .N
[1] 2
a has length two, because value of the key matches with 2 rows from x. b only has length 1 because it only has length 1 in y.
I think the best approach is to correctly Vectorize your function (which is hard to give advice upon without more of an example)
another approach would be to replicate b to the length of a eg
x[y, val := {
bl <- rep_len(b, .N)
sapply(seq_len(.N), function(i) a[i] + bl[i])}]
x
id a val
1: 1 1 3
2: 1 4 6
3: 2 2 5
4: 2 5 8
5: 3 3 7
6: 3 6 10
or if you know that y has unique rows for each value of id, then you don't need to try and index any columns from it.
x[y, val2 := sapply(seq_len(.N), function(i) a[i] + b)]
# an alternative would be to use sapply on a (avoid creating another vector)
x[y, val3 := sapply(a, function(ai) ai + b)]
x
# id a val val2 val3
# 1: 1 1 3 3 3
# 2: 1 4 6 6 6
# 3: 2 2 5 5 5
# 4: 2 5 8 8 8
# 5: 3 3 7 7 7
# 6: 3 6 10 10 10