R Alternatives to a for loop for searching through a large dataset - r

The goal here is to identify and count if the entries in column b have matching entries in column a with a range of +/-1 (or as required). A simplified version is provided:
a <- c("1231210","1231211", "1231212", "98798", "98797", "98796", "555125", "555127","555128")
b <- c("1", "2", "3", "4", "5", "6", "1231209", "98797", "555126")
df <- data.frame(a, b)
I merged this data in a dataframe to simulate my actual dataset, converted them to numerics and wrote the following function to get my desired output. (note: column a need not be part of the df, but can be a separate list I suppose?)
df$c <- mapply(
function(x){
count = 0
for (i in df$a){
if (abs(i-x) <= 1){
count = count +1
}
}
paste0(count)
},
df$b
)
a
b
c
1
1231210
1
0
2
1231211
2
0
3
1231212
3
0
4
98798
4
0
5
98797
5
0
6
98796
6
0
7
555125
1231209
1
8
555127
98797
3
9
555128
555126
2
While this appears to work fine for the trial dataset, my actual dataset has over 2 million rows which means 2M^2 iterations? (still running after 3h) I was wondering if there is an alternate strategy to tackle this, preferably using base R functions only.
I'm quite new to R and a common suggestion is to use vectorization to improve efficiency. However, I have no clue if this is possible in this case when looking at the examples provided on the net.
Would love to hear any suggestions and feel free to point out mistakes. Thanks!

As your data is quite large, outer and lapply approaches will be quite slow (for outer you need 14901.2 Gb of RAM). I suggest using data.table
require(data.table)
dt <- as.data.table(df)
dt[, id := 1:.N] # add id as maybe you have duplicated values
setkey(dt, id)
dt[, b1 := b - 1L]
dt[, b2 := b + 1L]
x <- dt[dt, on = .(a >= b1, a <= b2)] # non-equi join
x <- x[, .(c = sum(!is.na(b1))), keyby = .(id = i.id)]
dt[x, c := i.c, on = 'id']
dt
# a b id b1 b2 c
# 1: 1231210 1 1 0 2 0
# 2: 1231211 2 2 1 3 0
# 3: 1231212 3 3 2 4 0
# 4: 98798 4 4 3 5 0
# 5: 98797 5 5 4 6 0
# 6: 98796 6 6 5 7 0
# 7: 555125 1231209 7 1231208 1231210 1
# 8: 555127 98797 8 98796 98798 3
# 9: 555128 555126 9 555125 555127 2
dt[, id := NULL][, b1 := NULL][, b2 := NULL] # remove colls
p.s. check that a and b are converted to integers before...

why are vectors a and b characters? They should be numeric:
a <- c(1231210,1231211, 1231212, 98798, 98797, 98796, 555125, 555127,555128)
b <- c(1, 2, 3, 4, 5, 6, 1231209, 98797, 555126)
You can simplify by using only one loop and vectorization:
unlist(lapply(b, function(x) sum(abs(a-x) <= limit)))
where limit is variable describing allowed difference. For limit <- 1 you get:
[1] 0 0 0 0 0 0 1 3 2

What about colSums + outer?
transform(
type.convert(data.frame(a, b), as.is = TRUE),
C = colSums(abs(outer(a, b, `-`)) <= 1)
)
output
a b C
1 1231210 1 0
2 1231211 2 0
3 1231212 3 0
4 98798 4 0
5 98797 5 0
6 98796 6 0
7 555125 1231209 1
8 555127 98797 3
9 555128 555126 2

Related

Mapping 2 unrelated data frames in R

I need to use data from a dataframe A to fill a column in my dataframe B.
Here is a subset of dataframe A:
> dfA <- data.frame(Family=c('A','A','A','B','B'), Count=c(1,2,3,1,2), Start=c(0,10,35,0,5), End=c(10,35,50,5,25))
> dfA
Family Count Start End
1 A 1 0 10
2 A 2 10 35
3 A 3 35 50
4 B 1 0 5
5 B 2 5 25
and a subset of dataframe B
> dfB <- data.frame(Family=c('A','A','A','B','B'), Start=c(1,4,36,2,10), End=c(3,6,40,4,24), BelongToCount=c(NA,NA,NA,NA,NA))
> dfB
Family Start End BelongToCount
1 A 1 3 NA
2 A 4 6 NA
3 A 36 40 NA
4 B 2 4 NA
5 B 10 24 NA
What I want to do is to fill in the BelongToCount column in B according to the data from dataframe A, which would end up with dataframe B filled as:
Family Start End BelongToCount
A 1 3 1
A 4 6 1
A 36 40 3
B 2 4 1
B 10 24 2
I need to do this for each family (so grouping by family), and the condition to fill the BelongToCount column is that if B$Start >= A$Start && B$End <= A$End.
I can't seem to find a clean (and fast) way to do this in R.
Right now, I am doing as follows:
split_A <- split(dfA, dfA$Family)
split_A_FamilyA <- split_A[["A"]]
split_B <- split(dfB, dfB$Family)
split_B_FamilyA <- split_B[["A"]]
for(i in 1:nrow(split_B_FamilyA)) {
row <- split_B_FamilyA[i,]
start <- row$dStart
end <- row$dEnd
for(j in 1:nrow(split_A_FamilyA)) {
row_base <- split_A_FamilyA[j,]
start_base <- row_base$Start
end_base <- row_base$End
if ((start >= start_base) && (end <= end_base)) {
split_B_FamilyA[i,][i,]$BelongToCount <- row_base$Count
break
}
}
}
I admit this is a very bad way of handling the problem (and it is awfully slow). I usually use dplyr when it comes to applying operations on specific groups, but I can't find a way to do such a thing using it. Joining the tables does not make a lot of sense either because the number of rows don't match.
Can someone point me any relevant R function / an efficient way of solving this problem in R?
You can do this with non-equi join in data.table:
library(data.table)
setDT(dfB)
setDT(dfA)
set(dfB, j='BelongToCount', value = as.numeric(dfB$BelongToCount))
dfB[dfA, BelongToCount := Count, on = .(Family, Start >= Start, End <= End)]
# Family Start End BelongToCount
# 1: A 1 3 1
# 2: A 4 6 1
# 3: A 36 40 3
# 4: B 2 4 1
# 5: B 10 24 2
In case a row in dfB is contained in multiple roles of dfA:
dfA2 <- rbind(dfA, dfA)
dfA2[dfB, .(BelongToCount = sum(Count)),
on = .(Family, Start <= Start, End >= End), by = .EACHI]
# Family Start End BelongToCount
# 1: A 1 3 2
# 2: A 4 6 2
# 3: A 36 40 6
# 4: B 2 4 2
# 5: B 10 24 4

Subsetting data with a vec

I want to subset a data frame by a vector, but replicate the subsetting for each value in the vector:
data = data.frame(A = c(1,2,3,1), B = c(1,2,3,4))
vec = c(1, 1, 1)
subset(data, A %in% vec)
A B
1 1 1
4 1 4
Instead of this result I want this:
A B
1 1 1
4 1 4
1 1 1
4 1 4
1 1 1
4 1 4
If you use the purrr library, you can do
map_df(vec, function(x) subset(data, A == x))
with base R, it would be
do.call("rbind", lapply(vec, function(x) subset(data, A == x)))
You need to expand it, i.e.
df2 <- subset(data, A %in% vec)
df2[rep(rownames(df2), length(vec)),]
# A B
#1 1 1
#4 1 4
#1.1 1 1
#4.1 1 4
#1.2 1 1
#4.2 1 4
One option with data.table:
library(data.table)
setDT(data, key = 'A')[.(vec)]
# A B
#1: 1 1
#2: 1 4
#3: 1 1
#4: 1 4
#5: 1 1
#6: 1 4
Or use merge, which gives cartesian product as you need when there are duplicated values in the merge-by column:
merge(data, data.frame(A = vec))
# A B
#1: 1 1
#2: 1 1
#3: 1 1
#4: 1 4
#5: 1 4
#6: 1 4
Along the lines of a base R split-apply-combine solution would be
do.call(rbind, lapply(vec, function(i) data[data$A == i, ]))
A B
1 1 1
4 1 4
11 1 1
41 1 4
12 1 1
42 1 4
This could be useful if vec contained an uneven mixture of values. This solution could be expensive if there are many repetitions in vec. In that instance, computation can be reduced by combining it with the rep idea in soto's answer as follows.
# count the number of repetitions by unique value
uni <- table(vec)
# extract unique values
temp <- lapply(as.numeric(names(uni)), function(i) data[data$A == i, ])
# combine results, repeating data.frames according to count
do.call(rbind, temp[rep(seq_along(uni), each=uni)])

Mutate with dplyr using multiple conditions

I have a data frame (df) below and I want to add an additional column, result, using dplyr that will take on the value 1 if z == "gone" and where x is the maximum value for group y.
y x z
1 a 3 gone
2 a 5 gone
3 a 8 gone
4 a 9 gone
5 a 10 gone
6 b 1
7 b 2
8 b 4
9 b 6
10 b 7
If I were to simply select the maximum for each group it would be:
df %>%
group_by(y) %>%
slice(which.max(x))
which will return:
y x z
1 a 10 gone
2 b 7
This is not what I want. I need to take advantage of the max value of x for each group in y while checking to see if z == "gone", and if TRUE 1 otherwise 0. This would look like:
y x z result
1 a 3 gone 0
2 a 5 gone 0
3 a 8 gone 0
4 a 9 gone 0
5 a 10 gone 1
6 b 1 0
7 b 2 0
8 b 4 0
9 b 6 0
10 b 7 0
I'm assuming I would use a conditional statement within mutate() but I cannot seem to find an example. Please advise.
With dplyr you can use:
df %>% group_by(y) %>% mutate(result = +(x == max(x) & z == 'gone'))
The +(..) notation is shorthand for as.integer to coerce the logical output to 1's and 0's. Some don't like it so it's a matter of shorter code versus readability. Efficiency gains can be debated on the circumstance.
Also to appreciate what data.table and dplyr have done for data manipulation with R, let's do the same thing in the old-fashioned "split-apply-combine" way:
#split data.frame by group
split.df <- split(df, df$y)
#apply required function to each group
lst <- lapply(split.df, function(dfx) {
dfx$result <- +(dfx$x == max(dfx$x) & dfx$z == "gone")
dfx})
#combine result in new data.frame
newdf <- do.call(rbind, lst)
We can do this with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'y', we create the logical condition for maximum value of 'x' and the 'gone' element in 'z', coerce it to 'integer' (as.integer) and assign (:=) the output to the new column ('result').
library(data.table)
setDT(df)[, result := as.integer(x==max(x) & z=='gone') , by = y]
df
# y x z result
# 1: a 3 gone 0
# 2: a 5 gone 0
# 3: a 8 gone 0
# 4: a 9 gone 0
# 5: a 10 gone 1
# 6: b 1 0
# 7: b 2 0
# 8: b 4 0
# 9: b 6 0
#10: b 7 0
Or we can use ave from base R
df$result <- with(df, +(ave(x, y, FUN=max)==x & z=='gone' ))

Assigning values in first rows of groups in a data.table

I'd like to assign only those values in the first row of a group in a data.table.
For example (simplified): my data.table is DT with the following content
x v
1 1
2 2
2 3
3 4
3 5
3 6
The key of DT is x.
I want to address every first line of a group.
This is working fine:DT[, .SD[1], by=x]
x v
1 1
2 2
3 4
Now, I want to assign only those values of v to 0.
But none of this is working:
DT[, .SD[1], by=x]$v <- 0
DT[, .SD[1], by=x]$v := 0
DT[, .SD[1], by=x, v:=0]
I searched the R-help from the package and any links provided but I just can't get it work.
I found notes there saying this would not work but no examples/solutions that helped me out.
I'd be very glad for any suggestions.
(I like this package very much and I don't wanna go back to a data.frame... where I got this working)
edit:
I'd like to have a result like this:
x v
1 0
2 0
2 3
3 0
3 5
3 6
This is not working:
DT[, .SD[1], by=x] <- DT[, .SD[1], by=x][, v:=0]
Another option would be:
DT[,v:={v[1]<-0L;v}, by=x]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6
Or
DT[DT[, .I[1], by=x]$V1, v:=0]
DT
# x v
#1: 1 0
#2: 2 0
#3: 2 3
#4: 3 0
#5: 3 5
#6: 3 6
With a little help from Roland's solution, it looks like you could do the following. It simply concatenates zero with all the other grouped values of v except the first.
DT[, v := c(0L, v[-1]), by = x] ## must have the "L" after 0, as 0L
which results in
DT
# x v
# 1: 1 0
# 2: 2 0
# 3: 2 3
# 4: 3 0
# 5: 3 5
# 6: 3 6
Note: the middle section j of code could also be v := c(integer(1), v[-1])

Add a countdown column to data.table containing rows until a special row encountered

I have a data.table with ordered data labled up, and I want to add a column that tells me how many records until I get to a "special" record that resets the countdown.
For example:
DT = data.table(idx = c(1,3,3,4,6,7,7,8,9),
name = c("a", "a", "a", "b", "a", "a", "b", "a", "b"))
setkey(DT, idx)
#manually add the answer
DT[, countdown := c(3,2,1,0,2,1,0,1,0)]
Gives
> DT
idx name countdown
1: 1 a 3
2: 3 a 2
3: 3 a 1
4: 4 b 0
5: 6 a 2
6: 7 a 1
7: 7 b 0
8: 8 a 1
9: 9 b 0
See how the countdown column tells me how many rows until a row called "b".
The question is how to create that column in code.
Note that the key is not evenly spaced and may contain duplicates (so is not very useful in solving the problem). In general the non-b names could be different, but I could add a dummy column that is just True/False if the solution requires this.
Here's another idea:
## Create groups that end at each occurrence of "b"
DT[, cd:=0L]
DT[name=="b", cd:=1L]
DT[, cd:=rev(cumsum(rev(cd)))]
## Count down within them
DT[, cd:=max(.I) - .I, by=cd]
# idx name cd
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
I'm sure (or at least hopeful) that a purely "data.table" solution would be generated, but in the meantime, you could make use of rle. In this case, you're interested in reversing the countdown, so we'll use rev to reverse the "name" values before proceeding.
output <- sequence(rle(rev(DT$name))$lengths)
makezero <- cumsum(rle(rev(DT$name))$lengths)[c(TRUE, FALSE)]
output[makezero] <- 0
DT[, countdown := rev(output)]
DT
# idx name countdown
# 1: 1 a 3
# 2: 3 a 2
# 3: 3 a 1
# 4: 4 b 0
# 5: 6 a 2
# 6: 7 a 1
# 7: 7 b 0
# 8: 8 a 1
# 9: 9 b 0
Here's a mix of Josh's and Ananda's solution, in that, I use RLE to generate the way Josh has given the answer:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rep(t, t+1)]
DT[, cd:=max(.I) - .I, by=cd]
Even better: Taking use of the fact that there's only one b always (or assuming here), you could do this one better:
t <- rle(DT$name)
t <- t$lengths[t$values == "a"]
DT[, cd := rev(sequence(rev(t+1)))-1]
Edit: From OP's comment, it seems clear that there is more than 1 b possible and in such cases, all b should be 0. The first step in doing this is to create groups where b ends after each consecutive a's.
DT <- data.table(idx=sample(10), name=c("a","a","a","b","b","a","a","b","a","b"))
t <- rle(DT$name)
val <- cumsum(t$lengths)[t$values == "b"]
DT[, grp := rep(seq(val), c(val[1], diff(val)))]
DT[, val := c(rev(seq_len(sum(name == "a"))),
rep(0, sum(name == "b"))), by = grp]
# idx name grp val
# 1: 1 a 1 3
# 2: 7 a 1 2
# 3: 9 a 1 1
# 4: 4 b 1 0
# 5: 2 b 1 0
# 6: 8 a 2 2
# 7: 6 a 2 1
# 8: 3 b 2 0
# 9: 10 a 3 1
# 10: 5 b 3 0

Resources