R - efficient comparison of subsets of rows between data frames - r

thank you for any help.
I need to check the total number of matches from the elements of each row of a data frame (df1) on rows of another data frame (df2).
The data frames have different number of columns (5 in the first one versus 6 in the second one, for instance). And there is no exact formation rule for the rows (so I can not find a way of doing this through combinatory analysis)
This routine must check all the rows from the first data frame against all the rows of the second data frame, resulting a total number of occurences by the number of hits.
Not all the possible sums are of interest. Actually I am looking for a specific total (which I call "hits" in this text).
In other words: how many times a subset of each row of df2 of size "hits" can be found in rows of df1.
Here is an example:
> ### Example
> ### df1 and df2 here are regularly formed just for illustration purposes
>
> require(combinat)
>
> df1 <- as.data.frame(t(combn(6,5)))
> df2 <- as.data.frame(t(combn(7,6)))
>
> df1
V1 V2 V3 V4 V5
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 5 6
4 1 2 4 5 6
5 1 3 4 5 6
6 2 3 4 5 6
>
> df2
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 2 3 4 5 7
3 1 2 3 4 6 7
4 1 2 3 5 6 7
5 1 2 4 5 6 7
6 1 3 4 5 6 7
7 2 3 4 5 6 7
>
In this example, please note, for instance, that subsets of size 5, from row #1 of df2 can be found 6 times in the rows of df1. And so on.
I tried something like this:
> ### Check how many times subsets of size "hits" from rows from df2 are found in rows of df1
>
> myfn <- function(dfa,dfb,hits) {
+ sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
+ }
>
> r1 <- myfn(df1,df2,5)
>
> cbind(df2,"hits.eq.5" = r1)
V1 V2 V3 V4 V5 V6 hits.eq.5
1 1 2 3 4 5 6 6
2 1 2 3 4 5 7 1
3 1 2 3 4 6 7 1
4 1 2 3 5 6 7 1
5 1 2 4 5 6 7 1
6 1 3 4 5 6 7 1
7 2 3 4 5 6 7 1
This seems to do what I need, but it is too slow! I need using this routine on large data frames (about 200 K rows)
I am currently using R 3.1.2 GUI 1.65 Mavericks build (6833)
Can anyone provide a faster or more clever way of doing this? Than you again.
Best regards,
Vaccaro

Using apply(...) on data frames is very inefficient. This is because apply(...) takes a matrix as argument, so if you pass a data frame it will coerce that to a matrix. In your example you convert df1 to a matrix every time you call apply(...), which is nrow(df2) times.
Also, by using sapply(1:nrow(df2),...) and dfb[i,] you are using data frame row indexing, which is also very inefficient. You are much better off converting everything to matrix class at the beginning and then using apply(...) twice.
Finally, there is no reason to use a call to c(...). apply(...) already returns a vector (in this case), so you are just incurring the overhead of another function call to no effect.
Doing these things alone speeds up your code by about a factor of 20.
set.seed(1)
nrows <- 100
df1 <- data.frame(matrix(sample(1:5,5*nrows,replace=TRUE),nc=5))
df2 <- data.frame(matrix(sample(1:6,6*nrows,replace=TRUE),nc=6))
myfn <- function(dfa,dfb,hits) {
sapply(c(1:dim(dfb)[1]),function(y) { sum(c(apply(dfa,1,function(x,i) { sum(x %in% dfb[i,]) },i=y))==hits) })
}
myfn.2 <- function(dfa,dfb,hits) {
ma <- as.matrix(dfa)
mb <- as.matrix(dfb)
apply(mb,1,function(y) { sum(apply(ma,1,function(x) { sum(x %in% y) })==hits) })
}
system.time(r1<-myfn(df1,df2,3))
# user system elapsed
# 1.99 0.00 2.00
system.time(r2<-myfn.2(df1,df2,3))
# user system elapsed
# 0.09 0.00 0.10
identical(r1,r2)
# [1] TRUE
There is another approach which takes advantage of the fact that R is extremely efficient at manipulating lists. Since a data frame is just a list of vectors, we can improve performance by putting your rows into data frame columns and then using sapply(..) on that. This is faster than myfn.2(...) above, but only by about 20%.
myfn.3 <-function(dfa,dfb,hits) {
df1.t <- data.frame(t(dfa)) # rows into columns
df2.t <- data.frame(t(dfb))
sapply(df2.t,function(col2)sum(sapply(df1.t,function(col1)sum(col1 %in% col2)==hits)))
}
library(microbenchmark)
microbenchmark(myfn.2(df1,df2,5),myfn.3(df1,df2,5),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# myfn.2(df1, df2, 5) 92.84713 94.06418 96.41835 98.44738 99.88179 10
# myfn.3(df1, df2, 5) 75.53468 77.44348 79.24123 82.28033 84.12457 10
If you really have a dataset with 55MM rows, then I think you need to rethink this problem. I have no idea what you are trying to accomplish, but this seems like a brute force approach.

Related

Build a data frame with overlapping observations

Lets say I have a data frame with the following structure:
> DF <- data.frame(x=1:5, y=6:10)
> DF
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I need to build a new data frame with overlapping observations from the first data frame to be used as an input for building the A matrix for the Rglpk optimization library. I would use n-length observation windows, so that if n=2 the resulting data frame would join rows 1&2, 2&3, 3&4, and so on. The length of the resulting data frame would be
(numberOfObservations-windowSize+1)*windowSize
The result for this example with windowSize=2 would be a structure like
x y
1 1 6
2 2 7
3 2 7
4 3 8
5 3 8
6 4 9
7 4 9
8 5 10
I could do a loop like
DFResult <- NULL
numBlocks <- nrow(DF)-windowSize+1
for (i in 1:numBlocks) {
DFResult <- rbind(DFResult, DF[i:(i+horizon-1), ])
}
But this seems vey inefficient, especially for very large data frames.
I also tried
rollapply(data=DF, width=windowSize, FUN=function(x) x, by.column=FALSE, by=1)
x y
[1,] 1 6
[2,] 2 7
[3,] 2 7
[4,] 3 8
where I was trying to repeat a block of rows without applying any aggregate function. This does not work since I am missing some rows
I am a bit stumped by this and have looked around for similar problems but could not find any. Does anyone have any better ideas?
We could do a vectorized approach
i1 <- seq_len(nrow(DF))
res <- DF[c(rbind(i1[-length(i1)], i1[-1])),]
row.names(res) <- NULL
res
# x y
#1 1 6
#2 2 7
#3 2 7
#4 3 8
#5 3 8
#6 4 9
#7 4 9
#8 5 10

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

For each row in data frame, return variable with non-zero column names

I am trying to create a variable that contains a list of all of the column names that are not zero for each row.
Example of data:
set.seed(334)
DF <- matrix(sample(0:9,9),ncol=4,nrow=10)
DF <- as.data.frame.matrix(DF)
DF$id <- c("ty18","se78","first", "gh89", "sil12","seve","aga2", "second","anotherX", "CH560")
DF$count <- rowSums(DF[,2:5]>0)
DF
> V1 V2 V3 V4 id count
> 1 9 4 0 5 ty18 3
> 2 4 0 5 8 se78 3
> 3 0 5 8 2 first 4
> 4 5 8 2 6 gh89 4
> 5 8 2 6 7 sil12 4
> 6 2 6 7 3 seve 4
> 7 6 7 3 9 aga2 4
> 8 7 3 9 4 second 4
> 9 3 9 4 0 anotherX 3
> 10 9 4 0 5 CH560 3
The desired output would be a new variable that was, for row 1, "V1 V2 V4" and for row 2 "V1 V3 V4". I only want to use the V1-V4 for this, and not consider id or count.
This question on SO helped: For each row return the column name of the largest value
I tried to test this out, but it ignores my selective columns, even for max, so the first test here just gives the max for the whole row, which is not always in V1-V4 in my data.
DF$max <- colnames(DF)[apply(DF[,1:4],1,which.max)]
Despite the error, I think I need to do something like this, but my DF$list attempt is clearly all wrong:
DF$list <- colnames(DF[,1:4]>0)
I'm getting
Error in `$<-.data.frame`(`*tmp*`, "list", value = c("V1", "V2", "V3", :
replacement has 4 rows, data has 10
Maybe I'm trying to put a vector into a cell, and that is why it doesn't work, but I don't know how to get this information out and then make it into a string. I also don't understand why the max on selective columns did not work.
How about this
DF$nonzeros <- simplify2array(
apply(
DF[1:4], 1,
function(x) paste(names(DF[1:4])[x != 0], collapse = " ")
)
)

Computing difference between rows in a data frame

I have a data frame. I would like to compute how "far" each row is from a given row. Let us consider it for the 1st row. Let the data frame be as follows:
> sampleDF
X1 X2 X3
1 5 5
4 2 2
2 9 1
7 7 3
What I wish to do is the following:
Compute the difference between the 1st row & others: sampleDF[1,]-sampleDF[2,]
Consider only the absolute value: abs(sampleDF[1,]-sampleDF[2,])
Compute the sum of the newly formed data frame of differences: rowSums(newDF)
Now to do this for the whole data frame.
newDF <- sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})
This creates a problem in that the result is a transposed list. Hence,
newDF <- as.data.frame(t(sapply(2:4,function(x) { return (abs(sampleDF[1,]-sampleDF[x,]));})))
But another problem arises while computing rowSums:
> class(newDF)
[1] "data.frame"
> rowSums(newDF)
Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric
> newDF
X1 X2 X3
1 3 3 3
2 1 4 4
3 6 2 2
>
Puzzle 1: Why do I get this error? I did notice that newDF[1,1] is a list & not a number. Is it because of that? How can I ensure that the result of the sapply & transpose is a simple data frame of numbers?
So I proceed to create a global data frame & modify it within the function:
sapply(2:4,function(x) { newDF <<- as.data.frame(rbind(newDF,abs(sampleDF[1,]-sampleDF[x,])));})
> newDF
X1 X2 X3
2 3 3 3
3 1 4 4
4 6 2 2
> rowSums(outDF)
2 3 4
9 9 10
>
This is as expected.
Puzzle 2: Is there a cleaner way to achieve this? How can I do this for every row in the data frame (shown above is only for "distance" from row 1. Would need to do this for other rows as well)? Is running a loop the only option?
To put it in words, you are trying to compute the Manhattan distance:
dist(sampleDF, method = "Manhattan")
# 1 2 3
# 2 9
# 3 9 10
# 4 10 9 9
Regarding your implementation, I think the problem is that your inner function is returning a data.frame when it should return a numeric vector. Doing return(unlist(abs(sampleDF[1,]-sampleDF[x,]))) should fix it.

Resources