Finding rows of a large matrix that match specific values - r

My aim is to find row indices of a matrix (dat) that contain matching rows of another matrix (xy).
I find it easy to do this with smaller matrices, as shown in the examples. But the matrices I have a very large number of rows.
For toy example the matrices dat and xy are given below. The aim is to recover the indices 14, 58, 99. In my case, both these matrices have a very larger number of rows.
# toy data
dat <- iris
dat$Sepal.Length <- dat$Sepal.Length * (1 + runif(150))
xy <- dat[c(14, 58, 99), c(1, 5)]
For small matrices, the solutions would be
# solution 1
ind <- NULL
for(j in 1 : length(x)) {
ind[j] <- which((dat$Sepal.Length ==xy[j, 1]) & (dat$Species == xy[j, 2]))
}
Or
# solution 2
which(outer(dat$Sepal.Length, xy[, 1], "==") &
outer(dat$Species, xy[, 2], "=="), arr.ind=TRUE)
But given the size of my data, these methods are not feasible. The first method takes a lot of time and the other fails due to lack of memory.
I wish I know more data.table and dplyr.

With data.table, it's a join:
library(data.table)
setDT(dat); setDT(xy)
dat[xy, on=names(xy), which=TRUE]
# [1] 14 58 99

You could try this dplyr solution. Depends on how big your data frames are.
#use dplyr filter
library(dplyr)
dat %>%
mutate(row_no = row_number()) %>%
filter(Sepal.Length %in% xy$Sepal.Length & Species %in% xy$Species) %>%
select(row_no)
#> row_no
#> 1 14
#> 2 58
#> 3 99

I used paste0() to concatenate Sepal.Length and Species into a temporary variable.
Then match() to return the index of the matches between the two temporary variables.
Then not, '!', is.na() to remove the non-matches and convert to a logical vector.
Then return which() indices are true.
which(!is.na(match(paste0(dat$Sepal.Length, dat$Species), paste0(xy$Sepal.Length, xy$Species))))
[1] 14 58 99
PS: merge() accepts combined variables in by.x and by.y:
merge(dat, xy, by.x=c("Sepal.Length", "Species"), by.y=c("Sepal.Length", "Species"), all.x=FALSE, all.y=TRUE)

Following chinsoon12's suggestion, try this:
library(dplyr)
dat$rowind <- 1:nrow(dat) # adds row index if wanted (not necessary though)
newDf <- semi_join(dat, xy, by = c("Species", "Sepal.Length"))

For the setup you provided, you could use:
library(tidyverse)
dat %>%
mutate(row_num = row_number()) %>%
inner_join(xy, by = c("Sepal.Length", "Species")) %>%
pull(row_num)
This adds a column for the initial row number, does an inner join to produce a data frame with rows in dat that match rows from xy, and then pulls the indices. (An inner join will return all rows from dat that match rows from xy, while a semi-join will return only one row from dat for each row in xy.)
It's worth noting that in this example we are dealing with data frames, not matrices:
> class(xy)
[1] "data.frame"
> class(dat)
[1] "data.frame"
The above code won't work if the data is in matrix form - can you convert your matrices to data frames or tibbles?

if your data is huge, You can hash your rows firstly (for both matrix),then match the row hash values,using digest package.
target_matrix<-iris
query_matrix<-iris[c(14, 58, 99),]
target_row_hash<-apply(target_matrix,1,digest)
query_row_hash<-apply(query_matrix,1,digest)
row_nums<-match(query_row_hash,target_row_hash)
row_nums
output:
14 58 99

Related

In R, sample n rows from a df in which a certain column has non-NA values (sample conditionally)

Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?
We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]
We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220
Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663
For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)
The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]

How to omit rows with a value contained in a separate vector [duplicate]

This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))

Merging two data.frames by two columns each

I have a huge data.frame that I want to reorder. The idea was to split it in half (as the first half contains different information than the second half) and create a third data frame which would be the combination of the two. As I always need the first two columns of the first data frame followed by the first two columns of the second data frame, I need help.
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
df3<-data.frame()
The new data frame should look like the following:
new3[new1[1],new1[2],new2[1],new2[2],new1[3],new1[4],new2[3],new2[4],new1[5],new1[6],new2[5],new2[6], etc.].
Pseudoalgorithmically, cbind 2 columns from data frame new1 then cbind 2 columns from data frame new2 etc.
I tried the following now (thanks to Akrun):
new1<-all_cont_video_algo[,1:826]
new2<-all_cont_video_algo[,827:length(all_cont_video_algo)]
new1<-as.data.frame(new1, stringsAsFactors =FALSE)
new2<-as.data.frame(new2, stringsAsFactors =FALSE)
df3<-data.frame()
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
lst1 <- split.default(new1, f1(ncol(new1), 2))
lst2 <- split.default(new2, f1(ncol(new2), 2))
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
However, giving me a "undefined columns selected error".
See whether the below code helps
library(tidyverse)
# Two sample data frames of equal number of columns and rows
df1 = mtcars %>% select(-1)
df2 = diamonds %>% slice(1:32)
# get the column names
dn1 = names(df1)
dn2 = names(df2)
# create new ordered list
neworder = map(seq(1,length(dn1),2), # sequence with interval 2
~c(dn1[.x:(.x+1)], dn2[.x:(.x+1)])) %>% # a vector of two columns each
unlist %>% # flatten the list
na.omit # remove NAs arising from odd number of columns
# Get the data frame ordered
df3 = bind_cols(df1, df2) %>%
select(neworder)
It is not clear without a reproducible example. Based on the description, we can split the dataset columns into a list of datasets and use Map to cbind the columns of corresponding datasets, unlist and use that to order the third dataset
1) Create a function to return a grouping column for splitting the dataset
f1 <- function(Ncol, n) {
as.integer(gl(Ncol, n, Ncol))
}
2) split the datasets into a list
lst1 <- split.default(df1, f1(ncol(df1), 2))
lst2 <- split.default(df2, f1(ncol(df2), 2))
3) Map through the corresponding list elements, cbind and unlist and use that to subset the columns of 'df3'
lst3 <- Map(function(x, y) df3[unlist(cbind(x, y))], lst1, lst2)
data
df1 <- as.data.frame(matrix(letters[1:10], 2, 5), stringsAsFactors = FALSE)
df2 <- as.data.frame(matrix(1:10, 2, 5))

Turning a data.frame into a list of smaller data.frames in R

Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))

extract highest and lowest values for columns in R, as well as row identifiers

Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a

Resources