Find similar groups of numbers across rows R - r

I'm trying to find similar patterns of numbers across a dataframe. I have a dataframe with 5 columns and some columns have a random number between 3 and 50. However, for some rows 2 or 3 columns don't have a number.
A B C D E
5 23 6
9 33 7 8 12
33 7 14
6 18 23 48
8 44 33 7 9
I want to know what are the recurring numbers, so I'm interested in:
Row 1 and 4 that have the number 23 and 6,
Row 2 and 5 that have number 9, 33 and 8,
Row 2, 3 and 5 that have number 33 and 7.
Basically I'm trying to get the number of different combinations.
I'm a bit stuck about how to do this. I've tried to join the numbers in a list.
for (i in 1:dim(knots_all)[1]) {
knots_all$list_knots <- list(sort(knots_all[i,1:5]))
}
I've also tried intersect but it doesn't seem very efficient as R also considers the NAs which I want to disregard.
I would like to hear some ideas about the best way to achieve this. I've been thinking about this problem but I'm not able to understand how to get to the answer. My mind is stuck so any idea is much appreciated!
Thank you!

There's no specific/target pattern you want to capture. It seems like you need a process to identify the numbers that appear more often in your dataset and then see in which rows they appear.
I'll modify your example dataset to have number 23 appearing twice in the same row in order to illustrate some useful differences in counts.
df = read.table(text = "
A B C D E
5 23 6 23 NA
9 33 7 8 12
33 7 14 NA NA
6 18 23 48 NA
8 44 33 7 9
", header=T)
library(dplyr)
library(tidyr)
df %>%
mutate(row_id = row_number()) %>% # add a row flag
gather(col_name,value,-row_id) %>% # reshape
filter(!is.na(value)) %>% # exclude NAs
group_by(value) %>% # for each number value
summarise(NumOccurences = n(), # count occurences
rows = paste(sort(row_id), collapse = "_"), # capture rows
NumRowOccurences = n_distinct(row_id), # count occurences in unique rows
unique_rows = paste(sort(unique(row_id)), collapse = "_")) %>% # capture unique rows
arrange(desc(NumOccurences)) # order by number popularity (occurences)
# # A tibble: 12 x 5
# value NumOccurences rows NumRowOccurences unique_rows
# <int> <int> <chr> <int> <chr>
# 1 7 3 2_3_5 3 2_3_5
# 2 23 3 1_1_4 2 1_4
# 3 33 3 2_3_5 3 2_3_5
# 4 6 2 1_4 2 1_4
# 5 8 2 2_5 2 2_5
# 6 9 2 2_5 2 2_5
# 7 5 1 1 1 1
# 8 12 1 2 1 2
# 9 14 1 3 1 3
# 10 18 1 4 1 4
# 11 44 1 5 1 5
# 12 48 1 4 1 4

Make a list of lists:
List = [1[],2[],...,n[]].
Loop through your data frame and for your example ad A to List = [1[],2[],.5[A]..,[n]] (at index = 5). And so on for every column.
after this loop through list check if the list (in the list) are filled and have multiple columns.
this should get you started.
good luck

This is an algorithm which can detect numbers presents in two columns.
df <- data.frame(A = c(5, 23, 6, NA, NA),
B = c(9, 33, 7, 8, 12),
C = c(33, 7, 14, NA, NA),
D = c(6, 18, 23, 48, NA),
E = c(8, 44, 33, 7, 9))
L <- as.list(df)
LL <- rep(list(rep(list(NA), length(L))), length(L))
for(i in 1:length(L)){
for(j in 1:length(L))
LL[[i]][[j]] <- intersect(L[[i]], L[[j]])
}
To see the overlapping numbers in columns 1 and 4:
LL[[1]][[4]]
[1] 23 6 NA
To see all overapping numbers:
unique(unlist(LL))
[1] 5 23 6 NA 9 33 7 8 12 14 18 48 44
It could be changed a little bit (by adding a level in the nested loop and if the for loop) to see the pesence in 3 different columns etc

One example for dealing with the NA would be to temporarily fill them with randomly generated numbers:
# data
df <- data.frame(A = c(5,9,33,6,8),
B = c(23,33,7,18,44),
C = c(6,7,14,23,33),
D = c(NA, 8, NA, 48, 7),
E = c(NA, 12, NA, NA, 9))
# fill NA with random numbers
set.seed(1)
df2 <- as.data.frame(do.call(cbind, lapply(df, function(x) ifelse(is.na(x), rnorm(1), x))))
> df2
A B C D E
1 5 23 6 -0.6264538 0.1836433
2 9 33 7 8.0000000 12.0000000
3 33 7 14 -0.6264538 0.1836433
4 6 18 23 48.0000000 0.1836433
5 8 44 33 7.0000000 9.0000000
# split data by rows
df2 <- split(df2, seq_along(df2))
# compare rows with each other
temp <- lapply(lapply(df2, function(x) lapply(df2, function(y) x %in% y)), function(x) do.call(rbind, x))
# delete self comparisons
output <- lapply(1:5, function(x) temp[[x]] <- temp[[x]][-x,])
Result:
[[1]]
[,1] [,2] [,3] [,4] [,5]
2 FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE TRUE TRUE
4 FALSE TRUE TRUE FALSE TRUE
5 FALSE FALSE FALSE FALSE FALSE
[[2]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
5 TRUE TRUE TRUE TRUE FALSE
[[3]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE TRUE TRUE
2 TRUE TRUE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE TRUE
5 TRUE TRUE FALSE FALSE FALSE
[[4]]
[,1] [,2] [,3] [,4] [,5]
1 TRUE FALSE TRUE FALSE TRUE
2 FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE TRUE
5 FALSE FALSE FALSE FALSE FALSE
[[5]]
[,1] [,2] [,3] [,4] [,5]
1 FALSE FALSE FALSE FALSE FALSE
2 TRUE FALSE TRUE TRUE TRUE
3 FALSE FALSE TRUE TRUE FALSE
4 FALSE FALSE FALSE FALSE FALSE

Related

Scope of Aggregation Functions when nesting apply(within())

Edited original post to clarify question
Background
I'm learning R and saw this scenario and don't understand how R handles (what I'll call) implied context transitions. The script I am trying to understand simply iterates through each row of a matrix and prints the index of the column(s) within that row that contain the minimum value of that row. What I don't understand is how R handles the context transition as different functions are applied to the dependent variable x:
x (when defined as an argument to function(x)) is an atomic vector because of the apply() function with a MARGIN = 1 argument
The which() function then iterates over the individual elements within the atomic vector x to see which ones == min(x)
This is the part that truly confuses me: Despite the fact which() is iterating over elements of atomic vector x, you can call min(x) within the which() function and R somehow switches x to be defined as the entire atomic vector again for calculating the min() across the vector vs. within the scope of a single element
Example Data Matrix
a <- matrix (c(5, 2, 7, 1, 2, 8, 4, 5, 6), 3, 3)
[,1] [,2] [,3]
[1,] 5 1 4
[2,] 2 2 5
[3,] 7 8 6
This is the script that returns the column indexes that I am struggling to understand
apply (a, 1, function(x) which(x == min(x)))
My question:
Within the which() function, why does min(x) return the minimum of the atomic vector (as is desired) and not the minimum within the scope of an individual element within that vector, since which() is iterating over each individual element within the atomic vector x?
Edit: discussion about which and x:
the first comment on your question is incorrect:
x is anonymous function, lambda
x is just a variable, nothing fancy. function(x) declares it as the first (and only) argument of the anonymous function, and then every reference to x after that is referencing what is passed to this anonymous function;
the code uses an anonymous function; normally, almost everything you do in R is using named functions (e.g., mean, min). In some cases (e.g., in apply and related functions), it makes sense to define a whole function as an argument and not name it, as in
## anonymous (unnamed) function
apply(m, 1, function(x) which(x == min(x)))
## equivalently, with a named function
myfunc <- function(x) which(x == min(x))
apply(m, 1, myfunc)
In the first case, function(x) which(x == min(x))) is not named, so it is "anonymous". The results between the two apply calls are identical.
Given that context, x is the first argument to the function (myfunc or the anonymous function in your case). With the rest of the apply/MARGIN discussion below,
x (in this case) contains the whole row (when MARGIN=1);
min(x) returns the value of the lowest value within x, and it is always length 1); and
which(x == min(x)) returns the index of that lowest value within x; in this case, it will always be length 1 or more, because we are confident that there is always one element such that it is equal to the minimum of that vector ... however, there is no guarantee that which will find any matches, so the length of which(...)'s return value can be between 0 and the length of the inputs. Examples:
which(11:15 == 13)
# [1] 3
which(11:15 == 1:5)
# integer(0)
which(11:15 == 11:15)
# [1] 1 2 3 4 5
which(11:15 %in% c(12, 14))
# [1] 2 4
apply works one or more dimensions at a time. For now, I'll stick with a 2d matrix, in which case MARGIN= selects rows or columns. (There is a caveat, see below.)
I'm going to use a step-by-step verbose function for trying to show each step. I'll name it anonfunc, but in your mind convert apply(a, 1, anonfunc) later with apply(a, 1, function(x) { ... }) and you will see what I'm intending to do. Also, I have a dematrix function to help show what's being used in the anonfunc.
dematrix <- function(m, label = "") {
if (!is.matrix(m)) m <- matrix(m, nrow = 1)
out <- capture.output(print(m))[-1]
out <- gsub("^[][,0-9]+", "", out)
paste(paste0(c(label, rep(strrep(" ", nchar(label)), length(out) - 1)), out),
collapse = "\n")
}
anonfunc <- function(x) {
message(dematrix(x, "Input: "))
step1 <- x == min(x)
message(dematrix(step1, "Step1: "))
step2 <- which(step1)
message("Step2: ", paste(step2, collapse = ","), "\n#\n")
step2
}
2d arrays
I'm going to modify your sample data a little by adding a column. This helps visualize how many function calls there are and how big the function's input is.
apply(a, 1, anonfunc)
# Input: 5 1 4 11
# Step1: FALSE TRUE FALSE FALSE
# Step2: 2
# #
# Input: 2 2 5 12
# Step1: TRUE TRUE FALSE FALSE
# Step2: 1,2
# #
# Input: 7 8 6 13
# Step1: FALSE FALSE TRUE FALSE
# Step2: 3
# #
# [[1]]
# [1] 2
# [[2]]
# [1] 1 2
# [[3]]
# [1] 3
Our anonymous function is called three times, once for each row. In each call, it is passed a vector of length 4, which is the size of one row in the matrix.
Note that we get a list in return. Normally apply returns a vector or matrix. The return value is actually the dimension of the MARGIN= axes, with an added dimension of the length of the return values. That is, a has dims 3x4; if the return value from each call to the anon-func is length 1, then the return value is "sort of" 3x1, but R simplifies that to a vector of length 3 (this might be construed as inconsistent mathematically, I don't disagree).; if the return value from each anon-func call is length 10, then the output would be a matrix of 3x10.
However, when any of the anon-func returns is of a different length/size/class as the others, then apply will return a list. (This is the same behavior as sapply, and it can be frustrating if it changes when you are not expecting it. There is allegedly a patch in R-devel that allows us to force a list with apply(..., simplify=FALSE).)
If we instead use MARGIN=2, we'll be operating on columns:
apply(a, 2, anonfunc)
# Input: 5 2 7
# Step1: FALSE TRUE FALSE
# Step2: 2
# #
# Input: 1 2 8
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 4 5 6
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 11 12 13
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# [1] 2 1 1 1
Now, one call for each column (4 calls) and x is a vector of length 3 (number of rows in the source matrix).
It is possible to operate on more than one axis at a time; while it seems meaningless to do it with a matrix (2d array), it makes more sense with larger-dimensioned arrays.
apply(a, 1:2, anonfunc)
# Input: 5
# Step1: TRUE
# Step2: 1
# #
# Input: 2
# Step1: TRUE
# Step2: 1
# #
# Input: 7
# Step1: TRUE
# Step2: 1
# #
# ...truncated... total of 12 calls to `anonfunc`
# #
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
From the discussion of output dimensions, the MARGIN=1:2 means the output dimension will be the dimensions of the margin -- 3x4 -- with the dimension/length of the output. Since the output here is always length 1, then that is technically 3x4x1, which in R-speak is a matrix of dim 3x4.
Pics of what each margin uses from a matrix:
3d array
Let's go slightly larger to see some of the "plane" operations.
a3 <- array(1:24, dim = c(3,4,2))
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
Starting with MARGIN=1. While you have both arrays visible, look at the first Input: and see which "plane" is being used from the original a3 array. It appears transposed, sure ...
For the sake of brevity (too late!), I'll abbreviate the third and subsequent iterations of anonfunc to show just the first line (inner-matrix row) of the verbose output.
apply(a3, 1, anonfunc)
# Input: 1 13
# 4 16
# 7 19
# 10 22
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 2 14
# 5 17
# 8 20
# 11 23
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 3 15 ...
# #
# [1] 1 1 1
Similarly, MARGIN=2. I'll show a3 again so you can see which "plane" is being used:
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
apply(a3, 2, anonfunc)
# Input: 1 13
# 2 14
# 3 15
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 4 16
# 5 17
# 6 18
# Step1: TRUE FALSE
# FALSE FALSE
# FALSE FALSE
# Step2: 1
# #
# Input: 7 19 ...
# Input: 10 22 ...
# #
# [1] 1 1 1 1
MARGIN=3 is not very exciting: anonfunc is only called twice, one for each of the front-facing "planes" (no abbreviation necessary here):
apply(a3, 3, anonfunc)
# Input: 1 4 7 10
# 2 5 8 11
# 3 6 9 12
# Step1: TRUE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# Step2: 1
# #
# Input: 13 16 19 22
# 14 17 20 23
# 15 18 21 24
# Step1: TRUE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# FALSE FALSE FALSE FALSE
# Step2: 1
# #
# [1] 1 1
One can use multiple dimensions here as well, and this is where I think the Input: string becomes a little clarifying:
a3
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 13 16 19 22
# [2,] 14 17 20 23
# [3,] 15 18 21 24
apply(a3, 2:3, anonfunc)
# Input: 1 2 3
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 4 5 6
# Step1: TRUE FALSE FALSE
# Step2: 1
# #
# Input: 7 8 9 ...
# Input: 10 11 12 ...
# Input: 13 14 15 ...
# Input: 16 17 18 ...
# Input: 19 20 21 ...
# Input: 22 23 24 ...
# #
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
# [3,] 1 1
# [4,] 1 1
And since the dimensions of a3 are 3,4,2, and we're looking at margins 2:3, and each call to anonfunc returns length 1, our returned matrix is 4x2x1 (where the x1 is silently dropped by R).
To visualize what each call of MARGIN= actually uses, see the below pics:
"Lexical scoping looks up symbol values based on how functions were nested when they were created, not how they are nested when they are called. With lexical scoping, you don’t need to know how the function is called to figure out where the value of a variable will be looked up. You just need to look at the function’s definition."**
**Source: http://adv-r.had.co.nz/Functions.html#lexical-scoping

R Match rows in a data frame based on formula

I have a data frame containing 7 columns and I want to add a column with information about the 'parent-row'. This sounds vague, so I'll clarify with an example. Below you can see a data frame:
` Nclass0 Nclass1 BestSBestI impurity n
[1,] 5 5 4 36.0 0.2500000 10
[2,] 5 2 1 37.0 0.2040816 7
[3,] 4 0 -1 -1.0 0.0000000 4
[4,] 1 2 2 0.5 0.2222222 3
[5,] 1 0 -1 -1.0 0.0000000 1
[6,] 0 2 -1 -1.0 0.0000000 2
[7,] 0 3 -1 -1.0 0.0000000 3`
Using the nclass0 and nclass1, I want to add an 8th column in which matching pairs have the same id. The first row is the parent row (with id=0). The rows match if [rowX,1] + [rowY,1] are equal to the parents row nclass0 and [rowX,2] + [rowY,2] are equal to the parent rows nclass1. RowX and rowY are the child rows and should get id=1.
In this case the parent row [1,] has child rows [2,]&[7,] and these rows should get id=1. After this the second row becomes the parent row with its own child rows [3,] and [4,] with id=2, until all rows with child rows have been assigned an id.
I have made several attempts but failed miserably. Does anyone have a suggestion how this can be done? The desired output for this case would be:
` Nclass0 Nclass1 BestS BestI impurity n id
[1,] 5 5 4 36.0 0.2500000 10 0
[2,] 5 2 1 37.0 0.2040816 7 1
[3,] 4 0 -1 -1.0 0.0000000 4 2
[4,] 1 2 2 0.5 0.2222222 3 2
[5,] 1 0 -1 -1.0 0.0000000 1 4
[6,] 0 2 -1 -1.0 0.0000000 2 4
[7,] 0 3 -1 -1.0 0.0000000 3 1`
Here's a solution that makes use of a while loop. The loop will run until either every row has an id value, or until it has evaluated all of the rows in the data frame. I'm sure there are some weaknesses, but it's a good start:
Note: I think this could get unbearably slow in a large data frame, so I hope you don't need to do this on anything large (each outer takes about 1 second to complete on a vector of 10,000).
DF <-
structure(list(Nclass0 = c(5, 5, 4, 1, 1, 0, 0),
Nclass1 = c(5, 2, 0, 2, 0, 2, 3),
BestS = c(4, 1, -1, 2, -1, -1, -1),
BestI = c(36, 37, -1, 0.5, -1, -1, -1),
impurity = c(0.25, 0.2040816, 0, 0.2222222, 0, 0, 0),
n = c(10, 7, 4, 3, 1, 2, 3)),
.Names = c("Nclass0", "Nclass1", "BestS", "BestI", "impurity", "n"),
row.names = c(NA, -7L), class = "data.frame")
DF[["id"]] <- c(0, rep(NA, nrow(DF) - 1))
i <- 1
while(sum(is.na(DF[["id"]])) > 0){
cross0 <- outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
match0 <- cross0 == DF[["Nclass0"]][i] & lower.tri(cross0)
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[["id"]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
Explanation
To try to clarify your problem, you are looking for pairs where x1 + x2 = x_ref AND y1 + y2 = y_ref.
What this code does is make a matrix of all of the possible pairwise sums of a vector with itself. This is accomplished with outer.
outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 10 10 9 6 6 5 5
[2,] 10 10 9 6 6 5 5
[3,] 9 9 8 5 5 4 4
[4,] 6 6 5 2 2 1 1
[5,] 6 6 5 2 2 1 1
[6,] 5 5 4 1 1 0 0
[7,] 5 5 4 1 1 0 0
When trying to find the x-match for the first row, we compare this matrix to DF$class0[1] (and take set the upper triangle to false to avoid duplicates).
match0 <- cross0 == DF[["Nclass0"]][1] & lower.tri(cross0)
match0
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
We repeat this process for Nclass1
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][1] & lower.tri(cross1)
match1
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
To find the row indices, we want to find the intersection of these two match matrices--in other words which positions in both matrices are TRUE
as.vector(which(match0 & match1, arr.ind = TRUE))
[1] 7 2
So rows 7 and 2 are related to the first row. We can repeat this operation for each subsequent row until we've assigned an ID for every row.
Turning it into a function
Here's a function that takes a data frame, a column name for the x-match, the column name for the y-match, and a character to name the id variable. I've added some bells and whistles to check the inputs.
assign_id <- function(DF, class0, class1, id_var){
check <- require(checkmate)
if (!check) stop ("Install the checkmate package")
checkmate::assert_character(x = class0,
len = 1)
checkmate::assert_character(x = class1,
len = 1)
checkmate::assert_character(x = id_var,
len = 1)
checkmate::assert_subset(c(class0, class1),
choices = names(DF))
i <- 1
DF[[id_var]] <- c(0, rep(NA, nrow(DF) - 1))
while(sum(is.na(DF[[id_var]])) > 0){
cross0 <- outer(DF[[class0]], DF[[class0]], `+`)
match0 <- cross0 == DF[[class0]][i] & lower.tri(cross0)
cross1 <- outer(DF[[class1]], DF[[class1]], `+`)
match1 <- cross1 == DF[[class1]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[[id_var]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
DF
}
assign_id(DF, "Nclass0", "Nclass1", "id")

check whether matrix rows equal a vector in R , vectorized

I'm very surprised this question has not been asked, maybe the answer will clear up why. I want to compare rows of a matrix to a vector and return whether the row == the vector everywhere. See the example below. I want a vectorized solution, no apply functions because the matrix is too large for slow looping. Suppose there are many rows as well, so I would like to avoid repping the vector.
set.seed(1)
M = matrix(rpois(50,5),5,10)
v = c(3 , 2 , 7 , 7 , 4 , 4 , 7 , 4 , 5, 6)
M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 8 3 5 9 4 5 6 7 7
[2,] 4 9 3 6 3 1 5 7 6 1
[3,] 5 6 6 11 6 4 5 2 7 5
[4,] 8 6 4 4 3 8 3 6 5 6
[5,] 3 2 7 7 4 4 7 4 5 6
Output should be
FALSE FALSE FALSE FALSE TRUE
One possibility is
rowSums(M == v[col(M)]) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
Or simlarly
rowSums(M == rep(v, each = nrow(M))) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
Or
colSums(t(M) == v) == ncol(M)
## [1] FALSE FALSE FALSE FALSE TRUE
v[col(M)] is just a shorter version of rep(v, each = nrow(M)) which creates a vector the same size as M (matrix is just a vector, try c(M)) and then compares each element against its corresponding one using ==. Fortunately == is a generic function which has an array method (see methods("Ops") and is.array(M)) which allows us to run rowSums (or colSums) on it in order to makes sure we have the amount of matches as ncol(M)
Using DeMorgan's rule (Not all = Some not), then All equal = Not Some Not equal, we also have
!colSums(t(M) != v)
The package prodlim has a function called row.match, which is easy to use and ideal for your problem. First install and load the library: library(prodlim). In our example, row.match will return '5' because the 5th row in M is equal to v. We can then convert this into a logical vector.
m <- row.match(v, M)
m==1:NROW(M)#[1] FALSE FALSE FALSE FALSE TRUE

replace <NA> with NA

I have a data frame containing entries; It appears that these values are not treated as NA since is.na returns FALSE. I would like to convert these values to NA but could not find the way.
Use dfr[dfr=="<NA>"]=NA where dfr is your dataframe.
For example:
> dfr<-data.frame(A=c(1,2,"<NA>",3),B=c("a","b","c","d"))
> dfr
A B
1 1 a
2 2 b
3 <NA> c
4 3 d
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
> dfr[dfr=="<NA>"] = NA **key step**
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] FALSE FALSE
The two classes where this is likely to be an issue are character and factor. This should loop over a dtaframe and convert the "NA" values into true <NA>'s but just for those two classes:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x=="NA"; x} else {
x}
df[] <- lapply(df, make.true.NA)
(Untested in the absence of a data example.) The use of the form: df_name[] will attempt to retain the structure of the original dataframe which would otherwise lose its class attribute. I see that ujjwal thinks your spelling of NA has flanking "<>" characters so you might try this functions as more general:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x %in% c("NA", "<NA>"); x} else {
x}
You can do this with the naniar package as well, using replace_with_na and associated functions.
dfr <- data.frame(A = c(1, 2, "<NA>", 3), B = c("a", "b", "c", "d"))
library(naniar)
# dev version - devtools::install_github('njtierney/naniar')
is.na(dfr)
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] FALSE FALSE
#> [4,] FALSE FALSE
dfr %>% replace_with_na(replace = list(A = "<NA>")) %>% is.na()
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] TRUE FALSE
#> [4,] FALSE FALSE
# You can also specify how to do this for many variables
dfr %>% replace_with_na_all(~.x == "<NA>")
#> # A tibble: 4 x 2
#> A B
#> <int> <int>
#> 1 2 1
#> 2 3 2
#> 3 NA 3
#> 4 4 4
You can read more about using replace_with_na here

Is there a way to extract continuous feature in an 2D array

Say I have an array of number
a <- c(1,2,3,6,7,8,9,10,20)
if there a way to tell R to output just the range of the continuous sequence from "a"
e.g., the continuous sequences in "a" are the following
1,3
6,10
20
Thanks a lot!
Derek
I don't think there is a straight way, but you could create two logical vectors telling you if next/previous element is 1 greatest/least. E.g.:
data.frame(
a,
is_first = c(TRUE,diff(a)!=1),
is_last = c(diff(a)!=1,TRUE)
)
# Gives you:
a is_first is_last
1 1 TRUE FALSE
2 2 FALSE FALSE
3 3 FALSE TRUE
4 6 TRUE FALSE
5 7 FALSE FALSE
6 8 FALSE FALSE
7 9 FALSE FALSE
8 10 FALSE TRUE
9 20 TRUE TRUE
So ranges are:
cbind(a[c(TRUE,diff(a)!=1)], a[c(diff(a)!=1,TRUE)])
[1,] 1 3
[2,] 6 10
[3,] 20 20
I did this (not so elegant I admit) in case you want all the numbers of each sequence in a list
a <- c(1,2,3,6,7,8,9,10,20)
z <- c(1,which(c(1,diff(a))!=1))
g <- lapply(seq(1:length(z)),function(i) {
if (i < length(z)) a[z[i] : (z[i+1] - 1)]
else a[z[i] : length(a)]
})
[[1]]
[1] 1 2 3
[[2]]
[1] 6 7 8 9 10
[[3]]
[1] 20
Then you can get a 2D array with something like this
sapply(g,function(x) c(x[1],x[length(x)]))
[,1] [,2] [,3]
[1,] 1 6 20
[2,] 3 10 20
> a <- c(1,2,3,6,7,8,9,10,20)
> N<-length(a)
> k<-2:(N-1)
> z<-(a[k-1]+1)!=a[k] | (a[k+1]-1)!=a[k]
> c(a[1],a[k][z],a[N])
[1] 1 3 6 10 20

Resources