subset using `[`, explain NA output

subset using `[`, explain NA output - r

If we have his data recentely used here:
data <- data.frame(name = rep(letters[1:3], each = 3),
var1 = rep(1:9), var2 = rep(3:5, each = 3))
name var1 var2
1 a 1 3
2 a 2 3
3 a 3 3
4 b 4 4
5 b 5 4
6 b 6 4
7 c 7 5
8 c 8 5
9 c 9 5
we can look for rows where var2 == 4.
data[data[,3] == 4 ,] # equally data[data$var2 == 4 ,]
# name var1 var2
#4 b 4 4
#5 b 5 4
#6 b 6 4
or rows where both var1 and var2 ==4
data[data[,2] == 4 & data[,3] == 4,]
# name var1 var2
#4 b 4 4
what I dont get is why this:
data[ data[ , 2:3 ] == 4 ,]
gives this:
name var1 var2
4 b 4 4
NA <NA> NA NA
NA.1 <NA> NA NA
NA.2 <NA> NA NA
#I would still hope to get
# name var1 var2
#4 b 4 4
Where do the NAs come from?

Your logical that you're subsetting on is a matrix:
> sel <- data[ , 2:3 ] == 4
> sel
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
According to help("[.data.frame"):
Matrix indexing (x[i] with a logical or a 2-column integer matrix i)
using [ is not recommended, and barely supported. For extraction, x is
first coerced to a matrix. For replacement, a logical matrix (only)
can be used to select the elements to be replaced in the same way as
for a matrix.
But that implies this form:
> data[ sel ]
[1] "b" "4" "5" "6" "4"
Badness. What you're doing is even less sensical, though, in that you're telling it you want only the rows (with your trailing comma), and then giving it a matrix to index on!
> data[sel,]
name var1 var2
4 b 4 4
NA <NA> NA NA
NA.1 <NA> NA NA
NA.2 <NA> NA NA
If you really wanted to use the matrix form, you could use apply to apply a logical operation across rows.

Your data[,2:3]==4 is the following :
R> data[,2:3]==4
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
Then you try to index the rows of your data frame with this matrix. To do this, R seems to first convert your matrix to a vector :
R> as.vector(data[,2:3]==4)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE TRUE TRUE TRUE FALSE FALSE FALSE
It then selects the rows of data based on this vector. The 4th TRUE value selects the 4th row, but the three others TRUE values select "out of bounds" rows, so they return NA's.

data[ data[ , 2 ] == 4 | data[,3] == 4,]
name var1 var2
4 b 4 4
5 b 5 4
6 b 6 4
I suspect your method does not work because c() builds a vector, whereas you need to compare the atomic elements.

Because you're not passing a vector but a matrix to the index:
> data[ , 2:3 ] == 4
var1 var2
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
If you want the matrix collapsed into a vector that indexing works with here are two options:
data[ apply(data[ , 2:3 ] == 4, 1, all) ,]
data[ rowSums(data[ , 2:3 ] == 4) == 2 ,]

Related

How can I remove cells containing NULL from a matrix consisting of a list and vectors?

I would like to remove all cells, which contain NULL, from a matrix. However, the matrix consists of a list and vectors.
For example:
Col1<-list(NULL,2,3,4,5,NULL)
Col2<-c(0,2,3,4,5,0)
Col3<-c("Name1","Name2","Name3","Name4","Name5","Name6")
cbind(Col1,Col2,Col3)
Col1 Col2 Col3
[1,] NULL 0 "Name1"
[2,] 2 2 "Name2"
[3,] 3 3 "Name3"
[4,] 4 4 "Name4"
[5,] 5 5 "Name5"
[6,] NULL 0 "Name6"
How can I remove cell 1 and 6 from the matrix?
Thanks in advance for your help.

If you compare the matrix with string "NULL", it returns logical value whether the cell is NULL.
mat <- cbind(Col1,Col2,Col3)
mat == "NULL"
# Col1 Col2 Col3
#[1,] TRUE FALSE FALSE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE
#[6,] TRUE FALSE FALSE
You can use this to remove rows with NULL.
Using rowSums :
mat[rowSums(mat == "NULL") == 0, ]
# Col1 Col2 Col3
#[1,] 2 2 "Name2"
#[2,] 3 3 "Name3"
#[3,] 4 4 "Name4"
#[4,] 5 5 "Name5"
Or with apply :
mat[apply(mat != "NULL", 1, all), ]

We can also use is.null
mat[!apply(`dim<-`(sapply(mat, is.null), dim(mat)), 1, any),]
data
mat <- cbind(Col1,Col2,Col3)

R Match rows in a data frame based on formula

I have a data frame containing 7 columns and I want to add a column with information about the 'parent-row'. This sounds vague, so I'll clarify with an example. Below you can see a data frame:
` Nclass0 Nclass1 BestSBestI impurity n
[1,] 5 5 4 36.0 0.2500000 10
[2,] 5 2 1 37.0 0.2040816 7
[3,] 4 0 -1 -1.0 0.0000000 4
[4,] 1 2 2 0.5 0.2222222 3
[5,] 1 0 -1 -1.0 0.0000000 1
[6,] 0 2 -1 -1.0 0.0000000 2
[7,] 0 3 -1 -1.0 0.0000000 3`
Using the nclass0 and nclass1, I want to add an 8th column in which matching pairs have the same id. The first row is the parent row (with id=0). The rows match if [rowX,1] + [rowY,1] are equal to the parents row nclass0 and [rowX,2] + [rowY,2] are equal to the parent rows nclass1. RowX and rowY are the child rows and should get id=1.
In this case the parent row [1,] has child rows [2,]&[7,] and these rows should get id=1. After this the second row becomes the parent row with its own child rows [3,] and [4,] with id=2, until all rows with child rows have been assigned an id.
I have made several attempts but failed miserably. Does anyone have a suggestion how this can be done? The desired output for this case would be:
` Nclass0 Nclass1 BestS BestI impurity n id
[1,] 5 5 4 36.0 0.2500000 10 0
[2,] 5 2 1 37.0 0.2040816 7 1
[3,] 4 0 -1 -1.0 0.0000000 4 2
[4,] 1 2 2 0.5 0.2222222 3 2
[5,] 1 0 -1 -1.0 0.0000000 1 4
[6,] 0 2 -1 -1.0 0.0000000 2 4
[7,] 0 3 -1 -1.0 0.0000000 3 1`

Here's a solution that makes use of a while loop. The loop will run until either every row has an id value, or until it has evaluated all of the rows in the data frame. I'm sure there are some weaknesses, but it's a good start:
Note: I think this could get unbearably slow in a large data frame, so I hope you don't need to do this on anything large (each outer takes about 1 second to complete on a vector of 10,000).
DF <-
structure(list(Nclass0 = c(5, 5, 4, 1, 1, 0, 0),
Nclass1 = c(5, 2, 0, 2, 0, 2, 3),
BestS = c(4, 1, -1, 2, -1, -1, -1),
BestI = c(36, 37, -1, 0.5, -1, -1, -1),
impurity = c(0.25, 0.2040816, 0, 0.2222222, 0, 0, 0),
n = c(10, 7, 4, 3, 1, 2, 3)),
.Names = c("Nclass0", "Nclass1", "BestS", "BestI", "impurity", "n"),
row.names = c(NA, -7L), class = "data.frame")
DF[["id"]] <- c(0, rep(NA, nrow(DF) - 1))
i <- 1
while(sum(is.na(DF[["id"]])) > 0){
cross0 <- outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
match0 <- cross0 == DF[["Nclass0"]][i] & lower.tri(cross0)
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[["id"]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
Explanation
To try to clarify your problem, you are looking for pairs where x1 + x2 = x_ref AND y1 + y2 = y_ref.
What this code does is make a matrix of all of the possible pairwise sums of a vector with itself. This is accomplished with outer.
outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 10 10 9 6 6 5 5
[2,] 10 10 9 6 6 5 5
[3,] 9 9 8 5 5 4 4
[4,] 6 6 5 2 2 1 1
[5,] 6 6 5 2 2 1 1
[6,] 5 5 4 1 1 0 0
[7,] 5 5 4 1 1 0 0
When trying to find the x-match for the first row, we compare this matrix to DF$class0[1] (and take set the upper triangle to false to avoid duplicates).
match0 <- cross0 == DF[["Nclass0"]][1] & lower.tri(cross0)
match0
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[7,] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
We repeat this process for Nclass1
cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][1] & lower.tri(cross1)
match1
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
To find the row indices, we want to find the intersection of these two match matrices--in other words which positions in both matrices are TRUE
as.vector(which(match0 & match1, arr.ind = TRUE))
[1] 7 2
So rows 7 and 2 are related to the first row. We can repeat this operation for each subsequent row until we've assigned an ID for every row.
Turning it into a function
Here's a function that takes a data frame, a column name for the x-match, the column name for the y-match, and a character to name the id variable. I've added some bells and whistles to check the inputs.
assign_id <- function(DF, class0, class1, id_var){
check <- require(checkmate)
if (!check) stop ("Install the checkmate package")
checkmate::assert_character(x = class0,
len = 1)
checkmate::assert_character(x = class1,
len = 1)
checkmate::assert_character(x = id_var,
len = 1)
checkmate::assert_subset(c(class0, class1),
choices = names(DF))
i <- 1
DF[[id_var]] <- c(0, rep(NA, nrow(DF) - 1))
while(sum(is.na(DF[[id_var]])) > 0){
cross0 <- outer(DF[[class0]], DF[[class0]], `+`)
match0 <- cross0 == DF[[class0]][i] & lower.tri(cross0)
cross1 <- outer(DF[[class1]], DF[[class1]], `+`)
match1 <- cross1 == DF[[class1]][i] & lower.tri(cross1)
rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
if (length(rows)) DF[[id_var]][rows] <- i
if (i == nrow(DF)) break else i <- i + 1
}
DF
}
assign_id(DF, "Nclass0", "Nclass1", "id")

row wise comparison between a vector and a matrix in r

I have two datasets from 10 people. One is a vector, and the other is a matrix. What I want to see is if the first element of the vector includes in the first row of the matrix, and if the second element of the vector includes in the second row of the matrix, and so on.
so, I changed the vector into a matrix and used apply to compare them row-wise. But, the result was not that correct.
Here is the datasets.
df1<-matrix(c(rep(0,10),2,4,7,6,5,7,4,2,2,2),ncol=2)
df1
# [,1] [,2]
# [1,] 0 2
# [2,] 0 4
# [3,] 0 7
# [4,] 0 6
# [5,] 0 5
# [6,] 0 7
# [7,] 0 4
# [8,] 0 2
# [9,] 0 2
#[10,] 0 2
df2<-c(1,3,6,4,1,3,3,2,2,5)
df2<-as.matrix(df2)
apply(df2, 1, function(x) any(x==df1))
# [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
However, the result must be all FALSE but 8th and 9th.
Can anyone correct the function? Thanks!

This vectorized code should be very efficient:
> as.logical( rowSums(df1==df2))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE

Here are a few approaches you could take
Two calls to apply
#
# 1 by column to check if the values are equal
# then by row to see if any rows contain TRUE
apply(apply(df1,2,`==`,df2),1,any)
Use sapply and seq_along
sapply(seq_along(df2), function(x, y, i) y[i] %in% x[i, ], y = df2 ,x = df1)
repeat df2 to the same length as df1 and then compare
rowSums(df1==rep(df2, length = length(df1))) > 0

replace <NA> with NA

I have a data frame containing entries; It appears that these values are not treated as NA since is.na returns FALSE. I would like to convert these values to NA but could not find the way.

Use dfr[dfr=="<NA>"]=NA where dfr is your dataframe.
For example:
> dfr<-data.frame(A=c(1,2,"<NA>",3),B=c("a","b","c","d"))
> dfr
A B
1 1 a
2 2 b
3 <NA> c
4 3 d
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
> dfr[dfr=="<NA>"] = NA **key step**
> is.na(dfr)
A B
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] FALSE FALSE

The two classes where this is likely to be an issue are character and factor. This should loop over a dtaframe and convert the "NA" values into true <NA>'s but just for those two classes:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x=="NA"; x} else {
x}
df[] <- lapply(df, make.true.NA)
(Untested in the absence of a data example.) The use of the form: df_name[] will attempt to retain the structure of the original dataframe which would otherwise lose its class attribute. I see that ujjwal thinks your spelling of NA has flanking "<>" characters so you might try this functions as more general:
make.true.NA <- function(x) if(is.character(x)||is.factor(x)){
is.na(x) <- x %in% c("NA", "<NA>"); x} else {
x}

You can do this with the naniar package as well, using replace_with_na and associated functions.
dfr <- data.frame(A = c(1, 2, "<NA>", 3), B = c("a", "b", "c", "d"))
library(naniar)
# dev version - devtools::install_github('njtierney/naniar')
is.na(dfr)
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] FALSE FALSE
#> [4,] FALSE FALSE
dfr %>% replace_with_na(replace = list(A = "<NA>")) %>% is.na()
#> A B
#> [1,] FALSE FALSE
#> [2,] FALSE FALSE
#> [3,] TRUE FALSE
#> [4,] FALSE FALSE
# You can also specify how to do this for many variables
dfr %>% replace_with_na_all(~.x == "<NA>")
#> # A tibble: 4 x 2
#> A B
#> <int> <int>
#> 1 2 1
#> 2 3 2
#> 3 NA 3
#> 4 4 4
You can read more about using replace_with_na here

Is there a way to extract continuous feature in an 2D array

Say I have an array of number
a <- c(1,2,3,6,7,8,9,10,20)
if there a way to tell R to output just the range of the continuous sequence from "a"
e.g., the continuous sequences in "a" are the following
1,3
6,10
20
Thanks a lot!
Derek

I don't think there is a straight way, but you could create two logical vectors telling you if next/previous element is 1 greatest/least. E.g.:
data.frame(
a,
is_first = c(TRUE,diff(a)!=1),
is_last = c(diff(a)!=1,TRUE)
)
# Gives you:
a is_first is_last
1 1 TRUE FALSE
2 2 FALSE FALSE
3 3 FALSE TRUE
4 6 TRUE FALSE
5 7 FALSE FALSE
6 8 FALSE FALSE
7 9 FALSE FALSE
8 10 FALSE TRUE
9 20 TRUE TRUE
So ranges are:
cbind(a[c(TRUE,diff(a)!=1)], a[c(diff(a)!=1,TRUE)])
[1,] 1 3
[2,] 6 10
[3,] 20 20

I did this (not so elegant I admit) in case you want all the numbers of each sequence in a list
a <- c(1,2,3,6,7,8,9,10,20)
z <- c(1,which(c(1,diff(a))!=1))
g <- lapply(seq(1:length(z)),function(i) {
if (i < length(z)) a[z[i] : (z[i+1] - 1)]
else a[z[i] : length(a)]
})
[[1]]
[1] 1 2 3
[[2]]
[1] 6 7 8 9 10
[[3]]
[1] 20
Then you can get a 2D array with something like this
sapply(g,function(x) c(x[1],x[length(x)]))
[,1] [,2] [,3]
[1,] 1 6 20
[2,] 3 10 20

> a <- c(1,2,3,6,7,8,9,10,20)
> N<-length(a)
> k<-2:(N-1)
> z<-(a[k-1]+1)!=a[k] | (a[k+1]-1)!=a[k]
> c(a[1],a[k][z],a[N])
[1] 1 3 6 10 20

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subset using `[`, explain NA output - r

data[ data[ , 2 ] == 4 | data[,3] == 4,] name var1 var2 4 b 4 4 5 b 5 4 6 b 6 4 I suspect your method does not work because c() builds a vector, whereas you need to compare the atomic elements.

Related

How can I remove cells containing NULL from a matrix consisting of a list and vectors?

R Match rows in a data frame based on formula

row wise comparison between a vector and a matrix in r

replace <NA> with NA

Is there a way to extract continuous feature in an 2D array

Categories

Resources