I have the following dataframe a:
> a <- cbind(c(FALSE,FALSE,TRUE,TRUE),c(TRUE,FALSE,FALSE,TRUE))
> a
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE FALSE
[3,] TRUE FALSE
[4,] TRUE TRUE
I want to remove all rows whose first column value and second column value is false. Note that I do have some other, non-boolean columns.
So you want to keep each row which contains at least one TRUE column:
keep <- a[,1] | a[,2]
a <- a[keep, ]
You can use rowSums.
a[(rowSums(a[,1:2])!=0),]
Related
I have a vector A, which contains a list of genera, which I want to use to subset a second vector, B. I have successfully used grepl to extract anything from B that has a partial match to the genera in A. Below is a reproducible example of what I have done.
But now I would like to get a list of which genera in A matched with something in B, and which which genera did not. I.e. the "matched" list would contain Cortinarius and Russula, and the "unmatched" list would contain Laccaria and Inocybe. Any ideas on how to do this? In reality my vectors are very long, and the genus names in B are not all in the same position amongst the other info.
# create some dummy vectors
A <- c("Cortinarius","Laccaria","Inocybe","Russula")
B <- c("fafsdf_Cortinarius_sdfsdf","sdfsdf_Russula_sdfsdf_fdf","Tomentella_sdfsdf","sdfas_Sebacina","sdfsf_Clavulina_sdfdsf")
# extract the elements of B that have a partial match to anything in A.
new.B <- B[grepl(paste(A,collapse="|"), B)]
# But now how do I tell which elements of A were present in B, and which ones were not?
We could use lapply or sapply to loop over the patterns and then get a named output
out <- setNames(lapply(A, function(x) grep(x, B, value = TRUE)), A)
THen, it is easier to check the ones returning empty elements
> out[lengths(out) > 0]
$Cortinarius
[1] "fafsdf_Cortinarius_sdfsdf"
$Russula
[1] "sdfsdf_Russula_sdfsdf_fdf"
> out[lengths(out) == 0]
$Laccaria
character(0)
$Inocybe
character(0)
and get the names of that
> names(out[lengths(out) > 0])
[1] "Cortinarius" "Russula"
> names(out[lengths(out) == 0])
[1] "Laccaria" "Inocybe"
You can use sapply with grepl to check for each value of A matching with ever value of B.
sapply(A, grepl, B)
# Cortinarius Laccaria Inocybe Russula
#[1,] TRUE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE TRUE
#[3,] FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE FALSE
You can take column-wise sum of these values to get the count of matches.
result <- colSums(sapply(A, grepl, B))
result
#Cortinarius Laccaria Inocybe Russula
# 1 0 0 1
#values with at least one match
names(Filter(function(x) x > 0, result))
#[1] "Cortinarius" "Russula"
#values with no match
names(Filter(function(x) x == 0, result))
#[1] "Laccaria" "Inocybe"
I have a matrix like this:
M <- rbind(c("CD4", "CD8"),
c("CD8", "CD4"),
c("DN", "CD8"),
c("CD8", "DN"),
c("CD4", "DN"),
c("DN", "CD4"))
The 1st and 2nd is duplicated, and 3rd and 4th is duplicated, and 5th and 6th is duplicated since they included the same elements (no matter what order it is).
I know that the following code can did it.
Msort <- t(apply(M, 1, sort))
duplicated(Msort)
I want to get this Logical vector:
> duplicated(Msort)
[1] FALSE TRUE FALSE TRUE FALSE TRUE
But if the matrix is large, say 10,000 rows and 10,000 columns, how to deal with this situation efficicently?
Thanks.
I have tried to do using matrix. Please try this once:
M[duplicated(M[c("V1", "V2")]),]
# [,1] [,2]
#[1,] "CD8" "CD4"
#[2,] "CD8" "DN"
#[3,] "DN" "CD4"
I have a matrix that I am performing a for loop over. I want to know if the values of position i in the for loop exist anywhere else in the matrix, and if so, report TRUE. The matrix looks like this
dim
x y
[1,] 5 1
[2,] 2 2
[3,] 5 1
[4,] 5 9
In this case, dim[1,] is the same as dim[3,] and should therefore report TRUE if I am in position i=1 in the for loop. I could write another for loop to deal with this, but I am sure there are more clever and possibly vectorized ways to do this.
We can use duplicated
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
The duplicated(m1) gives a logical vector of 'TRUE/FALSE' values. If there is a duplicate row, it will be TRUE
duplicated(m1)
#[1] FALSE FALSE TRUE FALSE
In this case, the third row is duplicate of first row. Suppose if we need both the first and third row, we can do the duplication from the reverse side and use | to make both positions TRUE. i.e.
duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE FALSE FALSE
duplicated(m1)|duplicated(m1, fromLast=TRUE)
#[1] TRUE FALSE TRUE FALSE
According to ?duplicated, the input data can be
x: a vector or a data frame or an array or ‘NULL’.
data
m1 <- cbind(x=c(5,2,5,5), y=c(1,2,1,9))
Let's say we have the following dataset
set.seed(144)
dat <- matrix(rnorm(100), ncol=5)
The following function creates all possible combinations of columns and removes the first
(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
# Var1 Var2 Var3 Var4 Var5
# 2 TRUE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE FALSE FALSE FALSE
# 4 TRUE TRUE FALSE FALSE FALSE
# ...
# 31 FALSE TRUE TRUE TRUE TRUE
# 32 TRUE TRUE TRUE TRUE TRUE
My question is how can I calculate single, binary and triple combinations only ?
Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ]
However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:
Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) :
invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow
Any suggestion that would allow me to compute single, binary and triple combinations only ?
You could try either
cols[rowSums(cols) < 4L, ]
Or
cols[Reduce(`+`, cols) < 4L, ]
You can use this solution:
col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# <...skipped...>
#
# [[24]]
# [1] 2 4 5
#
# [[25]]
# [1] 3 4 5
Here, col.i is a list every element of which contains column indices.
How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.
You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.
lapply(col.i, function(cols) dat[,cols])
which will return a list of data frames each containing a certain subset of columns of dat.
In case you want to get column indices as a boolean matrix, you can use:
col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# ...
[UPDATE]
More efficient realization:
library("gRbase")
coli <- function(x=5,m=3) {
col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))
z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
v.b <- rep(F,x*length(col.i))
v.b[unlist(z)] <- TRUE
matrix(v.b,ncol=x,byrow = TRUE)
}
coli(70,5) # takes about 30 sec on my desktop
I don't understand what is going on here:
Set up:
> df = data.frame(x1= rnorm(10), x2= rnorm(10))
> df[3,1] <- "the"
> df[6,2] <- "NA"
## I want to create values that will be challenging to coerce to numeric
> df$x1.fixed <- as.numeric(df$x1)
> df$x2.fixed <- as.numeric(df$x2)
## Here is the DF
> df
x1 x2 x1.fixed x2.fixed
1 0.955965351551298 -0.320454533088042 0.9559654 -0.3204545
2 -1.87960909714257 1.61618672247496 -1.8796091 1.6161867
3 the -0.855930398468875 NA -0.8559304
4 -0.400879592905882 -0.698655375066432 -0.4008796 -0.6986554
5 0.901252404134257 -1.08020133150191 0.9012524 -1.0802013
6 0.97786920899034 NA 0.9778692 NA
.
.
.
> table(is.na(df[,c(3,4)]))
FALSE TRUE
18 2
I wanted to find the rows that got converted to NAs, so I put in a complex apply that did not work as expected. I then simplified and tried again...
Question:
Simpler call:
> apply(df, 1, function(x) (any(is.na(df[x,3]), is.na(df[x,4]))))
which unexpectedly yielded:
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Instead, I'd expected:
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to highlight the rows (3 & 6) where an NA existed. To verify that non-apply'ed functions would work, I tried:
> any(is.na(df[3,1]), is.na(df[3,2]))
[1] FALSE
> any(is.na(df[3,3]), is.na(df[3,4]))
[1] TRUE
as expected. To further my confusion on what apply is doing, I tried:
> apply(df, 1, function(x) is.na(df[x,1]))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Why is this traversing the entire DF, when I have clearly indicated both (a) that I want it in the row direction (I passed "1" into the second parameter), and (b) the value "x" is only placed in the row id, not the column id?
I understand there are other, and perhaps better, ways to do what I am trying to do (find the rows that have been changed to NA's in the new columns. But please don't supply that in the answer. Instead, please explain why apply did not work as I'd expected, and what I could do to fix it.
To find the columns that have NA's you can do:
sapply(df, function(x) any(is.na(x)))
# x1 x2 x1.fixed x2.fixed
# FALSE FALSE TRUE TRUE
A data.frame is a list of vectors, so the above function inside sapply will evaluate any(is.na( for each element of that list, i.e. each column.
As per OP edit - to get the rows that have NA's, use apply(df, 1, ... instead:
apply(df, 1, function(x) any(is.na(x)))
# [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
apply is working exactly as it is supposed to. It is your expectations that are wrong.
apply(df, 1, function(x) is.na(df[x,1]))
The first thing that apply does (per the documentation) is coerce your data frame to a matrix. In the process, all numeric columns are coerced to character.
Next, each individual row of df is passed as the argument x to your function. In what sense is it meaningful to index df by the character values in the first row in df? So you just get a bunch of NAs. You can test this via:
> df[as.character(df[1,]),]
x1 x2 x1.fixed x2.fixed
NA <NA> <NA> NA NA
NA.1 <NA> <NA> NA NA
NA.2 <NA> <NA> NA NA
NA.3 <NA> <NA> NA NA
You say you want to know which columns introduced NAs, and yet you are applying over rows. If you really wanted to use apply (I recommend #eddi's method) you could do:
apply(df,2,function(x) any(is.na(x)))
You could use
rowSums(is.na(df))>0
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
to find the rows containing NAs.
I'm not sure, but I think this is a vectorized operation which might be faster than using apply in case you are working with large data.