Why %in% in R is not matching properly?

Why %in% in R is not matching properly? - r

I am new to R. I have created two vectors 'a' and 'b' in R. Both of them contains the same elements, but in different order. Following are the details of 2 vectors.
> str(a)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 369 obs. of 1 variable:
$ SKD_DOCUMENT_NO: chr "A0000514011" "A0000514012" "A0000514013" "A0000514014" ...
> str(b)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 369 obs. of 1 variable:
$ SKD_DOCUMENT_NO: chr "A0000648001" "A0000648050" "A0000648049"
"A0000648048" ...
but when I try to check whether an element is in the vectors or not, i get confusing answers from R.
>'A0000648050' %in% a #[1] FALSE
>"A0000648050" %in% a #[1] FALSE
But when I try to use other methods to check whether the element is in 'a'. Then following results come:
> any(a == "A0000648050") #[1] TRUE
> which(a == "A0000648050") #[1] 115
> grep("A0000648050", a) #[1] 1
Q1. What I don't understand is why %in% is failing.
Q2. What is the easiest way to find if all elements of 'a' vector are present in all elements of 'b' vector? (all elements of 'a' are indeed present in 'b'. But would like to be confirmed from R). Why does following 2 lines give different results?
> a %in% b #[1] FALSE
> setequal(a,b) # TRUE

%in%
From ?'%in%' :
%in% is currently defined as "%in%" <- function(x, table) match(x,
table, nomatch = 0) > 0
Factors, raw vectors and lists are converted to character vectors, and
then x and table are coerced to a common type (the later of the two
types in R's ordering, logical < integer < numeric < complex <
character) before matching. If incomparables has positive length it is
coerced to the common type.
in your case a is a tibble, which is a data.frame, which is a list, so it's converted to character before the comparison takes place.
a <- tibble(SKD_DOCUMENT_NO =c("A0000514011","A0000514012","A0000514013","A0000514014"))
as.character(a)
# [1] "c(\"A0000514011\", \"A0000514012\", \"A0000514013\", \"A0000514014\")"
This, though it's not intuitive, will return TRUE:
"c(\"A0000514011\", \"A0000514012\", \"A0000514013\", \"A0000514014\")" %in% a
any
from ?any, on the ... argument:
Other objects of zero length are ignored, and the rest are coerced to
logical ignoring any class.
a == "A0000514012"
# SKD_DOCUMENT_NO
# [1,] FALSE
# [2,] TRUE
# [3,] FALSE
# [4,] FALSE
The following happens when we coerce it to logical:
as.logical(a == "A0000514012")
# [1] FALSE TRUE FALSE FALSE
so the output you get with any(a == "A0000514012") makes sense.
The same exercise can be done with which or grep
solution
The solution is to use either:
"A0000514012" %in% a$SKD_DOCUMENT_NO # to look into a precise column
or
"A0000514012" %in% unlist(a) # to look into all columns, equivalent to your solution with `any`
or
sapply(a,`%in%`,x = "A0000514012") # to look into individual columns separately

Related

Determine which elements of a vector partially match a second vector, and which elements don't (in R)

I have a vector A, which contains a list of genera, which I want to use to subset a second vector, B. I have successfully used grepl to extract anything from B that has a partial match to the genera in A. Below is a reproducible example of what I have done.
But now I would like to get a list of which genera in A matched with something in B, and which which genera did not. I.e. the "matched" list would contain Cortinarius and Russula, and the "unmatched" list would contain Laccaria and Inocybe. Any ideas on how to do this? In reality my vectors are very long, and the genus names in B are not all in the same position amongst the other info.
# create some dummy vectors
A <- c("Cortinarius","Laccaria","Inocybe","Russula")
B <- c("fafsdf_Cortinarius_sdfsdf","sdfsdf_Russula_sdfsdf_fdf","Tomentella_sdfsdf","sdfas_Sebacina","sdfsf_Clavulina_sdfdsf")
# extract the elements of B that have a partial match to anything in A.
new.B <- B[grepl(paste(A,collapse="|"), B)]
# But now how do I tell which elements of A were present in B, and which ones were not?

We could use lapply or sapply to loop over the patterns and then get a named output
out <- setNames(lapply(A, function(x) grep(x, B, value = TRUE)), A)
THen, it is easier to check the ones returning empty elements
> out[lengths(out) > 0]
$Cortinarius
[1] "fafsdf_Cortinarius_sdfsdf"
$Russula
[1] "sdfsdf_Russula_sdfsdf_fdf"
> out[lengths(out) == 0]
$Laccaria
character(0)
$Inocybe
character(0)
and get the names of that
> names(out[lengths(out) > 0])
[1] "Cortinarius" "Russula"
> names(out[lengths(out) == 0])
[1] "Laccaria" "Inocybe"

You can use sapply with grepl to check for each value of A matching with ever value of B.
sapply(A, grepl, B)
# Cortinarius Laccaria Inocybe Russula
#[1,] TRUE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE TRUE
#[3,] FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE FALSE
You can take column-wise sum of these values to get the count of matches.
result <- colSums(sapply(A, grepl, B))
result
#Cortinarius Laccaria Inocybe Russula
# 1 0 0 1
#values with at least one match
names(Filter(function(x) x > 0, result))
#[1] "Cortinarius" "Russula"
#values with no match
names(Filter(function(x) x == 0, result))
#[1] "Laccaria" "Inocybe"

subsetting a dataframe with a logical vector, which has only one `TRUE` element, returns multiple rows

I have a logical vector with only one TRUE value in it
> summary(no_in_gos)
Mode FALSE TRUE
logical 5891 1
When I use it to subset a dataframe, it returns multiple rows, which I don't understand. I'm only expecting one row, given that there is only one TRUE element in the logical vector:
> gos2[no_in_gos,]
ensembl_gene_id go_id
24 YDL245C GO:0005355
5916 YBR015C GO:0000026
11808 YPR182W GO:0046540
17700 YJR090C GO:0005634
23592 YFL034C-B GO:0007163
...
...
I'm really sure that my logical vector has only one TRUE element:
> sum(no_in_gos)
[1] 1
> which(no_in_gos == 'TRUE')
[1] 24
What can be some of the reasons that explain this behavior?

After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?

Caveat: novice. I have several data.tables with millions of rows each, variables are mostly dates and factors. I was using rbindlist() to combine them because. Yesterday, after breaking up the tables into smaller pieces vertically (instead of the current horizontal splicing), I was trying to understand rbind better (especially with fill = TRUE) and also tried bind_rows() and then tried to verify the results but identical() returned FALSE.
library(data.table)
library(dplyr)
DT1 <- data.table(a=1, b=2)
DT2 <- data.table(a=4, b=3)
DT_bindrows <- bind_rows(DT1,DT2)
DT_rbind <- rbind(DT1,DT2)
identical(DT_bindrows,DT_rbind)
# [1] FALSE
Visually inspecting the results from bind_rows() and rbind() says they are indeed identical. I read this and this (from where I adapted the example). My question: (a) what am I missing, and (b) if the number, names, and order of my columns is the same, should I be concerned that identical() = FALSE?

The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE

Index a Particular Numeric Vector From a List of Vectors in R

In R, for the sake of example, I have a list composed of equal-length numeric vectors of form similar to:
list <- list(c(1,2,3),c(1,3,2),c(2,1,3))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 3 2
[[3]]
[1] 2 1 3
...
Every element of the list is unique. I want to get the index number of the element x <- c(2,1,3), or any other particular numeric vector within the list.
I've attempted using match(x,list), which gives a vector full of NA, and which(list==(c(1,2,3)), which gives me a "(list) object cannot be coerced to type 'double'" error. Coercing the list to different types didn't seem to make a difference for the which function. I also attempted various grep* functions, but these don't return exact numeric vector matches. Using find(c(1,2,3),list) or even some fancy sapply which %in% type functions didn't give me what I was looking for. I feel like I have a type problem. Any suggestions?
--Update--
Summary of Solutions
Thanks for your replies. The method in the comment for this question is clean and works well (via akrun).
> which(paste(list)==deparse(x))
[1] 25
The next method didn't work correctly
> which(duplicated(c(x, list(y), fromLast = TRUE)))
[1] 49
> y
[1] 1 2 3
This sounds good, but in the next block you can see the problem
> y<-c(1,3,2)
> which(duplicated(c(list, list(y), fromLast = TRUE)))
[1] 49
More fundamentally, there are only 48 elements in the list I was using.
The last method works well (via BondedDust), and I would guess it is more efficient using an apply function:
> which( sapply(list, identical, y ))
[1] 25

match works fine if you pass it the right data.
L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
match(list(c(2,1,3)), L)
#[1] 3
Beware that this works via coercing lists to character, so fringe cases will fail - with a hat-tip to #nicola:
match(list(1:3),L)
#[1] NA
even though:
1:3 == c(1,2,3)
#[1] TRUE TRUE TRUE
Although arguably:
identical(1:3,c(1,2,3))
#[1] FALSE
identical(1:3,c(1L,2L,3L))
#[1] TRUE

You can use duplicated(). If we add the matching vector to the end of the original list and set fromLast = TRUE, we will find the duplicate(s). Then we can use which() to get the index.
which(duplicated(c(list, list(c(2, 1, 3)), fromLast = TRUE))
# [1] 3
Or you could add it as the first element and subtract 1 from the result.
which(duplicated(c(list(c(2, 1, 3)), list))) - 1L
# [1] 3
Note that the type always matters with this type of comparison. When comparing integers and numerics, you will need to convert doubles to integers for this to run without issue. For example, 1:3 is not the same type as c(1, 2, 3).

> L <- list(c(1,2,3),c(1,3,2),c(2,1,3))
> sapply(L, identical, c(2,1,3))
[1] FALSE FALSE TRUE
> which( sapply(L, identical, c(2,1,3)) )
[1] 3
This would be slightly less restrictive in its test:
> which( sapply(L, function(x,y){all(x==y)}, c(1:3)) )
[1] 1

Try:
vapply(list,function(z) all(z==x),TRUE)
#[1] FALSE FALSE TRUE
Enclosing the above line to which gives you the index of the list.

Lookup table with the query with arbitrary length without using a for loop in R

Say I have a lookup table as following
dt <- data.frame(name=c("jack","jill","sam","dan"),age=c(20,14,28,13))
name age
1 jack 20
2 jill 14
3 sam 28
4 dan 13
Now I want to convert the following vectors to the vectors containing the ages of each element.
query1 <- c("jack","dan")
query2 <- c("sam")
query3 <- c("jack","sam", "dan")
I can build the following function(which I don't like) to accomplish this task,
get.age <- function(x) {
answer <- list()
for(i in 1:length(x)){
answer[[i]]<- dt[dt$name==x[i],"age"]
}
ldply(answer)$V1
}
which gets the job done like this
> get.age(query1)
[1] 20 13
> get.age(query2)
[1] 28
> get.age(query3)
[1] 20 28 13
But I don't like the solution because it uses a for loop and some dirty hack. Ideally, I would want to do it more R-like using vector operations like this(which doesn't seem to work)
> dt[dt$name==c("jack","dan"),"age"]
[1] 20 13 #worked
> dt[dt$name==c("jack","sam"),"age"]
[1] 20 # not the right answer
The following solution works, but this requires the prior knowledge of how many things I lookup.
dt[dt$name=="jack" | dt$name=="sam","age"]
[1] 20 28
I would like to know a method that can handle arbitrary size of vectors that converts the keys to elements without using for loop, if there is any

For a true lookup table, the result should be the length of the query and also deal with replication in the query. The approaches using match(...) are the only ones that do this:
query4 <- c("jack","sam", "dan","sam","jack")
dt[match(query4,dt$name),]$age
# [1] 20 28 13 28 20
This is because match(LHS,RHS) returns an integer vector of length(LHS) which contains the row numbers of the RHS which match the corresponding element of LHS.
The approaches based on comparison (==) will generally not work. This s because, when comparing two vectors, R tries to replicate the shorter one however many times are needed to make it the same length as the longer one, and then does an element-by-element comparison. So in the case of dt$name==query1, for example, the RHS gets replicated twice and the comparison is between c("jack","jill","sam","dan") and c("jack","dan","jack","dan").
dt$name==query1 # RHS is: c("jack","dan","jack","dan")
# [1] TRUE FALSE FALSE TRUE
dt$name==query2 # RHS is: c("sam","sam","sam","sam")
# [1] FALSE FALSE TRUE FALSE
dt$name==query3 # RHS is: c("jack","sam", "dan","jack") with warning
# [1] TRUE FALSE FALSE FALSE
# with warning: longer object length is not a multiple of shorter object length
On the other hand, using LHS %in% RHS gives a result with length(LHS) and either T or F depending on whether that element is present in RHS.
dt$name %in% query1
# [1] TRUE FALSE FALSE TRUE
query1 %in% dt$name
# [1] TRUE TRUE
Note that it looks like df$name %in% query1 and df$name==query1 give the same result, but that's an artifact of query1 being replicated twice in the latter comparison. See, for example:
dt$name %in% query3
# [1] TRUE FALSE TRUE TRUE
dt$name == query3
# [1] TRUE FALSE FALSE FALSE

You want %in%, it returns of logical vector that is used to subset the data frame
dt[dt$name %in% query3,"age"]

There are a lot of ways to do this, but I'll throw out one that I find useful. match(). #jlhoward's answer goes into more detail and explains why my previous == examples were wrong.
> match(query1, dt$name) #these give us the index of the *first* matching value
[1] 1 4
> match(query2, dt$name)
[1] 3
> dt$age[match(query1, dt$name)]
[1] 20 13
> dt$age[match(query2, dt$name)]
[1] 28
You can also use %in% unlike match this returns TRUE and FALSE for the elements that exist in the comparison (be sure to get the order right for, dt$name %in% query1 returns TRUE FALSE FALSE TRUE, query1 %in% dt$name returns TRUE TRUE)
> dt[dt$name %in% query1, ][,'age',]
[1] 20 13
With dplyr you can use filter
> require(dplyr)
> filter(dt, name %in% query1)
name age
1 jack 20
2 dan 13
> filter(dt, name %in% query1)$age
[1] 20 13

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Why %in% in R is not matching properly? - r

Related

Determine which elements of a vector partially match a second vector, and which elements don't (in R)

subsetting a dataframe with a logical vector, which has only one `TRUE` element, returns multiple rows

After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?

Index a Particular Numeric Vector From a List of Vectors in R

Lookup table with the query with arbitrary length without using a for loop in R

Categories

Resources