R: how does the which() function operate [duplicate] - r

This question already has answers here:
Is there an R function for finding the index of an element in a vector?
(4 answers)
Closed 4 years ago.
I am confused about the which function. Basically I thought that it checks at which position of an input object (e.g., a vector) a logical condition is true. As seen in the documentation:
which(LETTERS == "R")
[1] 18
In other words, it goes through all LETTERS values and checks if value == R. But this seems to be a misunderstanding. If I input
a <- c("test","test2","test3","test4")
b <- c("test","test3")
which(a==b)
[1] 1
it returns [1] 1 although test3 does also appear in both vectors. Also, if I input a shorter vector for a, it returns a warning:
a <- c("test","test2","test3")
b <- c("test","test3")
which(a==b)
[1] 1
Warning message:
In a == b : longer object length is not a multiple of shorter object length
My question here is twofold:
How can I return the positions of a character vector a that match a character vector b?
How does which() operate because I obviously misunderstand the function.
Thank you for your answers
Edit: thank you for your quick replies, you clarified my misunderstanding!

You need to give which an input that tells it what elements of a are in b:
which(a%in%b)
[1] 1 3
which essentially identifies which elements are TRUE in a logical vector.

== compares values 1 by 1 (a[1]==b[1]);(a[2]==b[2])..... and not as sets.
for set operations use %in%
use a[which(a %in% b)] to get [1] "test" "test3"
which() returns the index of TRUE expressions (!) not the value.
which(a %in% b) will return
[1] 1 3
the reason for the strange warning message is R's recycling
Warning message:
In a == b : longer object length is not a multiple of shorter object length
so when you compare a vector of length 4 with a vector of length 2, value by value (using == ), R 'recycles' the short vector. in 4 and 2 it works and you will get an answer for this question: (a1==b1,a2==b2,a3==b1,a4==b2). in case of length 4 and 3 - you get a warning message saying the short vector cannot be multiplied by an integer to get the long vector length.

Related

How to test if an object is a vector in R

I want to test if an object is a vector in R. I'm confused as to why
is.vector(c(0.1))
returns TRUE and so does
is.vector(0.1)
I would like it to return false when it is just a number and true when it is a vector. Can anyone offer any help on this please?
Many thanks in advance.
in R there doesn't exist a single number or string alone. They are vectors of length 1. Or embedded in some more complex structures.
is.vector(c(0.1)) and is.vector(0.1) are in R absolutely identical.
That is also the reason, why length("this is a string/character") returns 1 - because length() in this case measures the number of elements in the vector.
And you see it if you type "this is a string/character" into R console:
It returns [1] "this is a string/character" - the [1] indicates: vector of length 1.
So you have to do nchar("this is a string/character") to get the length of the first element - the charater string - returning 26.
nchar(c("this is a string/character", "and this another string"))
## [1] 26 23
## nchar is vectorized as you see ...
This is an important difference to Python, where strings and numbers can stand alone.
So len("this") returns 4 in Python. len(["this"]) however 1 (1 element in list, thus length of list is 1).
As already mentioned by #RHertel, R considers c(0.1) a vector of length 1. You may want to test for length as well. E.g.
> x <- 1
> y <- 1:2
> is.vector(x) & length(x) > 1
[1] FALSE
> is.vector(y) & length(y) > 1
[1] TRUE

R drop by empty index on vector inconsistent behaviour

Consider removing those elements from a vector that match a certain set if criteria. The expected behaviour is to remove those that match, and, in particular, if none match then remove none:
> d = 1:20
> d
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> d[-which(d > 10)]
[1] 1 2 3 4 5 6 7 8 9 10
> d[-which(d > 100)]
integer(0)
We see here that the final statement has both done something very unexpected and silently hidden the error without even a warning.
I initially thought that this was an undesirable (but consistent) consequence of the choice that an empty index selects all elements of a vector
http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html
as is commonly used to e.g. select the first column of a matrix, m, by writing
m[ , 1]
However the behaviour observed here is consistent with interpreting an empty vector as "no elements", not "all elements":
> a = integer(0)
selecting "no elements" works exactly as expected:
> v[a]
numeric(0)
however removing "no elements" does not:
> v[-a]
numeric(0)
For an empty vector to both select no elements and remove all elements requires inconsistency.
Obviously it is possible to work around this issue, either by checking that the which() returns non-zero length or using a logical expression as covered here In R, why does deleting rows or cols by empty index results in empty data ? Or, what's the 'right' way to delete?
but my two questions are:
Why is the behaviour inconsistent?
Why does it silently do the wrong thing without an error or warning?
This doesn't work because which(d > 100) and -which(d > 100) are the same object: there is no difference between an empty vector and the negative of that empty vector.
For example, imagine you did:
d = 1:10
indexer = which(d > 100)
negative_indexer = -indexer
The two variables would be the same (which is the only consistent behavior- turning all the elements of an empty vector negative leaves it the same since it has no elements).
indexer
#> integer(0)
negative_indexer
#> integer(0)
identical(indexer, negative_indexer)
#> [1] TRUE
At that point, you couldn't expect d[indexer] and d[negative_indexer] to give different results. There is also no place to provide an error or warning: it doesn't know when passed an empty vector that you "meant" the negative version of that empty vector.
The solution is that for subsetting there's no reason you need which() at all: you could use d[d > 10] instead of your original example. You could therefore use !(d > 100) or d <= 100 for your negative indexing. This behaves as you'd expect because d > 10 or !(d > 100) are logical vectors rather than vectors of indices.

Types and comparisons in R

I've been working with R for a month or so, and my comprehension of some subtleties is still quite superficial.
I have had an issue, which I managed to solve (details below), but I still can't explain precisely why it did not work with the first solution.
Note that the example below makes no practical sense for I have simplified it as much as possible so that the problem is quite clear.
ISSUE :
Given a data frame with 4 columns (email, first, last, company) :
> users <- data.frame(matrix(vector(), 0, 4, dimnames=list(c(), c("email", "first", "last", "company"))), stringsAsFactors=F)
> users[1,] <- c("robert#redford.com", "Robert", "Redford", "Paramount")
> users[2,] <- c("julia#roberts.com", "Erin", "B.", "Hinkley")
> users[3,] <- c("matt#damon.com", "Will", "H.", "Stanford")
> users[4,] <- c("john#malkovitch.com", "John", "M.", "JM")
I take one particular row :
> user <- users[3,]
When I try to subset the dataframe on a criteria which could have lead to return the previously mentioned row, it returns no result.
> users[users$email == user["email"],]
[1] email first last company
<0 lignes> (ou 'row.names' de longueur nulle)
I instantly thought it was a casting issue (sorry for this bad one)
> users[users$email == as.character(user["email"]),]
email first last company
3 matt#damon.com Will H. Stanford
However, when I tried to figure out where exactly the issue was, and tried this :
> users[users$email == "matt#damon.com",]
email first last company
3 matt#damon.com Will H. Stanford
> user["email"] == "matt#damon.com"
email
3 TRUE
> users[3,]$email == user$email
[1] TRUE
I got quite confused :
First, I thought about it as a math problem : if A == B and B == C, then A == C (according to Captain Obvious). So, just replacing a member A by another member B which is supposed to be equal to A (given the "TRUE" statement) in some expression should have no impact on the result of this expression.
3 TRUE != [1] TRUE. I think [1] TRUE is a logical vector of size 1 which first element is TRUE. 3 TRUE is (1x1) matrix row, which column "email" value is TRUE.
My problem is with consistency : either two objects of equal content but different types should be equal, or they should be different. I have a problem with "Sometimes there is type inference, and sometimes not". Is there a rule I can't see beyond this behavior ? (I guess there is one)
Another expression of the behavior I'd like to get is this one :
> unique(users$email) == "matt#damon.com"
[1] FALSE FALSE TRUE FALSE
> unique(users$email) == user["email"]
email
3 FALSE
Obviously R does get what I want (considering the fact that it gives me the matching row). But I can't explain (nor use) the result of the second statement.
Any explanations / thoughts?
in normal list situations
users$email == user[["email"]]
however in data.frames things get inconsistent/ a lot worse!
tdf=data.frame(matrix(1:100,10,10))
tdf[] # returns data.frame everything
tdf[1] # returns data.frame first column
tdf[1,1] # returns object as type of the object...
tdf[,1] # returns a vector of the first column
tdf[1,] # returns a data.frame of the first row # eeeeeugh... that is odd....
tdf[2:4] # returns a data.frame with 3 columns
tdf[1,2:4] # returns a data.frame of the first row of 3 colums
tdf[2:4,2:4] # returns a 3x3 data.frame
tdf[2:4,1] # returns a vector of 2:4 row and 1st column
tdf[,2:4] # returns a data.frame with 3 columns
then there is also the double [[]]
do note that in data.frames things get horribly annoying and fugly
tdf[[1]] # gives the first row as a vector
tdf[[1,1]] # gives first element
and pretty much all other combinations gives errors
and assigning stuff to a data.frame or matrix, is an even bigger mess!

Extract indices from array meeting a condition in R

Say I have d<-c(1,2,3,4,5,6,6,7). How can I select the indices from d that meet a certain condition such as x>3 and x<=6 (i.e. d[4], d[5], d[6], d[7])?
Use which
> which(d>3 & d<=6)
[1] 4 5 6 7
Minor: c() creates a vector, which is similar to but not exactly an array.
You can create a logical vector an use it to access d.
d[d>3 & d<=6] # the operators return logical vectors, [] extracts
# only the TRUE values

How many elements in a vector are greater than x without using a loop

If I have the following vector :
x
[1] 1 5 8 9 1 0 15 15
and I want to know how many elements are greater than 10, how can I proceed without using a loop ?
I would like to get :
2
as a result
Use length or sum:
> length(x[x > 10])
[1] 2
> sum(x > 10)
[1] 2
In the first approach, you would be creating a vector that subsets the values that matches your condition, and then retrieving the length of the vector.
In the second approach, you are simply creating a logical vector that states whether each value matches the condition (TRUE) or doesn't (FALSE). Since TRUE and FALSE equate to "1" and "0", you can simply use sum to get your answer.
Because the first approach requires indexing and subsetting before counting, I am almost certain that the second approach would be faster than the first.
Another way to do this:
> length(which(as.vector(x) > 10))
[1] 2

Resources