Trouble with understanding explanation of %in% - r

I am having trouble with understanding %in%. In Hadley Wickham's Book "R for data science" in section 5.2.2 it says, "A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y." Then this example is given:
nov_dec <- filter(flights, month %in% c(11, 12))
However, I when I look at the syntax, It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 (y) appear in "month" (x).
?"%in%" doesn't make this any clearer to me. Obviously I'm missing something, but could someone please spell out exactly how this function works?

It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 appear in "month."
If you don't understand the behavior from looking at the example, try it out yourself. For example, you could do this:
> c(1,2,3) %in% c(2,4,6)
[1] FALSE TRUE FALSE
So it looks %in% gives you a vector of TRUE and FALSE values that correspond to each of the items in the first argument (the one before %in%). Let's try another:
> c(1,2,3) %in% c(2,4,6,8,10,12,1)
[1] TRUE TRUE FALSE
That confirms it: the first item in the returned vector is TRUE if the first item in the first argument is found anywhere in the second argument, and so on. Compare that result to the one you get using match():
> match(c(1,2,3), c(2,4,6,8,10,12,1))
[1] 7 1 NA
So the difference between match() and %in% is that the former gives you the actual position in the second argument of the first match for each item in the first argument, whereas %in% gives you a logical vector that just tells you whether each item in the first argument appears in the second.
In the context of Wickham's book example, month is a vector of values representing the months in which various flights take place. So for the sake of argument, something like:
> month <- c(2,3,5,11,2,9,12,10,9,12,8,11,3)
Using the %in% operator lets you turn that vector into the answers to the question Is this flight in month 11 or 12? like this:
> month %in% c(11,12)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[13] FALSE
which gives you a logical vector, i.e. a list of true/false values. The filter() function uses that logical vector to select corresponding rows from the flights table. Used together, filter and %in% answer the question What are all the flights that occur in months 11 or 12?
If you turned the %in% around and instead asked:
> c(11,12) %in% month
[1] TRUE TRUE
you're really just asking Are there any flights in each of month 11 and month 12?
I can imagine that it might seem odd to ask whether a large vector is "in" a vector that has only two values. Consider reading x %in% y as Are each of the values from x also in y?

A quick exercise should be enough to demonstrate how the function works:
> x <- c(1, 2, 3, 4)
> y <- 4
> z <- 5
> x %in% y
[1] FALSE FALSE FALSE TRUE
So the fourth element of numeric vector x is present in numeric vector y.
> y %in% x
[1] TRUE
And the first element of y (there's only one) is in x.
> z %in% x
[1] FALSE
> x %in% z
[1] FALSE FALSE FALSE FALSE
And neither z is in x nor any of x is in z.
Also see the help for all matching functions with ?match

I think understanding how it works is somewhat semantic, and once you can say it logically then the grammar works itself out.
The key is to create a sentence in your head, as you read the code, that would include the context of apply as you work you way through each row, and Boolean Logic to include or exclude rows based on what is contained in the "filter by list "%in% c( ).
nov_dec <- filter(flights, month %in% c(11, 12))
In this case for your example above it should read like this:
"Set the variable nov_dec equal to the subset of rows in flights, where the variable column month (from those rows) is in the list c(11,12). "
As r works from the top down it looks at month and if the it is either 11 or 12, the two variables in your list, then it includes them in nov_dec, otherwise it just continues on.

this explicitly means: are value from x also in y
The best way to understand is a exemple :
x <- 1:10 # numbers from 1 to 10
y <- (1:5)*2 # pair numbers between 2 and 10
y %in% x # all pair numbers between 2 and 10 are in numbers from 1 to 10
x %in% y #only pair numbers are return as True

Related

How to test whether each element in a column of values falls between values in two other columns?

This may be a very convoluted way of asking this question. I have a column of "results" that I want to test against statistics of previous results, namely calculated minimum and maximum values. If the value in the result column falls between the corresponding min and max values, I want to assign it as "1" in a fourth column named Within_range and if not, "0".
I have tried using relational operators (<,>)
df$Within_Range <- if(df$Result > df$Min & df$Result < df$Max){"1"} else {"0"}
and got this:
In if (df$Result > df$Min & df$Result < df$Max) { :
the condition has length > 1 and only the first element will be used
R did not seem to like that I tried to use multiple conditions, so I tried using between()
df$Within_Range <- if(between(df$Result,df$Min,df$Max)){"1"} else {"0"}
and I got this:
Error: Expecting a single value: [extent=20511].
Here is some example code:
Result <- 1:5
Min <- c(2,1,2,3,4)
Max <- c(3,4,5,8,7)
df <- data.frame(Result, Min, Max)
Apologies if this is a silly question; I am still new to R and hours of searching R forums returned nothing helpful... I am stuck.
between is not vectorized for the left, right arguments. We need comparison operators
df$Within_Range <- with(df, +(Result > Min & Result < Max))
NOTE: Change to >= or <= if the range should also include the Min, Max values
Also, in the first piece of code, the if/else is unnecessary due to multiple reasons
It is not vectorized i.e. it expects a input of length 1 and output a logical vector of length 1 (df$Result and other columns are obviously having length greater than 1)
TRUE/FALSE output from comparison operators are stored as 1/0 values. So, we just need to coerce it to binary with as.integer or +
df %>% mutate(Within_Range = between(Result, Min, Max))
## OutPut
Result Min Max Within_Range
1 1 2 3 FALSE
2 2 1 4 TRUE
3 3 2 5 TRUE
4 4 3 8 TRUE
5 5 4 7 TRUE

How to find missing numbers in a sequence?

I have a vector containing a list of numbers. How do I find numbers that are missing from the vector?
For example:
sequence <- c(12:17,1:4,6:10,19)
The missing numbers are 5, 11 and 18.
sequence <- c(12:17,1:4,6:10,19)
seq2 <- min(sequence):max(sequence)
seq2[!seq2 %in% sequence]
...and the output:
> seq2[!seq2 %in% sequence]
[1] 5 11 18
>
You can use the setdiff() function to compute set differences. You want the difference between the complete sequence (from min(sequence) to max(sequence)) and the sequence with missing values.
setdiff(min(sequence):max(sequence), sequence)
This answer just gets all of the numbers from the lowest to highest in the sequence, then asks which are not present in the original sequence.
which(!(seq(min(sequence), max(sequence)) %in% sequence))
[1] 5 11 18
c(1:max(sequence))[!duplicated(c(sequence,1:max(sequence)))[-(1:length(sequence))]]
[1] 5 11 18
Not a particularly elegant solution, I admit, but what it does is determines which in the vector 1:max(sequence) are duplicates of sequence, and then selects those out of that same vector.

understanding levels: is levels not same as unique()

I read a csv file into a data frame named rr. The character column was treated as factors which was nice.
Do I understand correctly that the levels are just the unique values of the columns? i.e.
levels(rr$col) == unique(rr$col)
Then I wanted to strip leading and trailing whitespaces.(I didn't knew about strip.WHITESPACE option in read)
So I did
rr$col = str_trim(rr$col).
Now the rr$col is no longer a factor. So I did
rr$col = as.factor(rr$col)
But I see now that levels(rr$col) is missing some unique values !! Why?
"Level" is a special property of a variable (column). They are handy because they are retained even if a subset does not contain any values from a specific level. Take for example
x <- as.factor(rep(letters[1:3], each = 3))
If we subset only elements under levels a and b, c is left out. It will be detected with levels(), but not unique(). The latter will see which values appear in the subset only.
> x[c(1,2, 4)]
[1] a a b
Levels: a b c
> levels(x[c(1,2, 4)])
[1] "a" "b" "c"
> unique(x[c(1,2, 4)])
[1] a b
Levels: a b c

Keep values from vector when another vector TRUE in R

I have got two vectors, both with dimensions 30000x1, so only one column and many rows. First vector contains values, second only TRUE or FALSE.
I want to keep all the rows of vector1 where at the same row vector2 equals TRUE.
I have tried combinations like:
res=apply(vector1,2,vector2)
res=vector1(vector2)
res=vector1[vector2]
but I can't figure this out. Thanks a lot for help.
Example:
vector1:
123
345
667
vector2:
TRUE
FALSE
TRUE
res:
123
667
In R you can index into one vector using a second vector of the same length that contains Boolean values, such that wherever the second vector contains TRUE you select the corresponding element of the first.
So your third way works for me
v1=c(123,345,667)
v2=c(TRUE,FALSE,TRUE)
v1[v2]
which outputs
[1] 123 667
This is because v2 contains TRUE at positions 1 and 3, and so v1[v2] is equivalent to v1[c(1,3)].
See the point 1 of the introductory documentation on indexing. Specifically
[indexing with] a logical vector. In this case the index vector must be of the same length as the vector from which elements are to be selected. Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted
This works:
x= 1:3
y = c(T,F,T)
x
#[1] 1 2 3
y
#[1] TRUE FALSE TRUE
x[y]
#[1] 1 3

compare two variables of different length using R

I need to compare the values stored in two variables.The variable sizes are different. For example
x = c(1,2,3,4,5,6,7,8,9,10)
and
y = c(2,6,11,12,13)
I need an answer that 2 and 6 are present in both variables. I need this to be done in R.Anyone help please.
The intersect function avoids the need for #mdsumner's simple indexing:
> x = c(1,2,3,4,5,6,7,8,9,10)
> y = c(2,6,11,12,13)
> intersect(x,y)
[1] 2 6
Whole bunch of set operators to be found here: help(intersect)
Posted after the added requirement that some sort of tolerance be allowed: You could sequentially check one set of values against all the others in the second set or you could do it all at once with outer(). Once you have the outer result as a logical matrix there remains the task of referring back to the values, but expand.grid seems capable of handling that:
expand.grid(x,y)[outer(x,y, FUN=function(x,y) abs(x-y) < 0.01), ]
# Var1 Var2
#2 2 2
#16 6 6
After posting It occurred to me that you values were sorted. Turns out that this extraction from expand.grid() survives passing unsorted vectors.
x[x %in% y]
[1] 2 6
Or, more explicitly:
x[match(x, y, nomatch = 0) > 0]
[1] 2 6
Note that you actually chain together the results of the match with simple indexing into the input values.
See ?match.

Resources