How do you read the %in% operator in plain English? - r

I'm struggling with how to read the %in% operator in R in "plain English" terms. I've seen multiple examples of code for its use, but not a clear explanation of how to read it.
For example, I've found terminology for the pipe operator %>% that suggests to read it as "and then." I'm looking for a similar translation for the %in% operator.
In the book R for Data Science in chapter 5 titled "Data Transformation" there is an example from the flights data set that reads as follows:
The following code finds all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:
nov_dec <- filter(flights, month %in% c(11, 12))
When I read "a useful short-hand for this problem is x %in% y," and then look at the nov_dec example, it seems like this is to be understood as "select every row where month (x) is one of the values in c(11,12) (y)," which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11 and 12 in the month column." In this example, it seems like x should be the values of 11 and 12 and the %in% operator is checking if those values are in y which would be the month column. My brain is reading this example from right to left.
However, all of the code examples I've found seem to indicate that this x %in% y should be read left to right and not right to left.
Can anyone help me read the %in% operator in layman's terms please? Examples would be appreciated.

If I wanted to really "spell it out", I'd read x %in% y as "for each x value, is it in y"?
nov_dec <- filter(flights, month %in% c(11, 12))"
When I read "A useful short-hand for this problem is x %in% y," and then look at the nov_dec example, it seems like this is to be understood as "select every row where month ('x') is one of the values in c(11,12) ('y'), which doesn't make sense to me.
However my brain wants to read it as something like, "Look for 11 and 12 in the month column." In this example, it seems like 'x' should be the values of 11 and 12 and the %in% operator is checking if those values are in 'y' which would be the month column. My brain is reading this example from right to left.
The left-vs-right thing is all about what you're asking about. x %in% y is asking (using my verbose phrasing above), "for each x value, is it in y?" With that phrasing, we know to expect an answer (TRUE or FALSE) for every item in x.
This might actually get clearer if we extend it a little more - two common related questions are "are any x values in y?" and "are all the x values in y"? These can be coded naturally as
any(x %in% y) # Are any x values in y?
all(x %in% y) # Are all x values in y?
To me, at least, those seem quite natural, and they use the left-to-right reading. It would get convoluted to try to use a right-to-left reading here, something like "look for the y values in x, did you cover every x value with your matches?"

That's actually a really good question. Think about the literal nature here:
Is the answer yes?
Is the answer no?
Is the answer yes or no?
Is the answer yes and no?
When you use %in% it is in lieu of an 'or' statement-- are any of these in here?
answers = data.frame(ans = sample(rep(c("yes","no","maybe"),
each = 3, times = 2)),
ind = 1:9)
# yes or no?
answers[answers$ans == "yes"|answers$ans == "no",]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# now about %in%
answers[answers$ans %in% c("yes","no"),]
# ans ind
# 1 yes 1
# 2 yes 2
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 10 yes 1
# 12 no 3
# 13 no 4
# 16 yes 7
# 17 no 8
# 18 yes 9
# yes and no?
answers[answers$ans == c("yes","no"),]
# ans ind
# 1 yes 1
# 4 no 4
# 5 yes 5
# 6 no 6
# 8 no 8
# 12 no 3
# what happened here? were you expecting that?
# this checked the first row for yes,
# the second row for no,
# the third row for yes,
# the fourth row for no and so on...

I think your disconnect is understanding how to apply "in" to a vector. You wrote that you want to read it as "Look for 11 and 12 in the month column." You can indeed think of it that way. Your example was:
nov_dec <- filter(flights, month %in% c(11, 12))
And that could be expressed in plain English as:
Give me all the flights where one of the values in c(11, 12) is in the month column
But we could also say that 11 and 12 are "in" the vector c(11, 12). That's what the left-to-right reading would be:
Give me all the flights whose month is in the vector c(11, 12).
Or, expressed slightly differently and more verbosely:
Give me all the flights whose month is equal to one of the values in the vector c(11, 12)
This is conceptually similar to using a bunch of | operators in a row (month == 11 | month == 12), but it's best not to think of those as exactly equivalent. Instead of explicitly comparing x to every value in y, you're asking the question "is x equal to one of the values in y?" That's different in the same way that saying "please turn off the lights" is different than saying "please walk over to that plate on the wall and pull the little stick on it downwards." It's expressing what you want instead of how to figure it out, which makes your code more readable, and code is read more often than it's written, so that's important!!!
Now I'm getting way out of my area - again, I don't know what R actually does here - but the underlying method of answering the question might also be different. It might use a binary search algorithm to find out if x is in y.

Related

Is there an r function for counting how many data points fit into certain categories?

I am working on a data set involving the details from various fatal police shootings from Jan 2015 till now. The details involve race, presence of body camera, age and more.
from this data set, I would like to count how many situations involved a body camera (in the "body_camera" column it would read as "TRUE") and at the same time involved a black victim (which reads as "B"). Id also like to do the same with victims who were white "W".
If anyone could help with this it would be much appreciated, I'm struggling.
As you did not provide data, I chose too use the mtcars dataset, which is built into r. Just like your dataset, each row represents an individual (in this case, different cars). And it also has variables with the same categorical properties as your "bodycam" and "ethnicity" variable.
Just that they are called "vs" (V-shaped engine), which is either 0 (= V-shaped) or 1 (= straight), and "cyl" (cylinders), which has the expressions 4, 6 and 8. For more detail, type ?mtcars.
To count the frequency of something, there is the table function.
df <- mtcars
table(df$vs, df$cyl)
which yields these results:
> table(df$vs, df$cyl)
4 6 8
0 1 3 14
1 10 4 0
Now to get the amount of cars that have a straight engine AND 4 cylinders, just read the table. The code equivalent to reading the table works by indexing:
> table(df$vs, df$cyl)[2,1]
[1] 10
Here is another approach. You can also translate the criteria (straight engine AND 4 cylinders) into code and specify a conditional statement using the & operator.
df$vs == 1 & df$cyl == 4
This yields a rather human unreadable TRUE and FALSE set.
To make this better, we can use which, which returns the rows at which the condition is true.
> which(df$vs == 1 & df$cyl == 4)
[1] 3 8 9 18 19 20 21 26 28 32
Now we just have to count how many rows there are for which the condition is TRUE.
To count the overall number of items, there is length.
tabledoes not get us anywhere for this counting task, as it would try to count the frequency of each item, and each row occurs exactly once.
> length(which(df$vs == 1 & df$cyl == 4))
[1] 10
This corresponds to what we can read from the table for a straight engine and 4 cylinders.
Hope this helps!

How can I troubleshoot the delete row function

I am attempting to delete a row like this:
data <- data[-1645,]
However, after running the code, the row is still there. I can tell because there is an outlier in that row that is showing up on all my graphs, and when I view the data I can sort a column to easily find the offending outlier. I have had no trouble deleting rows in the past- has anyone run into anything similar? I do understand the limitations of outlier removal and I don't typically remove them however for a number of reasons I would like to see what the data look like without this one (in this case, all other values in the response variable are between -1 and 0, and in this row the value is 10^4).
You really need to provide more information, but there are several ways you can troubleshoot the problem. The first one is to print out the line you are removing:
data[1645, ]
Is that the outlier? You did not tell us how you identified the outlier. If lines have been removed from the data frame, the row names are not changed but the index values are changed, e.g.
set.seed(42)
x <- sample.int(25)
y <- sample.int(25)
data <- data.frame(x, y)
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 5 4 10
# 6 18 11
data <- data[-c(5, 10, 15, 20, 25), ]
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 6 18 11
# 7 25 15
data[6, ]
# x y
# 7 25 15
data["6", ]
# x y
# 6 18 11
Notice that the 6th row of the data has a row name of "7" but the row with name "6" is the 5th row in the data frame because we deleted the 5th row. The which function will give you the index value, but if you identified the outlier by looking at the printout, you got the row name and that may be different from the index. If we want to remove values in x greater than 24, here is one way to do that:
data[data$x<25, ]
After playing around with the data, I think the best explanation is that the indexing is off. This is in line with what dcarlson was saying- that it could be removing the 1,645th row, it just isn't labelled as such. I think the best solution is to use subset:
data <- subset(data, Yield.Decline < 100)
This is a more robust solution than trying to remove any given row based on its value (the line can be accidentally run multiple times without erroneously removing additional lines).

Looping through items on a list in R

this may be a simple question but I'm fairly new to R.
What I want to do is to perform some kind of addition on the indexes of a list, but once I get to a maximum value it goes back to the first value in that list and start over from there.
for example:
x <-2
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
data[x]
1
data[x+12]
1
data[x+13]
3
or something functionaly equivalent. In the end i want to be able to do something like
v=6
x=8
y=9
z=12
values <- c(v,x,y,z)
data <- c(0,1,2,3,4,5,6,7,8,9,10,11)
set <- c(data[values[1]],data[values[2]], data[values[3]],data[values[4]])
set
5 7 8 11
values <- values + 8
set
1 3 4 7
I've tried some stuff with additon and substraction to the lenght of my list but it does not work well on the lower numbers.
I hope this was a clear enough explanation,
thanks in advance!
We don't need a loop here as vectors can take vectors of length >= 1 as index
data[values]
#[1] 5 7 8 11
NOTE: Both the objects are vectors and not list
If we need to reset the index
values <- values + 8
ifelse(values > length(data), values - length(data) - 1, values)
#[1] 1 3 4 7

R help converting non numeric column to numeric

I'm trying to help my friend, Director of Sales, make sense of his logged call data. There is one column in particular in which he is interested, "Disposition". This column has string values and I'm trying to convert them to numeric values (i.e. "Not Answered" converted to 1, "Answered" converted to 2, etc.) and remove any row with no values entered. I've created data frames, used as.numeric, created and deleted columns/rows, etc. to no avail. I'm just trying to run simple R code to give him some insight. Any and all help is much appreciated. Thanks in advance!
P.S. I'm unsure as to whether I should provide some code due to the fact that there is a lot of delicate information (personal phone numbers and emails).
First off: You should always provide representative sample data; if your data is sensitive in nature, provide mock-up data.
That aside, to recode a character vector as numeric you could convert to factor and then use as.numeric. For example:
# Sample data
column <- c("Not Answered", "Answered", "Something else", "Others")
# Convert character vector to factor
column <- factor(column, levels = as.character(unique(column)))
# Convert to numeric
as.numeric(column);
#[1] 1 2 3 4
The numbering can be adjusted by changing the order of the factor levels.
Alternatively, you can create a new column and fill it with the numeric values using an ifelse statement. To illustrate, let's assume this is your dataframe:
df <- data.frame(
Disposition = c(rep(c("answer", "no answer", "whatever", NA),3)),
Anything = c(rnorm(12))
)
df
Disposition Anything
1 answer 2.54721951
2 no answer 1.07409803
3 whatever 0.60482744
4 <NA> 2.08405038
5 answer 0.31799860
6 no answer -1.17558239
7 whatever 0.94206106
8 <NA> 0.45355501
9 answer 0.01787330
10 no answer -0.07629330
11 whatever 0.83109679
12 <NA> -0.06937357
Now you define a new column, say df$Analysis, and assign to it numbers based on the information in df$Disposition:
df$Analysis <- ifelse(df$Disposition=="no answer", 1,
ifelse(df$Disposition=="answer", 2, 3))
df
Disposition Anything Analysis
1 answer 2.54721951 2
2 no answer 1.07409803 1
3 whatever 0.60482744 3
4 <NA> 2.08405038 NA
5 answer 0.31799860 2
6 no answer -1.17558239 1
7 whatever 0.94206106 3
8 <NA> 0.45355501 NA
9 answer 0.01787330 2
10 no answer -0.07629330 1
11 whatever 0.83109679 3
12 <NA> -0.06937357 NA
The advantage of this method is that you keep the original information unchanged. If you now want to remove Na values in the dataframe, use na.omit. NB: this will remove not only the NA values in df$Disposition but any row with NA in any column:
df_clean <- na.omit(df)
df_clean
Disposition Anything Analysis
1 answer 2.5472195 2
2 no answer 1.0740980 1
3 whatever 0.6048274 3
5 answer 0.3179986 2
6 no answer -1.1755824 1
7 whatever 0.9420611 3
9 answer 0.0178733 2
10 no answer -0.0762933 1
11 whatever 0.8310968 3

Switch-like function for questionnaire grading

I'd done a serious PHP/JS coding recently, and I kind-of lost my R muscle. While this problem can be easily tackled within PHP/JS, what is the most efficient way of solving this one: I have to grade a questionnaire, and I have following scenario:
raw t
5 0
6 2
7-9 3
10-12 4
15-20 5
if x equals to, or is within range given in raw, value in according row in t should be returned. Of course, this can be done with for loop, or switch, but just imagine very lengthy set of value ranges in raw. How would you tackle this one?
We seem to be missing a part of the example because there in no mention of "x"
dat <- read.table(textConnection("raw t
5 0
6 2
7-9 3
10-12 4
15-20 5"), header=TRUE, stringsAsFactors=FALSE)
dat$bot <- as.numeric( sapply( sapply(dat$raw, strsplit, "-"), "[", 1 ))
get.t <- function(x) findInterval(x, dat$bot)
get.t(8)
#[1] 3
> dat$t[get.t(6)]
[1] 2
> dat$t[get.t(5)]
[1] 0
I would simply use an indexing scheme kind of like what Corbin alluded to, but since he didn't provide an example, here's a simple one:
m <- cbind(c(5:12,15:20),
rep(c(0,2,3,4,5),times = c(1,1,3,3,6)))
m[m[,1] == 11,2]
[1] 4
Note: very similar to Simone's answer as I started typing this a bit back. Has a note at the end though. The indexing approach I give is essentially Simone's answer.
There will have to be a loop involved somewhere.
The pseudo code of what I would do is something like:
score = blah
for each raw => t
break raw into rMin -> rMax
if(rMin <= score and rMax >= score)
return t
It avoids having to loop over each number between rMin and rMax (which is what I'm assuming you meant), but without some kind of indexing, that is the best you're going to get.
Note: if you have a ton of calls to this, and indexing would actually be worth your while, the easiest type of indexing would just be a hash map of score -> t entries.
Basically you would parse your example data into something like:
index[5] = 0
index[6] = 2
index[7] = 3
index[8] = 3
index[9] = 3
You would need to carefully weigh if building the index would be more time consuming than just looping over the ranges.
Note: the indexing approach is actually what Simone said.

Resources