which() conveniently gives all the indices which are TRUE in x. What is a simple way to get all the indices of x which are FALSE?
Sample data
x <- c(T,T,F,F)
[1] TRUE TRUE FALSE FALSE
which function gives indices where we have TRUE value
which(x)
[1] 1 2
If we need to populate indices for only FALSE values
which(!x)
[1] 3 4
we can also bring false values as output as
!which(x)
[1] FALSE FALSE
Related
I came across a question like this: "retrieve all values less than or equal to 5 from a vector of sequence 1 through 9 having a length of 9". Now based on my knowledge so far, I did trial & error, then I finally executed the following code:
vec <- c(1:9) ## assigns to vec
lessThanOrEqualTo5 <- vec[vec <= 5]
lessThanOrEqualTo5
[1] 1 2 3 4 5
I know that the code vec <= 5 would return the following logical
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
So my question is, how does R use these logical to return the appropriate values satisfying the condition since the code would end up having a structure like this vec[TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE]?
I wrote this lines of code to work in a dataframe that returns a new column with case insensitive match with the elements of string list.
However, the resulting column works for the first element of the list only, 'seed' in this case, but not other match. Not sure where is the wrong in the for loop.
Here is the sample dataframe you may want to check results for.
input.strings <- c('seed', 'fertilizer', 'fertiliser', 'loan', 'interest', 'feed', 'insurance')
polic = data.frame(policy_label=c('seed supply','energy subsidy','fertilizer distribution','loan guarantee','Interest waiver','feed purchase'))
polic$policy_class <- sapply(polic$policy_label, function(x){
for (i in input.strings){
if (grepl(i, tolower(x))){
return(i)
}
else{
return("others")
}
}
})
base R alternative
Here's a somewhat faster and more-direct approach using sapply (and no for loops), relying on the fact that grepl can be vectorized on x=. (It is not vectorized on pattern=, requiring that to be length 1, which is one reason why we need the sapply at all.)
matches <- sapply(input.strings, grepl, x = polic$policy_label)
matches
# seed fertilizer fertiliser loan interest feed insurance
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
Because we want to assign "others" to everything without a match (and because we will need at least one TRUE in
matches <- cbind(matches, others = rowSums(matches) == 0)
matches
# seed fertilizer fertiliser loan interest feed insurance others
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
From here, we can find the names associated with the true values and assign them (optionally ,-collapsed) into polic:
polic$policy_class <- apply(matches, 1, function(z) toString(colnames(matches)[z]))
polic
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy others
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver others
# 6 feed purchase feed
FYI, the reason I used toString is because I did not want to assume that there would always be no more than one match; that is, if two input.strings matched one policy_label for whatever reason, than toString will combine them into one string, e.g., "seed, feed" for multi-match policies.
fuzzyjoin alternative
If you're familiar with merges/joins (and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?), then this should seem familiar. If not, the concept of joining data in this way can be transformative to data-munging/cleaning.
library(fuzzyjoin)
out <- regex_left_join(
polic, data.frame(policy_class = input.strings),
by = c("policy_label" = "policy_class"))
out
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy <NA>
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver <NA>
# 6 feed purchase feed
### clean up the NAs for "others"
out$policy_class[is.na(out$policy_class)] <- "others"
In contrast to the base-R variant above, there is no safe-guard here (yet!) to handle when multiple input.strings match one policy_label; when that happens, that row with a match will be duplicated, so you'd see (e.g.) seed supply and all other columns on that row twice. This can easily be mitigated given some effort.
This question already has answers here:
What is the difference between `%in%` and `==`?
(3 answers)
Closed 5 years ago.
My question concerns the practical difference between the == and %in% operators in R.
I have run into an instance at work where filtering with either operator gives different results (e.g. one results on 800 rows, and the other 1200). I have run into this problem in the past and am able to validate in a way that ensures I get the results I desire. However, I am still stumped regarding how they are different.
Can someone please shed some light on how these operators are different?
%in% is value matching and "returns a vector of the positions of (first) matches of its first argument in its second" (See help('%in%')) This means you could compare vectors of different lengths to see if elements of one vector match at least one element in another. The length of output will be equal to the length of the vector being compared (the first one).
1:2 %in% rep(1:2,5)
#[1] TRUE TRUE
rep(1:2,5) %in% 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#Note this output is longer in second
== is logical operator meant to compare if two things are exactly equal. If the vectors are of equal length, elements will be compared element-wise. If not, vectors will be recycled. The length of output will be equal to the length of the longer vector.
1:2 == rep(1:2,5)
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
rep(1:2,5) == 1:2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
1:10 %in% 3:7
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#is same as
sapply(1:10, function(a) any(a == 3:7))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
NOTE: If possible, try to use identical or all.equal instead of == and.
Given two vectors, x and y, the code x == y will compare the first element of x with the first element of y, then the second element of x with the second element of y, and so on. When using x == y, the lengths of x and y must be the same. Here, compare means "is equal to" and therefore the output is a logical vector equal to the length of x (or y).
In the code x %in% y, the first element of x is compared to all elements in y, then the second element of x is compared to all elements of y, and so on. Here, compare means "is the current element of x equal to any value in y" and therefore the output is a logical vector that has the same length of x and not (necessarily) y.
Here is a code snippet illustrating the difference. Note that x and y have the same lengths but the elements of y are the elements of x in different order. Note too in the final examples that x is a 3-element vector being compared to the letters vector, which contains 26 elements.
> x <- c('a','b','c')
> y <- c('c', 'b', 'a')
> x == y
[1] FALSE TRUE FALSE
> x %in% y
[1] TRUE TRUE TRUE
> x %in% letters
[1] TRUE TRUE TRUE
> letters %in% x
[1] TRUE TRUE TRUE FALSE FALSE FALSE
[7] FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE
Try it for objects of different length.
ac <- c("a", "b", "c")
ae <- c("a", "b", "c", "d", "e")
ac %in% ae
[1] TRUE TRUE TRUE
ac == ae
[1] TRUE TRUE TRUE FALSE FALSE
It's becomes clear that %in% checks whether or not the object is contained in the other object. Whereas == is a logical operator that checks for identity properties.
== cheks if elements of a vector is equal to all elements of another vector. Ideally two vectors will have the same size (or it will have unexpected results as when sizes don't match R recycles the shorter vector, silently if sizes are multiples of each other). For instance
c(1,2,3) == c(1,3,2)
[1] TRUE FALSE FALSE
or
c(1,2) == c(1,3,2)
[1] TRUE FALSE FALSE
Warning message:
In c(1, 2) == c(1, 3, 2) :
longer object length is not a multiple of shorter object length
%in% on the other hand checks which elements of list 1 is included in list 2
c(1,2,3) %in% c(1,3,2)
[1] TRUE TRUE TRUE
or
c(1,2) %in% c(1,3,2)
[1] TRUE TRUE
I try to subset values in R depending on values in column y like shown in the following:
I have the data set "data" which is like this:
data <- data.frame(y = c(0,0,2000,1500,20,77,88),
a = "bla", b = "bla")
And would end up with this:
I have this R code:
data <- arrange(subset(data, y != 0 & y < 1000 & y !=77 & [...]), desc(y))
print(head(data, n =100))
Which works.
However I would like to collect the values to exclude in a list as:
[0, 1000, 77]
And somehow loop through this, with the lowest possible running time instead of hardcoding them directly in the formula. Any ideas?
The list, should only contain "!=" operations:
[0, 77]
and the "<" should be remain in the formula or in another list.
I'm going to answer your original question because it's more interesting. I hope you won't mind.
Imagine you had values and operators to apply to your data:
my.operators <- c("!=","<","!=")
my.values <- c(0,1000,77)
You can use Map from base R to apply a function to two vectors. Here I'll use get so we can obtain the actual operator given by the character string.
Map(function(x,y)get(y)(data$y,x),my.values,my.operators)
[[1]]
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[[2]]
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[[3]]
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE
As you can see, we get a list of logical vectors for each value, operator pair.
To better understand what's going on here, consider only the first value of each vector:
get("!=")(data$y,0)
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Now we can use Reduce:
Reduce(`&`,lapply(my.values,function(x) data$y!=x))
[1] FALSE FALSE TRUE TRUE TRUE FALSE TRUE
And finally subset the data:
data[Reduce("&",Map(function(x,y)get(y)(data$y,x),my.values,my.operators)),]
y a b
5 20 bla bla
7 88 bla bla
I have a data.frame in which I want to perform a count by row versus a specified criterion. The part I cannot figure out is that I want a different count criterion for each row.
Say I have 10 rows, I want 10 different criteria for the 10 rows.
I tried: count.above <- rowSums(Data > rate), where rate is a vector with the 10 criterion, but R used only the first as the criterion for the whole frame.
I imagine I could split my frame into 10 vectors and perform this task, but I thought there would be some simple way to do this without resorting to that.
Edit: this depends whether you want to operate over rows or columns. See below:
This is a job for mapply and Reduce. Suppose you have a data frame along the lines of
df1 <- data.frame(a=1:10,b=2:11,c=3:12)
Let's say we want to count the rows where a>6, b>3 and c>5. This is done with mapply:
mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)
$a
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
$b
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
$c
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now we use Reduce to find those which are all TRUE:
Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE))
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
Lastly, we use sum to add them all up:
sum(Reduce("&",mapply(">",df1,c(6,3,5),SIMPLIFY=FALSE)))
[1] 4
If you want a result for each row rather than a global aggregate, then apply is the function to use:
apply(df1,1,function(v) sum(v>c(6,3,5)))
[1] 0 0 1 2 2 2 3 3 3 3
Given the dummy data (from #zx8754s solution)
# dummy data
df1 <- data.frame(matrix(1:15, nrow = 3))
myRate <- c(7, 5, 1)
Solution using apply
Courtesy of #JDL
rowSums(apply(df1, 2, function(v) v > myRate))
Alternative solution using the Reduce pattern
Reduce(function(l, v) cbind(l[,1] + (l[,2] > myRate), l[,-2:-1]),
1:ncol(df1),
cbind(0, df1))