I have got two vectors, both with dimensions 30000x1, so only one column and many rows. First vector contains values, second only TRUE or FALSE.
I want to keep all the rows of vector1 where at the same row vector2 equals TRUE.
I have tried combinations like:
res=apply(vector1,2,vector2)
res=vector1(vector2)
res=vector1[vector2]
but I can't figure this out. Thanks a lot for help.
Example:
vector1:
123
345
667
vector2:
TRUE
FALSE
TRUE
res:
123
667
In R you can index into one vector using a second vector of the same length that contains Boolean values, such that wherever the second vector contains TRUE you select the corresponding element of the first.
So your third way works for me
v1=c(123,345,667)
v2=c(TRUE,FALSE,TRUE)
v1[v2]
which outputs
[1] 123 667
This is because v2 contains TRUE at positions 1 and 3, and so v1[v2] is equivalent to v1[c(1,3)].
See the point 1 of the introductory documentation on indexing. Specifically
[indexing with] a logical vector. In this case the index vector must be of the same length as the vector from which elements are to be selected. Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted
This works:
x= 1:3
y = c(T,F,T)
x
#[1] 1 2 3
y
#[1] TRUE FALSE TRUE
x[y]
#[1] 1 3
Related
I have a data.frame whose column will hold R formulas as a string.
df1
ID NAME FORMULA
1 a "R formula saved as text string" # will result in a vector of TRUE/FALSE
2 b "R formula saved as text string" # will result in a vector of TRUE/FALSE
i.e [1] TRUE TRUE FALSE FALSE TRUE FALSE
I have another data.frame I wish to add a RESULT column that holds the SUM of the result of each formula string.
df2
ID NAME RESULT
1 a 25
2 b 37
A caveat: I actually need to do a conditional SUM based on a column (WEIGHT) in a another data.frame (COMISSION_TABLE).
The following code was a complete fail:
df2$RESULT <- sum(ifelse(eval(parse(text=df1$FORMULA)),
yes = COMISSION_TABLE$WEIGHT, no = 0))
Can anyone suggest how I could evaluate strings saved as text and apply each per row?
Great thanks for any assistance! :)
I am having trouble with understanding %in%. In Hadley Wickham's Book "R for data science" in section 5.2.2 it says, "A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y." Then this example is given:
nov_dec <- filter(flights, month %in% c(11, 12))
However, I when I look at the syntax, It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 (y) appear in "month" (x).
?"%in%" doesn't make this any clearer to me. Obviously I'm missing something, but could someone please spell out exactly how this function works?
It appears that it should be selecting every row where y is one of the values in x(?) So in the example, all the cases where 11 and 12 appear in "month."
If you don't understand the behavior from looking at the example, try it out yourself. For example, you could do this:
> c(1,2,3) %in% c(2,4,6)
[1] FALSE TRUE FALSE
So it looks %in% gives you a vector of TRUE and FALSE values that correspond to each of the items in the first argument (the one before %in%). Let's try another:
> c(1,2,3) %in% c(2,4,6,8,10,12,1)
[1] TRUE TRUE FALSE
That confirms it: the first item in the returned vector is TRUE if the first item in the first argument is found anywhere in the second argument, and so on. Compare that result to the one you get using match():
> match(c(1,2,3), c(2,4,6,8,10,12,1))
[1] 7 1 NA
So the difference between match() and %in% is that the former gives you the actual position in the second argument of the first match for each item in the first argument, whereas %in% gives you a logical vector that just tells you whether each item in the first argument appears in the second.
In the context of Wickham's book example, month is a vector of values representing the months in which various flights take place. So for the sake of argument, something like:
> month <- c(2,3,5,11,2,9,12,10,9,12,8,11,3)
Using the %in% operator lets you turn that vector into the answers to the question Is this flight in month 11 or 12? like this:
> month %in% c(11,12)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
[13] FALSE
which gives you a logical vector, i.e. a list of true/false values. The filter() function uses that logical vector to select corresponding rows from the flights table. Used together, filter and %in% answer the question What are all the flights that occur in months 11 or 12?
If you turned the %in% around and instead asked:
> c(11,12) %in% month
[1] TRUE TRUE
you're really just asking Are there any flights in each of month 11 and month 12?
I can imagine that it might seem odd to ask whether a large vector is "in" a vector that has only two values. Consider reading x %in% y as Are each of the values from x also in y?
A quick exercise should be enough to demonstrate how the function works:
> x <- c(1, 2, 3, 4)
> y <- 4
> z <- 5
> x %in% y
[1] FALSE FALSE FALSE TRUE
So the fourth element of numeric vector x is present in numeric vector y.
> y %in% x
[1] TRUE
And the first element of y (there's only one) is in x.
> z %in% x
[1] FALSE
> x %in% z
[1] FALSE FALSE FALSE FALSE
And neither z is in x nor any of x is in z.
Also see the help for all matching functions with ?match
I think understanding how it works is somewhat semantic, and once you can say it logically then the grammar works itself out.
The key is to create a sentence in your head, as you read the code, that would include the context of apply as you work you way through each row, and Boolean Logic to include or exclude rows based on what is contained in the "filter by list "%in% c( ).
nov_dec <- filter(flights, month %in% c(11, 12))
In this case for your example above it should read like this:
"Set the variable nov_dec equal to the subset of rows in flights, where the variable column month (from those rows) is in the list c(11,12). "
As r works from the top down it looks at month and if the it is either 11 or 12, the two variables in your list, then it includes them in nov_dec, otherwise it just continues on.
this explicitly means: are value from x also in y
The best way to understand is a exemple :
x <- 1:10 # numbers from 1 to 10
y <- (1:5)*2 # pair numbers between 2 and 10
y %in% x # all pair numbers between 2 and 10 are in numbers from 1 to 10
x %in% y #only pair numbers are return as True
I'm trying to get rid of NAs in an R data.frame. I was trying to create a new df that included only rows whose cluster was "texas" in this example.
> newdf <- df[df$cluster == "texas",]
> summary(newdf$cluster)
texas oklahoma NA's
510 0 719
I had found other questions that address getting rid of NAs, but in this case, I was only selecting those whose "cluster" column is equal to "texas" -- how did an NAs come along for the ride?
Is there a better way of doing what I want?
As #MrFlick suggests above, NA values are handled in slightly (subtly?) different ways depending on how you index.
Test data:
dd <- data.frame(cluster=c("oklahoma","texas",NA))
logical indexing: a TRUE value in the index vector selects the corresponding value, FALSE drops it, and NA results in NA.
dd$cluster=="oklahoma"
## [1] TRUE FALSE NA
summary(dd[dd$cluster=="oklahoma",])
## oklahoma texas NA's
## 1 0 1
In principle you could use dd$cluster=="oklahoma" & !is.na(dd$cluster) as your criterion - since FALSE & NA is FALSE - but that's rather awkward. (Since we have specified a single-column data frame, without saying drop=FALSE, the result gets simplified to a vector before being summarized.)
subset: although it is sometimes deprecated for non-interactive use, subset has the convenient property that it drops values where the criterion evaluates to NA. (Also, subset always returns a data frame even if the result is only one column wide.)
summary(subset(dd,cluster=="oklahoma"))
## cluster
## oklahoma:1
## texas :0
which:
which() only returns indices for TRUE values, not for NA values:
which(dd$cluster=="oklahoma")
## [1] 1
summary(dd[which(dd$cluster=="oklahoma"),])
## oklahoma texas
## 1 0
I have this data frame with two columns which can either take the value of left or right.
test_df <- data.frame(col1 = c("right","left","right",NA),
col2 = c("left","right",NA,"right"))
test_df
# col1 col2
# 1 right left
# 2 left right
# 3 right <NA>
# 4 <NA> right
Now I want to test this multiple condition
test_df$col1 == "left" | test_df$col2 == "right"
# [1] FALSE TRUE NA TRUE
The first three results are as expected, but why the last result is TRUE instead of NA. What's different between results for row 3 and row 4?
In your code you are testing whether at least one of the following conditions is fulfilled; "left" in col1 or "right" in col2. In row 4 you have "right" in col2, therefore the result is TRUE, irrespective of what may or may not be in col1. The situation is different in row 3. There, col1 does not contain "left", hence it remains to be seen if col2 contains "right" in order to conclude whether the statement is FALSE or TRUE. However, since the entry in col2 for row 3 is NA, the result of the comparison cannot be decided and, accordingly, the output is NA.
If you want to have a function that performs the comparison between the entries in col1 and col2 that you mentioned but returns NA if any of the entries in those two columns is NA, you could use
as.logical((test_df$col1 == "left") + (test_df$col2 == "right"))
#[1] FALSE TRUE NA NA
In this line of code, the results of the individual comparisons, yielding TRUE or FALSE, are coerced into numerical values by the + operator. If any part of the sum is NA, the sum will be NA. This addition is done for each row of the dataframe, so the result is a vector with the length nrow(test_df).
By using as.logical(), the result of the sum calculated in the brackets is converted back into logical values. Again, this is done for each element of the vector. If the sum is zero, then the result is FALSE, if it is NA it will remain NA. Any non-zero integer will be converted into TRUE.
I need some help in determining more than one minimum value in a vector. Let's suppose, I have a vector x:
x<-c(1,10,2, 4, 100, 3)
and would like to determine the indexes of the smallest 3 elements, i.e. 1, 2 and 3. I need the indexes of because I will be using the indexes to access the corresponding elements in another vector. Of course, sorting will provide the minimum values but I want to know the indexes of their actual occurrence prior to sorting.
In order to find the index try this
which(x %in% sort(x)[1:3]) # this gives you and index vector
[1] 1 3 6
This says that the first, third and sixth elements are the first three lowest values in your vector, to see which values these are try:
x[ which(x %in% sort(x)[1:3])] # this gives the vector of values
[1] 1 2 3
or just
x[c(1,3,6)]
[1] 1 2 3
If you have any duplicated value you may want to select unique values first and then sort them in order to find the index, just like this (Suggested by #Jeffrey Evans in his answer)
which(x %in% sort(unique(x))[1:3])
I think you mean you want to know what are the indices of the bottom 3 elements? In that case you want order(x)[1:3]
You can use unique to account for duplicate minimum values.
x<-c(1,10,2,4,100,3,1)
which(x %in% sort(unique(x))[1:3])
Here's another way with rank that includes duplicates.
x <- c(x, 3)
# [1] 1 10 2 4 100 3 3
which(rank(x, ties.method='min') <= 3)
# [1] 1 3 6 7