I've started learning R and got a piece of code in which a statement is:
if(sum(C == C[i]) == 1)# C is simply a vector and i is index of a value in this vector which the user specifies in an argument.
How can you pass a conditional statement as an argument of a function? Also explain the meaning of this statement.
Thank you.
Let's take an example to understand
Consider C as a numeric vector from 1 to 10 and let's take i as 3
C <- 1:10
i <- 3
So when we do
C == C[i]
#[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
it compares every element of C with C[i] which is 3 and returns a corresponding logical vector which is only TRUE at 3rd index.
When we sum this logical vector it returns count of all TRUE (as it considers FALSE as 0 and TRUE as 1) values which in this case is 1
sum(C == C[i])
#[1] 1
which is then compared to 1 again to make sure that there is only one C[i] in C
sum(C == C[i]) == 1
#[1] TRUE
This will fail in case if we have repeated numbers in C. For example,
C <- c(1:10, 3) #Adding an extra 3 in the end
C
#[1] 1 2 3 4 5 6 7 8 9 10 3
i <- 3
sum(C == C[i]) == 1
#[1] FALSE
The bottom line is the condition is TRUE if C[i] occurs only once in C.
Related
I am new to R. I have created an object a:
a <- c(2,4,6,8,10,12,14,16,18,20)
I have performed the following operation on the vector:
a[!c(10,0,8,6,0)]
and I get the output as 4 10 14 20
I do understand that !c(10,0,8,6,0) produces the output as FALSE TRUE FALSE FALSE TRUE
I don't understand how the final results comes out to be 4 10 14 20
Can someone help?
We obtain the results because the logical vector is recycled (as its length is only 5 compared to length(a) which is 10) to meet the end of the 'a' vector i..e
i1 <- rep(!c(10,0,8,6,0), length.out = length(a))
i1
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
If we use that vector
a[i1]
[1] 4 10 14 20
It is easier to understand if we just pass TRUE, then the TRUE is recycled to return all the elements or the reverse with FALSE
a[TRUE]
[1] 2 4 6 8 10 12 14 16 18 20
a[FALSE]
numeric(0)
The recycling is mentioned in the documentation of ?Extract
For [-indexing only: i, j, ... can be logical vectors, indicating elements/slices to select. Such vectors are recycled if necessary to match the corresponding extent. i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection.
In most of the languages, 0 is considered as FALSE and other values as TRUE. So, when we negate the 0 (FALSE) is converted to TRUE and all others to FALSE
I have this example data.frame:
df <- data.frame(a = c(1,2,3,5,7,8),b=c(2,3,4,6,8,9))
And I'd like to collapse all rows i whose b column value is equal to a column value at their subsequent row (i+1) such that in the collapsed row they their a column will be that of row i and their b column will be that of row i+1. This has to be done as long as there are no consecutive rows that meet this condition.
For the example df rows 1-3 are to be collapsed, row 4 left as is, and then rows 5-6 collapsed, giving:
res.df <- data.frame(a = c(1,5,7), b = c(4,6,9))
This isn't overly pretty, but it is vectorised comparing a cutdown version of df$a to df$b.
grps <- rev(cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))))
#[1] 3 3 3 2 1 1
cbind(df["a"], b=ave(df$b,grps,FUN=max) )[!duplicated(grps),]
# a b
#1 1 4
#4 5 6
#5 7 9
Breaking it down probably helps explain the first part:
tail(df$a,-1) != head(df$b,-1)
#[1] FALSE FALSE TRUE TRUE FALSE
c(tail(df$a,-1) != head(df$b,-1),TRUE)
#[1] FALSE FALSE TRUE TRUE FALSE TRUE
rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE)))
#[1] 1 1 2 3 3 3
I was trying this out, trying to subset a data frame based on values in vector being in another vector:
x <- c( 1,2,3,1,2,3 )
df <- data.frame(x=x,y=x)
df[ df$x == c(1,2), ]
expecting to get this:
x y
1 1 1
2 2 2
4 1 1
5 2 2
but I didn't, I got this:
x y
1 1 1
2 2 2
Disregarding the fact that I really wanted this (occurred to me a minute later):
df[ df$x %in% c(1,2), ]
What is the logic behind the result of this:
x == c(1,2)
being this:
[1] TRUE TRUE FALSE FALSE FALSE FALSE
I don't really get it. I am aware that this is likely a duplicate, but I couldn't find one.
It is based on the recycling of c(1,2) to the length of 'x', i.e. we are comparing df$x with
rep(c(1,2),length.out= nrow(df))
#[1] 1 2 1 2 1 2
df$x ==rep(c(1,2),length.out= nrow(df))
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
It means, we are comparing the corresponding elements of 'x' with the corresponding recycled c(1,2) instead of checking any element of 'x' contains c(1,2)
I am writing one code in R. First I am creating one blank column in the data set and I want to assign 0 and 1 value in that column according to some conditions. Here is my code
#Creating a empty column in the data file
Mydata$final <- "";
#To assign 0,1 value in final variable
if(Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3)){
Mydata$final <- 1
}else{
Mydata$final <- 0
}
And I am getting this error
Warning message:
In if (Mydata$Default_Config == "No" & is.na(Mydata$Best_Config) == :
the condition has length > 1 and only the first element will be used
How Can I fix this error? Please help me out. Thanks in advance
Your problem is one of vectorisation. if is not vectorised. You are testing multiple values in each comparison in your if statement and R is telling you it will only use the first because if is not vectorised. You need ifelse which is vectorised:
ifelse( Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3) , 1 , 0 )
A reproducible example is below. If x is > 5 and y is even then return 1 otherwise return 0:
x <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
y <- seq(1,30,3)
# [1] 1 4 7 10 13 16 19 22 25 28
x > 5
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
y %% 2 == 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
ifelse( x > 5 & y %% 2 == 0 , 1 , 0 )
# [1] 0 0 0 0 0 1 0 1 0 1
An alternative approach is to take advantage of R's coercion. You have a set of conditionals which are vectorizable, and R is happy to convert TRUE/FALSE to 1 / 0, so you can write it like:
Mydata$final <- ( (Mydata$Default_Config == "No") *( is.na(Mydata$Best_Config)=="TRUE") * (Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3)) )
(extra parentheses added for clarity) .
Apologies if I fouled up the logic there.
Edit: My code for the OR won't quite work, since if both sides are TRUE you'd get a big number ("2" :-) ). Change it to as.logical((Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3))
I have a data.frame with a block of columns that are logicals, e.g.
> tmp <- data.frame(a=c(13, 23, 52),
+ b=c(TRUE,FALSE,TRUE),
+ c=c(TRUE,TRUE,FALSE),
+ d=c(TRUE,TRUE,TRUE))
> tmp
a b c d
1 13 TRUE TRUE TRUE
2 23 FALSE TRUE TRUE
3 52 TRUE FALSE TRUE
I'd like to compute a summary column (say: e) that is a logical AND over the whole range of logical columns. In other words, for a given row, if all b:d are TRUE, then e would be TRUE; if any b:d are FALSE, then e would be FALSE.
My expected result is:
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
I want to indicate the range of columns by indices, as I have a bunch of columns, and the names are cumbersome. The following code works, but i'd rather use a vectorized approach to improve performance.
> tmp$e <- NA
> for(i in 1:nrow(tmp)){
+ tmp[i,"e"] <- all(tmp[i,2:(ncol(tmp)-1)]==TRUE)
+ }
> tmp
a b c d e
1 13 TRUE TRUE TRUE TRUE
2 23 FALSE TRUE TRUE FALSE
3 52 TRUE FALSE TRUE FALSE
Any way to do this without using a for loop to step through the rows of the data.frame?
You can use rowSums to loop over rows... and some fancy footwork to make it quasi-automated:
# identify the logical columns
boolCols <- sapply(tmp, is.logical)
# sum each row of the logical columns and
# compare to the total number of logical columns
tmp$e <- rowSums(tmp[,boolCols]) == sum(boolCols)
By using rowSums in ifelse statement, in one go it can be acheived:
tmp$e <- ifelse(rowSums(tmp[,2:4] == T) == 3, T, F)