I have a dataset which contains positive, negative as well as NA values. How could I select positive-only values using a script? I would also like to replace negatives numbers with NA and leave NA values as they are.
You could use the which function:
sample <- c(1, 2, -7, NA, NaN)
sample[which(sample > 0)]
[1] 1 2
For negative values assign NA.
Using which:
sample[which(sample < 0)] <- NA
You could try the following command:
> x<-c(1,2,3,-5)
> x[x>0]
[1] 1 2 3
would return all the positive values.
To replace negative numbers with NA use
> x <- ifelse(x<0, NA,x)
> x
[1] 1 2 3 NA
Another way to select positive values would be to use sign
x[sign(x) == 1]
and we can combine both these in Filter
Filter(function(i) i > 0, x)
Filter(function(i) sign(i) == 1, x)
Related
I have a vector a with values (1,2,3,4) and another vector b with values (1,1,0,1). Using the elements in b as a flag, I want to remove the vector elements from A at the same positions where 0 is found in element b.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
if(b[i] == 0)
{
a <- a[-i]
}
}
I get the desired output
a
[1] 1 2 4
But using ifelse, I do not get the output as required.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
a <- ifelse(b[i] == 0,a[-i],a)
}
Output:
a
[1] 1
How to use ifelse in such situations?
I think ifelse isn't the correct function here since ifelse gives output of same length as input and we want to subset values here. You don't need a loop as well. You can directly do
a[b != 0]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)
Another option could be:
a[as.logical(b)]
[1] 1 2 4
If you want to use ifelse, you can use the following code
na.omit(ifelse(b==0,NA,a))
such that
> na.omit(ifelse(b==0,NA,a))
[1] 1 2 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"
We can also use double negation
a[!!b]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)
Following is related to R language.
x1 <- c(1, 4, 3, NA, 7)
is.na(x1) <- which(x1 == 7)
I don't undertand, the LHS in last line gives you a vector of boolean and RHS is a value(index where x ==7, 5 in this case). So what does it mean to assign a boolean vector a value of 5?
is.na from the docs returns:
The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x, containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN, and FALSE otherwise.
Therefore, by making a logical vector(you're in essence saying wherever an index is TRUE, this should be an NA.
By "matching" these indices to the corresponding index from which, you're turning the latter into NAs wherever FALSE hence the change.
To put it in practice:
This is the output from is.na(x1):
is.na(x1)
[1] FALSE FALSE FALSE TRUE FALSE
The corresponding output from which(x==7):
which(x1 == 7)
[1] 5
Combining, the element at position 5 will now become an NA because it has been given the logical is.na() which returns TRUE
is.na(x1) <- which(x1 == 7)
x1
[1] 1 4 3 NA NA
The above turns the first index into an NA and appends two more NAs so as to make index 7 and NA.
This can be best seen by:
is.na(x1) <- c(1,7)
x1
[1] NA 4 3 NA 7 NA NA
Compare with this example from the docs:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx
[1] 0 NA 2 NA 4
From the above, it is clear that c(2,4) follows the original indices in xx hence the rest become NAs.
This is a minor nuisance in R that I'm looking to see if there is possibly some default case for.
I create the following vector:
x <- c(1, 2, 1, NA)
I now want to select from x, only values equal to 1. I do as such:
x2 <- x[x == 1]
Now, when you see what's in x2 it has the values:
> x2
[1] 1 1 NA
It seems that R defaults to include NA values regardless of the condition. I would like it so that R by default excludes NA values from conditions (as it is true that NA does not satisfy the condition x == 1).
I'm aware of the complete.cases function, used as such:
x2 <- x[complete.cases(x == 1)]
The desired output would be the result of the complete.cases method as such:
[REMOVED CAUSE I MESSED THIS UP]
Which solves my problem, but I am curious to see if there is a setting in options or something like that where I can default R to not include NAs in a boolean condition.
I would like to see if there is a way to set are so that x2 <- x[x == 1] results in the same as x2 <- x[complete.cases(x == 1)]. Currently the difference is that the non-complete.cases (normal) method allows NAs through and I would like that to not be the case.
Hey, sorry, I realized I messed up my output with complete.cases as many of you have said, I essentially want to see if I can make this:
> x <- c(1, 2, 1, NA)
> x2 <- x[x == 1 & !is.na(x)]
> x2
[1] 1 1
Work with just this: x2 <- x[x == 1]. Can I make it so that R automatically ignores NAs. I could create a function to do this, but wanted to see if there is something built in R for single conditions that ignore NAs.
Do you need which?
> x <- c(1, 2, 1, NA)
> x[which(x==1)]
[1] 1 1
To explain, which(x==1) will give you the locations in your vector x that matches the test, x==1. You use this result to subset x, giving the output.
> which(x==1)
[1] 1 3
I don't see the functional value of returning a vector which contains the elements equal to 1, because all these elements will be 1. Rather, in production you most likely would be returning a boolean vector, where TRUE means a value matches and FALSE means a value does not match.
If you want NA values to show up as being FALSE for not matching, then you can make the comparison and simply replace these NA values with false:
x <- c(1, 2, 1, NA)
x2 <- x == 1
x2[is.na(x2)] <- FALSE
> x2
[1] TRUE FALSE TRUE FALSE
So I know to determine the first occurrence of a specific element in each row you use the apply function with which.max or which.min. Here is the code that I am using right now.
x <- matrix(c(20,9,4,16,6,2,14,3,1),nrow=3)
x
apply(3 >= x,1,which.max )
This produces and output of:
[1] 1 3 2
Now when I try to do the same thing on a different matrix "x2"
x2 <- matrix(c(3,9,4,16,6,2,14,3,1),nrow=3)
x2
apply(3 >= x2,1,which.max )
The output is the same;
[1] 1 3 2
But for "x2" it is correct because the "x2" matrix's first row does have a value less than or equal to three.
Now my question which is probably something simple is why do the apply functions produce the same thing for "x" and "x2". For "x" below I would want something like:
[1] 0 3 2
Or maybe even something like this:
[1] NA 3 2
I have seen questions on stack overflow before on which.max not producing NAs and the answer was to just use the which() function, but since I am using a matrix and I want the first occurrence I do not have that luxury... I think.
We could replace the values in 'x' that are >3 with a very small number, for e.g. -999 or the value that is lower than in the minimum value in the dataset. Get the index of the replaced vector with which.max and multiply with a logical index to take care of cases where there are only negative values. i.e. in the case of 'x', the first row is all greater than 3. So by replacing with -999, the which.max returns 1 as the index but we prefer to have it NA or 0. By using sum(x1>0, the first row will be '0' and negating (!), it converts to TRUE, negate once again and it returns FALSE. Multiplying the logical index coerces to binary (0/1) and we get the '0' value for the first case.
apply(x, 1, function(x) {x1 <- ifelse(x>3, -999, x)
which.max(x1)*(!!sum(x1>0))})
#[1] 0 3 2
apply(x2, 1, function(x) {x1 <- ifelse(x>3, -999, x)
which.max(x1)*(!!sum(x1>0))})
#[1] 1 3 2
Another option is using max.col
x1 <- replace(x, which(x>3), -999)
max.col(x1)*!!rowSums(x1>0)
#[1] 0 3 2
x2N <- replace(x2, which(x2>3), -999)
max.col(x2N)*!!rowSums(x2N>0)
#[1] 1 3 2
Or a slight modification would be
indx <- x*(x <=3)
max.col(indx)*!!rowSums(indx)
#[1] 0 3 2
Put a column in front of '(3>=x)' that is Infinity, if and only if all entries in the corresponding row of 'x' are larger than 3, and otherwise NaN. Then apply 'which.max' rowwise, and finally subtract 1, because of the extra column:
x <- matrix(c(20,9,4,16,6,2,14,3,1),nrow=3)
a <- (!apply(3>=x,1,max))*Inf
apply( cbind(a,3>=x), 1, which.max ) - 1
This gives '0,3,2' 'which.max' is applied to the extended matrix
> cbind(a,3>=x)
a
[1,] Inf 0 0 0
[2,] NaN 0 0 1
[3,] NaN 0 1 1
Suppose I have a factor A with 3 levels A1, A2, A3 and with NA's. Each appears in 10 cases, so there is a total of 40 cases. If I do
subset1 <- df[df$A=="A1",]
dim(subset1) # 20, i.e., 10 for A1 and 10 for NA's
summary(subset1$A) # both A1 and NA have non-zero counts
subset2 <- df[df$A %in% c("A1"),]
dim(subset2) # 10, as expected
summary(subset2$A) # only A1 has non-zero count
And it is the same whether the class of the variable used for subsetting is factor or integer. Is it just how equal (and >, <) works? So should I just stick to %in% for factors and always include !is.na when using equal? Thanks!
Yes, the return types of == and %in% are different with respect to NA because of how "%in%" is defined...
# Data...
x <- c("A",NA,"A")
# When NA is encountered NA is returned
# Philosophically correct - who knows if the
# missing value at NA is equal to "A"?!
x=="A"
#[1] TRUE NA TRUE
x[x=="A"]
#[1] "A" NA "A"
# When NA is encountered by %in%, FALSE is returned, rather than NA
x %in% "A"
#[1] TRUE FALSE TRUE
x[ x %in% "A" ]
#[1] "A" "A"
This is because (from the docs)...
%in% is an alias for match, which is defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
If we redefine it to the standard definition of match you will see that it behaves in the same way as ==
"%in2%" <- function(x,table) match(x, table, nomatch = NA_integer_) > 0
x %in2% "A"
#[1] TRUE NA TRUE
There's a mismatch here between what you want (only the entries that match your filtering) and what R does.
The difference is that when the selection vector includes an NA, the corresponding entry yields an output, but the value is NA. The logical tests that you run yield NAs, which is where the problem occurs.
Consider these cases:
x <- 1:10
y <- x
y[4] <- NA
ix1 <- which(x < 5)
ix2 <- which(y < 5)
x[ix1]
y[ix2]
Versus:
x[x < 5]
y[y < 5]
And
y < 5
It is because of this behavior that I almost never use v[logicalCondition] and instead add an additional command to select the entries, e.g. ixSelect <- which(logicalCondition). If you want NAs, you can use which(logicalCondition | is.na(v)).