Removing zero lines from dataframe yields dataframe of zero lines - r

I have a script that has a bunch of quality control checksums and it got caught on a dataset that had no need to remove any samples (rows) due to quality control. However, this script gave me an unexpected result of a dataframe with zero rows. With example data, why does this work:
data(iris)
##get rid of those pesky factors
iris$Species <- NULL
med <- which(iris[, 1] < 4.9)
medtemp <- iris[-med, ]
dim(medtemp)
[1] 134 4
but this returns a dataframe of zero rows:
small <- which(iris[, 1] < 4.0)
smalltemp <- iris[-small, ]
dim(smalltemp)
[1] 0 4
As does this:
x <- 0
zerotemp <- iris[-x, ]
dim(zerotemp)
[1] 0 4
It seems that the smalltemp dataframe should be the same size as iris since there are no rows to remove at all. Why is this?

Copied verbatim from Patrick Burns's R Inferno p. 41 (I hope this constitutes "fair use" -- if someone objects I'll remove it)
negative nothing is something
> x2 <- 1:4
> x2[-which(x2 == 3)]
[1] 1 2 4
The command above returns all of the values in x2 not equal to 3.
> x2[-which(x2 == 5)]
numeric(0)
The hope is that the above command returns all of x2 since no elements are
equal to 5. Reality will dash that hope. Instead it returns a vector of length
zero.
There is a subtle difference between the two following statements:
x[]
x[numeric(0)]
Subtle difference in the input, but no subtlety in the difference in the output.
There are at least three possible solutions for the original problem.
out <- which(x2 == 5)
if(length(out)) x2[-out] else x2
Another solution is to use logical subscripts:
x2[!(x2 %in% 5)]
Or you can, in a sense, work backwards:
x2[ setdiff(seq along(x2), which(x2 == 5)) ]

Could it be that in your second example, small evaluates to 0?
Taking the zeroth element of a vector will always return the empty vector:
> foo <- 1:3
> foo
[1] 1 2 3
> foo[0]
integer(0)
>

Instead of using which to get your indices, I would use a boolean vector and negate it. That way you can do this:
small <- iris[, 1] < 4.0
smalltemp <- iris[!small, ]
dim(smalltemp)
[1] 150 4
EDIT: I don't think a negative index of 0 (as in your case) is allowed since there is no 0th index and thus R can't exclude that index from your selection. Negative indexing can be interpreted as: "give me back all rows except those with these indices".

It is because of the rules of what to do with an index that is zero. Only strictly positive or strictly negative indices are allowed. As [0] returns nothing, and
R> -0 == 0
[1] TRUE
Hence you get nothing where you expected it to drop nothing.
The identical(0) issue is treated as indexing by a NULL and this is documented to work as if indexing by 0 and hence the same behaviour.
This is discussed in the R Language Definition manual

Related

R condtional replace in a 3d array

I want to conditionally replace values in a specific vector in a 3d array, the replacement value being a value from a probability calculation. For some reason the replacement value is the same for all values of the vector, rather than being calculated on an individual vector element basis. I must have something simple incorrect in my syntax
library (abind)
pop <- array(c (1,0,1,1,1,0,0,0,0,0,2,0,2,3,5), dim = c(1,5,3))
pop <- abind(pop,pop, along = 1)
so the particular vector I want to work on is
pop[dim(pop)[1], ,1]
[1] 1 0 1 1 1
what I want to achieve is to leave the zero value alone, and if the value is one, then run a random binomial test, to see if it changes to zero, and if it does change, do the insertion. I'm told that the ifelse is vectorized but with this syntax it is not operating individually on each element of the vector. When I try to produce a new vector as such
ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
I get either no change
> ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
[1] 1 0 1 1 1
or alternatively it changes all values.
> ifelse (pop[dim(pop)[1], ,1] == 1, rbinom(1,1,0.5), 0)
[1] 0 0 0 0 0
I'm expecting some of the values in the array to be changed, but not "all or nothing". What am I doing wrong? Also if there is a simple elegant way to do the substitution back into the original 3d array I'd be grateful. Thx. J
I think I did find a solution using the "modify_if" function of the dplyr package.
pop[dim(pop)[1], ,1] %<>% modify_if(~ .x == 1, ~ rbinom(1,1,pliv1))
HTH, J

Get indices of two values that bracket zero in R

I have a vector x:
x <- c(-1,-0.5,-0.1,-0.001,0.5,0.6,0.9)
I want the index of the closest negative value to zero and the closest positive value to zero. In this case, 4 and 5. x is not necessarily sorted.
I can do this by setting numbers to NA:
# negative numbers only
tmp <- x
tmp[x > 0] <- NA
which.max(tmp)
# positive numbers only
tmp <- x
tmp[x < 0] <- NA
which.min(tmp)
But that seems clunky. Any tips?
good scenario
If you are in the classic case, where
your vector is sorted in increasing order,
it does not include 0,
it has no tied values,
you can simply do the following:
findInterval(0, x, TRUE) + 0:1
If condition 1 does not hold, but condition 2 and 3 still hold, you can do
sig <- order(x)
sig[findInterval(0, x[sig], TRUE) + 0:1]
akrun's answer is fundamentally the same.
bad scenario
Things become tricky once your vector x contains 0 or tied / repeated values, because:
repeated values challenge sorting based method, as sorting method like "quick sort" is not stable (see What is stability in sorting algorithms and why is it important? if you don't know what a stable sort is);
findInterval will locate exactly 0 at presence of 0.
In this situation, you have to adapt Ronak Shah's answer which allows you to exclude 0. But be aware that which may give you multiple indexes if there are repeated values.
Another way could be:
#closest positive value to zero.
which(x == min(x[x > 0]))
#[1] 5
#closest negative value to zero
which(x == max(x[x < 0]))
#[1] 4
We could try
rle(sign(x))$lengths[1] + 0:1
#[1] 4 5
if it is unsorted, then
x1 <- sort(x)
match(x1[rle(sign(x1))$lengths[1] + 0:1], x)

How to find if two or more continuously elements of a vector are equal in R

I want to find a way to determine if two or more continuously elements of a vector are equal.
For example, in vector x=c(1,1,1,2,3,1,3), the first, the second and the third element are equal.
With the following command, I can determine if a vector, say y, contains two or more continuously elements that are equal to 2 or 3
all(rle(y)$lengths[which( rle(y)$values==2 | rle(y)$values==3 )]==1)
Is there any other faster way?
EDIT
Let say we have the vector z=c(1,1,2,1,2,2,3,2,3,3).
I want a vector with three elements as output. The first element will refer to value 1, the second to 2 and the third one to 3. The values of the elements of the output vector will be equal to 1 if two or more continuously elements of z are the same for one value of 1,2,3 and 0 otherwise. So, the output for the vector z will be (1,1,1).
For the vector w=c(1,1,2,3,2,3,1) the output will be 1,0,0, since only for the value 1 there are two continuously elements, that is in the first and in the second position of w.
I'm not entirely sure if I'm understanding your question as it could be worded better. The first part just asks how you find if continuous elements in a vector are equal. The answer is to use the diff() function combined with a check for a difference of zero:
z <- c(1,1,2,1,2,2,3,2,3,3)
sort(unique(z[which(diff(z) == 0)]))
# [1] 1 2 3
w <- c(1,1,2,3,2,3,1)
sort(unique(w[which(diff(w) == 0)]))
# [1] 1
But your edit example seems to imply you are looking to see if there are repeated units in a vector, of which will only be the integers 1, 2, or 3. Your output will always be X, Y, Z, where
X is 1 if there is at least one "1" repeated, else 0
Y is 2 if there is at least one "2" repeated, else 0
Z is 3 if there is at least one "3" repeated, else 0
Is this correct?
If so, see the following
continuously <- function(x){
s <- sort(unique(x[which(diff(x) == 0)]))
output <- c(0,0,0)
output[s] <- s
return(output)
}
continuously(z)
# [1] 1 2 3
continuously(w)
# [1] 1 0 0
Assuming your series name is z=c(1,1,2,1,2,2,3,2,3,3) then you can do:
(unique(z[c(FALSE, diff(z) == 0)]) >= 0)+0 which will output to 1, 1, 1,
When you run the above command on your other sequenc:
w=c(1,1,2,3,2,3,1)
then (unique(w[c(FALSE, diff(w) == 0)]) >= 0)+0 will return to 1
You may also try this for an exact output like 1,1,1 or 1,0,0
(unique(z[c(FALSE, diff(z) == 0)]) == unique(z))+0 #1,1,1 for z and 1,0,0 for w
Logic:
diff command will take difference between corresponding second and prior items, since total differences will always 1 less than the number of items, I have added first item as FALSE. Then subsetted with your original sequences and for boolean comparison whether the difference returned is zero or not. Finally we convert them to 1s by asking if they are greater than or equal to 0 (To get series of 1s, you may also check it with some other conditions to get 1s).
Assuming your sequence doesn't have negative numbers.

R drop by empty index on vector inconsistent behaviour

Consider removing those elements from a vector that match a certain set if criteria. The expected behaviour is to remove those that match, and, in particular, if none match then remove none:
> d = 1:20
> d
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> d[-which(d > 10)]
[1] 1 2 3 4 5 6 7 8 9 10
> d[-which(d > 100)]
integer(0)
We see here that the final statement has both done something very unexpected and silently hidden the error without even a warning.
I initially thought that this was an undesirable (but consistent) consequence of the choice that an empty index selects all elements of a vector
http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html
as is commonly used to e.g. select the first column of a matrix, m, by writing
m[ , 1]
However the behaviour observed here is consistent with interpreting an empty vector as "no elements", not "all elements":
> a = integer(0)
selecting "no elements" works exactly as expected:
> v[a]
numeric(0)
however removing "no elements" does not:
> v[-a]
numeric(0)
For an empty vector to both select no elements and remove all elements requires inconsistency.
Obviously it is possible to work around this issue, either by checking that the which() returns non-zero length or using a logical expression as covered here In R, why does deleting rows or cols by empty index results in empty data ? Or, what's the 'right' way to delete?
but my two questions are:
Why is the behaviour inconsistent?
Why does it silently do the wrong thing without an error or warning?
This doesn't work because which(d > 100) and -which(d > 100) are the same object: there is no difference between an empty vector and the negative of that empty vector.
For example, imagine you did:
d = 1:10
indexer = which(d > 100)
negative_indexer = -indexer
The two variables would be the same (which is the only consistent behavior- turning all the elements of an empty vector negative leaves it the same since it has no elements).
indexer
#> integer(0)
negative_indexer
#> integer(0)
identical(indexer, negative_indexer)
#> [1] TRUE
At that point, you couldn't expect d[indexer] and d[negative_indexer] to give different results. There is also no place to provide an error or warning: it doesn't know when passed an empty vector that you "meant" the negative version of that empty vector.
The solution is that for subsetting there's no reason you need which() at all: you could use d[d > 10] instead of your original example. You could therefore use !(d > 100) or d <= 100 for your negative indexing. This behaves as you'd expect because d > 10 or !(d > 100) are logical vectors rather than vectors of indices.

ifelse with for loop

I would like to traverse through rows of a matrix and perform some operations on data entries based on a condition.
Below is my code
m = matrix(c(1,2,NA,NA,5,NA,NA,1,NA,NA,NA,NA,4,5,NA,NA,NA,NA,NA,NA), nrow = 5, ncol = 4)
if (m[,colSums(!is.na(m)) > 1, drop = FALSE]){
for(i in 1:4){
a = which(m[i,] != "NA") - mean(which(!is.na(m[i,])))
for(j in 2:5){
b = which(m[j,] != "NA") - mean(which(!is.na(m[j,])))
prod(a,b)
}
}
}
I get a warning message as below in my "if" condition
Warning message:
In if (m[, colSums(!is.na(m)) > 1, drop = FALSE]) { :
the condition has length > 1 and only the first element will be used
I know it returns a vector and I should be using ifelse block. How to incorporate for loops inside ifelse block? It seems to be a basic question, I am new to R.
Based on your description, you want to check the number of non NA in matrix by column and then do something dependent on this results (that why you need "if"/"ifelse" statement). So, you can implemented as below, and write inner loops in a specific function.
yourFunc <- function(x, data) {
# do what your want / your loops on "data"
# sample, you can check the result in here
if(x > 1) 1
else 0
}
m = matrix(c(1,2,NA,NA,5,NA,NA,1,NA,NA,NA,NA,4,5,NA,NA,NA,NA,NA,NA), nrow = 5, ncol = 4)
# use "apply" series function in here
sapply(colSums(!is.na(m)), yourFunc, data=m)
#[1] 1 0 1 0
Actually, I think you need to re-organize your problem and optimize the code, the "ifelse with for loop" may be totally unnecessary.
As you are new to R, I assume that some of the terminology is maybe a bit
confusing. So here is a little explanation regarding the if statement.
Lets look at the if condition:
m[,colSums(!is.na(m)) > 1, drop = FALSE]
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] NA 4
[4,] NA 5
[5,] 5 NA
This is nothing that if can work with as an if condition has to be
boolean (evaluate to TRUE/FALSE). So why the result? Well the result of
colSums(!is.na(m))
[1] 3 1 2 0
is a vector of counts of entries that are not NA! (= number of TRUE's in each column). Be carful as this is not the same as
colSums(m, na.rm = TRUE)
[1] 8 1 9 0
which returns a vector of sums over all five rows for each column, excluding NA's. My guess is that the latter is what you are looking for. In any case: be aware of the difference!
By asking which of those sums is greater than 1 you do get a boolean vector
colSums(!is.na(m)) > 1
[1] TRUE FALSE TRUE FALSE
However, using that boolean vector as a criteria for selecting columns, you correctly get a matrix which is obviously not boolean:
m[,colSums(!is.na(m)) > 1]
Note: drop = FALSE is unnecessary here as there are no dimensions to be dropped potentially. See ?[ or ?drop. You can verify this using identical:
identical(m[,colSums(!is.na(m)) > 1, drop = FALSE],
m[,colSums(!is.na(m)) > 1])
Now to the loop. You find tons of discussions on avoiding for loops and using the apply family of functions. I suspect you have to take some time togo through all that. Note however, that using apply - contrary to common belief - is not necessarily superior to a for loop in terms of speed, as it is actually just a fancy wrapper around a for loop (check the source code!). It is, however, clearly superior in terms of code clarity as it is compact and clear about what it is doing. So do try to use apply functions if possible!
In order to rewrite your loop it would be helpful if you could verbally
describe what you actually want to do, since I assume that what the loop
is doing right now is probably not what you want. As which() returns the index/posistion of an element in a vector or matrix what you are basically
doing is:
indices of the i'th row that are not NA (for a given column) - mean over these indices
While this is theoretically possible, this usually doesnt make much sense. So with all my notes at hand: clearly state your problem so we can think of a fix.

Resources