Conditional searching which omits NA values - r

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.

You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)

You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

Related

How to test whether each element in a column of values falls between values in two other columns?

This may be a very convoluted way of asking this question. I have a column of "results" that I want to test against statistics of previous results, namely calculated minimum and maximum values. If the value in the result column falls between the corresponding min and max values, I want to assign it as "1" in a fourth column named Within_range and if not, "0".
I have tried using relational operators (<,>)
df$Within_Range <- if(df$Result > df$Min & df$Result < df$Max){"1"} else {"0"}
and got this:
In if (df$Result > df$Min & df$Result < df$Max) { :
the condition has length > 1 and only the first element will be used
R did not seem to like that I tried to use multiple conditions, so I tried using between()
df$Within_Range <- if(between(df$Result,df$Min,df$Max)){"1"} else {"0"}
and I got this:
Error: Expecting a single value: [extent=20511].
Here is some example code:
Result <- 1:5
Min <- c(2,1,2,3,4)
Max <- c(3,4,5,8,7)
df <- data.frame(Result, Min, Max)
Apologies if this is a silly question; I am still new to R and hours of searching R forums returned nothing helpful... I am stuck.
between is not vectorized for the left, right arguments. We need comparison operators
df$Within_Range <- with(df, +(Result > Min & Result < Max))
NOTE: Change to >= or <= if the range should also include the Min, Max values
Also, in the first piece of code, the if/else is unnecessary due to multiple reasons
It is not vectorized i.e. it expects a input of length 1 and output a logical vector of length 1 (df$Result and other columns are obviously having length greater than 1)
TRUE/FALSE output from comparison operators are stored as 1/0 values. So, we just need to coerce it to binary with as.integer or +
df %>% mutate(Within_Range = between(Result, Min, Max))
## OutPut
Result Min Max Within_Range
1 1 2 3 FALSE
2 2 1 4 TRUE
3 3 2 5 TRUE
4 4 3 8 TRUE
5 5 4 7 TRUE

Different results for 2 subset data methods in R

I'm subseting my data, and I'm getting different results for the following codes:
subset(df, x==1)
df[df$x==1,]
x's type is integer
Am I doing something wrong?
Thank you in advance
Without example data, it is difficult to say what your problem is. However, my hunch is that the following probably explains your problem:
df <- data.frame(quantity=c(1:3, NA), item=c("Coffee", "Americano", "Espresso", "Decaf"))
df
quantity item
1 Coffee
2 Americano
3 Espresso
NA Decaf
Let's subset with [
df[df$quantity == 2,]
quantity item
2 Americano
NA <NA>
Now let's subset with subset:
subset(df, quantity == 2)
quantity item
2 Americano
We see that there is a difference in sub-setting output depending on how NA values are treated. I think of this as follows: With subset, you are explicitly stating you want the subset for which the condition is verifiably true. df$quantity==2 produces a vector of true/false-statements, but where quantity is missing, it is impossible to assign TRUE or FALSE. This is why we get the following output with an NA at the end:
df$quantity==2
[1] FALSE TRUE FALSE NA
The function [ takes this vector but does not understand what to do with NA, which is why instead of NA Decaf we get NA <NA>. If you prefer using [, you could use the following instead:
df[which(df$quantity == 2),]
quantity item
2 Americano
This translates the logical condition df$quantity == 2 into a vector or row numbers where the logical condition is "verifiably" satisfied.

Excluding a number of answers from a R dataframe

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:
df <- data.frame(x=1:3, y=c('a', NA, 'c'))
I can achieve this using R's indexing syntax like this:
df[df$y!='a',]
x y
2 <NA>
3 c
Note this returns both the NA and the value c - which is what I want.
However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:
subset(df, y!='a')
x y
3 c
dplyr::filter(df, y!='a')
x y
3 c
Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?
And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.
subset(df, y!='a' | is.na(y))
Thanks
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.
Many people dislike this behavior, but it is what it is.
subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.
From base::Extract:
When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA
From ?base::subset:
missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with [, rows where the condition evaluates to NA are dropped
One workaround is to use %in%:
subset(df, !y %in% "a")
dplyr::filter(df, !y %in% "a")

R - How to compare values across more than two columns

I'm trying to write code to compare the values of several columns, and i dont know ahead of time how many columns I will have. The data will look like this:
X Val1 Val2 Val3 Val4
A 1 1 1 2
B NA 2 2 2
C 3 3 3 3
The code should return a Fail for rows A and B, and a Pass for row C, but needs to be able to handle a changing number of columns. I can't figure out how to do this without nesting a couple of for loops, but there has to be some way to use apply or sapply to iterate through columns 2: length(df)
EDIT: I want to see if the values (which will be numbers) are equal
Assuming that the first column is excluded from the comparison and that all the other columns are not, you can try:
which(rowSums(df[,2]==df[,3:ncol(df)])==(ncol(df)-2))
You can use apply with a custom function length(unique(x)) to count the unique number of values in rows 2:ncol(yourDataFrame). You can then throw the whole thing into an ifelse function to return a true/false list.
ifelse(apply(df[ , 2:ncol(yourDataFrame)], MARGIN=1, function(x) length(unique(x))) == 1, TRUE, FALSE)

Resources