Why does is.na() change its argument? - r

I just discovered the following behaviour of the is.na() function which I don't understand:
df <- data.frame(a = 5:1, b = "text")
df
## a b
## 1 5 text
## 2 4 text
## 3 3 text
## 4 2 text
## 5 1 text
is.na(df)
## a b
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE
is.na(df) <- "0"
df
## a b 0
## 1 5 text NA
## 2 4 text NA
## 3 3 text NA
## 4 2 text NA
## 5 1 text NA
My question
Why does is.na() change its argument (and in this case adds an extra column to the data frame)? In this case its behaviour seems extra puzzling (or at least unexpected) because the result of the query is FALSE for all instances.
NB
This question is not about subsetting and changing the NA values in a data frame - I know how to do that (df[is.na(df)] <- "0"). This question is about the behaviour of the is.na function! Why is an assignment to a is.something function changing the argument itself - this is unexpected.

The actual function being used here is not is.na() but the assignment function `is.na<-`, for which the default method is `is.na<-.default`. Printing that function to console we see:
function (x, value)
{
x[value] <- NA
x
}
So clearly, value is supposed to be an index here. If you index a data.frame like df["0"], it will try to select the column named "0". If you assign something to df["0"], the column will be created and filled with (in this case) NA.
To clarify, `is.na<-` sets values to NA, it does not replace NA values with something else.

Related

NA Remove to calculation

I have some problems with NA value cause my dataset from excel is not same column number so It showed NA. It deleted all row containing NA value when make calculation Similarity Index function Psicalc in RInSp package.
B F
4 7
5 6
6 8
7 5
NA 4
NA 3
NA 2
Do you know how to handle with NA or remove it but not delete all row or not affect to package?. Beside when I import.RinSP it has message
In if (class(filename) == "character") { :
the condition has length > 1 and only the first element will be used
Thank you so much
Many R functions ( specifically base R ) have an na.rm argument, which is FALSE by default. That means if you omit this argument, and your data has NA, your "calculation" will result in NA. To remove these in the calculations, include an na.rm argument and assign it to TRUE.
Example:
x <- c(4,5,6,7,NA,NA)
mean(x) # Oops!
[1] NA
mean(x, na.rm=TRUE)
[1] 5.5

How to use apply over a vector?

Suppose I have a data.frame like
a <- data.frame(col1=1:6,
col2=c('a','b',1,'c',2,3),
stringsAsFactors=F)
a
col1 col2
1 1 a
2 2 b
3 3 1
4 4 c
5 5 2
6 6 3
I want to have a vector saying which rows have col2 as a number. I'm trying something like
apply(a$col2,1,is.numeric)
or
apply(a$col2,FUN=is.numeric)
but it always says
Error in apply(a$col2, 1, is.numeric) :
dim(X) must have a positive length
If a$col2 (the X in apply) must be a matrix, then why does the help from the function say:
X: an array, including a matrix.
The help on arrays says:
An array in R can have one, two or more dimensions.
If an array can have only one dimension, then why can't a one-dimensional array be used in apply? What am I missing here?
(Beyond that, I still would like to know how to find the numeric rows in col2 without using a loop.)
First note that even the numbers in col2 are character since when combined with other elements which are character they get coerced to character.
str(a)
## 'data.frame': 6 obs. of 2 variables:
## $ col1: int 1 2 3 4 5 6
## $ col2: chr "a" "b" "1" "c" ...
1) grepl thus we should use character processing like this:
grepl("^\\d+$", a$col2)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
grepl is alredy vectorized so we don't need an apply or related function to iterate over the elements of col2.
2) (s)apply These also work but seems unnecessarily involved given that grepl alone works:
sapply(a$col2, grepl, pattern = "^\\d+$")
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE
apply(array(a$col2), 1, grepl, pattern = "^\\d+$")
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
3) type.convert Another approach is to use type.convert which will convert to numeric if it can be represented as one. Then we can use is.numeric.
sapply(a$col2, function(x) is.numeric(type.convert(x)))
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE

Filter dataframe based on presence of sample in a seperate list

I am wanting to filter a dataframe with 1212 so it only contains that samples listed in a seperate list. The list has multiple values and I can't work out how to do this.
The df below is called RNASeq2
RNASeq2Norm_samples Substrng_RNASeq2Norm
1 TCGA-3C-AAAU-01A-11R-A41B-07 TCGA.3C.AAAU
2 TCGA-3C-AALI-01A-11R-A41B-07 TCGA.3C.AALI
3 TCGA-3C-AALJ-01A-31R-A41B-07 TCGA.3C.AALJ
4 TCGA-3C-AALK-01A-11R-A41B-07 TCGA.3C.AALK
5 TCGA-4H-AAAK-01A-12R-A41B-07 TCGA.4H.AAAK
6 TCGA-5L-AAT0-01A-12R-A41B-07 TCGA.5L.AAT0
7 TCGA-5L-AAT1-01A-12R-A41B-07 TCGA.5L.AAT1
8 TCGA-5T-A9QA-01A-11R-A41B-07 TCGA.5T.A9QA
.
.
.
1212
list = intersect_samples
intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097
I have tried this code but returns all the original 1212 samples:
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]
Yet if I try
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]
it will return the correct row
str(RNASeq2)
'data.frame': 1212 obs. of 2 variables:
$ RNASeq2 : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...
str(intersect_samples)
chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...
AFAIK R does not offer a convenience function to find a vector of search strings in a vector of strings using partial matching ("sub-strings").
%in is not the right function if you want to find a sub-string within a string since it compares only whole strings.
Instead use base R's grepl or the probably faster stri_detect_fixed function of the excellent stringi package.
Please note that I have abstracted the code and data (instead of using your code and data) for easier understanding.
library(stringi)
pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
string = c("123", "234", "345", "xyz"),
stringsAsFactors = FALSE)
# row_num string
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 xyz
string <- data$string # the column that contains the values to be filtered
# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)
selected # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1] TRUE TRUE FALSE FALSE
#
# [[2]]
# [1] FALSE FALSE TRUE FALSE
#
# [[3]]
# [1] FALSE FALSE FALSE FALSE
# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1] TRUE TRUE TRUE FALSE
# To handle NAs correctly (if you have NAs) you can use this (slower) code:
selected.rows <- rowSums(as.data.frame(selected), na.rm = TRUE) > 0
# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345

How do I use grep on a data frame?

I have the following data frame:
> my.data
A.Seats B.Seats
1 14,15 14,15,16
2 7 7,8
3 12,13 16,17
4 <NA> 10,11
I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
But I don't know how to create this table. As a start, I tried using grep:
grep(my.data$A.Seats,my.data$B.Seats)
But I receive the following output
[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used
...and I can't get past this error. Any ideas as to how I can get the intended result?
Many Thanks
The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:
my.data <- data.frame(
A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
# A.Seats B.Seats
# 1 14,15 14,15,16
# 2 7 7,8
# 3 12,13 16,17
# 4 <NA> 10,11
# 5 14,19 14,15,16
library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1] TRUE TRUE FALSE NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1] TRUE TRUE FALSE NA TRUE
The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.
Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.
If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:
vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats) # pattern is fixed
# [1] 1 1 0 NA 0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15 7 12|13 <NA> 14|19
# 1 1 0 NA 1
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats)) # coerce to logical
# [1] TRUE TRUE FALSE NA FALSE
Because this calls grepl on each element in the vector, I don't think this will scale well though.
This is an approach to get what you need
> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
Here's an alternative using grep
>transform(my.data,
Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

Extract a portion of 1 column from data.frame/matrix

I get flummoxed by some of the simplest of things. In the following code I wanted to extract just a portion of one column in a data.frame called 'a'. I get the right values, but the final entity is padded with NAs which I don't want. 'b' is the extracted column, 'c' is the correct portion of data but has extra NA padding at the end.
How do I best do this where 'c' is ends up naturally only 9 elements long? (i.e. - the 15 original minus the 6 I skipped)
NumBars = 6
a = as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,2] = c(11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)
names(a)[1] = "Data1"
names(a)[2] = "Data2"
{Use 1st column of data only}
b = as.matrix(a[,1])
c = as.matrix(b[NumBars+1:length(b)])
The immediate reason why you're getting NA's is that the sequence operator : takes precedence over the addition operator +, as is detailed in the R Language Definition. Therefore NumBars+1:length(b) is not the same as (NumBars+1):length(b). The first adds NumBars to the vector 1:length(b), while the second adds first and then takes the sequence.
ind.1 <- 1+1:3 # == 2:4
ind.2 <- (1+1):3 # == 2:3
When you index with this longer vector, you get all the elements you want, and you also are asking for entries like b[length(b)+1], which the R Language Definition tells us returns NA. That's why you have trailing NA's.
If i is positive and exceeds length(x) then the corresponding
selection is NA. A negative out of bounds value for i causes an error.
b <- c(1,2,3)
b[ind.1]
#[1] 2 3 NA
b[ind.2]
#[1] 2 3
From a design perspective, the other solutions listed here are good choices to help avoid this mistake.
It is often easier to think of what you want to remove from your vector / matrix. Use negative subscripts to remove items.
c = as.matrix(b[-1:-NumBars])
c
## [,1]
## [1,] 7
## [2,] 8
## [3,] 9
## [4,] 10
## [5,] 11
## [6,] 12
## [7,] 13
## [8,] 14
## [9,] 15
If your goal is to remove NAs from a column, you can also do something like
c <- na.omit(a[,1])
E.g.
> x
[1] 1 2 3 NA NA
> na.omit(x)
[1] 1 2 3
attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"
You can ignore the attributes - they are there to let you know what elements were removed.

Resources