How are missings represented in R? - r

Beforehand
Most obvious answer to the title is that missings are represented with NA in R. Dummy data:
x <- c("a", "NA", "<NA>", NA)
We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.
addNA
But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:
x_new <- addNA(x)
x_new
[1] a NA <NA> <NA>
Levels: <NA> a NA <NA>
Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:
as.character(x_new)
[1] "a" "NA" "<NA>" NA
How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?

That's probably a uncleanness in the base:::print.factor() method.
x <- c("a", "NA", "<NA>", NA)
addNA(x)
# [1] a NA <NA> <NA>
# Levels: <NA> a NA <NA>
But:
levels(addNA(x))
# [1] "<NA>" "a" "NA" NA
So, there are no duplicated levels.

Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.
x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA NA NA
But if you want to fix it afterwards
x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")
unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))
[1] "a" NA NA NA
some notes on factors and addNA
# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)
x_1 <- addNA(x)
x_1
# do not get confused on how the displayed output is
# [1] a b c <NA>
# Levels: a b c <NA>
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4
is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE
is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE
# but nothing is lost
x_2 <- as.character(x_1)
str(x_2)
# chr [1:4] "a" "b" "c" NA
is.na(x_2)
# [1] FALSE FALSE FALSE TRUE

Related

Count how many ID's have more than one value in columns -R

Hi i have this dataset:
> id s1 s2 s3 s4
1 "A" "A" "NA" "A"
2 "NA" "A" "NA" "A"
3 "Na" "NA" "NA" "A"
4 "A" "NA" "NA" "Na"
5 "A" "A" "NA" "A"
I want to count how many ID's have only one value of "A" in either s1, s2, s3, s4. In this case it is only 2 persons (ID 3 and 4). But if i have a large dataset how can I count this ?
You can use
which(rowSums(!is.na(df[-1])) == 1)
# [1] 3 4
Replace which() with sum() to get the number of ID that have only 1 non-missing value.
Update
If unfortunately, you store all NA as "NA", "Na", or "na", then use the following code to convert them back to regular NA in advance.
df[] <- lapply(df, \(x) { x[x %in% c('NA', 'Na', 'na')] <- NA; x })
Checks for "A" in the string
library(tidyverse)
df %>%
filter(rowSums(across(s1:s4, ~ str_detect(.x, "A")), na.rm = TRUE) == 1)
# A tibble: 2 × 5
id s1 s2 s3 s4
<dbl> <chr> <chr> <lgl> <chr>
1 3 Na NA NA A
2 4 A NA NA Na

Alternate elements of a vector with multiple NAs

I have a character vector in R, and want to make a new vector with multiple NAs between the elements of the character vector. To simplify, the character vector is:
cv <- c( "A", "B", "C" )
Let's say we just want 3 NAs (actually need much more). Desired output vector would be:
"A", NA, NA, NA, "B", NA, NA, NA, "C", NA, NA, NA
I'm guessing this has been asked before, but it's very difficult to search for. I've tried various permutations and combinations of rep and rbind with no success. Be gentle; my first question :-)
Use sapply to concatenate c(NA, NA, NA) to each element of cv so that for each element of cv we get a 4-vector. sapply will arrange these into a 4 x n matrix (where n is the length of cv) and c on the left will unravel that matrix into a vector.
c(sapply(cv, c, rep(NA, 3)))
## [1] "A" NA NA NA "B" NA NA NA "C" NA NA NA
You can try to play it with matrix() and as.vector()
v <- as.vector(rbind(cv,matrix(nrow = 3,ncol = length(cv))))
such that
> v
[1] "A" NA NA NA "B" NA NA NA "C" NA NA
[12] NA
We could create a vector with NA's and replace cv elements based on position generated by seq.
n <- 3
vec <- rep(NA, (n + 1) * length(cv))
vec[seq(1, length(vec), n + 1)] <- cv
vec
#[1] "A" NA NA NA "B" NA NA NA "C" NA NA NA

R Subsetting Specific Value Also Returns NA?

I am just starting out on learning R and came across a piece of code as follows
vec_1 <- c("a","b", NA, "c","d")
# create a subet of all elements which equal "a"
vec_1[vec_1 == "a"]
The result from this is
## [1] "a" NA
Im just curious, since I am subsetting vec_1 for the value "a", why does NA also show up in my results?
This is because the result of anything == NA is NA. Even NA == NA is NA.
Here's the output of vec_1 == "a" -
[1] TRUE FALSE NA FALSE FALSE
and NA is not TRUE or FALSE so when you subset anything by NA you get NA. Check this out -
vec_1[NA]
[1] NA NA NA NA NA
When dealing with NA, R tries to provide the most informative answer i.e. T | NA returns TRUE because it doesn't matter what NA is. Here are some more examples -
T | NA
[1] TRUE
F | NA
[1] NA
T & NA
[1] NA
F & NA
[1] FALSE
R has no way to test equality with NA. In your case you can use %in% operator -
5 %in% NA
[1] FALSE
"a" %in% NA
[1] FALSE
vec_1[vec_1 %in% "a"]
[1] "a"

How do i use grepl on each column in a data frame?

I have some values in my data frames #N/A that I want to convert to NA. I'm trying what seems like a straightforward grepl via lapply on the data frame, but its not working. Here's a simple example...
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = as.data.frame(cbind(a,b))
lapply(df, function(x){x[grepl("#N/A", x)]=NA})
Which outputs:
$a
[1] NA
$b
[1] NA
Can someone point me in the right direction? I'd appreciate it.
Your function needs to return x as the return value.
Try:
lapply(df, function(x){x[grepl("#N/A", x)] <- NA; x})
$a
[1] <NA> A B <NA> C
Levels: #N/A A B C
$b
[1] d <NA> e f 123
Levels: #N/A 123 d e f
But you should really use gsub instead of grep:
lapply(df, function(x)gsub("#N/A", NA, x))
$a
[1] NA "A" "B" NA "C"
$b
[1] "d" NA "e" "f" "123"
A better (more flexible and possibly easier to maintain) solution might be:
replace <- function(x, ptn="#N/A") ifelse(x %in% ptn, NA, x)
lapply(df, replace)
$a
[1] NA 2 3 NA 4
$b
[1] 3 NA 4 5 2
You need to return x, and it's probably best to use apply in this case. Creating a data.frame with cbind is best avoided as well.
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = data.frame(a=a, b=b, stringsAsFactors = FALSE)
str(df)
apply(df, 2, function(x){x[grepl("#N/A", x)] <- NA; return(x)})
If you are reading this data in from a CSV/tab delimited file, just set na.strings = "#N/A".
read.table("my file.csv", na.strings = "#N/A")
Update from comment: or maybe na.strings = c("#N/A", "#N/A#N/A").
Even if you are stuck with the case you described in your question, you still don't need grepl.
df <- data.frame(
a = c("#N/A", "A", "B", "#N/A", "C"),
b = c("d", "#N/A", "e", "f", "123")
)
df[] <- lapply(
df,
function(x)
{
x[x == "#N/A"] <- NA
x
}
)
df
## a b
## 1 <NA> d
## 2 A <NA>
## 3 B e
## 4 <NA> f
## 5 C 123
As per your example in the question, you don't need any types of apply loops, just do
df[df == "#N/A"] <- NA
As per cases when you have #N/A#N/A (although you didn't provide such data), another way to solve this would be
df[sapply(df, function(x) grepl("#N/A", x))] <- NA
In both cases the data itself will be updated, rather just printed to consule

Select names of columns which contain specific values in row

I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!
You can use which:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT
Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply and apply. You can get the same just with apply:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.
To answer #flodel's concerns, I'll write it as a separate answer:
1) Using lapply gets a list and apply doesn't guarantee this always:
A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply?
By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
2) lapply is overall shorter, and does not require anonymous function:
Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
3) All answers are wrong and the answer that'll work with all inputs:
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
why not do
colnames(df[,df[i,]>2])
for each row, where df is your data frame and i is the row number ;)

Resources