How do i use grepl on each column in a data frame? - r

I have some values in my data frames #N/A that I want to convert to NA. I'm trying what seems like a straightforward grepl via lapply on the data frame, but its not working. Here's a simple example...
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = as.data.frame(cbind(a,b))
lapply(df, function(x){x[grepl("#N/A", x)]=NA})
Which outputs:
$a
[1] NA
$b
[1] NA
Can someone point me in the right direction? I'd appreciate it.

Your function needs to return x as the return value.
Try:
lapply(df, function(x){x[grepl("#N/A", x)] <- NA; x})
$a
[1] <NA> A B <NA> C
Levels: #N/A A B C
$b
[1] d <NA> e f 123
Levels: #N/A 123 d e f
But you should really use gsub instead of grep:
lapply(df, function(x)gsub("#N/A", NA, x))
$a
[1] NA "A" "B" NA "C"
$b
[1] "d" NA "e" "f" "123"
A better (more flexible and possibly easier to maintain) solution might be:
replace <- function(x, ptn="#N/A") ifelse(x %in% ptn, NA, x)
lapply(df, replace)
$a
[1] NA 2 3 NA 4
$b
[1] 3 NA 4 5 2

You need to return x, and it's probably best to use apply in this case. Creating a data.frame with cbind is best avoided as well.
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = data.frame(a=a, b=b, stringsAsFactors = FALSE)
str(df)
apply(df, 2, function(x){x[grepl("#N/A", x)] <- NA; return(x)})

If you are reading this data in from a CSV/tab delimited file, just set na.strings = "#N/A".
read.table("my file.csv", na.strings = "#N/A")
Update from comment: or maybe na.strings = c("#N/A", "#N/A#N/A").
Even if you are stuck with the case you described in your question, you still don't need grepl.
df <- data.frame(
a = c("#N/A", "A", "B", "#N/A", "C"),
b = c("d", "#N/A", "e", "f", "123")
)
df[] <- lapply(
df,
function(x)
{
x[x == "#N/A"] <- NA
x
}
)
df
## a b
## 1 <NA> d
## 2 A <NA>
## 3 B e
## 4 <NA> f
## 5 C 123

As per your example in the question, you don't need any types of apply loops, just do
df[df == "#N/A"] <- NA
As per cases when you have #N/A#N/A (although you didn't provide such data), another way to solve this would be
df[sapply(df, function(x) grepl("#N/A", x))] <- NA
In both cases the data itself will be updated, rather just printed to consule

Related

How are missings represented in R?

Beforehand
Most obvious answer to the title is that missings are represented with NA in R. Dummy data:
x <- c("a", "NA", "<NA>", NA)
We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.
addNA
But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:
x_new <- addNA(x)
x_new
[1] a NA <NA> <NA>
Levels: <NA> a NA <NA>
Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:
as.character(x_new)
[1] "a" "NA" "<NA>" NA
How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?
That's probably a uncleanness in the base:::print.factor() method.
x <- c("a", "NA", "<NA>", NA)
addNA(x)
# [1] a NA <NA> <NA>
# Levels: <NA> a NA <NA>
But:
levels(addNA(x))
# [1] "<NA>" "a" "NA" NA
So, there are no duplicated levels.
Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.
x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA NA NA
But if you want to fix it afterwards
x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")
unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))
[1] "a" NA NA NA
some notes on factors and addNA
# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)
x_1 <- addNA(x)
x_1
# do not get confused on how the displayed output is
# [1] a b c <NA>
# Levels: a b c <NA>
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4
is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE
is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE
# but nothing is lost
x_2 <- as.character(x_1)
str(x_2)
# chr [1:4] "a" "b" "c" NA
is.na(x_2)
# [1] FALSE FALSE FALSE TRUE

How to merge two lists in parallel in R?

I'm asking to how to merge two lists in parallel, not orderly append as below codes.
For example,
A <- list(c(1,2,3), c(3,4,5), c(6,7,8))
B <- list(c("a", "b", "c"), c("d", "e", "f"), c("g", "h", "i"))
As results,
[[1]]
[[1]][[1]]
[1] 1 2 3
[[1]][[2]]
[1] "a" "b" "c"
[[2]]
[[2]][[1]]
[1] 3 4 5
[[2]][[2]]
[1] "d" "e" "f"
[[3]]
[[3]][[1]]
[1] 6 7 8
[[3]][[2]]
[1] "g" "h" "i"
Using  Map simply:
Map(list,A,B)
A longer approach (not recursive yet, up to second level merging):
A <- list(c(1,2,3), c(3,4,5), c(6,7,8))
B <- list(c("a", "b", "c"), c("d", "e", "f"), c("g", "h", "i"))
mergepar <- function(x = A, y = B) { # merge two lists in parallel
ln <- max(length(x), length(y)) # max length
newlist <- as.list(rep(NA, ln)) # empty list of max length
for (i in 1:ln) { # for1, across length
# two level subsetting (first with [ and then [[, so no subscript out of bound error) and lapply
newlist[[i]] <- lapply(list(A, B), function(x) "[["("["(x, i), 1))
}
return(newlist)
}

Find names of columns which contain missing values

I want to find all the names of columns with NA or missing data and store these column names in a vector.
# create matrix
a <- c(1,2,3,4,5,NA,7,8,9,10,NA,12,13,14,NA,16,17,18,19,20)
cnames <- c("aa", "bb", "cc", "dd", "ee")
mymatrix <- matrix(a, nrow = 4, ncol = 5, byrow = TRUE)
colnames(mymatrix) <- cnames
mymatrix
# aa bb cc dd ee
# [1,] 1 2 3 4 5
# [2,] NA 7 8 9 10
# [3,] NA 12 13 14 NA
# [4,] 16 17 18 19 20
The desired result: columns "aa" and "ee".
My attempt:
bad <- character()
for (j in 1:4){
tmp <- which(colnames(mymatrix[j, ]) %in% c("", "NA"))
bad <- tmp
}
However, I keep getting integer(0) as my output. Any help is appreciated.
Like this?
colnames(mymatrix)[colSums(is.na(mymatrix)) > 0]
# [1] "aa" "ee"
Or as suggested by #thelatemail:
names(which(colSums(is.na(mymatrix)) > 0))
# [1] "aa" "ee"
R 3.1 introduced an anyNA function, which is more convenient and faster:
colnames(mymatrix)[ apply(mymatrix, 2, anyNA) ]
Old answer:
If it's a very long matrix, apply + any can short circuit and run a bit faster.
apply(is.na(mymatrix), 2, any)
# aa bb cc dd ee
# TRUE FALSE FALSE FALSE TRUE
colnames(mymatrix)[apply(is.na(mymatrix), 2, any)]
# [1] "aa" "ee"
If you have a data frame with non-numeric columns, this solution is more general (building on previous answers):
R 3.1 +
names(which(sapply(mymatrix, anyNA)))
or
names(which(sapply(mymatrix, function(x) any(is.na(x)))))

Select names of columns which contain specific values in row

I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!
You can use which:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT
Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply and apply. You can get the same just with apply:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.
To answer #flodel's concerns, I'll write it as a separate answer:
1) Using lapply gets a list and apply doesn't guarantee this always:
A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply?
By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
2) lapply is overall shorter, and does not require anonymous function:
Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
3) All answers are wrong and the answer that'll work with all inputs:
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
why not do
colnames(df[,df[i,]>2])
for each row, where df is your data frame and i is the row number ;)

R Equality while ignoring NAs

Is there an equivalent of == but with the result that x != NA if x is not NA?
The following does what I want, but it's clunky:
mapply(identical, vec1, vec2)
Just replace "==" with %in%.
Example:
> df <- data.frame(col1= c("a", "b", NA), col2= 1:3)
> df
col1 col2
1 a 1
2 b 2
3 <NA> 3
> df[df$col1=="a", ]
col1 col2
1 a 1
NA <NA> NA
> df[df$col1%in%"a", ]
col1 col2
1 a 1
> "x"==NA
[1] NA
> "x"%in%NA
[1] FALSE
1 == NA returns a logical NA rather than TRUE or FALSE. If you want to call NA FALSE, you could add a second conditional:
set.seed(1)
x <- 1:10
x[4] <- NA
y <- sample(1:10, 10)
x <= y
# [1] TRUE TRUE TRUE NA FALSE TRUE TRUE FALSE TRUE FALSE
x <= y & !is.na(x)
# [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
You could also use a second processing step to convert all the NA values from your equality test to FALSE.
foo <- x <= y
foo[is.na(foo)] <- FALSE
foo
# [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE
Also, for what its worth, NA == NA returns NA as does NA != NA.
The == operator is often used in combination with filtering data.frames.
In that situation, dplyr::filter will retain only rows where your condition evaluates to TRUE, unlike [. That effectively implements == but where 1 == NA evalutes as FALSE.
Example:
> df <- data.frame(col1= c("a", "b", NA), col2= 1:3)
> df
col1 col2
1 a 1
2 b 2
3 <NA> 3
> dplyr::filter(df, col1=="a")
col1 col2
1 a 1
Why not use base R:
df <- data.frame(col1 = c("a", "b", NA), col2 = 1:3, col3 = 11:13)
df
subset(x = df, subset = col1=="a", select = c(col1, col2))
# col1 col2
# 1 a 1
or with arrays:
df <- c("a", "b", NA)
subset(x = df, subset = df == "a")

Resources