My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?
I am using sqldf to query the database.
Reproducible example:
vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")
df = data.frame(vect1,vect2,vect3)
To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:
df <- replace(df, df == "NA", NA)
Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
Here's a performance comparison. The sample data is from #roland's answer.
Here are the functions to test:
funop <- function() {
df[df == "NA"] <- NA
df
}
funr <- function() {
ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind][]
}
funam1 <- function() replace(df, df == "NA", NA)
funam2 <- function() {
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
}
Here's the benchmarking:
library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10
# funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10
# funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10
# funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10
replace would be the same as #roland's approach, which is the same as #jgozal's. However, the type.convert approach would result in different column types.
all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())
str(funop())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
# $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...
str(funam2())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ...
# $ vect3: logi NA NA NA NA NA NA ...
I found this nice way of doing it from this question:
So for this particular situation it would just be:
df[df=="NA"]<-NA
It only took about 30 seconds with 5 million rows and ~250 variables
This is slightly faster:
set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user system elapsed
#3.601 0.378 3.984
library(data.table)
setDT(df2)
system.time({
#find character and factor columns
ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
#assign by reference
df2[, names(df2)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind]
})
# user system elapsed
#2.484 0.190 2.676
all.equal(df, setDF(df2))
#[1] TRUE
Related
I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.
Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18
So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))
Is there a way to replace dashes by NA or zero without affecting negative values in a vector that resembles the following: c("-","-121512","123","-").
Just do
x <- c("-","-121512","123","-")
x[x == "-"] <- NA
x
#[1] NA "-121512" "123" NA
If you need a numeric vector instead of character wrap x in as.numeric().
If you want to replace all "-" in a dataframe we can use the same logic
df1 <- data.frame(x = c("-","-121512","123","-"),
y = c("-","-121512","123","-"),
z = c("-","A","B","-"), stringsAsFactors = FALSE)
df1[df1 == "-"] <- NA
If you want numeric columns if appropriate then you type.convert
df1[] <- lapply(df1, type.convert, as.is = TRUE)
str(df1)
'data.frame': 4 obs. of 3 variables:
$ x: int NA -121512 123 NA
$ y: int NA -121512 123 NA
$ z: chr NA "A" "B" NA
We can use na_if
library(dplyr)
na_if(v1, "-") %>%
as.numeric
#[1] NA -121512 123 NA
If it is a data.frame
library(tidyverse)
df1 %>%
mutate_all(na_if, "-") %>%
type_convert
data
v1 <- c("-","-121512","123","-")
I have a large dataset with ~200 columns of various types. I need to replace NA values with "", but only in character columns.
Using the dummy data table
DT <- data.table(x = c(1, NA, 2),
y = c("a", "b", NA))
> DT
x y
1: 1 a
2: NA b
3: 2 <NA>
> str(DT)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ x: num 1 NA 2
$ y: chr "a" "b" NA
I have tried the following for-loop with a condition, but it doesn't work.
for (i in names(DT)) {
if (class(DT$i) == "character") {
DT[is.na(i), i := ""]
}
}
The loop runs with no errors, but doesn't change the DT.
The expected output I am looking for is this:
x y
1: 1 a
2: NA b
3: 2
The solution doesn't necessarily have to involve a loop, but I couldn't think of one.
One option if you don't mind using dplyr:
na_to_space <- function(x) ifelse(is.na(x)," ",x)
> DT %>% mutate_if(.predicate = is.character,.funs = na_to_space)
x y
1 1 a
2 NA b
3 2
DT[, lapply(.SD, function(x){if(is.character(x)) x[is.na(x)] <- ' '; x})]
Or, if you don't like typing function(x)
library(purrr)
DT[, map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]
To replace
DT[, names(DT) := map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]
This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Coalesce two string columns with alternating missing values to one
(7 answers)
Closed 5 years ago.
This is an extension of this earlier question. How can I combine two columns of a data frame as
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c("t",2,NA,NA,NA),
'y' = c(NA,NA,NA,4,"r"))
displayed as
'a' 'x' 'y'
A t NA
B 2 NA
C NA NA
D NA 4
E NA r
to get
'a' 'mycol'
A t
B 2
C NA
D 4
E r
I tried this
cbind(data[1], mycol = na.omit(unlist(data[-1])))
But it obviously doesn't keep the NA row.
You could do it by using ifelse, like this:
data$mycol <- ifelse(!is.na(data$x), data$x, data$y)
> data
## a x y mycol
## 1 A 1 NA 1
## 2 B 2 NA 2
## 3 C NA NA NA
## 4 D NA 4 4
## 5 E NA 5 5
Going with your logic, you can do following:
cbind(data[1], mycol = unlist(apply(data[2:3], 1, function(i) ifelse(
length(is.na(i))==length(i),
na.omit(i),
NA)
)))
# a mycol
#1 A 1
#2 B 2
#3 C NA
#4 D 4
#5 E 5
This has been addressed here indirectly. Here is a simple solution based on that:
data$mycol <- coalesce(data$x, data$y)
Extending the answer to any number of columns, and using the neat max.col() function I've discovered thanks to this question:
coalesce <- function(value_matrix) {
value_matrix <- as.matrix(value_matrix)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
data$mycol <- coalesce(data[, c('x', 'y')])
data
# a x y mycol
# 1 A 1 NA 1
# 2 B 2 NA 2
# 3 C NA NA NA
# 4 D NA 4 4
# 5 E NA 5 5
max.col(..., ties.method = "first") returns, for each row, the index of the first column with the maximum value. Since we're using it on a logical matrix, the max is usually TRUE. So we'll get the first non-NA value for each row. If the entire row is NA, then we'll get an NA value as desired.
After that, the function uses a matrix of row-column indices to subset the values.
Edit
In comparison to mrip's coalesce, my max.col is slower when there are a few long columns, but faster when there are many short columns.
coalesce_reduce <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
coalesce_maxcol <- function(...) {
value_matrix <- cbind(...)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
set.seed(100)
wide <- replicate(
1000,
{sample(c(NA, 1:10), 10, replace = TRUE)},
simplify = FALSE
)
long <- replicate(
10,
{sample(c(NA, 1:10), 1000, replace = TRUE)},
simplify = FALSE
)
microbenchmark(
do.call(coalesce_reduce, wide),
do.call(coalesce_maxcol, wide),
do.call(coalesce_reduce, long),
do.call(coalesce_maxcol, long)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# do.call(coalesce_reduce, wide) 1879.460 1953.5695 2136.09954 2007.303 2152.654 5284.583 100
# do.call(coalesce_maxcol, wide) 403.604 423.5280 490.40797 433.641 456.583 2543.580 100
# do.call(coalesce_reduce, long) 36.829 41.5085 45.75875 43.471 46.942 79.393 100
# do.call(coalesce_maxcol, long) 80.903 88.1475 175.79337 92.374 101.581 3438.329 100
I created this data frame as an illustration of a larger problem.
> df <- data.frame(x=c(NA, 12, NA, 67), y=c(32, NA, NA, NA), z=c(NA, NA, NA, NA))
> df
x y z
1 NA 32 NA
2 12 NA NA
3 NA NA NA
4 67 NA NA
I want it to look like this.
x
1 32
2 12
3 NA
4 67
Which is essentially searching through each row for a number. If one is found to return it matching that row, and if no number is found, return an NA.
I created an empty vector.
> list <- c()
Then a for loop that goes through each row returning the element that is not an NA value. Then add it to the 'list' vector.
> for (i in 1:4) {list <- c(list, df[i,!is.na(df[i,])])}
> list
[[1]]
[1] 32
[[2]]
[1] 12
[[3]]
[1] 67
> unlist(list)
32 12 67
This gets close, but the NA rows are ignored.
I also tried a grep pattern match. But as you can imagine, the grep family of calls are designed to run through vectors and not data frame rows.
Not sure how to move forward. Again, if it could look like:
x
1 32
2 12
3 NA
4 67
Use apply to check for values in each row:
apply(df, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA})
# [1] 32 12 NA 67
Another strategy is to use rowSums, but this solution only works if there are no 0 values in your data.frame (if there are, this method will replace those results with NA):
x <- rowSums(df, na.rm = TRUE); x[x == 0] <- NA; x
# [1] 32 12 NA 67
You could use the Reduce function to combine columns pair by pair:
Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df)
# [1] 32 12 NA 67
This function should work with non-numeric data, handles rows with multiple non-NA elements gracefully (it takes the rightmost), and should be a good deal more efficient than one relying on apply.
df.big <- df[rep(1:4, 1000),]
library(microbenchmark)
microbenchmark(apply(df.big, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA}), {x <- rowSums(df.big, na.rm = TRUE); x[x == 0] <- NA; x}, Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df.big))
# Unit: microseconds
# expr min
# apply(df.big, 1, function(x) { z <- x[!is.na(x)] if (length(z)) z else NA }) 14550.050
# { x <- rowSums(df.big, na.rm = TRUE) x[x == 0] <- NA x } 239.826
# Reduce(function(x, y) { x[!is.na(y)] <- y[!is.na(y)] x }, df.big) 353.326
# lq mean median uq max neval
# 15322.4825 19124.8814 17008.2935 22037.387 43337.893 100
# 257.2215 389.4275 380.6595 424.593 1585.234 100
# 384.4750 457.9714 436.2400 511.085 799.992 100
Basically the approach is about as efficient as the rowSums one proposed by #Thomas but can handle character and other data.