I have a large dataset with ~200 columns of various types. I need to replace NA values with "", but only in character columns.
Using the dummy data table
DT <- data.table(x = c(1, NA, 2),
y = c("a", "b", NA))
> DT
x y
1: 1 a
2: NA b
3: 2 <NA>
> str(DT)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ x: num 1 NA 2
$ y: chr "a" "b" NA
I have tried the following for-loop with a condition, but it doesn't work.
for (i in names(DT)) {
if (class(DT$i) == "character") {
DT[is.na(i), i := ""]
}
}
The loop runs with no errors, but doesn't change the DT.
The expected output I am looking for is this:
x y
1: 1 a
2: NA b
3: 2
The solution doesn't necessarily have to involve a loop, but I couldn't think of one.
One option if you don't mind using dplyr:
na_to_space <- function(x) ifelse(is.na(x)," ",x)
> DT %>% mutate_if(.predicate = is.character,.funs = na_to_space)
x y
1 1 a
2 NA b
3 2
DT[, lapply(.SD, function(x){if(is.character(x)) x[is.na(x)] <- ' '; x})]
Or, if you don't like typing function(x)
library(purrr)
DT[, map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]
To replace
DT[, names(DT) := map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]
Related
I'm attempting to simplify my code below. I would like to cast a list of data frames from long to wide and then add several variables to each nested data frame conditional on variables contained in the dataset. The following code produces my preferred output, I would like help understanding how to reduce the steps and possibly do this all in one lapply command. I've attempted several trials combining "with" statements to no avail.
dflist <- list(data.frame(ID=(c(rep(1,10),rep(2,10))),Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))), b=rnorm(20),c=rnorm(20)),
data.frame(ID=(c(rep(1,10),rep(2,10))), Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))),b=rnorm(20),c=rnorm(20)),
data.frame(ID=(c(rep(1,10),rep(2,10))), Y=(c(rep(1,5),rep(2,5),rep(1,5),rep(2,5))),b=rnorm(20),c=rnorm(20)))
wide_data<-lapply(dflist, function(x) dcast(setDT(x), ID ~ Y, value.var=c('b','c'),mean))
b_flag<-lapply(wide_data, function(x) with(x,ifelse((b_1 < .30 | b_2 >.95),"Flag",NA)))
c_flag<-lapply(wide_data, function(x) with(x,ifelse((c_1 < 0) & (c_1 < 0),"Flag",NA)))
wide_data<-Map(cbind, wide_data, b_flag = b_flag)
wide_data<-Map(cbind, wide_data, c_flag = c_flag)
wide_data
I agree with you that 1 lapply would be better:
wide_data <- lapply(dflist, function(x) {
tmp <- dcast(setDT(x), ID ~ Y, value.var=c('b','c'), mean)
tmp$b_flag <- ifelse((tmp$b_1 < .30 | tmp$b_2 >.95) , "Flag", NA)
tmp$c_flag <- ifelse((tmp$c_1 < 0) & (tmp$c_2 < 0), "Flag", NA)
tmp
})
another way of approaching it without using dcast
also, the condition for c column is ambiguous in your question. Please check if it is correct and edit your question.
library('data.table')
df2 <- lapply( dflist, function(x) {
x <- setDT(x)[, .(b = mean(b), c = mean(c)), by = .(ID, Y)]
x[ , `:=` ( b_flag = ifelse( any(b[Y == 1 ] < 0.30, b[Y == 2] > 0.95), "Flag", NA_character_ ),
c_flag = ifelse( all(c < 0), "Flag", NA_character_ ) ),
by = ID ]
return( x )
} )
df2 <- rbindlist(l = df2)
df2
# ID Y b c b_flag c_flag
# 1: 1 1 0.198227292 0.57377712 Flag NA
# 2: 1 2 0.578991810 0.40128112 Flag NA
# 3: 2 1 0.578724225 0.30608932 NA NA
# 4: 2 2 0.619338292 0.35209122 NA NA
# 5: 1 1 0.321089583 -0.83979393 NA NA
# 6: 1 2 -0.341194581 0.52508394 NA NA
# 7: 2 1 0.179836568 -0.02041203 Flag NA
# 8: 2 2 0.482725899 0.17163968 Flag NA
# 9: 1 1 0.003591178 -0.30250232 Flag NA
# 10: 1 2 -0.230479093 0.01971357 Flag NA
# 11: 2 1 -0.038689389 0.35717286 Flag NA
# 12: 2 2 0.691364217 -0.37037455 Flag NA
My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?
I am using sqldf to query the database.
Reproducible example:
vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")
df = data.frame(vect1,vect2,vect3)
To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:
df <- replace(df, df == "NA", NA)
Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
Here's a performance comparison. The sample data is from #roland's answer.
Here are the functions to test:
funop <- function() {
df[df == "NA"] <- NA
df
}
funr <- function() {
ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind][]
}
funam1 <- function() replace(df, df == "NA", NA)
funam2 <- function() {
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
}
Here's the benchmarking:
library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10
# funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10
# funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10
# funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10
replace would be the same as #roland's approach, which is the same as #jgozal's. However, the type.convert approach would result in different column types.
all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())
str(funop())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
# $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...
str(funam2())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ...
# $ vect3: logi NA NA NA NA NA NA ...
I found this nice way of doing it from this question:
So for this particular situation it would just be:
df[df=="NA"]<-NA
It only took about 30 seconds with 5 million rows and ~250 variables
This is slightly faster:
set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user system elapsed
#3.601 0.378 3.984
library(data.table)
setDT(df2)
system.time({
#find character and factor columns
ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
#assign by reference
df2[, names(df2)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind]
})
# user system elapsed
#2.484 0.190 2.676
all.equal(df, setDF(df2))
#[1] TRUE
In data.table v.1.9.6 you can split a variable in columns like so:
library(data.table)
DT = data.table(x=c("A/B", "A", "B"), y=1:3)
DT[, c("c1", "c2") := tstrsplit(x, "/", fixed=TRUE)][]
The number of required splits [above: 2] is not always known in advance.
How can I generate the required variable names when the number of splits is known?
n = 2 # desired number of splits
# naive attempt to build required string
m = paste0("'", "myvar", 1:n, "'", collapse = ",")
m = paste0("c(", m, ")" )
# [1] "c('myvar1','myvar2','myvar3')"
DT[, m := tstrsplit(x, "/", fixed=TRUE)][] # doesn't work
Two methods. The first is strongly suggested:
#one
n=2
DT[, paste0("myvar", 1:n) := tstrsplit(x, "/", fixed=T)][]
# x y myvar1 myvar2
#1: A/B 1 A B
#2: A 2 A NA
#3: B 3 B NA
#two
DT[, eval(parse(text=m)) := tstrsplit(x, "/", fixed=TRUE)][]
# x y myvar1 myvar2
#1: A/B 1 A B
#2: A 2 A NA
#3: B 3 B NA
extra
If you do not know the amount of splits beforehand:
splits <- max(lengths(strsplit(DT$x, "/")))
DT[, paste0("myvar", 1:splits) := tstrsplit(x, "/", fixed=T)][]
Another simple way of doing this. Instead of making extra columns, you can stack the splitted strings in a single column:
DT = data.table(x=c("A/B", "A", "B"), y=1:3)
DT1 <- DT[, .(new=tstrsplit(x, "/",fixed=T)), by=y]
DT1
# y new
# 1: 1 A
# 2: 1 B
# 3: 2 A
# 4: 3 B
I've stumbled upon weird data table i behavior that returns a row with NAs where I would expect an empty data table. See:
a = data.table(a = 1, d = NA)
a[!is.na(a) & d == "3"]
# a d
# 1: NA NA
I would expect an empty data table as a result here.
Compare to:
a = data.table(a = c(1,2), d = c(NA,3))
a[!is.na(a) & d == "3"]
# a d
# 1: 2 3
This one does not produce an extra row with NA values, though.
Is this a bug in data.table or there's some logic underlying this behavior that someone could explain?
Thanks for the ping #SergiiZaskaleta. I forgot to update this question, but this has been fixed a while ago, with this commit.
From NEWS:
Subsets using logical expressions in i never returns all-NA rows. Edge case DT[NA] is now fixed, #1252. Thanks to #sergiizaskaleta.
Don't know if it's a bug or not, but it seems it has to do with the type of your variable d.
a = data.table(a = 1, d = NA)
str(a)
# Classes ‘data.table’ and 'data.frame': 1 obs. of 2 variables:
# $ a: num 1
# $ d: logi NA
# - attr(*, ".internal.selfref")=<externalptr>
a[!is.na(a) & d == "3"] # this returns NAs
# a d
# 1: NA NA
a[!is.na(a) & !is.na(d)] # this returns nothing
# Empty data.table (0 rows) of 2 cols: a,d
This one also works:
a = data.table(a = 1, d = 4)
str(a)
# Classes ‘data.table’ and 'data.frame': 1 obs. of 2 variables:
# $ a: num 1
# $ d: num 4
# - attr(*, ".internal.selfref")=<externalptr>
a[!is.na(a) & d == "3"]
# Empty data.table (0 rows) of 2 cols: a,d
Looks like if a variable is of logical type it can't be compared to another type and returns NAs.
However, with the dplyr package it seems to work:
library(dplyr)
a = data.table(a = 1, d = NA)
a %>% filter(!is.na(a) & d == "3")
# Empty data.table (0 rows) of 2 cols: a,d
The same with the subset command:
subset(a, !is.na(a) & d == "3")
# Empty data.table (0 rows) of 2 cols: a,d
i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])