Replace dash with zero without affecting negative numbers - r

Is there a way to replace dashes by NA or zero without affecting negative values in a vector that resembles the following: c("-","-121512","123","-").

Just do
x <- c("-","-121512","123","-")
x[x == "-"] <- NA
x
#[1] NA "-121512" "123" NA
If you need a numeric vector instead of character wrap x in as.numeric().
If you want to replace all "-" in a dataframe we can use the same logic
df1 <- data.frame(x = c("-","-121512","123","-"),
y = c("-","-121512","123","-"),
z = c("-","A","B","-"), stringsAsFactors = FALSE)
df1[df1 == "-"] <- NA
If you want numeric columns if appropriate then you type.convert
df1[] <- lapply(df1, type.convert, as.is = TRUE)
str(df1)
'data.frame': 4 obs. of 3 variables:
$ x: int NA -121512 123 NA
$ y: int NA -121512 123 NA
$ z: chr NA "A" "B" NA

We can use na_if
library(dplyr)
na_if(v1, "-") %>%
as.numeric
#[1] NA -121512 123 NA
If it is a data.frame
library(tidyverse)
df1 %>%
mutate_all(na_if, "-") %>%
type_convert
data
v1 <- c("-","-121512","123","-")

Related

Replace NA with " ", but only in character columns

I have a large dataset with ~200 columns of various types. I need to replace NA values with "", but only in character columns.
Using the dummy data table
DT <- data.table(x = c(1, NA, 2),
y = c("a", "b", NA))
> DT
x y
1: 1 a
2: NA b
3: 2 <NA>
> str(DT)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ x: num 1 NA 2
$ y: chr "a" "b" NA
I have tried the following for-loop with a condition, but it doesn't work.
for (i in names(DT)) {
if (class(DT$i) == "character") {
DT[is.na(i), i := ""]
}
}
The loop runs with no errors, but doesn't change the DT.
The expected output I am looking for is this:
x y
1: 1 a
2: NA b
3: 2
The solution doesn't necessarily have to involve a loop, but I couldn't think of one.
One option if you don't mind using dplyr:
na_to_space <- function(x) ifelse(is.na(x)," ",x)
> DT %>% mutate_if(.predicate = is.character,.funs = na_to_space)
x y
1 1 a
2 NA b
3 2
DT[, lapply(.SD, function(x){if(is.character(x)) x[is.na(x)] <- ' '; x})]
Or, if you don't like typing function(x)
library(purrr)
DT[, map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]
To replace
DT[, names(DT) := map(.SD, ~{if(is.character(.x)) .x[is.na(.x)] <- ' '; .x})]

R finding values in a data frame using | operator vs %in%

I'm trying to find all instances of certain values in a data frame, and replace them with NA. I tried this two different ways that I thought were equivalent, but I get different results. For example:
df <- data.frame(a=c(1,2),b=c(3,4))
df[df == 1 | df == 4] <- NA
gives me the expected result:
df
# a b
# 1 NA 3
# 2 2 NA
whereas
df <- data.frame(a=c(1,2),b=c(3,4))
df[df %in% c(1,4)] <- NA
does nothing:
df
# a b
# 1 1 3
# 2 2 4
This seems to be because if I use the "|" operator, it searches the data frame element by element, whereas if I use %in% it searches the data frame vector by vector (column by column), but I don't understand why.
df <- data.frame(a=c(1,2),b=c(3,4))
df == 1 | df == 4
# a b
# [1,] TRUE FALSE
# [2,] FALSE TRUE
df %in% c(1,4)
# [1] FALSE FALSE
If we look at the code for %in%
function (x, table)
match(x, table, nomatch = 0L) > 0L
So, it is basically doing a match. The output of match would be
match(c(1,4), df, nomatch = 0L) > 0L
#[1] FALSE FALSE
%in% is applied on vectors instead of data.frame. So, we loop through the columns using lapply, then do the %in%
lapply(df, `%in%`, c(1, 4))
If we need how the matrix, then use sapply
df[sapply(df, `%in%`, c(1, 4))] <- NA
We can check the match works on a vector
sapply(df, match, x = c(1,4), nomatch = 0L) > 0
# a b
#[1,] TRUE FALSE
#[2,] FALSE TRUE
%in% is only for vectors. In order to perform it on a dataframe you would have to use sapply to apply a function across each of the columns.
df[sapply(df, function(x) x %in% c(1, 4))] <- NA
a b
1 NA 3
2 2 NA

Replace NA in Factor type data in R

The data frame X looks like this
State code
New Jersey 1
New York 2
Califronia NA
All columns are factors. I am looking to replace NA is with a text or 0. So that I can transpose them later.
When I try to run this command
X[is.na(X)] <- "0"
I get following errors
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = "0") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = "0") :
invalid factor level, NA generated
3: In `[<-.factor`(`*tmp*`, thisvar, value = "0") :
invalid factor level, NA generated
4: In `[<-.factor`(`*tmp*`, thisvar, value = "0") :
invalid factor level, NA generated
There is no change in NA values.
Another alternative using built-in factor:
df <- data.frame(a=letters[1:3], b=c("d", "e", NA))
df
a b
1 a d
2 b e
3 c <NA>
Now, recode the factor with factor:
df$b <- factor(df$b, exclude = NULL,
levels = c("d", "e", NA),
labels = c("d", "e", "f"))
df
a b
1 a d
2 b e
3 c f
And for many factors, the following may be useful:
df[] <- lapply(df, function(x){
# check if you have a factor first:
if(!is.factor(x)) return(x)
# otherwise include NAs into factor levels and change factor levels:
x <- factor(x, exclude=NULL)
levels(x)[is.na(levels(x))] <- "0"
return(x)
})
The code you wrote will work for matrices, if you don't mind converting back and forth.
> X
State code code2
1 NewJersey 1 NA
2 NewYork 2 0
3 Califronia NA 4
> X<-as.matrix(X)
> X[is.na(X)] <- "0"
> X<-as.data.frame(X)
> X
State code code2
1 NewJersey 1 0
2 NewYork 2 0
3 Califronia 0 4
> str(X)
'data.frame': 3 obs. of 3 variables:
$ State: Factor w/ 3 levels "Califronia","NewJersey",..: 2 3 1
$ code : Factor w/ 3 levels " 1"," 2","0": 1 2 3
$ code2: Factor w/ 3 levels " 0"," 4","0": 3 1 2
Simply:
X$code <- as.character(X$code) #as.numeric works just as good
X[is.na(X)] <- "0"
X$code <- as.factor(as.numeric(X$code))
In a loop over all columns it would look like this:
for (i in 2:ncol(X)) {
X[,i] <- as.character(X[,i])
X[which(is.na(X[,i])==TRUE),i] <- "0"
X[,i] <- as.factor(as.numeric(X[,i]))
}
And for a character value like this:
for (i in 2:ncol(X)) {
X[,i] <- as.character(X[,i])
X[which(is.na(X[,i])==TRUE),i] <- "Not Assigned"
X[,i] <- as.factor(X[,i])
}
Or if you prefer not to transform to character first, assign a new level to each column:
for (i in 2:ncol(X)) {
levels(X[,i]) <- c(levels(X[,i]), "Not Assigned")
X[which(is.na(X[,i])==TRUE),i] <- "Not Assigned"
}
let's create a random df with factor levels
df <- data.frame(a=sample(0:10, size=10, replace=TRUE),
b=sample(20:30, size=10, replace=TRUE))
df[df$a==0,'a'] <- NA
df$a <- as.factor(df$a)
other way to do is:
#check levels
levels(df$a)
#[1] "3" "4" "7" "9" "10"
#add new factor level. i.e 88 in our example
df$a = factor(df$a, levels=c(levels(df$a), 88))
#convert all NA's to 88
df$a[is.na(df$a)] = 88
#check levels again
levels(df$a)
#[1] "3" "4" "7" "9" "10" "88"

parsing quotes out of "NA" strings

My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?
I am using sqldf to query the database.
Reproducible example:
vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")
df = data.frame(vect1,vect2,vect3)
To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:
df <- replace(df, df == "NA", NA)
Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
Here's a performance comparison. The sample data is from #roland's answer.
Here are the functions to test:
funop <- function() {
df[df == "NA"] <- NA
df
}
funr <- function() {
ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind][]
}
funam1 <- function() replace(df, df == "NA", NA)
funam2 <- function() {
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
}
Here's the benchmarking:
library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10
# funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10
# funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10
# funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10
replace would be the same as #roland's approach, which is the same as #jgozal's. However, the type.convert approach would result in different column types.
all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())
str(funop())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
# $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...
str(funam2())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ...
# $ vect3: logi NA NA NA NA NA NA ...
I found this nice way of doing it from this question:
So for this particular situation it would just be:
df[df=="NA"]<-NA
It only took about 30 seconds with 5 million rows and ~250 variables
This is slightly faster:
set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user system elapsed
#3.601 0.378 3.984
library(data.table)
setDT(df2)
system.time({
#find character and factor columns
ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
#assign by reference
df2[, names(df2)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind]
})
# user system elapsed
#2.484 0.190 2.676
all.equal(df, setDF(df2))
#[1] TRUE

Can you use rbind.fill without having it fill in NA's?

I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999

Resources