Function to change blanks to NA - r

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.

You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).

This worked for me
df[df == 'NULL'] <- NA

How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.

This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

Related

Generate variable with missing values if condition doesn't hold true

I want to generate a new variable in a data frame which contains the difference of the current row and a lag-value of another variable. However, I want to assign only values for those rows, where a specific condition holds true for a second variable. In this example the new lag-difference variable should only have values for rows with the fruit "Banana". All other rows shall be empty or rather contain NA.
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
I tried to solve this problem with an if condition:
fruitnumbers$newvar <- if(fruitnumbers$fruits=="Banana"){
fruitnumbers$numbers-lag(fruitnumbers$numbers, 1)
}
However, I've received the following warning massage.
Warning message:
In if (fruits == "Banana") { :
the condition has length > 1 and only the first element will be used
From research, I assume that it has something to do with the fact R wants to check the If-condition for the whole data frame instead of row by row for each value but I'm not quite sure. I'd be grateful for any solution here.
Here fruitnumbers$fruits is a vector so when you run if (fruitnumbers$fruits == "Banana") only the first element of fruitnumbers$fruits is tested(here "Apple" == "Banana").
If you want a vectorized test use the case_when function of the library dplyr
library(dplyr)
fruitnumbers$newvar <- case_when(
fruitnumbers$fruits == "Banana" ~ fruitnumbers$numbers-lag(fruitnumbers$numbers, 1),
TRUE ~ NA_real_
)
Which gives
fruitnumbers$newvar
[1] NA 2 NA NA NA 2 -3 NA NA 2
EDIT : as mentioned by someone you could have used the ifelse function
fruitnumbers$newvar <- ifelse(fruitnumbers$fruits == "Banana", fruitnumbers$numbers-lag(fruitnumbers$numbers, 1), NA)
I would do that in two stages:
Create a new column in the data frame:
fruitnumbers$newvar <- NA
Change the values only for bananas:
fruitnumbers$newvar[fruitnumbers$fruits=="Banana"] <-
fruitnumbers$numbers[fruitnumbers$fruits=="Banana"] - lag(fruitnumbers$numbers[fruitnumbers$fruits=="Banana"], 1)
I am not sure about the lag function in this context. It only returns zeros. Another problem might be hiding there.
In base R, you could try this:
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
indexes = which(fruitnumbers$fruits == "Banana")
fruitnumbers[indexes, 'newvar'] = fruitnumbers[indexes, 'numbers'] - lag(fruitnumbers[indexes, 'numbers'], 1)
Rest of the row values in column newvar would show as blank.

too many NA values in dataset for na.omit to handle

I have a text file dataset that I read as follows:
cancer1 <- read.table("cancer.txt", stringsAsFactors = FALSE, quote='', header=TRUE,sep='\t')
I then have to convert the class of the constituent values so that I can perform mathematical analyses on the df.
cancer<-apply(cancer1,2, as.numeric)
This introduces >9000 NA values into a "17980 X 598" df. Hence there are too many NA values to just simply use "na.omit" as that just removes all of the rows....
Hence my plan is to replace each NA in each row with the mean value of that row, my attempt is as follows:
for(i in rownames(cancer)){
cancer2<-replace(cancer, is.na(cancer), mean(cancer[i,]))
}
However this removes every row just like na.omit:
dim(cancer2)
[1] 0 598
Can someone tell me how to replace each of the NA values with the mean of that row?
You can use rowMeans with indexing.
k <- which(is.na(cancer1), arr.ind=TRUE)
cancer1[k] <- rowMeans(cancer1, na.rm=TRUE)[k[,1]]
Where k is an indices of the rows with NA values.
This works better than my original answer, which was:
for(i in 1:nrow(cancer1)){
for(n in 1:ncol(cancer1)){
if(is.na(cancer1[i,n])){
cancer1[i,n] <- mean(t(cancer1[i,]), na.rm = T)# or rowMeans(cancer1[i,], na.rm=T)
}
}
}
sorted it out with code adapted from related post:
cancer1 <- read.table("TCGA_BRCA_Agilent_244K_microarray_genomicMatrix.txt", stringsAsFactors = FALSE, quote='' ,header=TRUE,sep='\t')
t<-cancer1[1:800, 1:400]
t<-t(t)
t<-apply(t,2, as.numeric) #constituents read as character strings need to be converted
#to numerics
cM <- rowMeans(t, na.rm=TRUE) #necessary subsequent data cleaning due to the
#introduction of >1000 NA values- converted to the mean value of that row
indx <- which(is.na(t), arr.ind=TRUE)
t[indx] <- cM[indx[,2]]

Finding the unique identifyer for a row with NA in a particular column in R

I have data in the following format:
ID Species Side_of_boat
1 spA Port
2 spB Starboard
3 spA NA
I would like to write a line of code that gives me the unique ID for all rows that have NA in 'side of boat'.
I have tried:
unique(df$ID[df$side_of_boat == "NA"])
But it doesn't give me the output I want. I would like the output to be:
"3"
Thanks!
Try
unique(df$ID[is.na(df$Side_of_boat)])
instead. NA is a special value in R and also has its own special function is.na() to test if an entry is NA. Check ?NA for more information.
#Method1
n <- which(is.na(df$side_of_boat))
you can also use *apply with this, e.g.
lapply(apply(df$side_of_boat, 1, function(x) which(!is.na(x)) ) , paste, collapse=", ")
#Method 2
new_DF <- subset(df, is.na(df$side_of_boat))
#Method 3
You could also write a function to do this for you:
getNa <- function(dfrm) lapply(dfrm, function(x) which(is.na(x) ) )
#Note
In case you have NA character values, first run
df$side_of_boat[df$side_of_boat=='NA'] <- NA
Try:
df$ID[which(is.na(df$Side_of_Boat))]
It should give you a vector of the ID's regardless of them being numbers or characters

Create new column with binary data based on several columns

I have a dataframe in which I want to create a new column with 0/1 (which would represent absence/presence of a species) based on the records in previous columns. I've been trying this:
update_cat$bobpresent <- NA #creating the new column
x <- c("update_cat$bob1999", "update_cat$bob2000", "update_cat$bob2001","update_cat$bob2002", "update_cat$bob2003", "update_cat$bob2004", "update_cat$bob2005", "update_cat$bob2006","update_cat$bob2007", "update_cat$bob2008", "update_cat$bob2009") #these are the names of the columns I want the new column to base its results in
bobpresent <- function(x){
if(x==NA)
return(0)
else
return(1)
} # if all the previous columns are NA then the new column should be 0, otherwise it should be 1
update_cat$bobpresence <- sapply(update_cat$bobpresent, bobpresent) #apply the function to the new column
Everything is going fina until the last string where I'm getting this error:
Error in if (x == NA) return(0) else return(1) :
missing value where TRUE/FALSE needed
Can somebody please advise me?
Your help will be much appreciated.
By definition all operations on NA will yield NA, therefore x == NA always evaluates to NA. If you want to check if a value is NA, you must use the is.na function, for example:
> NA == NA
[1] NA
> is.na(NA)
[1] TRUE
The function you pass to sapply expects TRUE or FALSE as return values but it gets NA instead, hence the error message. You can fix that by rewriting your function like this:
bobpresent <- function(x) { ifelse(is.na(x), 0, 1) }
In any case, based on your original post I don't understand what you're trying to do. This change only fixes the error you get with sapply, but fixing the logic of your program is a different matter, and there is not enough information in your post.

How to show indexes of NAs?

I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))

Resources