Set character column of dataframe to missing if length<n - r

Very new to R here.
I have a dataframe with a character column "col1":
col1 <- c("org","blorg","forg","chorg","horg","blorg","horg","phthorg")
col2 <- c("a","b","c","d","a","b","e","f")
df<-data.frame(col1, col2)
I would like to set the values with fewer than 5 characters to missing so I end up with:
c(NA,"blorg",NA,"chorg",NA,"blorg",NA,"phthorg")
I have tried the following:
if(nchar(as.character(df$col1))<5) {df$col1<-NA}
but I get the error "the condition has length > 1".

nchar is vectorized, so we can directly apply instead of if/else (which expects a single TRUE/FALSE as input) Or may use ifelse/replace/case_when which are vectorized, but it is not needed as we need to replace only to NA based on the condition
df$col1[nchar(df$col1) <5] <- NA
-output
> df$col1
[1] NA "blorg" NA "chorg" NA "blorg" NA "phthorg"

Related

Generate variable with missing values if condition doesn't hold true

I want to generate a new variable in a data frame which contains the difference of the current row and a lag-value of another variable. However, I want to assign only values for those rows, where a specific condition holds true for a second variable. In this example the new lag-difference variable should only have values for rows with the fruit "Banana". All other rows shall be empty or rather contain NA.
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
I tried to solve this problem with an if condition:
fruitnumbers$newvar <- if(fruitnumbers$fruits=="Banana"){
fruitnumbers$numbers-lag(fruitnumbers$numbers, 1)
}
However, I've received the following warning massage.
Warning message:
In if (fruits == "Banana") { :
the condition has length > 1 and only the first element will be used
From research, I assume that it has something to do with the fact R wants to check the If-condition for the whole data frame instead of row by row for each value but I'm not quite sure. I'd be grateful for any solution here.
Here fruitnumbers$fruits is a vector so when you run if (fruitnumbers$fruits == "Banana") only the first element of fruitnumbers$fruits is tested(here "Apple" == "Banana").
If you want a vectorized test use the case_when function of the library dplyr
library(dplyr)
fruitnumbers$newvar <- case_when(
fruitnumbers$fruits == "Banana" ~ fruitnumbers$numbers-lag(fruitnumbers$numbers, 1),
TRUE ~ NA_real_
)
Which gives
fruitnumbers$newvar
[1] NA 2 NA NA NA 2 -3 NA NA 2
EDIT : as mentioned by someone you could have used the ifelse function
fruitnumbers$newvar <- ifelse(fruitnumbers$fruits == "Banana", fruitnumbers$numbers-lag(fruitnumbers$numbers, 1), NA)
I would do that in two stages:
Create a new column in the data frame:
fruitnumbers$newvar <- NA
Change the values only for bananas:
fruitnumbers$newvar[fruitnumbers$fruits=="Banana"] <-
fruitnumbers$numbers[fruitnumbers$fruits=="Banana"] - lag(fruitnumbers$numbers[fruitnumbers$fruits=="Banana"], 1)
I am not sure about the lag function in this context. It only returns zeros. Another problem might be hiding there.
In base R, you could try this:
fruitnumbers <- data.frame(numbers=c(2,4,1,5,3,5,2,5,1,3),
fruits=c("Apple","Banana","Orange","Cherry","Strawberry","Banana","Banana",
"Apple","Cherry","Banana"))
indexes = which(fruitnumbers$fruits == "Banana")
fruitnumbers[indexes, 'newvar'] = fruitnumbers[indexes, 'numbers'] - lag(fruitnumbers[indexes, 'numbers'], 1)
Rest of the row values in column newvar would show as blank.

Function to change blanks to NA

I'm trying to write a function that turns empty strings into NA. A summary of one of my column looks like this:
a b
12 210 468
I'd like to change the 12 empty values to NA. I also have a few other factor columns for which I'd like to change empty values to NA, so I borrowed some stuff from here and there to come up with this:
# change nulls to NAs
nullToNA <- function(df){
# split df into numeric & non-numeric functions
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
# Change empty strings to NA
b<-b[lapply(b,function(x) levels(x) <- c(levels(x), NA) ),] # add NA level
b<-b[lapply(b,function(x) x[x=="",]<- NA),] # change Null to NA
# Put the columns back together
d<-cbind(a,b)
d[, names(df)]
}
However, I'm getting this error:
> foo<-nullToNA(bar)
Error in x[x == "", ] <- NA : incorrect number of subscripts on matrix
Called from: FUN(X[[i]], ...)
I have tried the answer found here: Replace all 0 values to NA but it changes all my columns to numeric values.
You can directly index fields that match a logical criterion. So you can just write:
df[is_empty(df)] = NA
Where is_empty is your comparison, e.g. df == "":
df[df == ""] = NA
But note that is.null(df) won’t work, and would be weird anyway1. I would advise against merging the logic for columns of different types, though! Instead, handle them separately.
1 You’ll almost never encounter NULL inside a table since that only works if the underlying vector is a list. You can create matrices and data.frames with this constraint, but then is.null(df) will never be TRUE because the NULL values are wrapped inside the list).
This worked for me
df[df == 'NULL'] <- NA
How about just:
df[apply(df, 2, function(x) x=="")] = NA
Works fine for me, at least on simple examples.
This is the function I used to solve this issue.
null_na=function(vector){
new_vector=rep(NA,length(vector))
for(i in 1:length(vector))
if(vector[i]== ""){new_vector[i]=NA}else if(is.na(vector[i]))
{new_vector[i]=NA}else{new_vector[i]=vector[i]}
return(new_vector)
}
Just plug in the column or vector you are having an issue with.

Finding the unique identifyer for a row with NA in a particular column in R

I have data in the following format:
ID Species Side_of_boat
1 spA Port
2 spB Starboard
3 spA NA
I would like to write a line of code that gives me the unique ID for all rows that have NA in 'side of boat'.
I have tried:
unique(df$ID[df$side_of_boat == "NA"])
But it doesn't give me the output I want. I would like the output to be:
"3"
Thanks!
Try
unique(df$ID[is.na(df$Side_of_boat)])
instead. NA is a special value in R and also has its own special function is.na() to test if an entry is NA. Check ?NA for more information.
#Method1
n <- which(is.na(df$side_of_boat))
you can also use *apply with this, e.g.
lapply(apply(df$side_of_boat, 1, function(x) which(!is.na(x)) ) , paste, collapse=", ")
#Method 2
new_DF <- subset(df, is.na(df$side_of_boat))
#Method 3
You could also write a function to do this for you:
getNa <- function(dfrm) lapply(dfrm, function(x) which(is.na(x) ) )
#Note
In case you have NA character values, first run
df$side_of_boat[df$side_of_boat=='NA'] <- NA
Try:
df$ID[which(is.na(df$Side_of_Boat))]
It should give you a vector of the ID's regardless of them being numbers or characters

Unexpected row(s) of NAs when selecting subset of dataframe

When selecting a subset of data from a dataframe, I get row(s) entirely made up of NA values that were not present in the original dataframe. For example:
example.df[example.df$census_tract == 27702, ]
returns:
census_tract number_households_est
NA NA NA
23611 27702 2864
Where did that first row of NAs come from? And why is it returned even though example.df$census_tract != 27702 for that row?
That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca

How to show indexes of NAs?

I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))

Resources