R Data Frame - Convert all <NA> to blank(" ") for character columns - r

I am looking for a solution to convert all to blank(' ') for all character columns in the data frame. I would prefer Base R solution. I tried solution described in (Setting <NA> to blank
) but it requires to convert entire data frame as a factor and that creates an issue for numeric columns e.g.
df <- data.frame(x=c(1,2,NA), y=c("a","b",NA))
To convert numeric NA to 0
df[is.na(df)] <- 0
To convert character to Blank(" ") - It converts all columns to character.
df <- sapply(df, as.character)
df[is.na(df)] <- " "

Create your dataframe with stringsAsFactors = FALSE
df <- data.frame(x=c(1,2,NA), y=c("a","b",NA), stringsAsFactors = FALSE)
Find character columns
cols <- sapply(df, is.character)
Turn them to blank
df[cols][is.na(df[cols])] <- ' '
df
# x y
#1 1 a
#2 2 b
#3 NA

It's maybe not the most elegant way but using dplyr, you can convert all factor column to character column using mutate_if and then replace all NA by "" in character columns by using ifelse in mutate_if:
library(dplyr)
df %>% mutate_if(is.factor, ~as.character(.)) %>%
mutate_if(is.character, ~ifelse(is.na(.)," ",.))
x y
1 1 a
2 2 b
3 NA

Related

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Replace NA in a data frame with factor variables

I would like to create a function to replace NA by the text "NR" in factor variables of a data frame.
I found the below code on the web, that works perfectly :
i <- sapply(data_5, is.factor) # Identify all factor variables in your data
data_5[i] <- lapply(data_5[i], as.character) # Convert factors to character variables
data_5[is.na(data_5)] <- 0 # Replace NA with 0
data_5[i] <- lapply(data_5[i], as.factor) # Convert character columns back to factors
But I would like to transform this code in a function called "remove_na_factor". I tried as below :
remove_na_factor <- function(x){
i <- sapply(x, is.factor) # Identify all factor variables in your data
x[i] <- lapply(x[i], as.character) # Convert factors to character variables
x[is.na(x)] <- "NR" # Replace NA with NR
x[i] <- lapply(x[i], as.factor) # Convert character columns back to factors
}
When when I run the function on a data frame with NA values, nothing happens ...
Thanks in advance for your help.
Just add return(x) at the end of your function:
remove_na_factor <- function(x){
#your function body
return(x)
}
You can also get the same result using a tidyverse approach
library(tidyverse)
x %>%
mutate_if(is.factor, as.character) %>% # Convert factors to character variables
mutate_if(is.character, replace_na, "NR") %>% # Replace NA with NR
mutate_if(is.character, as.factor) # Convert character columns back to factors

How to coerce a character column to a list column

I am trying to bind data frames rows. I generate some data frame with list columns after aggregation but some are character. I can't find a way to bind them. I tried converting the character column using as.list() but that didn't work.
library(dplyr)
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
> df1
a b
1 1 1, 2
2 2 4
3 3 5, 6
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
> df2
a b
1 4 9
2 5 12
dplyr::bind_rows(df2,df1)
Error in bind_rows_(x, .id) :
Column `b` can't be converted from character to list
I don't know the dplyr library well, but using base R's rbind() below seems to be working:
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
result <- rbind(df1, df2)
class(result$a)
[1] "numeric"
class(result$b)
[1] "list"
Demo
If you wanted to get this working with bind_rows(), start by looking at the error message. It looks like dplyr doesn't like that one data frame has character data while the other has list data. You could try converting the character column to list and then call bind_rows, e.g.
df2$b <- as.list(df2$b)
dplyr::bind_rows(df2,df1)

Using dplyr, Remove all strings from a data frame

I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson

Variable as a column name in data frame

Is there any way to use string stored in variable as a column name in a new data frame? The expected result should be:
col.name <- 'col1'
df <- data.frame(col.name=1:4)
print(df)
# Real output
col.name
1 1
2 2
3 3
4 4
# Expected output
col1
1 1
2 2
3 3
4 4
I'm aware that I can create data frame and then use names() to rename column or use df[, col.name] for existing object, but I'd like to know if there is any other solution which could be used during creating data frame.
You cannot pass a variable into the name of an argument like that.
Instead what you can do is:
df <- data.frame(placeholder_name = 1:4)
names(df)[names(df) == "placeholder_name"] <- col.name
or use the default name of "V1":
df <- data.frame(1:4)
names(df)[names(df) == "V1"] <- col.name
or assign by position:
df <- data.frame(1:4)
names(df)[1] <- col.name
or if you only have one column just replace the entire names attribute:
df <- data.frame(1:4)
names(df) <- col.name
There's also the set_names function in the magrittr package that you can use to do this last solution in one step:
library(magrittr)
df <- set_names(data.frame(1:4), col.name)
But set_names is just an alias for:
df <- `names<-`(data.frame(1:4), col.name)
which is part of base R. Figuring out why this expression works and makes sense will be a good exercise.
In addition to ssdecontrol's answer, there is a second option.
You're looking for mget. First assign the name to a variable, then the value to the variable that you have previously assigned. After that, mget will evaluate the string and pass it to data.frame.
assign("col.name", "col1")
assign(paste(col.name), 1:4)
df <- data.frame(mget(col.name))
print(df)
col1
1 1
2 2
3 3
4 4
I don't recommend you do this, but:
col.name <- 'col1'
eval(parse(text=paste0('data.frame(', col.name, '=1:4)')))

Resources