Replace NA in a data frame with factor variables - r

I would like to create a function to replace NA by the text "NR" in factor variables of a data frame.
I found the below code on the web, that works perfectly :
i <- sapply(data_5, is.factor) # Identify all factor variables in your data
data_5[i] <- lapply(data_5[i], as.character) # Convert factors to character variables
data_5[is.na(data_5)] <- 0 # Replace NA with 0
data_5[i] <- lapply(data_5[i], as.factor) # Convert character columns back to factors
But I would like to transform this code in a function called "remove_na_factor". I tried as below :
remove_na_factor <- function(x){
i <- sapply(x, is.factor) # Identify all factor variables in your data
x[i] <- lapply(x[i], as.character) # Convert factors to character variables
x[is.na(x)] <- "NR" # Replace NA with NR
x[i] <- lapply(x[i], as.factor) # Convert character columns back to factors
}
When when I run the function on a data frame with NA values, nothing happens ...
Thanks in advance for your help.

Just add return(x) at the end of your function:
remove_na_factor <- function(x){
#your function body
return(x)
}
You can also get the same result using a tidyverse approach
library(tidyverse)
x %>%
mutate_if(is.factor, as.character) %>% # Convert factors to character variables
mutate_if(is.character, replace_na, "NR") %>% # Replace NA with NR
mutate_if(is.character, as.factor) # Convert character columns back to factors

Related

calculating average of two column in another column

I am trying to calculate the average of columns in another column but getting errror
converted all the nastring na.strings = c("N") to NA but after the class of columns is character.
after this i have NA in place of N in data frame but still the class of column is character
df <- data.frame("T_1_1"= c(68,24,"N",105,58,"N",135,126,24),
"T_1_2"=c(26,105,"N",73,39,97,46,108,"N"),
"T_1_3"=c(93,32,73,103,149,"N",147,113,139),
"S_2_1"=c(69,67,94,"N",77,136,137,92,73),
"S_2_2"=c(87,67,47,120,85,122,"N",96,79),
"S_2_3"= c(150,"N",132,121,29,78,109,40,"N"),
"TS1_av"=c(68.5,45.5,94,105,67.5,136,136,109,48.5),
"TS2_av"=c(56.5,86,47,96.5,62,109.5,46,102,79),
"TS3_av"=c(121.5,32,102.5,112,89,78,128,76.5,139)
)
df$TS1_av <- rowMeans(df[,c(as.numeric(as.character("T_1_1","S_2_1")))], na.rm=TRUE)
You can use :
#Change 'N' to NA
df[df == 'N'] <- NA
#Change the type of columns
df <- type.convert(df, as.is = TRUE)
#Take mean of selected columns and add a new column
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df
You could use readr::parse_number to extract numbers and replace any string that can't be converted to numeric by NA.
the na argument allows to specify strings to be interpreted as NA (here 'N'). If you don't supply this argument, you get a warning for every string which couldn't be interpreted but it's also replaced by NA.
library(dplyr)
library(readr)
df <- df %>% mutate(across(where(is.character),readr::parse_number,na='N'))
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df
2 Base R solutions:
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 1: calculate the mean row-wise: TS1_av => numeric vector
df$TS1_av <- apply(df[,cols], 1, function(x){
mean(suppressWarnings(as.numeric(x)), na.rm = TRUE)
}
)
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 2: Coerce to numeric and calculate the row mean:
# TS1_av => numeric vector
df$TS1_av <- rowMeans(
suppressWarnings(
vapply(df[,cols], as.numeric, numeric(nrow(df)))
),
na.rm = TRUE
)

R Data Frame - Convert all <NA> to blank(" ") for character columns

I am looking for a solution to convert all to blank(' ') for all character columns in the data frame. I would prefer Base R solution. I tried solution described in (Setting <NA> to blank
) but it requires to convert entire data frame as a factor and that creates an issue for numeric columns e.g.
df <- data.frame(x=c(1,2,NA), y=c("a","b",NA))
To convert numeric NA to 0
df[is.na(df)] <- 0
To convert character to Blank(" ") - It converts all columns to character.
df <- sapply(df, as.character)
df[is.na(df)] <- " "
Create your dataframe with stringsAsFactors = FALSE
df <- data.frame(x=c(1,2,NA), y=c("a","b",NA), stringsAsFactors = FALSE)
Find character columns
cols <- sapply(df, is.character)
Turn them to blank
df[cols][is.na(df[cols])] <- ' '
df
# x y
#1 1 a
#2 2 b
#3 NA
It's maybe not the most elegant way but using dplyr, you can convert all factor column to character column using mutate_if and then replace all NA by "" in character columns by using ifelse in mutate_if:
library(dplyr)
df %>% mutate_if(is.factor, ~as.character(.)) %>%
mutate_if(is.character, ~ifelse(is.na(.)," ",.))
x y
1 1 a
2 2 b
3 NA

Read data set from website in numeric format not character

With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)

How to create customized aggregate function for dcast that can deal with both character and numeric input?

I have a large data frame that I would like to cast into wide-form data using the dcast() function in the reshape2 package. However, the value column is a character column, but some of the values in it are numeric values in string format. I tried to create a custom aggregate function to deal with this, that will return the mean if there are numeric entries, but return the first entry if all entries are non-numeric. Although the function seems to work, it returns an error when used as fun.aggregate. Below is code with a smaller toy example to demonstrate. What I want is a 3x5 data frame with the first column the grouping variable, 3 columns of numeric values, and 1 column of character values.
mean_with_char <- function(x) {
xnum <- as.numeric(x)
if (any(!is.na(xnum))) mean(xnum, na.rm=TRUE) else x[1]
}
library(reshape2)
fakedata <- data.frame(grp1 = rep(letters[1:3],times=20), grp2 = rep(LETTERS[17:20],each=15), val=rnorm(60))
fakedata$val[46:60] <- rep(c('foo','bar','bla','bla','bla','bla'), length.out=15)
# This returns a 3x5 data frame with NA entries.
dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean)
# This returns an error.
dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean_with_char)
Error in vapply(indices, fun, .default) : values must be type
'character', but FUN(X[[1]]) result is type 'double'
Here is the workaround suggested by aosmith. The mean_with_char function returns only character output, and the numstring2num function converts numeric strings to numerics.
mean_with_char <- function(x) {
xnum <- as.numeric(x)
if (any(!is.na(xnum))) as.character(mean(xnum, na.rm=TRUE)) else x[1]
}
library(reshape2)
fakedata <- data.frame(grp1 = rep(letters[1:3],times=20), grp2 = rep(LETTERS[17:20],each=15), val=rnorm(60))
fakedata$val[46:60] <- rep(c('foo','bar','bla','bla','bla','bla'), length.out=15)
fakecast <- dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean_with_char)
# Function to change columns in a df that only consist of numeric strings to numerics.
numstring2num <- function(x) {
xnum <- as.numeric(x)
if (!any(is.na(xnum)) & !is.factor(x)) xnum else x
}
fakecast[] <- lapply(fakecast[], numstring2num)

Creating empty data frame with stringsAsFactors = FALSE

I am trying to create an empty data frame where the data will be strings and with stringsAsFactors set to FALSE. It seems that when I do that, though, it does not remember the value of stringsAsFactors.
It works if I create a blank row, like this:
> df <- data.frame(a="", b="", stringsAsFactors=FALSE)
> new.row <- c("a", "z")
> df <- rbind(df, new.row)
> df
a b
1
2 a z
> df[2,1] <- "q"
> df
a b
1
2 q z
But, I want an empty data frame. When I do that, though, it treats the strings that I later add as factors:
> df2 <- data.frame(a=character(), b=character(), stringsAsFactors=FALSE)
> df2 <- rbind(df2, new.row)
> df2
X.a. X.z.
1 a z
> df2[2,1] <- "q"
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "q") :
invalid factor level, NA generated
How can I create the empty data frame without string factors?
rbind.data.frame first drops all zero-row and zero column data.frames, and then coerces the remaining arguments into data.frames. This internal coercion uses the default value for stringsAsFactors in the coercion. (see the help for rbind, under data frame methods.
You can set this value by setting
options(stringsAsFactors=FALSE)
# now it works as you wish
str(rbind(df2,new.row))
# 'data.frame': 1 obs. of 2 variables:
# $ X.a.: chr "a"
# $ X.z.: chr "z"
I have been searching for an answer to this same problem and couldn't find anything, so I wrote my own function:
row.add <- function(x,newRow)
{
cn <- colnames(x)
x <- data.frame(lapply(x,as.character),stringsAsFactors = FALSE)
x <- rbind(x,newRow)
colnames(x) <- cn
return(x)
}
df <- data.frame("a"=character(),"b"=character())
df <- row.add(df,c("A","Z"))
df <- row.add(df,c("B","X"))
Hopefully someone searching for a similar answer will find this useful.

Resources