Function to impute missing values using mean in R - r

My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.

One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.

Related

Trying to figure out how to return the mean value of each column in a data frame using a list

I have a data frame which shows the average life expectancy of a country from 1800 to 2018.The Columns are labeled like this: XYear. For example: X2000. I made a function which returns the mean value of a selected column. Here's the part I'm struggling with: the assignment is asking me to create a list which has the mean value of every column in the data frame, using the aforementioned function.
I tried making a list element which would select all rows and columns except for the first ones (selecting them with [-1,-1]).
life_exp <- read.csv("data/life_expectancy_years.csv", stringsAsFactors = FALSE)
Write a function get_col_mean() that takes in a column name and a data frame and returns the mean of that column. Make sure to properly handle NA values
get_col_mean <- function(col_name, data_frame_name) {
return(mean(data_frame_name[, col_name], na.rm = TRUE))
}
Create a list col_means that has the mean value of each column in the data frame (except the Country column). You should use your function above.
I tried this:
column_means = get_col_mean(life_exp$life_exp[, -1], life_exp)
But I got this error message:
In mean.default(data_frame_name[, col_name], na.rm = TRUE) :
argument is not numeric or logical: returning NA
I believe you are misusing the $ operator. This is used to grab a single column by name.
#data frame
z <- data.frame(l = c(1,2,3,4), y = c(4,3,2,3), c =c(1,'',3,4)))
z$l
[1] 1 2 3 4
z$z
NULL
#numeric (note that I am providing the column name as a string
get_col_mean("l", z)
#outout
[1] 3
#this is the same as putting NULL in
get_col_mean(z$z, z)
#your presumed error
[1] NA
Warning message:
In mean.default(data_frame_name[, col_name], na.rm = TRUE) :
argument is not numeric or logical: returning NA
If you are looking to apply this to each column, a for loop or the apply family of functions is likely what you are looking for.

How to properly apply RowMeans()? "X is not numeric" error

I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)

Mean imputation issue with data.table

Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.

Replace Some Variable Values with Variable Names

A package ('related') requires me to change some values withing variables in a largeish SNP dataframe (385x12300). This is no doubt simple but I can't find this particular question anywhere. Sample data:
binfrom<-c(1,1,1,1,0,NA)
x <- sample(binfrom, 100, replace = TRUE)
x<-data.frame(matrix(x,10,10))
I need the variable names X1,X2 etc to replace each "1" in that variable column. The values "0" and "NA" remain unchanged.
Another way is to use which (I'm assuming you have real NAs there- see #akruns comment)
indx <- which(x == 1, arr.ind = TRUE)
x[indx] <- names(x)[indx[, 2]]
This is basically identifies the locations of ones and replacing with the corresponding column names while using the columns location of the generated index.
We convert the columns of 'x' to character class from factor and use Map to replace 1 in each column with the corresponding column name.
x[] <- lapply(x, as.character)
x[] <- Map(function(y,z) replace(y, y==1, z), x, colnames(x))
In the OP's post, NA was created as character "NA". Because of that, the columns were factor while creating data.frame (with stringsAsFactors=TRUE - default option). If we used real NA, then the first step i.e. converting to character is not needed.
In case, we work with data.table, another option is set which should be fast when working with large datasets.
library(data.table)
setDT(x)
for(j in seq_along(x)){
set(x, i=NULL, j= j, value= as.character(x[[j]]))
set(x, i= which(x[[j]]==1 & !is.na(x[[j]])),
j=j, value= names(x)[j])
}
NOTE: Assumption is that we are working with real NA values.

Calculate Mean of a column in R having non numeric values

I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion

Resources