I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion
Related
I need your help here. I need to calculate variance manually in R. I have achieved it with this codes, it is to not robust enough for missing values and non-numeric data types.
a= c(1,2,3,4,5)
k=mean(a,na.rm = T)
storage=a
for(i in 1:length(a)) {
storage[i]= ((i-k)^2)
}
storage =sum((storage)/(length(a)-1))
storage
I run into trouble when I have a= c(1,2,3,4,5,c,NA)
Please how would I edit the code?
First, a few observations:
In R, you can do an operation on the whole vector. E.g. (c(1, 2, 3))^2 yields 1 4 9. There's no need to use a for loop.
mean isn't the only function that needs na.rm = TRUE; sum does too.
In R, atomic vectors (which are pretty much all vectors that aren't a list) can only have elements of one single data type. There are four primary types: logical, integer, double and character. If there's more than one type in the vector, all the elements are coerced to be the same, in the following order: character → double → integer → logical. For example, c(1, 'c') will return the character vector "1", "c". That's why you were having trouble. (Note: If there's an NA in the vector, its type will be the same type of the vector.)
Unfortunately for that specific vector, c(1,2,3,4,5,c,NA), I don't think there's a simple way to coerce it to an integer. That's because it's a list that has a function as an element: the function c().
However, this function works whenever x is an atomic vector:
variance <- function(x){
x = as.numeric(x)
x = na.omit(x)
m = mean(x)
return(
sum((x-m)^2, na.rm = TRUE)/(length(x) - 1)
)
}
First we coerce the vector to numeric, so we can deal with a vector like c(1, 2, 'a'). Then we remove the NA's, so we don't have to write na.rm = TRUE in mean and sum. Then we just write down the formula.
A minor inconvenience is that when converting a character vector to numeric, we get a warning saying that NAs were generated. This can be solved if we write x = suppressWarnings(as.numeric(x)) instead.
If you want your function to be able to handle lists with functions, let me know.
You are using a for loop but that is really unnecessary, you can make a function to vectorise it which removes the NAs from the data as the first step, via conversion to character then numeric vector types (because c is a function)...
# Create data
set.seed(1)
x1 <- sample(1:10, 5)
x2 <- c(x1, c, NA)
# Make the function
varFunc <- function(x){
# Convert to character then numeric (non numeric become NA) then remove NAs
x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))]
# Return Variance
sum((x-mean(x))^2) / (length(x)-1)
}
# Use the function
varFunc(x1)
varFunc(x2)
# Sanity check
var(x1)
var(x2, na.rm = TRUE)
One possible approach: first, clean up a. If you start with something like a = c(1, 2, 3, 4, 5, "c", NA), then a will not be stored as a numeric variable (because of the non-numeric entry). You might first coerce it to a numeric vector, which will give an extra NA entry:
a = c(1, 2, 3, 4, 5, "c", NA)
a <- as.numeric(a)
a
## 1 2 3 4 5 NA NA
Then, you could subset the original vector by retaining only the entries from this that are numeric (by using !):
a <- a[!is.na(as.numeric(a))]
a
## 1 2 3 4 5
You could do these right after your initial declaration of a, for instance. Gregor Thomas also suggested na.omit(), which could work if combined properly with as.numeric().
I notice that you computed the mean by using the built-in mean() function and using na.rm = T... if you're able to use that same approach here, note that var() also has an optional na.rm = T parameter. I suspect you're not allowed to use it since you were instructed to compute the variance by hand, but perhaps you could use this to check your answers.
I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)
My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.
One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.
A package ('related') requires me to change some values withing variables in a largeish SNP dataframe (385x12300). This is no doubt simple but I can't find this particular question anywhere. Sample data:
binfrom<-c(1,1,1,1,0,NA)
x <- sample(binfrom, 100, replace = TRUE)
x<-data.frame(matrix(x,10,10))
I need the variable names X1,X2 etc to replace each "1" in that variable column. The values "0" and "NA" remain unchanged.
Another way is to use which (I'm assuming you have real NAs there- see #akruns comment)
indx <- which(x == 1, arr.ind = TRUE)
x[indx] <- names(x)[indx[, 2]]
This is basically identifies the locations of ones and replacing with the corresponding column names while using the columns location of the generated index.
We convert the columns of 'x' to character class from factor and use Map to replace 1 in each column with the corresponding column name.
x[] <- lapply(x, as.character)
x[] <- Map(function(y,z) replace(y, y==1, z), x, colnames(x))
In the OP's post, NA was created as character "NA". Because of that, the columns were factor while creating data.frame (with stringsAsFactors=TRUE - default option). If we used real NA, then the first step i.e. converting to character is not needed.
In case, we work with data.table, another option is set which should be fast when working with large datasets.
library(data.table)
setDT(x)
for(j in seq_along(x)){
set(x, i=NULL, j= j, value= as.character(x[[j]]))
set(x, i= which(x[[j]]==1 & !is.na(x[[j]])),
j=j, value= names(x)[j])
}
NOTE: Assumption is that we are working with real NA values.
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.