I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)
Related
I have a list of columns that contain 0 and 1 as values. Right now they are treated as numerical variables but I want them to be treated as categorical.
I tried
as.factor(df[,"diseasesA":"diseaseM"], exclude = NULL)
but received the following error message:
Error in as.factor(df[,"diseasesA":"diseaseM"], :
unused argument (exclude = NULL)
not using "exclude = NULL" gave me the following error message:
Error in "diseasesA":"diseaseM" : NA/NaN argument
In addition: Warning messages:
1: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
2: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
factor() or as.factor() works on a single column, not a data frame. So you need to apply that function to the columns you want to convert. Here are a few equivalent methods:
cols = paste0("disease", LETTERS[1:13]) # assuming your naming pattern is consistent
## base R with lapply
df[cols] = lapply(df[cols], factor)
## base R with for loop
for(i in seq_along(cols)) {
df[[i]] = factor(df[[i]])
}
## dplyr
library(dplyr)
df = df %>%
mutate(across(diseaseA:diseaseM, factor))
I will note that your question is inconsistent in its column naming pattern, disease vs diseases. In the base R methods I assumed that's a typo and further assumed you wanted to convert columns diseaseA, diseaseB, diseaseC, ..., diseaseM. In dplyr we can use across() to use X:Z to operate on all columns starting with X through Z--but there are many other methods possible to select which columns to work on, e.g., starts_with("diesease").
This appears to be quite surprising:
df1 <- data.frame(A=TRUE, B=FALSE)
df2 <- data.frame(A=1, B=2)
> any(df1)
Error in FUN(X[[i]], ...) :
only defined on a data frame with all numeric variables
> any(df2)
[1] TRUE
This doesn't seem to be a bug because the error correctly states that any() will only work in the case where all variables within a data.frame are numeric.
But what is the reason for any() to work on all numeric variables and not when values are all logical?
any can work if it is vector as the documentation says
Given a set of logical vectors, is at least one of the values true?
In the OP's post, both examples are not vectors. The first is a data.frame with logical columns. If we go by way to satisfy the documentation i.e create a logical vector, either convert to matrix (as a matrix is a vector anyway with some dim attributes)
any(as.matrix(df1))
#[1] TRUE
Or change it to a vector by unlisting the list (a data.frame is a list of vectors aka columns of same length)
any(unlist(df1))
In the second case, there is a warning and it is doing some coercing
any(df2)
#[1] TRUE
Warning message: In any(c(1, 2), na.rm = FALSE) : coercing argument
of type 'double' to logical
I have a large df with a specific numeric column named Amount.
df = data.frame(Amount = c(as.numeric(1:14)), stringsAsFactors = FALSE)
I want to select odd rows. So far, I have tried with the syntax below but I always get this error messages:
df$Amount[c(FALSE, TRUE),]
Error in df$Amount[c(FALSE, TRUE), ] : incorrect number of dimensions
seq_len(ncol(df$Amount)) %% 2
Error in seq_len(ncol(df$Amount)) :
argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(ncol(df$Amount)) :
first element used of 'length.out' argument
odd = seq(1,14,1)
df$Amount[odd,1]
Error in P20$Journal.Amount[even, 1] : incorrect number of dimensions
P20$Journal.Amount[seq(2,length(14), 2),]
Error in seq.default(2, length(14), 2) : wrong sign in 'by' argument
My question is: Is there a way I can do this directly? I tried with the solutions of questions previously posted but so far, I keep having these error messages.
BaseR preferably.
The row/column index is used when there are dim attributes. vector doesn't have it.
is.vector(df$Amount)
If we extract the vector, then just use the row index
df$Amount[c(FALSE, TRUE)]
If we want to subset the rows of the dataset,
df[c(FALSE, TRUE), 'Amount', drop = FALSE]
In the above code, we are specify the row index (i), 'j' as the column index or column name, and drop (?Extract - is by default drop = TRUE for data.frame. So, we need to specify drop = FALSE to not lose the dimensions and coerce to a vector)
My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.
One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.
I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in R?
Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.
You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.
nums <- as.numeric(as.character(df$x))
As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric
nums <- as.numeric(levels(df$x))[as.integer(df$x)]
To get the mean, you use mean() but pass na.rm = T
m <- mean(nums, na.rm = T)
Assign the mean to all the NA values.
nums[is.na(nums)] <- m
You could then replace the old data, but I don't recommend it. Instead just add a new column
df$new.x <- nums
This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.
colMeans2 <- function(x) {
# This function tries to guess column type. Since all columns come as
# characters, it first tries to see if x == "TRUE" or "FALSE". If
# not so, it tries to coerce vector into integer. If that doesn't
# work it tries to see if there's a ' \" ' in the vector (meaning a
# column with character), it uses that as a result. Finally if nothing
# else passes, it means the column type is numeric, and it calculates
# the mean of that. The end.
# browser()
# try if logical
if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)
# try if integer
try.int <- strtoi(x)
if (all(!is.na(try.int))) return(try.int[1])
# try if character
if (any(grepl("\\\"", x))) return(x[1])
# what's left is numeric
mean(as.numeric(as.character(x)), na.rm = TRUE)
# a possible warning about coerced NAs probably originates in the above line
}
You would use it like so:
apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)
It sort of depends on what your data looks like.
Does it look like this?
data = list(1, 2, 'new jersey')
Then you could
data.numbers = sapply(data, as.numeric)
and get
c(1, 2, NA)
And you can find the mean with
mean(data.numbers, na.rm=T)
A compact conversion:
vec <- c(0:10,"a","z")
vec2 <- (as.numeric(vec))
vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])
as.numeric will print the warning message listed below and convert the non-numeric to NA.
Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion