Convert numerical variable to categorical variable - r

I have a list of columns that contain 0 and 1 as values. Right now they are treated as numerical variables but I want them to be treated as categorical.
I tried
as.factor(df[,"diseasesA":"diseaseM"], exclude = NULL)
but received the following error message:
Error in as.factor(df[,"diseasesA":"diseaseM"], :
unused argument (exclude = NULL)
not using "exclude = NULL" gave me the following error message:
Error in "diseasesA":"diseaseM" : NA/NaN argument
In addition: Warning messages:
1: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
2: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion

factor() or as.factor() works on a single column, not a data frame. So you need to apply that function to the columns you want to convert. Here are a few equivalent methods:
cols = paste0("disease", LETTERS[1:13]) # assuming your naming pattern is consistent
## base R with lapply
df[cols] = lapply(df[cols], factor)
## base R with for loop
for(i in seq_along(cols)) {
df[[i]] = factor(df[[i]])
}
## dplyr
library(dplyr)
df = df %>%
mutate(across(diseaseA:diseaseM, factor))
I will note that your question is inconsistent in its column naming pattern, disease vs diseases. In the base R methods I assumed that's a typo and further assumed you wanted to convert columns diseaseA, diseaseB, diseaseC, ..., diseaseM. In dplyr we can use across() to use X:Z to operate on all columns starting with X through Z--but there are many other methods possible to select which columns to work on, e.g., starts_with("diesease").

Related

large rowSums() results in Inf ? large number problem in R

I have a great data.matrix and I want to calculate the sum of the rows. Using rowSums function results in Inf values for sum because (presumably) the numbers are too large.
So I tried using Brobdingnagian numbers (from Brobdingnagian package, function as.brob) to deal with great numbers. But that is not working. Here is an example of what I have done with mtcars example dataset
library(dplyr)
library(brobdingnag)
mtcars <- data.matrix(mtcars)
mtcars.rowsum <- mtcars %>% as.brob(.) %>% rowSums(.)
Error in h(simpleError(msg, call)) :
Error argument 'x' during method selection for function 'rowSums':
invalid class “brob” object: invalid object for slot "positive" in class "brob":
got class "matrix", should be or extend class "logical"
Selecting TRUE or FALSE in brob(.,positive = ) results in an error unused argument.
How to handle great numbers for rowSums() in R? How to use as.brob in a data.matrix?

How to properly apply RowMeans()? "X is not numeric" error

I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)

Function to impute missing values using mean in R

My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.
One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.

Removing rows in list (with ff)

I have a data set where I want to remove every row in which Dataset$a
does not have the value "Right". Dataset$a is a list with three diffrent objects "Right", "Wrong1" and "Wrong2". I tried to do this by using the code:
Dataset$a <- subset.ffdf(Dataset, a == "Right")
But I get the error
Error in if (any(B < 1)) stop("B too small") : missing value where TRUE/FALSE needed
In addition: Warning message:
In bbatch(n, as.integer(BATCHBYTES/theobytes)) :
NAs introduced by coercion to integer range
What should I do instead?
The warning that is returned has something do in working with a large data set.
see here
Before doing your filter see if row a is a factor or character using:
str(Dataset$a)
If it is in character format, this should work
finalDf <- Dataset[Dataset$a != "Right", ]
Or you can use dplyr like so:
require(dplyr)
newData <- Dataset%>% dplyr::filter(a=="Right")

R function applied on data frame grouped by multiple factors

I have a data frame called subdata, with a dimension of 10299 x 81. Column 1 called "Subject" and column 2 called "Activity". I want to calculate the average of each column grouped by "Subject" and "Activity".
Here are the functions I tried and none of them seems work so far. Finally I used colwise(mean) function, it seems work. I am new to R and just learned sapply, lapply, tapply functions and it seems mean function works in columns.
Can anyone help me explain what does these error or warning message mean and if there a way to make theses functions work?
Use lapply function:
newdata<- subdata[, lapply(.SD, mean), by = c("Subject","Activity")]
The error message:
Error in `[.data.frame`(subdata, , lapply(.SD, mean), by = c("Subject", :
unused argument (by = c("Subject", "Activity"))
Use by function:
newdata<-by(subdata, list(subdata$Subject, subdata$Activity), mean)
I got warning message:
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
Then I tried ddply in plyr package
ddply(subdata, .(Subject, Activity), mean)
I got the same warning message:
Warning messages:
1: In mean.default(piece, ...) : argument is not numeric or logical: returning NA 0
Finally I used colwise(mean)function, it seems work
newdata<-ddply(subdata, .(Subject, Activity), colwise(mean))
It is somewhat difficult to be certain with a representative sample of your dataset. Let's create some data to work with.
# Create some random demo data
subdata <- data.frame(Subject = rep(seq(5), each=4),
Activity = rep(LETTERS[1:2], 10), v1=rnorm(20), v2=rnorm(20))
Your first attempt I am not even sure where to start. It appears you are trying to subset your dataframe with the output of a list which already seems strange. You should abandon this attempt.
Your by statement is providing an error about non-numeric data. This is because the by function isn't that smart. You need to only provide the columns to be analyzed and then the indices (i.e. your factor columns).
by(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), function(x) colMeans(x))
Althought you probably want to rbind this output and reassign rownames to correspond to groups. However, for this purpose it may be best to just use something aggregate to avoid such extra computation.
aggregate(subdata[,-c(1,2)], list(subdata$Subject, subdata$Activity), mean)
Your ddply statements are close but as I suggested above you should use numcolwise to summarize over your numeric columns.
library(plyr)
# summarize over all numeric columns
ddply(subdata, .(Subject, Activity), numcolwise(mean))

Resources