Implementing one-hot encoding using r - r

For the dataset I am working on there is a lot of character variables that I want to one-hot encode them in order to build some predictive models. In my code I am excluding two variables because it does not make sense to encode them, they are the item identifier and establishment year of the store. Here is the code I am using:
one_hot_encoding = dummyVars("~.", data = train[,-
c("Item_Identifier", "Outlet_Establishment_Year")], fullRank = T)
ohe_df = data.table(predict(one_hot_encoding, train[,-
c("Item_Identifier", "Outlet_Establishment_Year")]))
train = cbind(train[,"Item_Identifier"], ohe_df)
When executing the first line it gives this error:
Error in -c("Item_Identifier", "Outlet_Establishment_Year") :
invalid argument to unary operator.
Why? and one question regarding the dummyVars function: does it by default exclude the numeric variables of the input dataset?

Yes, it excludes by default the numeric variables.
Concerncing your error, there are some workarounds:
With the dplyr-package
select(train, -Item_Identifier, -Outlet_Establishment_Year)
And with base-R
train[, -which(names(train) %in% c("Item_Identifier", "Outlet_Establishment_Year")]
OR just use the number of the column like
train[, -c(1,6)]

Related

R Tidymodels: error columns don't exist when using function argument to specify column

I'm trying to write a function to use the R tidymodels function initial_split with an argument that would let me change the strata to a different variable each time I call the function.
Using initial_split regularly like this works perfectly:
split_glab=initial_split(data,prop=0.7,strata=sp_glabrata)
Then I converted it to a function and plugged in my species parameter:
split_data=function(df,species){
initial_split(df,prop=0.7,strata=species)
}
split_data(data,species=sp_glabrata)
And get the following error:
Error: Can't subset columns that don't exist.
x Column `species` doesn't exist.
Of course, this column doesn't exist in my data since it's just an argument in my function --the column I'm trying to reference is called sp_glabrata. I can't figure out how to get my function to reference the column instead of the parameter. I don't want to just type the column name since I have to apply many similar functions to several columns and it would take forever.
Any guidance would be appreciated!
As it is a tidy package, can make use of curly-curly operator ({{}}) to evaluate the unquoted argument as a column name
library(tidymodels)
split_data <- function(df, species){
initial_split(df, prop=0.7, strata={{species}})
}
-testing
split_data(iris, species = Species)
#<Analysis/Assess/Total>
#<105/45/150>

Calculating multiple ROC curves in R using a for loop and pROC package. What variable to use in the predictor field?

I am using the pROC package and I want to calculate multiple ROC curve plots using a for loop.
My variables are specific column names that are included as string in a vector and I want pROC to read sequentially that vector and use the strings in the field "predictor" that seems to accept text/characters.
However, I cannot parse correctly the variable, as I am getting the error:
'predictor' argument should be the name of the column, optionally quoted.
here is an example code with aSAH dataset:
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(data=aSAH, outcome, as.character(a))
#code for output/print#
}
I have tried to call just "a" and using the functions print() or get() without any results.
Writing manually the variable (with or without quoting) works, of course.
Is there something I am missing about the type of variable I should use in the predictor field?
By passing data=aSAH as first argument, you are triggering the non-standard evaluation (NSE) of arguments, dplyr-style. Therefore you cannot simply pass the column name in a variable. Note the inconsistency with outcome that you pass unquoted and looks like a variable (but isn't)? Fortunately, functions with NSE in dplyr come with an equivalent function with standard evaluation, whose name ends with _. The pROC package follows this convention. You should usually use those if you are programming with column names.
Long story short, you should use the roc_ function instead, which accepts characters as column names (don't forget to quote "outcome"):
pROC_obj <- roc_(data=aSAH, "outcome", as.character(a))
A slightly more idiomatic version of your code would be:
for (predictor in ROCvector) {
pROC_obj <- roc_(data=aSAH, "outcome", predictor)
}
roc can accept formula, so we can use paste0 and as.formula to create one. i.e.
library(pROC)
ROCvector<- c("s100b","ndka")
for (i in seq_along(ROCvector)){
a<-ROCvector[i]
pROC_obj <- roc(as.formula(paste0("outcome~",a)), data=aSAH)
print(pROC_obj)
#code for output/print#
}
To can get the original call i.e. without paste0 wich you can use for later for downstream calculations, use eval and bquote
pROC_obj <- eval(bquote(roc(.(as.formula(paste0("outcome~",a))), data=aSAH)))

Coercing a vector to numeric mode in R

So, I have a set of data, and what I'm trying to do is find all the local maxima on the resulting curve. I read in a CSV file, which has x-values in the first column and y-values in the second, first step done, easy.
To find the maxima, I tried to use the findpeaks() function from the pracma database. However, each time I tried to run it, I got the same error:
Error: is.vector(x, mode = "numeric") is not TRUE
So, I first tried just converting this to a vector. Still got the same issue, however is.vector(x, mode = "any") was now returning true. I found some other help threads (which I can no longer find, so I can't share them, sorry!), and decided to try using lapply to coerce each entry in the new vector using as.numeric. Didn't work. Looked into ?as.numeric, and it mentioned that as.double might be better suited. Didn't work. Now I'm at a loss and not sure what to do - current working code is shown below.
plot <- read_csv("AFGP60 UV-05-04-16.csv",
col_names = FALSE, na = "null", skip = 2,n_max = numrow)
diffplot <- c(plot[1:601,2])
diffplot <- lapply(diffplot,as.double)
findpeaks(diffplot)`
Try diffplot <- as.numeric(as.vector(plot[1:600, 2])).
The problem was that the data was read as character or as factor. The above code should change that. However, there are multiple issues with your code. First, plot is a base function used for plotting. Naming a variable with such a name is bad practice.
Second, the diffplot variable is a vector (first 600 rows from the second column), so there is no need to change each element separately with the lapply function.

Uncommon error message converting Matrix to Sparse in R

I'm trying to run a LASSO on our dataset, and to do so, I need to convert non-numeric variables to numeric, ideally via a sparse matrix. However, when I try to use the Matrix command, I get the same error:
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
I thought this was due to NA's in my data, so I did an na.omit and got the same error. I tried again with a mini subset of my code and got the same error again:
> sparsecombined <- Matrix(combined1[1:10,],sparse=TRUE)
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
This is the data set I tried to convert with that last line of code:
Is there anything that jumps out that might prevent sparse conversion?
The easiest way to incorporate categorical variables into a LASSO is to use my glmnetUtils package, which provides a formula/data frame interface to glmnet.
glmnet(ArrDelay ~ ArrTime + uniqueCarrier + TailNum + Origin + Dest,
data=combined1, sparse=TRUE)
This automatically handles categorical vars via one-hot encoding (also known as dummy variables). It can also use sparse matrices if so desired.
I think the error is due to the fact that you have non-numeric data types in your matrix.
Perhaps first convert your nun-numeric columns like UniqueCarrier to binary vectors using one-hot encoding. And only then convert the matrix to sparse.
Here is my code that I used for that conversion:
# Convert Genre into binary variables
# Convert genreVector into a corpus in order to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence
library(tm)
library(slam)
convertToBinary <- function(category) {
genreVector = category
genreVector = strsplit(genreVector, "(\\s)?,(\\s)?") # separate out commas
genreVector = gsub(" ", "_", genreVector) # combine DirectorNames with whitespaces
genreCorpus = Corpus(VectorSource(genreVector))
#dtm = DocumentTermMatrix(genreCorpus, list(dictionary=genreNames))
dtm = DocumentTermMatrix(genreCorpus)
binaryGenreVector = inspect(dtm)
return(binaryGenreVector)
#return(data.frame(binaryGenreVector)) # convert binaryGenreVector to dataframe
}
directorBinary = convertToBinary(x$Director)
directorBinaryDF = as.data.frame(directorBinary)
See nograpes answer in
recommenderlab, Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
I got this error due to passing a data frame where a matrix was expected, and it looks like that's the same reason you are getting it. The solution in simple -- convert your data to a matrix before passing it to the Matrix function:
sparsecombined <- Matrix(as.matrix(combined1[1:10,]),sparse=TRUE)
In your case, this code will probably complain because you have some non-numeric data stored in there (e.g. the TailNum column). So you would need to downselect to just the numeric columns.

What does "argument to 'which' is not logical" mean in FactoMineR MCA?

I'm trying to run an MCA on a datatable using FactoMineR. It contains only 0/1 numerical columns, and its size is 200.000 * 20.
require(FactoMineR)
result <- MCA(data[, colnames, with=F], ncp = 3)
I get the following error :
Error in which(unlist(lapply(listModa, is.numeric))) :
argument to 'which' is not logical
I didn't really know what to do with this error. Then I tried to turn every column to character, and everything worked. I thought it could be useful to someone else, and that maybe someone would be able to explain the error to me ;)
Cheers
Are the classes of your variables character or factor?I was having this problem. My solution was to change al variables to factor.
#my data.frame was "aux.da"
i=0
while(i < ncol(aux.da)){
i=i+1 aux.da[,i] = as.factor(aux.da[,i])
}
It's difficult to tell without further input, but what you can do is:
Find the function where the error occurred (via traceback()),
Set a breakpoint and debug it:
trace(tab.disjonctif, browser)
I did the following (offline) to find the name of tab.disjonctif:
Found the package on the CRAN mirror on GitHub
Search for that particular expression that gives the error
I just started to learn R yesterday, but the error comes from the fact that the MCA is for categorical data, so that's why your data cannot be numeric. Then to be more precise, before the MCA a "tableau disjonctif" (sorry i don't know the word in english : Complete disjunctive matrix) is created.
So FactomineR is using this function :
https://github.com/cran/FactoMineR/blob/master/R/tab.disjonctif.R
Where i think it's looking for categorical values that can be matched to a numerical value (like Y = 1, N = 0).
For others ; be careful : for R categorical data is related to factor type, so even if you have characters you could get this error.
To build off #marques, #Khaled, and #Pierre Gourseaud:
Yes, changing the format of your variables to factor should address the error message, but you shouldn't change the format of numerical data to factor if it's supposed to be continuous numerical data. Rather, if you have both continuous and categorical variables, try running a Factor Analysis for Mixed Data (FAMD) in the same FactoMineR package.
If you go the FAMD route, you can change the format of just your categorical variable columns to factor with this:
data[,c(3:5,10)] <- lapply(data[,c(3:5,10)] , factor)
(assuming column numbers 3,4,5 and 10 need to be changed).
This will not work for only numeric variables. If you only have numeric use PCA. Otherwise, add a factor variable to your data frame. It seems like for your case you need to change your variables to binary factors.
Same problem as well and changing to factor did not solve my answer either, because I had put every variable as supplementary.
What I did first was transform all my numeric data to factor :
Xfac = factor(X[,1], ordered = TRUE)
for (i in 2:29){
tfac = factor(X[,i], ordered = TRUE)
Xfac = data.frame(Xfac, tfac)
}
colnames(Xfac)=labels(X[1,])
Still, it would not work. But my 2nd problem was that I included EVERY factor as supplementary variable !
So these :
MCA(Xfac, quanti.sup = c(1:29), graph=TRUE)
MCA(Xfac, quali.sup = c(1:29), graph=TRUE)
Would generate the same error, but this one works :
MCA(Xfac, graph=TRUE)
Not transforming the data to factors also generated the problem.
I posted the same answer to a related topic : https://stackoverflow.com/a/40737335/7193352

Resources