KNN function in R producing NA/NaN/Inf in foreign function call (arg 6) error - r

I'm working on a project where I need to construct a knn model using R. The professor provided an article with step-by-step instructions (link to article) and some datasets to choose from (link to the data I'm using). I'm getting stuck on step 3 (creating the model from the training data).
Here's my code:
data <- read.delim("data.txt", header = TRUE, sep = "\t", dec = ".")
set.seed(2)
part <- sample(2, nrow(data), replace = TRUE, prob = c(0.65, 0.35))
training_data <- data[part==1,]
testing_data <- data[part==2,]
outcome <- training_data[,2]
model <- knn(train = training_data, test = testing_data, cl = outcome, k=10)
Here's the error message I'm getting:
I checked and found that training_data, testing_data, and outcome all look correct, the issue seems to only be with the knn model.

The issue is with your data and the knn function you are using; it can't handle characters or factor variable
We can force this to work doing something like this first:
library(tidyverse)
data <- data %>%
mutate(Seeded = as.numeric(as.factor(Seeded))-1) %>%
mutate(Season = as.numeric(as.factor(Season)))
But this is a bad idea in general, since Season is not ordered naturally. A better approach would be to instead treat it as a set of dummies.
See this link for examples:
R - convert from categorical to numeric for KNN

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Make predictions on new data after training the GLM Lasso model

I have trained a classfication model on 13,000 rows of labels with lasso in r's glmnet library. I have checked my accuracy and it looks decent, now I want to make predictions for rest of the dataset, which is 300,000 rows. My approach was to label rest of the rows using the trained model. I'm not sure if that's the most effective strategy to do approximate labeling.
But, when I'm trying to label rest of the data, I'm running into this error:
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Even if I break the dataset to 5000 rows for predictions, I still get the same error.
Here's my code:
library(glmnet)
#the subset of original dataset
data.text <- data.text_filtered %>% filter(!label1 == "NA")
#Quanteda corpus
data_corpus <- corpus(data.text$text, docvars = data.frame(labels = data.text$label1))
set.seed(1234)
dataShuffled <- corpus_sample(data_corpus, size = 12845)
dataDfm <- dfm_trim( dfm(dataShuffled, verbose = FALSE), min_termfreq = 10)
#model to train the classifier
lasso <- cv.glmnet(x = dataDfm[1:10000,], y = trainclass[1:10000],
alpha = 1, nfolds = 5, family = "binomial")
#plot the lasso plot
plot(lasso)
#predictions
dataPreds <- predict(lasso, dataDfm[10000:2845,], type="class")
(movTable <- table(dataPreds, docvars(dataShuffled, "labels")[10000:2845]))
make predictions on rest of the dataset. This dataset has 300,000 rows.
data.text_NAs <- data.text_filtered %>% filter(label1 == "NA")
data_NADfm <- dfm_trim( dfm(corpus(data.text_NAs$text), verbose = FALSE), min_termfreq = 10)
data.text_filtered <- data.text_filtered %>% mutate(label = predict(lasso, as.matrix(data_NADfm), type="class", s="lambda.1se")
Thanks much for any help.
The problem lies in the as.matrix(data_NADfm) - this makes the dfm into a dense matrix, which makes it too large to handle.
Solution: Keep it sparse: either remove the as.matrix() wrapper, or if it does not like a raw dfm input, you can coerce it to a plain sparse matrix (from the Matrix package) using as(data_NADfm, "dgCMatrix"). This should be fine since both cv.glmnet() and its predict() method can handle sparse matrix inputs.

Why when I run the ggttest there is an error?

When I run the t-test for a numeric and a dichotomous variable there in no problem and I can see the results. The problem is when I run the ggttest of the same t-test. There is an error and says that one of my variable is not found. I do not why that happens. The aml dataset I used is from package boot. Below you can see the code:
https://i.stack.imgur.com/7kuaA.png
library(gginference)
time_group.test16537 = t.test(formula = time~group,
data = aml,
alternative = "two.sided",
paired = FALSE,
var.equal = FALSE,
conf.level = 0.95)
time_group.test16537
ggttest(time_group.test16537,
colaccept="lightsteelblue1",
colreject="gray84",
colstat="navyblue")
The problem comes with these lines of code in ggttest:
datnames <- strsplit(t$data.name, splitter)
len1 <- length(eval(parse(text = datnames[[1]][1])))
len2 <- length(eval(parse(text = datnames[[1]][2])))
It tries to find the len of group and time, but it doesn't see that it came from a data.frame. Pretty bad bug...
For your situation, supposedly you have less than 30 in each group and it plots a t-distribution, so do:
library(gginference)
library(boot)
gginference:::normt(t.test(time~group,data=aml),
colaccept = "lightsteelblue1",colreject = "grey84",
colstat = "navyblue")
t.test doesn't store your data in the output so there is no way that you could extract the data from the list of the output of t.test.
The only way to use formula is:
library(gginference)
t_test <- t.test(questionnaire$pulse ~ questionnaire$gender)
ggttest(t_test)
Original answer here: How to extract the dataset from an "htest" object when using formula in r

MXnet odd error

This is my first ANN so I imagine that there might be a lot of things done wrong here. I don't follow
I'm trying to predict species of flowers from iris data set provided in R language but I get following error:
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(n)) :
invalid 'dimnames' given for data frame
My code:
require(mxnet)
train <- iris[1:130,]
test <- iris[131:150,]
train.data <- as.data.frame(train[-5])
train.label <- data.frame(model.matrix(data=train,object =~Species-1))
test.data <- as.data.frame(test[-5])
test.label <- data.frame(model.matrix(data=test,object =~Species-1))
var1 <- mx.symbol.Variable("data")
layer0 <- mx.symbol.FullyConnected(var1, num.hidden=3)
cat.out <- mx.symbol.SoftmaxOutput(layer0)
net.model <- mx.model.FeedForward.create(cat.out,
array.layout = "auto",
X=train.data,
y=train.label,
eval.data = list(data=test.data,label=test.label),
num.round = 20,
array.batch.size = 20,
learning.rate=0.1,
momentum=0.9,
eval.metric = mx.metric.accuracy)
UPDATE:
I managed to get rid of this error by specifying column to use in labels(traning.label[,1]and test.label[,1]).
However now I'm training my net to predict just one of my binary variables while I have 3 (one for each species).
I had the same problem, turned out that:
train.data should be a matrix
train.label should be a numeric vector
Check these two and hopefully it should work.
I had a similar problem but during the prediction step. It turns out that my features were in a Data Frame which was causing the issue. Once I converted the data frame into a matrix, the issue went away.
pred.values = stats::predict(model,as.matrix(features))
instead of
pred.values = stats::predict(model,features)
So, the features need to be a matrix both during training and during the process of making predictions.

randomForest Predict error from test set

I am running into a an error with the R package of randomForest where after I split the data using Caret into training and testing, when I go to predict I run into error:
Error in predict.randomForest(randomForestFit, type = "response", newdata =testing$GEN)
:number of variables in newdata does not match that in the training data
I split the file between train and test from the exact same file. There are no N/A or missing values in any of the data. Below is my full code, but I do not think there is an error there. I am at a loss as to why this error is occurring. Any ideas would be greatly appreciated!
library(caret)
require(foreign)
set.seed(825)
data <- read.spss("C:/MODEL_SAMPLE.sav",use.value.labels=TRUE, to.data.frame = TRUE)
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
training <- data[inTraining, ]
testing <- data[-inTraining, ]
library(randomForest)
library(foreach)
start.time <- Sys.time()
randomForestFit <- foreach(ntree=rep(63, 8), .combine=combine, .packages='randomForest')
%dopar% randomForest(training[-201],
training$GEN,
mtry = 40,
ntree=ntree,
verbose = TRUE,
importance = TRUE,
keep.forest=TRUE,
do.trace = TRUE)
randomForestFit
predict = predict(randomForestFit, type="response", newdata=testing$GEN)
stopCluster(cl)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Without the data, its hard for anyone to say what the problem is exactly.
Three suggestions:
First, check the SPSS file for stray characters in the data.
Second, check the options from read.spss are set correctly especially: reencode = NA, use.missings =to.data.frame. You can use the latter option to specify non numeric characters to be turned into NA.
Third, use str(df), summary(df,useNA="if any") and make sure your factor variables including the response are actually factors. Apply as.numeric(as.character()) to numeric data in the data frame, this will generate NA values if there are expressions like VALUE!, #NA in the data frame.
You could also export to csv from SPSS and do the above again.
The key is the following
:number of variables in newdata does not match that in the training data
I therefore guess that the training and test data are different, in particular the column names. Maybe it breaks at this line?
inTraining <- createDataPartition(data$GEN, p = 0.75, list = FALSE)
To better understand the problem, you might have to post 3 rows of the training and test data set (with column names!).
I hope this helps!

Resources