knn train and class have different lengths - r

NN model to predict new data, but the error says:
"'train' and 'class' have different lengths"
can someone replicate and solve this error?
weather <- c(1, 1, 1, 0, 0, 0)
temperature <- c(1, 0, 0, 1, 0, 0)
golf <- c(1, 0, 1, 0, 1, 0)
df <- data.frame(weather, temperature, golf)
df_new <- data.frame(weather = c(1,1,1,1,1,1,1,1,1), temp = c(0,0,0,0,0,0,0,0,0), sunnday= c(1,1,1,0,1,1,1,0,0))
pred_knn <- knn(train=df[, c(1,2)], test=df_new, cl=df$golf, k=1)
Thank you very much!

The knn function in R requires the training data to only contain independent variables, as the dependent variable is called separately in the "cl" parameter. Changing the line below will fix this particular error.
pred_knn <- knn(train=df[,c(1,2)], test=df_new, cl=df$golf, k=1)
However, note that running the above line will throw another error. Since knn calculates the Euclidean distance between observations, it requires all independent variables to be numeric. These pages have helpful related information. I would recommend using a different classifier for this particular data set.
https://towardsdatascience.com/k-nearest-neighbors-algorithm-with-examples-in-r-simply-explained-knn-1f2c88da405c
https://discuss.analyticsvidhya.com/t/how-to-resolve-error-na-nan-inf-in-foreign-function-call-arg-6-in-knn/7280/4
Hope this helps.

I had a similar issue with data from the ISLR Weekly dataframe:
knn.pred = knn(train$Lag2,test$Lag2,train$Direction,k=1)
Error in knn(train$Lag2, test$Lag2, train$Direction, k = 1) :
dims of 'test' and 'train' differ
where
train = subset(Weekly, Year >= 1990 & Year <= 2008)
test = subset(Weekly, Year < 1990 | Year > 2008)
I finally solved it by putting the test and the train in as.matrix() like this:
knn.pred = knn(as.matrix(train$Lag2),as.matrix(test$Lag2),train$Direction,k=1)

Related

Building General Linear Mixed Model in R

I am trying to build a glmm in R but constantly get error messages (I am a complete beginner).
I have conducted an experiment with Camera traps in which I tested, if they react to a target, that I pulled in front of them, so my response variable is binomial.
I try to build a GLMM, in which all the variables are fixed factors and day (in which the experiment was conducted is a random factor). Could anyone more experienced tell me what I am doing wrong (I first try only with one explanatory variable)?
I tried it with the glmm() and with lmer():
library(glmm)
set.seed(1234)
ptm <- proc.time()
Detections <- glmm(Detection ~ 0 + Camera, random = list(~ 0 + Day),
varcomps.names = c("Day"), data = data1, family.glmm = bernoulli.glmm,
m = 10^4, debug = TRUE)`
This one produces an obscenely large glmm, even with the minimal dataset.
and with
library(lme4)
Detections_glmm <- lmer(Detection ~ Camera + (1|Day), family="binomial")
This one gives the following error message:
Error in lmer(Detection ~ Camera + (1 | Day), family = "binomial") :
unused argument (family = "binomial")
Here is a minimal df:
data.frame(
Detection = c(1, 0, 0, 1, 1, 1, 1, 0, 0, 0),
Temperature = as.factor(c("10","10","10","10","10","20","20",
"0","0","0")),
Distance = as.factor(c("75","75","75","225","225","225",
"75","150","150","150")),
Size = as.factor(c("0","0","0","0","1","1","1","1",
"2","2")),
Light = as.factor(c("1","1","1","1","1","0","0","0",
"0","0")),
Camera = as.factor(c("1","1","2","2","2","3","3","3",
"1","1")),
Day = as.factor(c("1","1","1","2","2","2","3","3",
"3","2"))
And Information about the variables:
Response variable:
Detection (binomial)
Explanatory variables:
Temperature of bottle: (0, 10, 20)
Distance from Camera (75, 150, 225)
Light(0/1)
Size of bottle (0, 1, 3)

Error: `data` and `reference` should be factors with the same levels' doesn't return confusion matrix

I have a csv file containing estimated probabilities and actual results. I want to create a confusion matrix using a threshold of 0.5 for the estimated probabilities but i keep getting the error message 'Error: data and reference should be factors with the same levels.' Whats wrong? See code below
I have tried to write the code
TURN PROBS INTO CLASSES AND DISPLAY FREQUENCIES
p_class = ifelse (probs_truth$estimated > 0.5, 1, 0)
table(p_class)
CALCULATING CONFUSION MATRIX
predicted = p_class
actual = probs_truth$truth
library(caret)
result = confusionMatrix (data=predicted, reference=actual)
print(result)
I expected a confusion matrix table to be returned
Subsequent code workes for me, hope it helps:
I made a small dataset which I guess resembles your data.
library(data.table)
probs_truth <- data.table(estimated = c(0.5, 0.3, 0.7, 0.8, 0.1), actual = c(1, 0, 0, 1, 0))
Added a column to your dataset with the estimated values according to your ifelse statement ('estimated2').
probs_truth$estimated2 = ifelse (probs_truth$estimated > 0.5, 1, 0)
Made sure 'estimated2' and 'actual' are factors.
probs_truth$estimated2 <- as.factor(probs_truth$estimated2)
probs_truth$actual <- as.factor(probs_truth$actual)
head(probs_truth)
library(caret)
result = confusionMatrix (data=probs_truth$estimated2, reference=probs_truth$actual)
print(result)

Confused by ROC curves and cutoffs in R

I have some data on study participants, the percent change in a biomarker, and their ultimate outcome. I'd like to use an ROC curve to find the best cutoff value for the biomarker for predicting the outcome, using the Youden method but get different answers from different packages and need to know where I'm going wrong.
To set up the dataset:
ID <- c(1:17)
PercentChange <- c(-85.5927051671732, -85.4849965108165,
-63.302752293578, -33.5509138381201, -75, -87.0988867059594,
-93.2523616734143, 65.2037617554859, -19.226393629124,
-44.7095435684647, -65.7342657342657, -43.7831467295227,
-37.0022539444027, 518.75, -77.1014492753623, 20.6572769953052,
-72.0742534301856)
Outcome <- c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1)
df <- data.frame(ID, PercentChange, Outcome)
For outcome, 1 is favorable and 0 is unfavorable.
Now with package pROC I did:
library(pROC)
roc <- roc(df$Outcome, df$PercentChange, auc= TRUE, plot = TRUE)
coords(roc, "b", ret="t", best.method="youden")
plot(roc, print.thres="best", print.thres.best.method="youden",main = "Percent change")
This gives me a reasonable curve and a cutoff (by the Youden index) of -44.246 that I verified has the correct sensitivity and specificity listed. The cutpoint seems a bit weird since its halfway between two of the actual values and not an actual value, but it works.
Then using OptimalCutpoints I tried
library(OptimalCutpoints)
optimal.cutpoint.Youden <- optimal.cutpoints(X = "PercentChange", status = "Outcome", tag.healthy = 1, methods = "Youden", data = df)
summary(optimal.cutpoint.Youden)
plot(optimal.cutpoint.Youden)
This gives me a different curve, and a different cutpoint of -43.783, which is one of the two points that pROC took the midpoint of. The sensitivity and specificity are also flipped from what I calculated using that cutpoint.
Lastly I tried the roc function from the Epi package
library(Epi)
ROC(form=Outcome~PercentChange, data=df, plot="ROC", PV=TRUE, MX=TRUE)
Which gave me a third completely different curve and says "24" at the cutpoint which doesn't make any sense. Can anyone help me figure this out? I'm not asking on the stats stackexchange cause they'd want to get into whether Youden is appropriate or not and not the technical application of these functions.

SEM-Path analysis in R

I run a path analysis in R and the following matrix represents the effect between variables.
M <- c(0, 0, 0, 0, 0)
p<-c(0, 0, 0, 0, 0)
O <- c(0, 0, 0, 0, 0)
T <- c(1, 0, 1, 1, 0)
Sales <- c(1, 1, 1, 1, 0)
sales_path <- rbind(M, p, O, T, Sales)
colnames(sales_path) <- rownames(sales_path)
#innerplot(sales_pls)
sales_blocks <- list(
c("m1", "m2"),
#c("pr"),
c("R1"),
#c("C1"),
c("tt1"),
c("Sales")
)
sales_modes = rep("A", 5)
sales_pls <- plspm(input_file, sales_path, sales_blocks, scheme = "centroid", scaled = FALSE, modes = sales_modes)
I have 2 questions:
The weights i receive can i use them to calculate the value of the latent variable e.g. My M variable has the manifest variables is there a formula to calculate its value?
The main purpose i run path analysis is to predict the sales. Is that possible by using the estimations(beta) for each latent variable?
i want to know if i am able to calculate the value of the latent
variable
Yes. Its actually very easy. Your latent variable scores can be obtained by running sales_pls$scores. You can also run summary(sales_pls). For easier interpretation you may wish to have the latent variables expressed in the same scale as the indicators (manifest variables). This can be accomplished by normalizing the outer weights in each block of indicators so that the weights are expressed as proportions. However, in order to apply this normalization all the outer weights must be positive.
You can apply the function rescale() to the object sales_pls and get scores expressed in the original scale of the manifest variables. Once you get the rescaled scores you can use summary() to verify that the obtained results make sense (i.e. scores expressed in original scale of indicators):
# rescaling scores
rescaled_sales_pls = rescale(sales_pls)
# summary
summary(rescaled_sales_pls)
(again you can also run rescaled_sales_pls)
if i can predict Sales using the betas from the output
Theoretically I guess you could, but I'm not really sure why you would want to. The utility of path analysis here is to decompose the sources of a correlation between an independent variable and a dependent variable of a multiple regression model.

How to forecast using ragged edge data in a MIDAS model using the MIDASR package?

I am trying to generate a 1-step-ahead forecast of a quarterly variable using a monthly variable with the midasr package. The trouble I am having is that I can only estimate a MIDAS model when the number of monthly observations in the sample is exactly 3 times as much the number of quarterly observations.
How can I forecast in the midasr package when the number of monthly observations is not an exact multiple of the quarterly observations (e.g. when I have a new monthly data point that I want to use to update the forecast)?
As an example, suppose I run the following code to generate a 1-step-ahead forecast when I have (n) quarterly observations and (3*n) monthly observations:
#first I create the quarterly and monthly variables
n <- 20
qrt <- rnorm(n)
mth <- rnorm(3*n)
#I convert the data to time series format
qrt <- ts(qrt, start = c(2009, 1), frequency = 4)
mth <- ts(mth, start = c(2009, 1), frequency = 12)
#now I estimate the midas model and generate a 1-step ahead forecast
library(midasr)
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(mth, 3:6, m = 3, nealmon), start = list(mth = c(1, 1, -1)))
forecast(reg, newdata = list(qrt = c(NA), mth =c(NA, NA, NA)))
This code works fine. Now suppose I have a new monthly data point that I want to include, so that the new monthly data is:
nmth <- rnorm(3*n +1)
I tried running the following code to estimate the new model:
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(nmth, 2:7, m = 3, nealmon), start = list(mth = c(1, 1, -1))) #I now use 2 lags instead 3 with the new monthly data
However I get an error message saying: 'Error in mls(nmth, 2:7, m = 3, nealmon) : Incomplete high frequency data'
I could not find anything online on how to deal with this problem.
A while ago I had to do with similar question. If I remember correctly, you first need to estimate the model using the old dataset with reduced lag, so insted of using 3:6 lags you should use 2:6 lags:
reg <- midas_r(qrt ~ mls(qrt, 1, 1) + mls(mth, 2:6, m = 3, nealmon), start = list(mth = c(1, 1, -1)))
Then suppose you observe a new value of the higher frequency data - new_value
new_value <- rnorm(1)
Then you can use this newly observed value for the forecasting of the lower frequency valiable as follows:
forecast(reg, newdata = list(mth = c(new_value, rep(NA, 2))))

Resources