r - Prediction for new observation in knn - r

I am trying to make an application which would predict prices based on users input. How can I predict the response for new values?
I have tried to do the following:
1. Add a new observation to the dataset
2. Train knn on all of the observations but the new one
3. Test knn on the new observation
But the prediction changes when I put different values of the response variable into the new observation so it doesn't seem to work.
Let's say the data has 100 observations of 7 variables.
This would be the code I have tried.
data <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(data[1:100,], test = dataset[101,],
data[1:100,]$response_variable, k = 8, algorithm="kd_tree")
prediction$pred
Thank you in advance for your help.

For one thing, you have not defined dataset. I am guessing your code is meant to read:
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,], test = dataset[101,],
y = dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
prediction$pred
In any case, it seems that you are not supposed to include the response variable as a column in your training and test sets (I found this out by playing around with the knn.reg function.) So, if your response variable was the 7th column of data then you could do this instead
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,-7], test = dataset[101,-7],
y = dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
prediction$pred
For example, here is a test case with some made-up data.
set.seed(1)
data <- data.frame(matrix(sample(1:7, 700, replace=T), nr=100))
colnames(data)[7] <- "response_variable"
dataset <- rbind(data, c(1,2,3,4,5,6,7))
prediction <- knn.reg(dataset[1:100,-7], test = dataset[101,-7],
dataset[1:100,]$response_variable, k = 8, algorithm="kd_tree")
prediction$pred

Related

How to apply MICE imputations on test set?

I have two separate data sets: one for train (1000000 observation) and the other one for test (1000000 observation). I divided the train set into 3 sets (mytrain: 700000 observations, myvalid: 150000 observations, mytest:150000 observations). Thetest set with 1000000 observations doesn't include the target variable, so it should be used for the final test. Since there are some missing values for categorical variables, I need to use mice to impute them. I should reuse the imputation done on mytrain set to fill the missing values in the myvalid, mytest and test sets. Based on the answer to this question, I should do this:
data2 <- rbind(mytrain,myval,mytest,test)
data2$ST_EMPL <- as.factor(data2$ST_EMPL)
data2$TYP_RES <- as.factor(data2$TYP_RES)
imp <- mice(data2, method = "cart", m = 1, maxit = 1, seed = 123,
ignore = c(rep(FALSE, 700000),rep(TRUE, 1300000)))
data2.imp <- complete(imp,1)
summary(imp)
mytrainN <- data2.imp[1:700000,]
myvalN <- data2.imp[700001:850000,]
mytestN <- data2.imp[850001:1000000,]
testN <- data2.imp[1000001:2000000,]
However, since the test set does not have the target column, it is not possible to merge it with mytrain, mytest, and myvalid. Is it possible to add a hypothetical target column (with the value of say 10 for all 1000000 observations) to the test set?

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

How to loop an arima model to run over different columns of a time series data?

So I have a data set which has district wise values of covid-19 cases. Now I want to run an Arima model on each of these districts and create a similar dataset of predicted values.
library("forecast")
df <- read.csv("D:/Hackathon/Time series/Maharashtra.csv")
z = ncol(df)
for(i in z){
x = ts(c[,i],frequency = 365, start = c(2020,1,30))
plot.ts(x)
pi = auto.arima(x)
summary(pi)
q = forecast(pi,h=30)
plot.forecast(q)
write.csv(q,"D:/Hackathon/pred.csv")
}
I know for a fact this is not correct. This how the data is

How to create a formulated table in R?

This is my reproducible example :
#http://gekkoquant.com/2012/05/26/neural-networks-with-r-simple-example/
library("neuralnet")
require(ggplot2)
traininginput <- as.data.frame(runif(50, min=0, max=100))
trainingoutput <- sqrt(traininginput)
trainingdata <- cbind(traininginput,trainingoutput)
colnames(trainingdata) <- c("Input","Output")
Hidden_Layer_1 <- 1 # value is randomly assigned
Hidden_Layer_2 <- 1 # value is randomly assigned
Threshold_Level <- 0.1 # value is randomly assigned
net.sqrt <- neuralnet(Output~Input,trainingdata, hidden=c(Hidden_Layer_1, Hidden_Layer_2), threshold = Threshold_Level)
#Test the neural network on some test data
testdata <- as.data.frame((1:13)^2) #Generate some squared numbers
net.results <- predict(net.sqrt, testdata) #Run them through the neural network
cleanoutput <- cbind(testdata,sqrt(testdata),
as.data.frame(net.results))
colnames(cleanoutput) <- c("Input","ExpectedOutput","NeuralNetOutput")
ggplot(data = cleanoutput, aes(x= ExpectedOutput, y= NeuralNetOutput)) + geom_point() +
geom_abline(intercept = 0, slope = 1
, color="brown", size=0.5)
rmse <- sqrt(sum((sqrt(testdata)- net.results)^2)/length(net.results))
print(rmse)
At here, when my Hidden_Layer_1 is 1, Hidden_Layer_2 is 2, and the Threshold_Level is 0.1, my rmse generated is 0.6717354.
Let's say we try for the other example,
when my Hidden_Layer_1 is 2, Hidden_Layer_2 is 3, and the Threshold_Level is 0.2, my rmse generated is 0.8355925.
How can I create a table that will automatically calculate the value of rmse when user assign value to the Hidden_Layer_1, Hidden_Layer_2, and Threshold_Level. ( I know how to do it in Excel but not in r haha )
The desired table should be looked like this :
I wish that I have Trial(s), Hidden_Layer_1, Hidden_Layer_2, Threshold_Level, and rmse in my column, and the number of rows can be generated infinitely by entering some actionButton (if possible), means user can keep on trying until they got the rmse they desired.
How can I do that? Can anyone help me? I will definitely learn from this lesson as I am quite new to r.
Thank you very much for anyone who willing to give a helping hand to me.
Here is a way to create the table of values that can be displayed with the data frame viewer.
# initialize an object where we can store the parameters as a data frame
data <- NULL
# function to receive a row of parameters and add them to the
# df argument
addModelElements <- function(df,trial,layer1,layer2,threshold,rmse){
newRow <- data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
rbind(df,newRow)
}
# once a model has been run, call addModelElements() with the
# model parameters
data <- addModelElements(data,1,1,2,0.1,0.671735)
data <- addModelElements(data,2,2,3,0.2,0.835593)
...and the output:
View(data)
Note that if you're going to create scores or hundreds of rows of parameters & RMSE results before displaying any of them to the end user, the code should be altered to improve the efficiency of rbind(). In this scenario, we build a list of sets of parameters, convert them into data frames, and use do.call() to execute rbind() only once.
# version that improves efficiency of `rbind()
addModelElements <- function(trial,layer1,layer2,threshold,rmse){
# return row as data frame
data.frame(trial = trial,
Hidden_Layer_1 = layer1,
Hidden_Layer_2 = layer2,
Threshold = threshold,
RMSE = rmse)
}
# generate list of data frames and rbind() once
inputParms <- list(c(1,1,2,0.1,0.671735),
c(1,1,2,0.3,0.681935),
c(2,2,3,0.2,0.835593))
parmList <- lapply(inputParms,function(x){
addModelElements(x[1],x[2],x[3],x[4],x[5])
})
# bind to single data frame
data <- do.call(rbind,parmList)
View(data)
...and the output:

Logistic regression training and test data

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.
Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:
train=(Year<2005)
And that returns the True/False statements.
I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried
train=(data$nrow < 1001)
But got logical(0) as a result.
Can anyone lead me in the right direction?
You get logical(0) because nrow is not a column
You can also subset your dataframe by using row numbers
train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]
But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using
train = sample(1:nrow(data),1000)
You can then get your train_data and test_data using
train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]
The setdiff function is used to get all rows not selected in train
The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.
# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
data = sample(seq(0, 1, by = 0.01), 1000, replace = T))
nrow(data)
[1] 1000
If you really want to take the first n rows then you can try:
first.n.rows <- data[1:1000, ]
The caret package provides a more reliable approach to using cross validation in your models.
First create the partition rule:
library(caret)
inTrain <- createDataPartition(y = data$year,
p = 0.8, list = FALSE)
Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.
The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.
Then apply the partition to the data set:
# Create the training set
train <- data[inTrain,]
# Create the testing set
test <- data[-inTrain,]
nrow(train) + nrow(test)
[1] 1000

Resources