problem predicting a value in R using a regression model - r

I need to predict the R&D expenses for a company with 5000 employees. I already created the linear regression model as follows:
lrmodel_2 <- lm(Data1_3$RD~Data1_3$employees, data = Data1_3)
summary(lrmodel_2)
I then used the predict function to predict expenses for 5000 employees. However when I try to predict I get a table with multiple values instead of 1:
(predict function result image)
Any help would be very much appreciated.

Related

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

Stock prediction + news sentiment with SVM in R?

I would like to predict the stock prices and news sentiment score together with SVM in R, in order to see whether news have an impact on stock price and their prediction. I read that support vector machines (svm) are a good machine learning approach for this problem. I have one column that represents the date of the stock and news, one column represents the stock prices on that day and 4 columns which represent the sentiment scores based on different lexica. I would like to test first with one of that lexica and if the models works, trying on the other. The dataset is included below. I found some examples with python but couldn't found something for R. I like to use the svm() function from the e1071 package
I split the data into train and test set:
sample <- sample(nrow(sentGI),nrow(sentGI)*0.70)
df.trainGI = sentGI[sample,]
df.testGI = sentGI[-sample,]
And I tried already this SVM code, but my wrong prediction rate is 100
plot(df.trainGI$GSPC.Close, df.trainGI$SentimentGI, pch = 19, col = c("red", "blue"))
svm_model_GI <- svm(SentimentGI.Class ~ ., df.trainGI)
print(svm_model_GI)
plot(svm_model_GI, df.trainGI)
svm_pred_GI <- predict(svm_model_GI, newdata = df.testGI, type="response")
rmse <- sqrt(mean((svm_pred_GI - df.testGI$GSPC.Close)^2))
rmse
What I am doing wrong here? Hope somebody can help me!
Dataset
You're using model accuracy to evaluate the model. Accuracy is used for classification problems but your response variable is continuous. You should use RMSE.
pred <- predict(radial.svm, newdata=df.test, type='response')
rmse <- sqrt(mean((pred - df.test$GSPC.Close)^2))
rmse
Continuation from comments:
The first plots GSPC.Close against date (left) and the second plots SentimentGI against date (right). Notice that stock prices generally increase over time whereas sentiment has a slope of 0 in that same time frame. What does that tell you?

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

Arimax forecasting

I need to forecast sales on daily basis using arima model with independent variables as weekdays.
So i build up the model :
d= data,
Total = sales Monday,tuesday...Sunday are my independent Vars
i am using library(forecast)
'fit=arima(d$Total,xreg=cbind(Sunday,Monday,Tuesday,Wednesday,Thursday,Friday),order=c(1,1,1))'
Please help me to proceed further and to predict future values.
How to decide p,d,q and to plot forecasted Vs Actual Values ? please help

Forecast future values for a time series using support vector machin

I am using support vector regression in R to forecast future values for a uni-variate time series. Splitting the historical data into test and train sets, I find a model by using svm function in R to the test data and then use the predict() command with train data to predict values for the train set. We can then compute prediction errors. I wonder what happens then? we have a model and by checking the model on the train data, we see the model is efficient. How can I use this model to predict future values out of train data? Generally speaking, we use predict function in R and give it a forecast horizon (h=12) to predict 12 future values. Based on what I saw, the predict() command for SVM does not have such coomand and needs a train dataset. How should I build a train data set for predicting future data which is not in our historical data set?
Thanks
Just a stab in the dark... SVM is not for prediction but for classification, specifically supervised. I am guessing you are trying to predict stock values, no? How about classify your existing data, using some size of your choice say 100 values at a time, for noise (N), up (U), big up (UU), down (D), and big down (DD). In this way as your data comes in you slide your classification frame and get it to tell you if the upcoming trend is N, U, UU, D, DD.
What you can do is to build a data frame with columns representing the actual stock price and its n lagged values. And use it as a train set/test set (the actual value is the output and the previous values the explanatory variables). With this method you can do a 1-day (or whatever the granularity is) into the future forecast and then you can use your prediction to make another one and so on.

Resources