Build Logistic Regression Model for shares - r

The data i am working with , contains the closing prices of 10 shares of the S&P 500 index.
Data :
> dput(head(StocksData))
structure(list(ACE = c(56.86, 56.82, 56.63, 56.39, 55.97, 55.23
), AMD = c(8.47, 8.77, 8.91, 8.69, 8.83, 9.19), AFL = c(51.83,
50.88, 50.78, 50.5, 50.3, 49.65), APD = c(81.59, 80.38, 80.03,
79.61, 79.76, 79.77), AA = c(15.12, 15.81, 15.85, 15.66, 15.71,
15.78), ATI = c(53.54, 52.37, 52.53, 51.91, 51.32, 51.45), AGN = c(69.77,
69.53, 69.69, 69.98, 68.99, 68.75), ALL = c(29.32, 29.03, 28.99,
28.66, 28.47, 28.2), MO = c(20.09, 20, 20.07, 20.16, 20, 19.88
), AMZN = c(184.22, 185.01, 187.42, 185.86, 185.49, 184.68)), row.names = c(NA,
6L), class = "data.frame")
In the following part , i am calculating the daily percentage changes of 10 shares :
perc_change <- (StocksData[-1, ] - StocksData[-nrow(StocksData), ])/StocksData[-nrow(StocksData), ] * 100
perc_change
Output :
# ACE AMD AFL APD AA ATI AGN ALL MO AMZN
#2 -0.07 3.5 -1.83 -1.483 4.56 -2.19 -0.34 -0.99 -0.45 0.43
#3 -0.33 1.6 -0.20 -0.435 0.25 0.31 0.23 -0.14 0.35 1.30
#4 -0.42 -2.5 -0.55 -0.525 -1.20 -1.18 0.42 -1.14 0.45 -0.83
#5 -0.74 1.6 -0.40 0.188 0.32 -1.14 -1.41 -0.66 -0.79 -0.20
#6 -1.32 4.1 -1.29 0.013 0.45 0.25 -0.35 -0.95 -0.60 -0.44
With the above code i find the latest N rates of change (N should be in [1,10]).
I want to make Logistic Regression Model in order to predict the change of the next day (N + 1), i.e., "increase" or "decrease".
Firstly, i split the data into two chunks: training and testing set :
(NOTE: as testset i must take the last 40 sessions and as trainset the previous 85 sessions of the test set !)
trainset <- head(StocksData, 870)
testset <- tail(StocksData, 40)
Continued with the fitting of the model:
model <- glm(Here???,family=binomial(link='logit'),data=trainset)
The problem iam facing is that i dont have understand and i dont know what to include in the glm function. I have study many models of logistic regression and i think that i havent in my data this object that i need to place there.
Any help for this misunderstanding part of my code ?

Based on what you shared, you need to predict an increment or decrease when new data arrives about the portfolio you mentioned. In that case, you need to define the target variable. We can do that computing the number of positive and negative changes. With that variables, we can create a target variable with 1 if positive is greater than negative (there will be an increment) and with 0 if opposite (there will not be an increment). Data shared is pretty small but I have sketched the code so that you can apply the training/test approach for the modeling. Here the code:
We will start from perc_change and compute the positive and negative variables:
#Build variables
#Store number of and positive negative changes
i <- names(perc_change)
perc_change$Neg <- apply(perc_change[,i],1,function(x) length(which(x<0)))
perc_change$Pos <- apply(perc_change[,i],1,function(x) length(which(x>0)))
Now, we create the target variable with a conditional:
#Build target variable
perc_change$Target <- ifelse(perc_change$Pos>perc_change$Neg,1,0)
We create a replicate for data and remove non necessary variables:
#Replicate data
perc_change2 <- perc_change
perc_change2$Neg <- NULL
perc_change2$Pos <- NULL
With perc_change2 the input is ready and you should split into train/test data. I will not do that as data is too small. I will go directly to the model:
#Train the model, few data for train/test in example but you can adjust that
model <- glm(Target~.,family=binomial(link='logit'),data=perc_change2)
With that model, you know how to evaluate performance and other things. Please do not hesitate in telling me if more details are needed.

Related

R For loop - estimates and values for forecasting

I'm trying to do some regression forecasting based on my estimates and actual values. I have the following
estimates=s1$coefficients[,1]
values = data.frame(cbind(sd_rgdpg,DISSIM,TRADE,SIZE,OPEN,TF,INFL,INT,NIIP))
Where estimates are my coefficients and values are my actual values. 'estimates' is a vector of ten with the intercept as the first item. 'values'is a dataframe with 9 columns and 21 rows. The columns' variables correspond to the rows of estimates. I need to multiply the variables estimates and values together to form an equation like y = intercept + b1x1+b2x2+...+b9x9.
I'm not quite sure how to do this in a forloop, can anyone help me out?
Here is the 'values' dataframe:
sd_rgdpg
<dbl>
DISSIM
<dbl>
TRADE
<dbl>
SIZE
<dbl>
OPEN
<dbl>
TF
<dbl>
INFL
<dbl>
INT
<dbl>
NIIP
<dbl>
0.3905156169 0.39590508 0.00000000 0.0000000000 2.629159 0.5474359 -0.40 1.43 -13.68144000
1.4482896523 0.37227806 0.03102011 0.0007919784 2.493771 0.5837563 -0.07 0.16 1.19404188
0.1698460561 0.35884028 0.10907448 0.0386795080 2.342112 0.6075000 0.22 -0.76 0.93052249
0.0020363597 0.04812418 0.24478591 0.0856910910 2.085918 0.6554404 -0.40 -1.22 0.94020757
0.3148110593 0.02315404 0.28936211 0.1649356627 2.094957 0.6589744 -3.16 -1.88 0.85515135
0.0279017603 0.02906603 0.31283051 0.2369223964 2.033051 0.6938776 -1.29 -1.36 0.57801452
0.0192319055 0.05513982 0.35421769 0.3050570794 2.137967 0.8312958 -0.02 -0.85 0.34994832
0.0358535769 0.07426063 0.48108389 0.4014364697 2.326611 0.8333333 -1.50 -0.35 -0.11022762
1.4919556927 0.05297878 0.60639908 0.4873392510 2.608321 0.8096886 -5.94 -0.76 -0.49419490
1.6980146354 0.03063955 0.75594659 0.5018749374 2.795147 0.8380282 1.27 -0.25 -0.28853577
1-10 of 21 rows | 1-5 of 9 columns
and here are the 'estimates'
(Intercept) sd_rgdpg TRADE OPEN
-1.048798e-04 -7.023954e-06 5.159287e-06 2.467633e-04
DISSIM SIZE TF INFL
-5.867023e-04 -3.927840e-04 -3.241606e-04 -2.520122e-05
INT NIIP
1.668813e-06 8.409097e-06
Just recall that your coefficients are a vector, and your values are a matrix. Hence, your result y is also a vector. You wouldn't need a for loop. As an example:
intercept <- -1.05
coef <- c(-7.02, 5.16, 2.47,-5.87,-3.93,-3.23,-2.52, 1.67, 8.41)
values <- matrix(runif(27), ncol = 9) # values is a matrix with 9 columns and 3 rows of Unif[0,1] values as an example
Then you can just do
> intercept + rowSums(coef * values)
[1] -30.385560 3.734984 3.262591
But after training a regression model, for instance with the lm.fit() function, you would generally use the predict() function to produce results.

ROC Curve Plot using R (Error code: Predictor must be numeric or ordered)

I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)

Penalized Regression: "ridge" RMSE greater than that for plain "lm"

Working with the "prostate" dataset in "ElemStatLearn" package.
set.seed(3434)
fit.lm = train(data=trainset, lpsa~., method = "lm")
fit.ridge = train(data=trainset, lpsa~., method = "ridge")
fit.lasso = train(data=trainset, lpsa~., method = "lasso")
Comparing RMSE (for bestTune in case of ridge and lasso)
fit.lm$results[,"RMSE"]
[1] 0.7895572
fit.ridge$results[fit.ridge$results[,"lambda"]==fit.ridge$bestTune$lambda,"RMSE"]
[1] 0.8231873
fit.lasso$results[fit.lasso$results[,"fraction"]==fit.lasso$bestTune$fraction,"RMSE"]
[1] 0.7779534
Comparing absolute value of coefficients
abs(round(fit.lm$finalModel$coefficients,2))
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
0.43 0.58 0.61 0.02 0.14 0.74 0.21 0.03 0.01
abs(round(predict(fit.ridge$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.49 0.62 0.01 0.14 0.65 0.05 0.00 0.01
abs(round(predict(fit.lasso$finalModel, type = "coef", mode = "norm")$coefficients[8,],2))
lcavol lweight age lbph svi lcp gleason pgg45
0.56 0.61 0.02 0.14 0.72 0.18 0.00 0.01
My question is: how can "ridge" RMSE be higher than that of plain "lm". Doesn't that defeat the very purpose of penalized regression vs plain "lm"?
Also, how can the absolute value of the coefficient of "lweight" be actually higher in ridge (0.62) vs that in lm (0.61)? Both coefficients are positive originally without the abs().
I was expecting ridge to perform similar to lasso, which not only reduced RMSE but also shrank the size of coefficients vs plain "lm".
Thank you!

Change factor labels in psych::fa or psych::fa.diagram

I'm using the psych package for factor analysis. I want to specify the labels of the latent factors, either in the fa() object, or when graphing with fa.diagram().
For example, with toy data:
require(psych)
n <- 100
choices <- 1:5
df <- data.frame(a=sample(choices, replace=TRUE, size=n),
b=sample(choices, replace=TRUE, size=n),
c=sample(choices, replace=TRUE, size=n),
d=sample(choices, replace=TRUE, size=n))
model <- fa(df, nfactors=2, fm="pa", rotate="promax")
model
Factor Analysis using method = pa
Call: fa(r = df, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
a 0.45 -0.49 0.47 0.53 2.0
b 0.22 0.36 0.17 0.83 1.6
c -0.02 0.20 0.04 0.96 1.0
d 0.66 0.07 0.43 0.57 1.0
I want to change PA1 and PA2 to FactorA and FactorB, either by changing the model object itself, or adjusting the labels in the output of fa.diagram():
The docs for fa.diagram have a labels argument, but no examples, and the experimentation I've done so far hasn't been fruitful. Any help much appreciated!
With str(model) I found the $loadings attribute, which fa.diagram() uses to render the diagram. Modifying colnames() of model$loadings did the trick.
colnames(model$loadings) <- c("FactorA", "FactorB")
fa.diagram(model)

Access / save information from metafor forest plot in meta-analysis

I'm wondering if it's possible to access (in some form) the information that is presented in the -forest- command in the -metafor- package.
I am checking / verifying results, and I'd like to have the output of values produced. Thus far, the calculations all check, but I'd like to have them available for printing, saving, etc. instead of having to type them out by hand.
Sample code is below :
es <- read.table(header=TRUE, text = "
b se_b
0.083 0.011
0.114 0.011
0.081 0.013
0.527 0.017
" )
library(metafor)
es.est <- rma(yi=b, sei=se_b, dat=es, method="DL")
studies <- as.vector( c("Larry (2011)" , "Curly (2011)", "Moe (2015)" , "Shemp (2010)" ) )
forest(es.est , transf=exp , slab = studies , refline = 1 , xlim=c(0,3), at = c(1, 1.5, 2, 2.5, 3, 3.5, 4) , showweights=TRUE)
I'd like to access the values (effect size and c.i. for each study, as well as the overall estimate, and c.i.) that are presented on the right of the graphic.
Thanks so much,
-Jon
How about:
> summary(escalc(measure="GEN", yi=b, sei=se_b, data=es), transf=exp)
b se_b yi vi sei zi ci.lb ci.ub
1 0.083 0.011 1.0865 0.0001 0.0110 7.5455 1.0634 1.1102
2 0.114 0.011 1.1208 0.0001 0.0110 10.3636 1.0968 1.1452
3 0.081 0.013 1.0844 0.0002 0.0130 6.2308 1.0571 1.1124
4 0.527 0.017 1.6938 0.0003 0.0170 31.0000 1.6383 1.7512
Then yi, ci.lb, and ci.ub provides the same info.

Resources