Subset named numeric in R error - r

This one is really stumping me. I cannot find what is wrong with this line of code and I'm using it in a function and it makes the entire function fail.
Here is a vector of class 'matrix', result.1:
SPY TLT VGK EEM BIV RWX IYR EWJ DBC GLD PCY LQD TIP
-0.08 0.09 -0.07 -0.07 -0.09 -0.07 -0.07 -0.07 -0.066 -0.04 -0.08 -0.08 -0.08
I calculate the mean of this vector by running this line of code:
mean.momentum.1 <- mean(result.1)
The result is a numeric with a value of -0.062309. I want to get the names of the columns that are greater than the mean, so I run this code:
names(result.1[,(result.1 > mean.momentum.1)])
The output is as I expect, returning a character vector "TLT", "GLD".
Here is the issue: when I do this with a second, nearly identical matrix result.2 I get a NULL result every time, when I should be getting the result "TLT".
Here is the matrix result.2:
SPY TLT VGK EEM BIV RWX IYR EWJ DBC GLD TIP PCY LQD
-0.08 0.15 -0.07 -0.06 -0.054 -0.07 -0.07 -0.07 -0.06 -0.06 -0.07 -0.07 -0.06
I calculate the mean using the same method as above (-0.05298) and name it mean.momentum.2
Then, I run this line:
names(result.2[,(result.2 > mean.momentum.2)])
I get NULL back every time, when I expect "TLT". What is going wrong with the way I am subsetting? If I run the line:
result.2 > mean.momentum.2
I get a logical vector where TLT is TRUE, but it does not work when I try to subset with this method. It works perfectly in the first instance, but never works in the second instance, and since I get NULL back, my entire function fails.
Thank you...

Use colnames() and rownames() rather than names() when referring to a matrix:
mean.momentum.1 <- mean(result.1)
colnames(result.1)[which(result.1 > mean.momentum.1)]
## [1] "TLT" "GLD"
Also note the syntax of colnames() above. You are returning a subset of the vector colnames(result1) using which() to select the elements of the vector for which result.1 is greater than mean.momentum.1.
This works fine for result.2 also:
mean.momentum.2 <- mean(result.2)
colnames(result.2)[which(result.2 > mean.momentum.2)]
## [1] "TLT"

Related

ROC Curve Plot using R (Error code: Predictor must be numeric or ordered)

I am trying to make a ROC Curve using pROC with the 2 columns as below: (the list goes on to over >300 entries)
Actual_Findings_%
Predicted_Finding_Prob
0.23
0.6
0.48
0.3
0.26
0.62
0.23
0.6
0.48
0.3
0.47
0.3
0.23
0.6
0.6868
0.25
0.77
0.15
0.31
0.55
The code I tried to use is:
roccurve<- plot(roc(response = data$Actual_Findings_% <0.4, predictor = data$Predicted_Finding_Prob >0.5),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve", col = colors)
Where the threshold for positive findings is
Actual_Findings_% <0.4
AND
Predicted_Finding_Prob >0.5
(i.e to be TRUE POSITIVE, actual_finding_% would be LESS than 0.4, AND predicted_finding_prob would be GREATER than 0.5)
but when I try to plot this roc curve, I get the error:
"Setting levels: control = FALSE, case = TRUE
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Predictor must be numeric or ordered."
Any help would be much appreciated!
This should work:
data <- read.table( text=
"Actual_Findings_% Predicted_Finding_Prob
0.23 0.6
0.48 0.3
0.26 0.62
0.23 0.6
0.48 0.3
0.47 0.3
0.23 0.6
0.6868 0.25
0.77 0.15
0.31 0.55
", header=TRUE, check.names=FALSE )
library(pROC)
roccurve <- plot(
roc(
response = data$"Actual_Findings_%" <0.4,
predictor = data$"Predicted_Finding_Prob"
),
legacy.axes = TRUE, print.auc=TRUE, main = "ROC Curve"
)
Now importantly - the roc curve is there to show you what happens when you varry your classification threshold. So one thing you do do wrong is to go and enforce one, by setting predictions < 0.5
This does however give a perfect separation, which is nice I guess. (Though bad for educational purposes.)

Build Logistic Regression Model for shares

The data i am working with , contains the closing prices of 10 shares of the S&P 500 index.
Data :
> dput(head(StocksData))
structure(list(ACE = c(56.86, 56.82, 56.63, 56.39, 55.97, 55.23
), AMD = c(8.47, 8.77, 8.91, 8.69, 8.83, 9.19), AFL = c(51.83,
50.88, 50.78, 50.5, 50.3, 49.65), APD = c(81.59, 80.38, 80.03,
79.61, 79.76, 79.77), AA = c(15.12, 15.81, 15.85, 15.66, 15.71,
15.78), ATI = c(53.54, 52.37, 52.53, 51.91, 51.32, 51.45), AGN = c(69.77,
69.53, 69.69, 69.98, 68.99, 68.75), ALL = c(29.32, 29.03, 28.99,
28.66, 28.47, 28.2), MO = c(20.09, 20, 20.07, 20.16, 20, 19.88
), AMZN = c(184.22, 185.01, 187.42, 185.86, 185.49, 184.68)), row.names = c(NA,
6L), class = "data.frame")
In the following part , i am calculating the daily percentage changes of 10 shares :
perc_change <- (StocksData[-1, ] - StocksData[-nrow(StocksData), ])/StocksData[-nrow(StocksData), ] * 100
perc_change
Output :
# ACE AMD AFL APD AA ATI AGN ALL MO AMZN
#2 -0.07 3.5 -1.83 -1.483 4.56 -2.19 -0.34 -0.99 -0.45 0.43
#3 -0.33 1.6 -0.20 -0.435 0.25 0.31 0.23 -0.14 0.35 1.30
#4 -0.42 -2.5 -0.55 -0.525 -1.20 -1.18 0.42 -1.14 0.45 -0.83
#5 -0.74 1.6 -0.40 0.188 0.32 -1.14 -1.41 -0.66 -0.79 -0.20
#6 -1.32 4.1 -1.29 0.013 0.45 0.25 -0.35 -0.95 -0.60 -0.44
With the above code i find the latest N rates of change (N should be in [1,10]).
I want to make Logistic Regression Model in order to predict the change of the next day (N + 1), i.e., "increase" or "decrease".
Firstly, i split the data into two chunks: training and testing set :
(NOTE: as testset i must take the last 40 sessions and as trainset the previous 85 sessions of the test set !)
trainset <- head(StocksData, 870)
testset <- tail(StocksData, 40)
Continued with the fitting of the model:
model <- glm(Here???,family=binomial(link='logit'),data=trainset)
The problem iam facing is that i dont have understand and i dont know what to include in the glm function. I have study many models of logistic regression and i think that i havent in my data this object that i need to place there.
Any help for this misunderstanding part of my code ?
Based on what you shared, you need to predict an increment or decrease when new data arrives about the portfolio you mentioned. In that case, you need to define the target variable. We can do that computing the number of positive and negative changes. With that variables, we can create a target variable with 1 if positive is greater than negative (there will be an increment) and with 0 if opposite (there will not be an increment). Data shared is pretty small but I have sketched the code so that you can apply the training/test approach for the modeling. Here the code:
We will start from perc_change and compute the positive and negative variables:
#Build variables
#Store number of and positive negative changes
i <- names(perc_change)
perc_change$Neg <- apply(perc_change[,i],1,function(x) length(which(x<0)))
perc_change$Pos <- apply(perc_change[,i],1,function(x) length(which(x>0)))
Now, we create the target variable with a conditional:
#Build target variable
perc_change$Target <- ifelse(perc_change$Pos>perc_change$Neg,1,0)
We create a replicate for data and remove non necessary variables:
#Replicate data
perc_change2 <- perc_change
perc_change2$Neg <- NULL
perc_change2$Pos <- NULL
With perc_change2 the input is ready and you should split into train/test data. I will not do that as data is too small. I will go directly to the model:
#Train the model, few data for train/test in example but you can adjust that
model <- glm(Target~.,family=binomial(link='logit'),data=perc_change2)
With that model, you know how to evaluate performance and other things. Please do not hesitate in telling me if more details are needed.

Change factor labels in psych::fa or psych::fa.diagram

I'm using the psych package for factor analysis. I want to specify the labels of the latent factors, either in the fa() object, or when graphing with fa.diagram().
For example, with toy data:
require(psych)
n <- 100
choices <- 1:5
df <- data.frame(a=sample(choices, replace=TRUE, size=n),
b=sample(choices, replace=TRUE, size=n),
c=sample(choices, replace=TRUE, size=n),
d=sample(choices, replace=TRUE, size=n))
model <- fa(df, nfactors=2, fm="pa", rotate="promax")
model
Factor Analysis using method = pa
Call: fa(r = df, nfactors = 2, rotate = "promax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
a 0.45 -0.49 0.47 0.53 2.0
b 0.22 0.36 0.17 0.83 1.6
c -0.02 0.20 0.04 0.96 1.0
d 0.66 0.07 0.43 0.57 1.0
I want to change PA1 and PA2 to FactorA and FactorB, either by changing the model object itself, or adjusting the labels in the output of fa.diagram():
The docs for fa.diagram have a labels argument, but no examples, and the experimentation I've done so far hasn't been fruitful. Any help much appreciated!
With str(model) I found the $loadings attribute, which fa.diagram() uses to render the diagram. Modifying colnames() of model$loadings did the trick.
colnames(model$loadings) <- c("FactorA", "FactorB")
fa.diagram(model)

Apply log and log1p to several columns with an if condition

I have a dataframe and I need to calculate log for all numbers greater than 0 and log1p for numbers equal to 0. My dataframe is called tcPainelLog and is it like this (str from columns 6:8):
$ IDD: num 0.04 0.06 0.07 0.72 0.52 ...
$ Soil: num 0.25 0.22 0.16 0.00 0.00 ...
$ QAI: num 0.00 0.50 0.00 0.71 0.26 ...
Therefore, I guess need to concatenate an ifelse statement with log and log1p functions. However, I tried several different ways to do it, but none has succeeded. For instance:
tcPainelLog <- tcPainel
cols <- names(tcPainelLog[,6:17]) # These are the columns I need to calculate
tcPainelLog$IDD <- ifelse(X = tcPainelLog$IDD>0, log(X), log1p(X))
tcPainelLog[cols] <- lapply(tcPainelLog[cols], function(x) ifelse((x > 0), log(x), log1p(x)))
tcPainelLog[cols] <- if(tcPainelLog[,cols] > 0) log(.) else log1p(.)
I haven't been able to perform it and I would appreciate any help for that. I am really sorry it there is an explanation for that, I searched by many words but I didn't find it.
Best regards.

Problems with using plotCalibration() from the predictABEL package in R

I’ve been having some trouble with the plotCalibration() function, I have managed to get it to work before, but recently whilst working with another dataset (here is a link to the .Rda data file), I have been unable to shake off an error message which keeps cropping up:
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
Error in plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort) : The specified outcome is not a binary variable.`
When I’ve tried to set the cOutcome column to factors or to logical, it still doesn’t work.
I’ve looked at the source of the function and the only time the error message comes up is in the first if()else{} statement:
if (length(unique(y))!=2) {stop(" The specified outcome is not a binary variable.\n")}
else{
But I have checked that the length(unique(y)) is indeed ==2, and so don’t understand why the error message still crops up!
Be sure you're passing a dataframe to PlotCalibration. Passing a dplyr tibble can cause this error. Converting with the normal as.data.frame() worked for me.
Using the data you sent earlier, I do not see any error though:
Following output were produced along with a calibration plot:
> library(PredictABEL)
> plotCalibration(data = data, cOutcome = 2, predRisk = data$sortmort)
$Table_HLtest
total meanpred meanobs predicted observed
[0.000632,0.00129) 340 0.001 0.000 0.31 0
0.001287 198 0.001 0.000 0.25 0
[0.001374,0.00201) 283 0.002 0.004 0.53 1
0.002009 310 0.002 0.000 0.62 0
[0.002505,0.00409) 154 0.003 0.000 0.52 0
[0.004086,0.00793) 251 0.006 0.000 1.42 0
[0.007931,0.00998) 116 0.008 0.009 0.96 1
[0.009981,0.19545] 181 0.024 0.011 4.40 2
$Chi_square
[1] 4.906
$df
[1] 8
$p_value
[1] 0.7676
Please try using table(data[,2],useNA = "ifany") to see the number of levels of the outcome variable of your dataset.
The function plotCalibration will execute when the outcome is a binary variable (two levels).

Resources