My predicted values are all negative. I would have expected 0's or 1's. Can anyone see where i am going wrong?
fold = 10
end = nrow(birthwt)
fold_2 = floor(end/fold)
df_i = birthwt[sample(nrow(birthwt)),] # random sort the dataframe birthwt
tester = df_i[1:fold_2,] # remove first tenth of rows - USE PREDICT ON THIS DATA
trainer = df_i[-c(1:fold_2),] # all other than the first tenth of rows - USE GLM ON THIS DATA
mod = glm(low~lwt,family=binomial,data=trainer)
ypred = predict(mod,data=tester) # predicted values
The default for predict.glm is to give you the value of the link (on the scale of the linear predictors) before transformation. If you want to predict the response, use
ypred <- predict(mod, data=tester, type="response")
If may be helpful to read the ?predict.glm help file.
Related
I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.
I have been trying to calculate Cook's distance manually for a multiple linear regression dataset, but running into problems with the for loop. What I have been doing is this:
This is the original linear model, and the associated fitted values, length = 'n'.
{fitted = lm10$fitted.values}
This is the new, n X n, blank matrix, I created to hold the new fitted values.
{lev.mat <- matrix(rep(0, nrow(X.des)^2), nrow = nrow(X.des))}
I wanted to save time, so I filled in the first column of the matrix manually.
{newData = as.data.frame(X.des[-1,])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(fitted[1],newFitted)
lev.mat[,1] = newDist}
I then tried to fill in the rest of the columns of the lev.mat similarly, using the for loop.
for(i in 2:nrow(lev.mat)){
newData = as.data.frame(X.des[-i, ])
newModel = lm(fev~., data = newData - 1)
newFitted = newModel$fitted.values
newDist = c(newFitted[1:(i-1)],fitted[i],newFitted[i:length(newFitted)])
lev.mat[,i] = newDist
}
But I keep getting this error repeatedly:
{Error in lev.mat[, i] <- newDist :
number of items to replace is not a multiple of replacement length}
I have been at this for three hours now, and it's getting frustrating. Can anybody point out the error and help me move along? My net steps are to calculate the difference between the original fitted values and each column of values in the new fitted values matrix, sum the differences, and divide by the product of the number of predictors and the MSE.
Thanks!
Thanks a lot to #Harlan Nelson for providing me with a wonderful link! I used the background provided in the link here to complete my work. Here is the rest of my code:
Hmat = hatvalues(lm10)
Leverage = Hmat/(1 - Hmat)
mse = (lm10$residuals)^2/var(lm10$residuals)
CooksD <- (1/6)*(mse)*Leverage
lm10 was the name of my linear model, and I had 6 predictors in the model. This helped me calculate Cook's Distance for the model. Thanks again!
I'm making model validation testing function for my own.
In doing so, I let
a=entire set of predictor variables in model-building set
b=set of response variable in model-building set
c=entire set of predictor variables in validation set
d=set of response variable in validation set
e=number of column which I have an interest
This is based on book Applied Linear Regression Models, Kutner , so I used
library(ALSM).
In my case, model-building set is SurgicalUnit, and validation set is SurgicalUnitAdditional.
Both data consists of 10 columns, of which from 1st to 8th columns are entire set of indep. variables, 9th is the response variable, 10th is the log(response variable)
So,
a=SurgicalUnit[,1:8]; b=SurgicalUnit[,10];
c=SurgicalUnitAdditional[,1:8]; d=SurgicalUnitAdditional[,10]; e=c(1,2,3,8)
, since I want to fit with logged response variable, and I want to regress with variable x1,x2,x3 and x8.
(Please note that the reason why I used "entire" set of independent variables with specific number of column instead of putting set of interested independent variables dircetly is, because I need to obtain Mallow's Cp in my function at once.)
So my regression is, asdf=lm(b~as.matrix(a[e])) , the problem is, I want to predict validation set in models built with model-building set. So, I let preds=data.frame(c[e]) and finally predict(asdf, newdata=preds) which is equal with predict(asdf), which means that it's fitted values of asdf.
Why predict doesn't work? Helps will be appreciated.
Below is my function
mod.valid=function(a,b,c,d,e){
asdf=lm(b~as.matrix(a[e])) # model what you want
qwer=lm(b~as.matrix(a[1:max(e)])) # full model in order to get Cp
mat=round(coef(summary(asdf))[,c(-3,-4)],4); mat2=matrix(0,5,2)
mat=rbind(mat,mat2); mat # matrix for coefficients and others(model-building)
n=nrow(anova(asdf)); m=nrow(anova(qwer))
nn=length(b) # To get size of sample size
p=asdf$rank # To get parameters p
cp=anova(asdf)$Sum[n] / (anova(qwer)$Mean[m]) - (nn-2*p); cp=round(cp,4)
mat[p+1,1]=p; mat[p+1,2]=cp # adding p and Cp
rp=summary(asdf)$r.squared; rap=summary(asdf)$adj.r.squared; rp=round(rp,4); rap=round(rap,4)
mat[p+2,1]=rp; mat[p+2,2]=rap # adding Rp2 and Rap2
sse=anova(asdf)$Sum[n]; pre=MPV::PRESS(asdf); sse=round(sse,4); pre=round(pre,4)
mat[p+3,1]=sse; mat[p+3,2]=pre # adding SSE and PRESS
**preds=data.frame(c[e]); predd=predict(asdf,newdata=preds)** **# I got problem here!**
mspr=sum((d-predd)^2) / length(d); mse=anova(asdf)$Mean[n]; mspr=round(mspr,4); mse=round(mse,4)
mat[p+4,1]=mse; mat[p+4,2]=mspr # adding MSE and MSPR
aic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + 2*p; aic=round(aic,4)
bic=nn*log(anova(asdf)$Sum[n]) - nn*log(nn) + log(nn)*p; bic=round(bic,4)
mat[p+5,1]=aic; mat[p+5,2]=bic # adding AIC and BIC
rownames(mat)[p+1]="p&Cp"; rownames(mat)[p+2]="Rp.sq&Rap.sq"
rownames(mat)[p+3]="SSE&PRESS"; rownames(mat)[p+4]="MSE&MSPR"; rownames(mat)[p+5]="AIC&BIC"
asdf2=lm(d~as.matrix(c[e]))
qwer2=lm(d~as.matrix(c[1:max(e)]))
matt=round(coef(summary(asdf2))[,c(-3,-4)],4); matt2=matrix(0,5,2)
matt=rbind(matt,matt2); matt # matrix for coefficients and others(validation)
n2=nrow(anova(asdf2)); m2=nrow(anova(qwer2))
nn2=length(d) # To get size of sample size
p2=asdf$rank # To get parameters p
cp2=anova(asdf2)$Sum[n2] / (anova(qwer2)$Mean[m2]) - (nn2-2*p2); cp2=round(cp2,4)
matt[p2+1,1]=p2; matt[p2+1,2]=cp2 # adding p and Cp
rp2=summary(asdf2)$r.squared; rap2=summary(asdf2)$adj.r.squared; rp2=round(rp2,4); rap2=round(rap2,4)
matt[p2+2,1]=rp2; matt[p2+2,2]=rap2 # adding Rp2 and Rap2
sse2=anova(asdf2)$Sum[n]; pre2=MPV::PRESS(asdf2); sse2=round(sse2,4); pre2=round(pre2,4)
matt[p2+3,1]=sse2; matt[p2+3,2]=pre2 # adding SSE and PRESS
mse2=anova(asdf2)$Mean[n]; mse2=round(mse2,4)
matt[p2+4,1]=mse2; matt[p2+4,2]=NA # adding MSE and MSPR, in this case MSPR=0
aic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + 2*p2; aic2=round(aic2,4)
bic2=nn2*log(anova(asdf2)$Sum[n2]) - nn2*log(nn2) + log(nn2)*p2; bic2=round(bic2,4)
matt[p2+5,1]=aic2; matt[p2+5,2]=bic2 # adding AIC and BIC
mat=cbind(mat,matt); colnames(mat)=c("Estimate","Std.Error","Val.Estimate","Val.Std.Error")
print(mat)
}
This function will provide useful statistics for model validation.
It returns a matrix with coefficients, p, Mallow's Cp, R.squared, R.adj.squared, SSE, PRESS, MSE, MSPR, AIC and BIC.
Everythig works fine for general given data, except for MSPR since predict function doesn't work! It only returns the fitted.
Can you try something like this. You have to make sure the both training and test data has same column names.
x <- rnorm(100)
y <- x + rnorm(100)
df <- data.frame(x = x, y=y)
# model fitting
fit <- lm(y ~ x, data=df)
predict(fit)
# creating new data
newx <- rnorm(50)
newdf <- data.frame(x = newx)
# making predictions
predict(fit, newdata = newdf)
I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)
I am having some trouble storing arima predictions into an empty vector. The problem is arima predictions give you predictions and standard errors. There are two columns of values. I cannot seem to store the values in an empty vector. I tried to create two empty vectors and bind them together, but it did not solve the problem.
My intention is to simulate 1000 observations. Use the first 900 observations to make 100 predictions. The list of values have to update. For example, use 900 observations to predict the value of the 901th observation. Now use 901 observations, including the predicted 901th observation, to predict the 902th observation. Repeat until you use 999 observations to predict the 1000th observation. I hope to figure out how to store multiple values into a vector.
The empty vector I hope to contain 100 predictions is called Predictions1.
# Create Arima Series #
ArimaSeries1 = arima.sim(n=1000, list(ar=c(0.99), ma=c(0.1)))+50
ts.plot(ArimaSeries1)
acf(ArimaSeries1)
ArimaSeries2 = arima.sim(n=1000, list(ar=c(0.7,0.2), ma=c(0.1,0.1)))+50
ts.plot(ArimaSeries2)
acf(ArimaSeries2)
ArimaSeries3 = arima.sim(n=1000, list(ar=c(0.6,0.2,0.1), ma=c(0.1,0.1,0.1)))+50
ts.plot(ArimaSeries3)
acf(ArimaSeries3)
# Estimate Arima Coefficients using maximum likehood #
ARC1 = arima(ArimaSeries1, order = c(1,0,1))
ARC2 = arima(ArimaSeries2, order = c(2,0,2))
ARC3 = arima(ArimaSeries3, order = c(3,0,3))
# Estimate Arima Coefficients with 900 observations #
AR1 = arima(ArimaSeries1[1:900], order = c(1,0,1))
AR2 = arima(ArimaSeries2[1:900], order = c(2,0,2))
AR3 = arima(ArimaSeries3[1:900], order = c(3,0,3))
# Create for-loop to make one prediction ahead for 100 times #
PredictionsA = rep(0,100)
PredictionsB = rep(0,100)
Predictions1 = cbind(PredictionsA,PredictionsB)
for(a in 1:100){ Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)}
Predictions1[a] = Forcasting1
R would give me this error message:
Warning message: In Predictions1[a] = Forcasting1 : number of items
to replace is not a multiple of replacement length
I would be grateful for any suggestions. Any explanations on where I went wrong is also appreciated. Thank you for your time.
Maybe something like this:
Predictions1 <- array(NA, c(100,2))
for(a in 1:100){
Forcasting1 = predict(arima(ArimaSeries1[1:900+a], order=c(1,0,1)), n.ahead=1)
Predictions1[a,] = unlist(Forcasting1)
}