Classification table for logistic regression in R - r

I have a data set consisting of a dichotomous depending variable (Y) and 12 independent variables (X1 to X12) stored in a csv file. Here are the first 5 rows of the data:
Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,9,3.86,111,126,14,13,1,7,7,0,M,46-50
1,7074,3.88,232,4654,143,349,2,27,18,6,M,25-30
1,5120,27.45,97,2924,298,324,3,56,21,0,M,31-35
1,18656,79.32,408,1648,303,8730,286,294,62,28,M,25-30
0,3869,21.23,260,2164,550,320,3,42,203,3,F,18-24
I constructed a logistic regression model from the data using the following code:
mydata <- read.csv("data.csv")
mylogit <- glm(Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
mysteps <- step(mylogit, Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
I can obtain the predicted probabilities for each data using the code:
theProbs <- fitted(mysteps)
Now, I would like to create a classification table--using the first 20 rows of the data table (mydata)--from which I can determine the percentage of the predicted probabilities that actually agree with the data. Note that for the dependent variable (Y), 0 represents probability that is less than 0.5, and 1 represents probability that is greater than 0.5.
I have spent many hour trying to construct the classification without success. I would appreciate it very much if someone suggest code that can help to solve this problem.

Question is a bit old, but I figure if someone is looking though the archives, this may help.
This is easily done by xtabs
classDF <- data.frame(response = mydata$Y, predicted = round(fitted(mysteps),0))
xtabs(~ predicted + response, data = classDF)
which will produce a table like this:
response
predicted 0 1
0 339 126
1 130 394

I think 'round' can do the job here.
table(round(theProbs))

Related

Testing and adjusting for autocorrelation / serial correlation

Unfortunately im not able to provide a reproducible example, but hopefully you get the idea regardless.
I am conducting some regression analyses where the dependent variable is a DCC of a pair of return series - two stocks. Im using dummies to represent shocks in the return series, i.e. the worst 1% of observed returns. In sum:
DCC = c + 1%Dummy
When I run the DurbinWatsonTest I get the output:
Autocorrelation: 0,9987
D-W statistic: 0
p-value: 0
HA: rho !=0
Does this just mean that its highly significant presence of autocorrelation?
I also tried dwtest, but that yields NA values for both P and DW-stat.
To correct for autocorrealtion I used the code:
spx10 = lm(bit_sp500 ~ Spx_0.1)
spx10_hc = coeftest(spx10, vcov. = vcovHC(spx10, method = "arellano",type = "HC3"))
How can I be certain that it had any effect, as I cannot run the DW-test for the spx10_hc, nor did the regression output change noteworthy. Is it common that regression analysis with 1 independent variable changes just ever so slightly when adjusting for autocorrelation?

Identify the outliers with the highest squared residuals under the Linear regression model in R

I have a data set [1000 x 80] of 1000 data points each with 80 variable values. I have to linearly regress two variables: price and area, and identify the 5 data points that have highest squared residuals. For these identified data points, I have to display 4 of the 80 variable values.
I do not know how to use the residuals to identify the original data points. All I have at the moment is:
model_lm <- lm(log(price) ~ log(area), data = ames)
Can I please get some guidance on how I can approach the above problem
The model_lm object will contain a variable called 'residuals' that will have the residuals in the same order as the original observations. If I'm understanding the question correctly, then an easy way to do this is base R is:
ames$residuals <- model_lm$residuals ## Add the residuals to the data.frame
o <- order(ames$residuals^2, decreaseing=T) ## Reorder to put largest first
ames[o[1:5],] ## Return results

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

how to create many linear models at once and put the coefficients into a new matrix?

I have 365 columns. In each column I have 60 values. I need to know the rate of change over time for each column (slope or linear coefficient). I created a generic column as a series of numbers from 1:60 to represent the 60 corresponding time intervals. I want to create 356 linear regression models using the generic time stamp column with each of the 365 columns of data.
In other words, I have many columns and I would like to create many linear regression models at once, extract the coefficients and put those coefficients into a new matrix.
First of all, statistically this might not be the best possible approach to analyse temporal data. Although, regarding the approach you propose, it is very simple to build a loop to obtain this:
Coefs <- matrix(,ncol(Data),2)#Assuming your generic 1:60 column is not in the same object
for(i in 1:ncol(Data)){
Coefs[i,] <- lm(Data[,i]~GenericColumn)$coefficients
}
Here's a way to do it:
# Fake data
dat = data.frame(x=1:60, y1=rnorm(60), y2=rnorm(60),
y3=rnorm(60))
t(sapply(names(dat)[-1], function(var){
coef(lm(dat[,var] ~ x, data=dat))
}))
(Intercept) x
y1 0.10858554 -0.004235449
y2 -0.02766542 0.005364577
y3 0.20283168 -0.008160786
Now, where's that turpentine soap?

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Resources