I am trying to make a confusion matrix for a logistic regression model and no matter what I do, the table function leaves out the bottom row. In terms of demonstrating my work, I can provide sample data below. My real data is very bulky and is downloaded onto my computer. However, the problem is that this simulation, which is based on my exact code, functions properly. The difference might be that this is a very small simulation. My real data has 20,000+ rows.
set.seed(1)
a<-runif(10)
b<-runif(10)
c<-rnorm(10)
sample_outcome<-sample(c(0,1), replace=TRUE, size=10)
sample.df<-data.frame(a,b,c,sample_outcome)
s_logistic <- glm(formula = sample_outcome~., data=sample.df, family=binomial)
s_probs <- predict(s_logistic, type="response")
s_predict <-rep(0,nrow(sample.df))
s_predict[s_probs>.5]=1
table(s_predict,sample.df$sample_outcome)
s_predict 0 1
0 1 0
1 2 7
In my actual data, the bottom row, the one that corresponds to "1,2,7" is ALWAYS missing. Any idea what might be going on here?
Related
I am carrying out an split experiment with microbiology.
Totally 3 blocks I set: A,B,C;
each block contains 2 incobators, with setting temperature at 19 and 31 °c, respectively.
In each incubator, 2 replicated microganism sample are cultured (a,b).
now I want to compare density of microganism between generation 27 (which I got and stocked a years agao) and 2400 (which I got and stocked now). split experiment diagram
my data
I write this code, it works
modele.ed <- lme ( density ~ temperature*generation, random = ~1|block/temperature/generation, na.action = na.omit, data = datae)
but it looks like wrong. I still dont know how to deal with "generation".what is the right code?
My first reaction is to remove 'generation' (because I don't think it's nested inside each experiment) from the nested random effects, it should probably be 'replicate' in there, thus:
modele.ed <- lme ( density ~ temperature*generation, random = ~1|block/temperature/replicate, na.action = na.omit, data = datae)
Whether or not there is enough levels for that nested random effect to work is another question. To check the validity of your experimental design, you would do better to post on http://StackExchange.com (for stats) as opposed to here which is mainly for code.
Also consider using library(lme4) and its lmer() functions as it is more popular and easier to make formulae for.
My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.
So help would be much appreciated!
I have already completed a CCA plot which shows 7 sites, about 15 species and 6 environmental variables. However, it is saying that the unconstrained axis is 0 and I cannot complete an ANOVA on my CCA results in order to see what the significance of the axes are. I also attempted to use the spenvcor function to see the environmental to species correlation and it is giving me 1's for all of the axes.
So I am definitely doing something wrong but I just can't figure out what.
Here is my code:
MayEnviro <- read.csv("MayEnviro.csv", header=TRUE)
MaySpecies <- read.csv("MaySpecies.csv", header=TRUE)
t <- cca(MaySpecies,
MayEnviro[, c("AFDM","Chla","Chloride","TSS","TN","TP","Velocity")])
spenvcor(t)
The number of axes you can derive from a data set with n = 7 sites, m = 15 species is min(n, m) - 1, which is 6. As you also have 6 constraints (the environmental variables) you explain the data exactly and there is no residual variance to work with. In fact there are no constraints on the solution and the result is just like CA.
In this instance, with so few sites, you should look to fit a model with fewer constraints, say 2 or 3 at most.
Hi…I have a very basic question regarding the input of weighted data into R. Currently I have to process data (mostly for curve fitting purposes) similar to the following:
> head(mydata, 10)
v sf
1 0.3003434 3.933106
2 0.3027852 5.947432
3 0.3052270 9.832596
4 0.3076688 12.927439
5 0.3101106 14.197519
6 0.3125525 13.572904
7 0.3149943 11.691078
8 0.3174361 9.543095
9 0.3198779 8.048558
10 0.3223197 7.660252
The first column is the data (increasing & equidistant), while the 2nd column gives the frequency (weights), currently these weights don't add up to one, but I can easily fix that.
Now, I searched for weighted data in R and the closest I found was via using the survey package and the svydesign() command, but is it really that hard?
What I did to work around my lack of knowledge, and that got me in trouble with the Kolmogorov_Smirnov test (more below), is the following:
> y <- with(mydata, c(rep(v, times=floor(10*sf))))
which will repeat the elements of the first column in proportion to the corresponding weight (times 10 to get a whole number). But now the problem is, when I conduct the Kolmogorov-Smirnov goodness of fit test, I get a warning that the p-value can not be computed since the data has ties.
Question is: How can I input and process the data in its original form (i.e. as a frequency or probability table) for the purpose of curve fitting? Thanks.
I have a data set consisting of a dichotomous depending variable (Y) and 12 independent variables (X1 to X12) stored in a csv file. Here are the first 5 rows of the data:
Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,9,3.86,111,126,14,13,1,7,7,0,M,46-50
1,7074,3.88,232,4654,143,349,2,27,18,6,M,25-30
1,5120,27.45,97,2924,298,324,3,56,21,0,M,31-35
1,18656,79.32,408,1648,303,8730,286,294,62,28,M,25-30
0,3869,21.23,260,2164,550,320,3,42,203,3,F,18-24
I constructed a logistic regression model from the data using the following code:
mydata <- read.csv("data.csv")
mylogit <- glm(Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
mysteps <- step(mylogit, Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
I can obtain the predicted probabilities for each data using the code:
theProbs <- fitted(mysteps)
Now, I would like to create a classification table--using the first 20 rows of the data table (mydata)--from which I can determine the percentage of the predicted probabilities that actually agree with the data. Note that for the dependent variable (Y), 0 represents probability that is less than 0.5, and 1 represents probability that is greater than 0.5.
I have spent many hour trying to construct the classification without success. I would appreciate it very much if someone suggest code that can help to solve this problem.
Question is a bit old, but I figure if someone is looking though the archives, this may help.
This is easily done by xtabs
classDF <- data.frame(response = mydata$Y, predicted = round(fitted(mysteps),0))
xtabs(~ predicted + response, data = classDF)
which will produce a table like this:
response
predicted 0 1
0 339 126
1 130 394
I think 'round' can do the job here.
table(round(theProbs))