#First, we'll create a fake dataset of 20 individuals of different body sizes:
bodysize=rnorm(20,30,2) # generates 20 values, with mean of 30 & s.d.=2
bodysize=sort(bodysize) # sorts these values in ascending order.
survive=c(0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,1,0,1,1,1) # assign 'survival' to these 20 individuals non-randomly... most mortality occurs at smaller body size
dat=as.data.frame(cbind(bodysize,survive)) # saves dataframe with two columns:
body size & survival
dat # just shows you what your dataset looks like. It will look something like this:
plot(bodysize,survive,xlab="Body size",ylab="Probability of survival") # plot with body size on x-axis and survival (0 or 1) on y-axis
g=glm(survive~bodysize,family=binomial,dat) # run a logistic regression model (in this case, generalized linear model with logit link). see ?glm
curve(predict(g,data.frame(bodysize=x),type="resp"),add=TRUE) # draws a curve based on prediction from logistic regression model
points(bodysize,fitted(g),pch=20) # optional: you could skip this draws an
invisible set of points of body size survival based on a 'fit' to glm model.
pch= changes type of dots.
I want to know where the 'x'(data.frame(bodysize=x)) come from?
Related
I used R to compute variable importance plot for random forest model. When I looked at the mean decrease in Gini plot, I was confused by the number on the x axis. Why are the numbers so large? I thought they were supposed to represent percentage of decrease. My model ran normally, and I used the below code to create the plot:
Variable Importance Plot
I tried the following code:
Plot <- importance(rf_best_model, scale=T)
varImp.rf =dotchart(sort(Plot[,4]), xlim=c(0,10000), xlab="MeanDecreaseGini")
varImp
i have trained SVM classification models based on probability prediction for recognision numbers 0-9.
I have visualization of probality for every model, looks like this for number 0 -data of probability are in variable prediction0
Then i have trained final classificator and i have 1423 correct observations (from 1499) - i have vector c= containing numbers correctly predicted
What i need to do, is when was 0 correctly predicted in vector c, mark that point on red on this graf. If it helps i have "ck" containing probalities for all number prediction for every test sample, where i get maximum probality, which was my final prediction.
You can do this by using the col argument. I'll use the mtcars dataset as an example
plot(
mpg~disp,
data=mtcars,
col=ifelse(mtcars$am==0,"red","blue")
)
I'm running a meta-analysis of prevalence rates using the metafor package.
The dataset is named 'metaAAS' and it has the study's author/year, the number of participants exposed to the condition (ni) and the number of those presenting the symptom (nphy).
I'm using the following code to call the dataset from a csv file:
metaAAS <- read.csv("metaAAS.csv")
dat <- get(metaAAS)
Then I'm using the function 'escalc' to calculate effect sizes and outcome measures. I am using the argument 'PR' because the studies measure a dichotomous variable (either participants present the symptom or not):
metaAAS <- escalc(measure="PR", xi=nphy, ni=ni, data=metaAAS, slab=paste (authors, year))
And this code to run the meta-analysis and show the results:
res <- rma(yi, vi, data=metaAAS)
res
confint(res)
My model results show this:
estimate: 0.3497
se: 0.0352
zval:9.9428
pval:<.0001
ci.lb: 0.2807
ci.ub:0.4186
How can I change the results to percentages (e.g. 34.97) instead of proportions (0.3497)?
It would be easy just to change this in the description of the results, but the whole Forest Plot shows the estimated prevalence as proportions, and I need them as percentages.
Any thoughts? Thanks!
I didn't manage to change the 'model results' output, but I changed the Forest Plot, which was the main goal.
First, create a function to multiply numbers *100:
mytransf<-function(x)
(x)*100
Then add this function to the forest plot arguments:
forest (res, ylim=c(-1,28),transf=mytransf)
I am interested in running some multivariable linear regression data simulations to try out some new statistical methods before I use them on my real dataset, where I regress a set of predictor variables on an outcome (both continuous and categorical).
The goal would be to generate data with three fake exposures, and an outcome, with the option of setting the beta estimate for the relationship between each exposure and outcome (continuous), or the relative risk or odds ratio for the outcome (categorical outcome). Is that something that can be easily done in R?
For example, it would be great to set a 4 variable dataset where one variable is related to the categorical outcome with an OR/RR of 1.5 that I set, and then I would get a RR/OR of 1.5 for that relationship if I ran a logistic regression on the dataset.
Thanks!
You can generate random categorical variables and then set B0=1, B1=log(1.5), B2=1, B3=1, and generate the appropraite XB. Then using a logit link function, you can generate P(Y=1|x) for each observation/row x and use sample to choose Y=1 or 0 with that probability. Fit a logistic regression using the binomial family and finally exponentiate the coefficient of "a" to get the odds ratio for this variable. Since we had set it to log(1.5), exponentiating gives approximately 1.5.
dt=data.frame(a=sample(c(0,1), 10000, replace=TRUE),
b=sample(c(0,1), 10000, replace=TRUE),
c=sample(c(0,1), 10000, replace=TRUE))
library(dplyr)
dt=mutate(dt, xb=1+log(1.5)*a+b+c, linked=1/(1+exp(-xb)))
y=numeric()
for (i in 1:10000) {
y[i]=sample(c(1,0), prob=c(dt$linked[i], 1-dt$linked[i]), size=1)
}
dt$y=y
m=glm(data=dt, y ~ a+b+c, family="binomial")
exp(m$coef["a"])
1.422448
everybody!
I have a response variable that counts sucessful days in a month and is distributed in a peculiar shape (see above). About 50% are zeros, and there is a heavy tail. Because of the overdispersion and the excess of zeros, I was advised to predict it with a Zero-Inflated Negative Binomial regression model.
However, no matter how significant a model I obtain, it reflects little of those distributing features (see below). For example, the peaks are always around 4, and no predictions fall beyond 20.
Is this usual in fitting overdispersed, heavy-tailed count data? Are there other ways to improve the fitting? Any suggestions would be appreciated. Thank you!
P. S.
I also tried logistic regression to predict zero/non-zero only. But none of the fitted models perform better than simply guessing zeros for all cases.
I suppose you did a histogram of the fitted values, so this will only reflect the fitted means, and possibly multiplied by the ratio of being zero depending on the model you use. It is not supposed to recreate that distribution because how spread your data can be is embedded in the dispersion parameter.
We can use an example from the pscl package:
library(pscl)
data("bioChemists")
fit <- hurdle(art ~ ., data = bioChemists,dist="negbin",zero.dist="binomial")
par(mfrow=c(1,2))
hist(fit$y,main="Observed")
hist(fit$fitted.values,main="Fitted")
As mentioned before, in this hurdle model, the fitted values you see, are the predicted means multiplied by the ratio of being zero (see more here):
head(fit$fitted.values)
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
head(predict(fit,type="zero")*predict(fit,type="count"))
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
To simulate the data based on the fitted model, we extract out the parameters:
Theta=fit$theta
Means=predict(fit,type="count")
Zero_p = predict(fit,type="prob")[,1]
Have function to simulate the counts:
simulateCounts = function(mu,theta,zero_p){
N = length(mu)
x = rnbinom(N,mu=mu,size=THETA)
x[runif(x)<zero_p] = 0
x
}
So run this simulation a number of times to get the spectrum of values:
set.seed(100)
simulated = replicate(10,simulateCounts(Means,Theta,Zero_p))
simulated = unlist(simulated)
par(mfrow=c(1,2))
hist(bioChemists$art,main="Observed")
hist(simulated,main="simulated")