I am having trouble process a function that can both output regression summary to csv files, and process regression analysis. So the code looks like this:
I have three predicting variables:
age1 (continuous), gender1 (categorical 0/1), FLUSHOT(categorical 0/1)
In the file, the first 100 columns are response variables (all categorical 0/1) I want to test.
The goal is to do regression analysis with each of the response variables(1:100), and only output p-value, OR, and CI.
So the code I have is something looks like this:
fun1<-function(x){
res<-c(paste(as.character(summary(x)$call),collapse = " "),
summary(x)$coefficients[4,4],
exp(coef(x))[4],
exp(confint(x))[4,1:2],"\n")
names(res)<-c("call","p-value","OR","LCI","UCI","")
return(res)}
res2=NULL
lms=list()
for(i in 1:100)
{
lms[[i]]=glm(A[,i]~age1+gender1+as.factor(FLUSHOT),family="binomial",data=A)
res2<-rbind(res2,fun1(lms[[i]]))
}
write.csv(res2,"A_attempt1.csv",row.names=F)
If for example, we have sufficient sample size in each categories, or if the marginal frequency looks like this:
table(variable1,FLUESHOT)
0 1
0 15 3
1 11 19
This code works well, but if we have something like:
table(variable15,FLUESHOT)
0 1
0 15 0
1 11 19
The code run into a error, report, and stops.
I tried multiple ways of using try() and tryCatch(), but didn't seems to work for me.
What error message do you see? You can try using lrm from the rms package to estimate the logistic regression model. And texreg to output it to csv.
Related
I tried doing a goodness of fit test for binomial regression that I did and get this results:
goodness of fit result
in the example my teacher gave the table was
row = 0 1
column =0 1
while mine is
column = 1 0
row = 0 1
as seen in the image above
does this difference matter in the results I get?
The results won't change. But if you like you can change the order of the columns using
table()[,2:1]
I'm fitting count data (number of fledgling birds produced per territory) using zero-inflated poisson models in R, and while model fitting is working fine, I'm having trouble using the predict function to get estimates for multiple values of one category (Year) averaged over the values of another category (StudyArea). Both variables are dummy coded (0,1) and are set up as factors. The data frame sent to the predict function looks like this:
Year_d StudyArea_d
1 0 0.5
2 1 0.5
However, I get the error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
If instead I use a data frame such as:
Year_d StudyArea_d
1 0 0
2 0 1
3 1 0
4 1 1
I get sensible estimates of fledgling counts per year and study site combination. However, I'm not really interested in the effect of study site (the effect is small and isn't involved in an interaction), and the year effect is really what the study was designed to examine.
I have previously used similar code to successfully get estimated counts from a model that had one categorical and one continuous predictor variable (averaging over the levels of the dummy-coded factor), using a data frame similar to:
VegHeight StudyArea_d
1 0 0.5
2 0.5 0.5
3 1 0.5
4 1.5 0.5
So I'm a little confused why the first attempt I describe above doesn't work.
I can work on constructing a reproducible example if it would help, but I have a hunch that I'm not understand something basic about how the predict function works when dealing with factors. If anyone can help me understand what I need to do to get estimates at both levels of one factor, and averaged over the levels of another factor, I would really appreciate it.
I have been preparing survival analysis and cox regression in R. However, my line manager is a Stata user and wants the output displayed in a similar way as Stata would display it, e.g.
# Stata code
. strate
. stsum, by (GROUP)
stsum will output a time at risk for each group and an incidence rate, and I can't figure out how to achieve this with R.
The data look roughly like this (I can't get to it as it's in a secure environment):
PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200
So far I have been using fairly bog standard code:
library(survival)
library(coin)
# survival analysis
table(data$INJURY, data$GROUP)
survdiff(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
surv_test(Surv(FOLLOWUP, INJURY)~factor(GROUP), data=data)
surv.all <- survfit(Surv(FOLLOWUP, INJURY)~GROUP, data=data)
print(sur.all, print.rmean=TRUE)
# cox regression
cox.all<- coxph(Surv(FOLLOWUP, INJURY)~GROUP, data=data))
summary(cox.all)
At the the moment we have 4 lines of data and no clear description (at least to a non-user of Stata) of the desired output:
dat <- read.table(text="PERS GROUP INJURY FOLLOWUP
111 1 0 2190
222 2 1 45
333 1 1 560
444 2 0 1200",header=TRUE)
I do not know if there are functions in either the coin or the survival packages that deliver a crude event rate for such data. It is trivial to deliver crude event rates (using 'crude' in the technical sense with no disparagement intended) with ordinary R functions:
by(dat, dat$GROUP, function(d) sum(d$INJURY)/sum(d$FOLLOWUP) )
#----------------
dat$GROUP: 1
[1] 0.0003636364
------------------------------------------------------
dat$GROUP: 2
[1] 0.0008032129
The corresponding function for time at risk (or both printed to the console) would be very a simple modification. It's possible that the 'Epi' or 'epiR' package or one of the other packages devoted to teaching basic epidemiology would have designed functions for this. The 'survival' and 'coin' authors may not have seen a need to write up and document such a simple function.
When I needed to aggregate the ratios of actual to expected events within strata of factor covariates, I needed to construct a function that properly created the stratified tables of events (to support confidence estimates), sums of "expecteds" (calculated on basis of age,gender and duration of observation), and divide actual A/E ratios. I assemble them into a list object and round the ratios to 2 decimal places. When I got it finished, I found these most useful as a sensibility check against the results I was getting with the 'survival' and 'rms' regression methods I was using. They also help explain results to a nonstatistical audience that is more familiar with tabular methods than with regression. I now have it as part of my Startup .profile.
I am having a huge difference between my fitted data with HoltWinters and the predict data. I can understand there being a huge difference after several predictions but shouldn't the first prediction be the same number as the fitted data would be if it had one more number in the data set??
Please correct me if I'm wrong and why that wouldn't be the case?
Here is an example of the actual data.
1
1
1
2
1
1
-1
1
2
2
2
1
2
1
2
1
1
2
1
2
2
1
1
2
2
2
2
2
1
2
2
2
-1
1
Here is an example of the fitted data.
1.84401709401709
0.760477897417666
1.76593566042741
0.85435674207981
0.978449891674328
2.01079668445307
-0.709049507055536
1.39603638693742
2.42620183925688
2.42819282543689
2.40391946256294
1.29795840410863
2.39684770489517
1.35370435531208
2.38165200319969
1.34590347535205
1.38878761417551
2.36316132796798
1.2226736501825
2.2344269563083
2.24742853293732
1.12409156568888
Here is my R code.
randVal <- read.table("~/Documents/workspace/Roulette/play/randVal.txt", sep = "")
test<-ts(randVal$V1, start=c(1,133), freq=12)
test <- fitted(HoltWinters(test))
test.predict<-predict(HoltWinters(test), n.ahead=1*1)
Here is the predicted data after I expand it to n.ahead=1*12. Keep in mind that I only really want the first value. I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data..... Thank you.
0.16860570380268
-0.624454483845195
0.388808753990824
-0.614404235175936
0.285645402877705
-0.746997659036848
-0.736666618626855
0.174830187188718
-1.30499945596422
-0.320145850774167
-0.0917166719596059
-0.63970713627854
Sounds like you need a statistical consultation since the code is not throwing any errors. And you don't explain why you are dissatisfied with the results since the first value for those two calls is the same. With that in mind, you should realize that most time-series methods assume input from de-trended and de-meaned data, so they will return estimated parameters and values that would need in many cases to be offset to the global mean to predict on the original scale. (It's really bad practice to overwrite intermediate values as you are doing with 'test'.) Nonetheless, if you look at the test-object you see a column of yhat values that are on the scale of the input data.
Your question "I don't understand why all the predict data is so low and close to 0 and -1 while the fitted data is far more accurate to the actual data" doesn't say in what sense you think the "predict data[sic]" is "more accurate" than the actual data. The predict-results is some sort of estimate and they are split into components, as you would see if you ran the code on help page for predict:
plot(test,test.predict12)
It is not "data", either. It's not at all clear how it could be "more accurate" unless you has some sort of gold standard that you are not telling us about.
I'm using the standard glm function with step function on 100k rows and 107 variables. When I did a regular glm I got the calculation done within a minute or two but when I added step(glm(...)) it runs for hours.
I tried to run it as a matrix, but it is still running for about 0.5 hour and I'm not sure it will ever be done. When I ran it on 9 variables it gave me the answers in a few seconds but with 9 warnings: all of them were "Warning messages:1: glm.fit: fitted probabilities numerically 0 or 1 occurred "
I used the line of code below: is it wrong? What should I do in order to gain better running time?
logit1back <- step(glm(IsChurn ~ var1 + var2+ var3+ var4+
var5+ var6+ var7+ var8+ var9, data=tdata , family='binomial'))