I have run a logrank test with survdiff like below:
survdiff(formula = Surv(YearsToEvent, Event) ~ Cat, data = RegressionData)`
I get the following output:
N Observed Expected (O-E)^2/E (O-E)^2/V
0 30913 487 437.9 5.50 11.9
1 3755 56 23.2 46.19 48.0
2 3322 36 45.2 1.89 2.0
3 15796 260 332.6 15.85 27.3
Chisq= 71.9 on 3 degrees of freedom, p= 0.000000000000002
How can I save this (especially the p-value) to a .txt file? I am looping a bunch of regressions like this and want to save them all to a .text file.
Related
I really need help with this. I want to make a predict model for my glm quasipoisson. I have a problems since i wrongly make a glm model with my dataset.
I used to make a predict model based on my glm quasipoisson for all my parameters, but I ended up predicting for each parameter, and the result is different from the glm quasipoisson data.
Here is my dataset. I use a csv file for all my dataset. Idk how to upload this csv data in this post, pardon me for this.
Richness = as.matrix(dat1[,14])
Richness
8
3
3
4
3
5
4
3
7
8
Parameter = as.matrix(dat1[,15:22])
Parameter
JE Temp Hmdt Sond HE WE L MH
1 31.3 93 63.3 3.89 4.32 80 7.82
2 26.9 92 63.5 9.48 8.85 60 8.32
1 27.3 93 67.4 1.23 2.37 60 10.10
3 31.6 99 108.0 1.90 3.32 80 4.60
1 29.3 99 86.8 2.42 7.83 460 12.20
2 29.4 85 86.1 4.71 15.04 200 10.10
1 29.4 87 93.5 3.65 14.70 200 12.20
1 29.5 97 87.5 1.42 3.17 80 4.07
1 25.9 95 62.3 5.23 16.89 140 10.03
1 29.5 95 63.5 1.85 6.50 120 6.97
Rich = glm(Richness ~ Parameter, family=quasipoisson, data = dat1)
summary(Rich)
Call:
glm(formula = Richness ~ Parameter, family = quasipoisson, data = dat1)
Deviance Residuals:
1 2 3 4 5
-0.017139 0.016769 -0.008652 0.002194 -0.003153
6 7 8 9 10
-0.016828 0.022914 -0.013823 -0.012597 0.030219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.4197959 0.5061733 -14.659 0.0434 *
ParameterJE 0.1833651 0.0224198 8.179 0.0775 .
ParameterTemp 0.2441301 0.0073380 33.269 0.0191 *
ParameterHmdt 0.0393258 0.0032176 12.222 0.0520 .
ParameterSond -0.0319313 0.0009662 -33.050 0.0193 *
ParameterHE -0.0982213 0.0060587 -16.212 0.0392 *
ParameterWE 0.1001758 0.0027575 36.329 0.0175 *
ParameterL -0.0014170 0.0001554 -9.117 0.0695 .
ParameterMH 0.0137196 0.0073704 1.861 0.3138
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 0.002739787)
Null deviance: 7.8395271 on 9 degrees of freedom
Residual deviance: 0.0027358 on 1 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 3
This is the model that i tried make with ggplot
ggplot(dat1, aes(Temp, Richness))+
geom_point() +
geom_smooth(method = "glm", method.args = list(family = quasipoisson),
fill = "grey", color = "black", linetype = 2)``
and this is the result.
I make for each parameters, but i just know this result turn wrong because it used a quasipoisson data for each parameter, what i want is the predict model based on quasipoisson data like in the summary above.
I tried to used the code from plot the results glm with multiple explanatories with 95% CIs, but i really confuse to set my data like the example there. But the result in that example is nearly like what i want.
Can anyone help me with this? How can I put the glm predict model for all parameters in one frame with ggplot?
Hope anyone can help me to fix this. Thank you so much!
Have you tried the plot_model function from sjplot package?
I'm writing from my phone, but the code is something Like this.
library(sjPlot)
plot_model(glm_model)
More info:
http://www.strengejacke.de/sjPlot/reference/plot_model.html
code:
data("mtcars")
glm_model<-glm(am~.,data = mtcars)
glm_model
library(sjPlot)
plot_model(glm_model, vline.color = "red")
plot_model(glm_model, show.values = TRUE, value.offset = .3)
I'm using the split() and lapply functions to run Mann Kendall trend tests in bulk. In the code below, split() separates the results (ConcLow) by Analyte (water quality parameter). Then lapply runs the MannKendall and summary for each. The output goes to the console (example shown below code), but I'd like it to go into an Excel or cvs document so I can work with it. Ideally the Excel document would have the analyte (TOC for example) in the first column, then end column = tau value, 3rd column = pvalue. Then the next tab or following columns would display results from the summary function. Any assistance you can provide is greatly appreciated! I'm quite new to R.
mk.analyte <- split(BarkTop$ConcLow, BarkTop$Analyte)
lapply(mk.analyte, MannKendall)
lapply(mk.analyte, summary)
Output for each analyte looks like this (abbreviated here, but it's a long list):
$TOC
tau = 0.0108, 2-sided pvalue =0.8081
$TOC
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.378 2.054 2.255 2.434 2.600 4.530
Data look like this:
Date Location Analyte ConcLow Units
5/8/2000 Barker Res. Hardness 3.34 mg/L (as CaCO3)
11/24/2000 Barker Res. Hardness 9.47 mg/L (as CaCO3)
6/12/2001 Barker Res. Hardness 1.4 mg/L (as CaCO3)
12/29/2001 Barker Res. Hardness 21.9 mg/L (as CaCO3)
7/17/2002 Barker Res. Fe (diss 81 ug/L
2/2/2003 Barker Res. Fe (diss 90 ug/L
8/21/2003 Barker Res. Fe (diss 0.08 ug/L
3/8/2004 Barker Res. Fe (diss 15.748 ug/L
9/24/2004 Barker Res. TSS 6.2 mg/L
4/12/2005 Barker Res. TSS 8 mg/L
10/29/2005 Barker Res. TSS 10 mg/L
In my own opinion, I would use the tidyverse, as it is easier to read.
Short way:
#Sample data
set.seed(42)
df <- data.frame(
Location = replicate(1000, sample(letters[1:15], 1)),
Analyte = replicate(1000, sample(c("Hardness", "TSS", "Fe"), 1)),
ConcLow = runif(1000, 1, 30))
#Soltion
df %>%
nest(-Location, -Analyte) %>%
mutate(
mannKendall = purrr::map(data, function(x) {
broom::tidy(Kendall::MannKendall(x$ConcLow))}),
sumData = purrr::map(data, function(x) {
broom::tidy(summary(x$ConcLow))})) %>%
select(-data) %>%
unnest(mannKendall, sumData) %>%
write_excel_csv(path = "mydata.xls")
#How the table looks like:
# A tibble: 45 x 13
Location Analyte statistic p.value kendall_score denominator var_kendall_sco~ minimum q1 median
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 n Fe 0.264 0.0907 61 231. 1258. 1.38 14.4 20.6
2 o Hardne~ 0.0870 0.568 24 276. 1625. 2.02 9.52 18.3
3 e Fe -0.108 0.499 -25 231. 1258. 1.14 9.24 15.9
4 m TSS -0.00654 1 -1 153 697 2.19 5.89 10.4
5 j TSS -0.158 0.363 -27 171. 817 1.20 6.44 12.8
6 h Hardne~ 0.0909 0.466 48 528 4165. 4.28 11.1 19.4
7 l TSS -0.0526 0.780 -9 171. 817 5.39 12.5 21.1
8 c Fe -0.0736 0.652 -17 231. 1258. 1.63 5.87 10.6
9 j Hardne~ 0.415 0.0143 71 171. 817 4.50 11.7 15.4
10 k Fe -0.146 0.342 -37 253. 1434. 2.68 12.3 15.4
# ... with 35 more rows, and 3 more variables: mean <dbl>, q3 <dbl>, maximum <dbl>
Long way
It's a bit backwards but you can do something below.
Please note that I used subset from the mtcars dataset for my solution.
require(tidyverse)
df <- mtcars %>%
select(cyl, disp)
wilx <- df %>%
split(.$cyl) %>%
map(function(x) {broom::tidy(wilcox.test(x$disp, paired = FALSE,
exact = FALSE))})
sumData <- df %>%
split(.$cyl) %>%
map(function(x) {summary(x$disp)})
for (i in 1:length(wilx)) {
write_excel_csv(as.data.frame(wilx[i]), path = paste0(getwd(), "/wilx", i, ".xls"))
write_excel_csv(as.data.frame(unlist(sumData[i])), path = paste0(getwd(), "/sumData", i, ".xls"))
}
When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.
I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help
I have been trying to convert a SAS code that calculates Simple Regression and Mixed Models. I've achieved to convert simple Regression but when it comes to Mixed Model, my trials turn into fails. The SAS code shnown below is the code that I try to convert
"parc" "m" "dap" "ht" is the header labes of dataset, respectively.
data algoritmo ;
input parc m dap ht ;
lnH = LOG(ht-1.3);
lnD = LOG(dap) ;
cards ;
8 1 24.3 26.7
8 1 29.9 30.7
8 1 32.6 31.7
8 1 35.9 33.7
8 1 36.5 32.5
22 2 22.3 21.0
22 2 26.9 23.1
22 2 26.9 20.5
22 2 32.4 21.5
22 2 33.5 25.0
85 3 33.6 33.5
85 3 36.0 33.0
85 3 37.0 35.0
85 3 40.8 35.0
;
run ;
/* Simpre Regression Model */
PROC REG DATA=algoritmo ;
model lnH = lnD ;
output out=out p=pred ;
run ; quit ;
/* Mixed-Effects Model */
PROC MIXED DATA=algoritmo COVTEST METHOD=REML ;
TITLE ' lnH = (B0+bok)+(B1+b1k)*lnd ' ;
MODEL lnH = lnD / S OUTPM=outpm OUTP=outp ;
RANDOM intercept lnD /SUBJECT=m s G TYPE=UN ;
RUN ;
Here is the part of code that I converted. This part of code works perfect for me.
data1= read.table(file.choose(), header=T, sep=",")
attach(data1)
lnH=log(ht-1.3)
lnD =log(dap)
data2 = cbind(data1,lnH, lnD)
#Simple Linear Model
model1 = lm(lnH~lnD,data=data2)
summary(model1)
But for the rest I'm stuck.
model2 = lme(lnH~lnD ,data=data2,random=~1|lnD / m, method= "REML", weights=varPower(0.2,form=~dap))
summary(model2)
with the help oh Roland, replacing random=~1|lnD with random=~lnD|mworked pretty good.