Linear Regression - Append Predicted Values to Same dataset - r

I recently started with R programming. This is my dataset
WeekOfYear Production
1 202612
2 245633
3 299653
4 252612
5 299633
6 288993
7 254653
8 288612
9 277733
10 245633
I need to predict “Production” values for the remaining Weeks of the Year
relation<-lm(Production~WeekOfYear,dataset)
predict(relation,data.frame(WeekOfYear=c(11)))
How to append predicted values for week nos 11 to 52 (end of year) to the same dataset like below
WeekOfYear Production
1 202612
2 245633
3 299653
4 252612
5 299633
6 288993
7 254653
8 288612
9 277733
10 245633
11 predicted value
12 predicted value
so on
-OR-
WeekOfYear Production Regression
1 202612 fitted value
2 245633 fitted value
3 299653 fitted value
4 252612 fitted value
5 299633 fitted value
6 288993 fitted value
7 254653 fitted value
8 288612 fitted value
9 277733 fitted value
10 245633 fitted value
11 predicted value
12 predicted value
13 predicted value
14 predicted value
.
.
52 predicted value

You could do it like this:
relation <- lm(Production ~ WeekOfYear, dat)
WeekOfYear <- 1:52
predict(relation, data.frame(WeekOfYear))
dat2 <- data.frame(WeekOfYear, regression = predict(relation, data.frame(WeekOfYear)))
merge(dat, dat2, by = 'WeekOfYear', all.y = TRUE)
The result:
WeekOfYear Production regression
1 1 202612 250517.6
2 2 245633 253864.1
3 3 299653 257210.5
4 4 252612 260557.0
5 5 299633 263903.5
6 6 288993 267249.9
7 7 254653 270596.4
8 8 288612 273942.9
9 9 277733 277289.3
10 10 245633 280635.8
11 11 NA 283982.3
12 12 NA 287328.7
----
51 51 NA 417840.9
52 52 NA 421187.4

To append your values you can use the following
test_data <- data.frame(WeekOfYear=11:52, Production = rep(0, 52-11+1))
test_data$Production <- predict(relation,test_data)
df = rbind(df, test_data)
where I have defined with df your data Frame
df = data.frame(WeekOfYear =
c(1,2,3,4,5,6,7,8,9,10),
Production = c(202612,245633,299653,252612,299633,288993,254653,288612, 277733,245633))
this will give you this behaviour (plot put together very quickly)
I am not sure anyway that your data follow a linear behaviour but you may know your data better...

Related

Using a loop to run Linear Models over multiple populations exporting AIC values and coeficients

I have data that includes multiple populations (which each contain multiple individuals per populations). I am trying to evaluate resource selection through using linear models to evaluate use. I want to do this at the population level, and therefore am hoping to use a loop to loop through running 6 models on each population, and then provide AIC tables for each population as well as the coefficients for the best fitting models.
Here is a sample of my data:
Population WLH_ID Used Var1 Var2 Var3 Var4 Var5
Tweed 1 1 15 2 10 21 22.1
Tweed 2 1 7 3 9 20 20
Lake 3 1 11 2 7 19 20
Lake 4 1 13 2 8 21 20
Hwy 5 1 14 1 6 12 23
Hwy 6 1 10 2 7 17 20
Jasper 7 1 12 4 7 19 22
Tweed 1 0 15 2 10 21 22.1
Tweed 2 0 7 3 9 20 20
Lake 3 0 11 2 7 19 20
Lake 4 0 11 2 8 21 20
Hwy 5 0 12 1 5 23 23
Hwy 6 0 14 7 7 17 20
Jasper 7 0 17 2 4 19 21.5
So far I have tried the following
Model1 <- as.formula(Used ~ var1+var2+var3+var4+var5+(1|WLH_ID))
Model2 <- as.formula(Used ~ var1+var2+var3+var4+(1|WLH_ID))
Model3 <- as.formula(Used ~ var1+var2+var3+(1|WLH_ID))
Model4 <- as.formula(Used ~ var1+var2+(1|WLH_ID))
Model5 <- as.formula(Used ~ var1+(1|WLH_ID))
Model6<- as.formula(Used~1)
### It will use the Model1 formula entered above, so make sure you have run that
SM.split <- split(mydata,mydata$Population) #Split the data into a list with one entry for each population
for (i in 1:length (SM.split)){
poprun<-SM.split[[i]]
Cand.models[[1]]<-glmer(Model1,family = binomial,data=poprun)
Cand.models[[2]]<-glmer(Model2,family = binomial,data=poprun)
Cand.models[[3]]<-glmer(Model3,family = binomial,data=poprun)
Cand.models[[4]]<-glmer(Model4,family = binomial,data=poprun)
Cand.models[[5]]<-glmer(Model5,family = binomial,data=poprun)
Cand.models[[6]]<-glmer(Model6,family = binomial,data=poprun)
Modnames[[i]]<-paste("mod",1:length(Cand.models),sep=" ")
AICTable[[i]]<-aictab(cand.set = Cand.models, modnames = Modnames, sort = TRUE)
}
I receive the following error:
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, :
pwrssUpdate did not converge in (maxit) iterations
Alternatively I am really trying to split the data and have R loop through each population separately, and then tell me for that population what is the model with the lowest AIC and what are the coefficients for that model. In reality I have 30+ populations and am comparing 6 models so I am hoping to not have to write out each one.
NOTE: I do not have a good way to make up data to attach, and cannot attach my own.

R: calculating ICC across multiple columns in two dataframes

I am calculating the ICC's for 301 variables of 2 readers. Results are saved in two files with 301 columns each. The first column of file1 (reader1$Var1) corresponds to the first column of file2 (reader2$Var302). I can perform the ICC manually (see below), but I need help to automate this process using apply or a loop. Thank you.
library(irr)
irr::icc()
a= data.frame(reader1$Var1)
b= data.frame(reader2$Var302)
X= data.frame (a,b)
function.ICC <- function (X) {irr::icc(X, model =c("oneway"), type = c("consistency"), unit =("single"), r0 = 0, conf.level = 0.95)}
Results <- function.ICC(X)
Results[7]
A combination of lapply and do.call could do for your case (although there's quite a few options). You don't provide a sample of your data, so I assume you first do a cbind of your 2 dataframes one after the other, so that in this toy example
> X = data.frame(cbind(1:10, 11:20, 21:30, 21:30))
> X
X1 X2 X3 X4
1 1 11 21 21
2 2 12 22 22
3 3 13 23 23
4 4 14 24 24
5 5 15 25 25
6 6 16 26 26
7 7 17 27 27
8 8 18 28 28
9 9 19 29 29
10 10 20 30 30
you would like to run icc of X1 vs X3 and X2 versus X4. It would be something like the following, relying on function.ICC as you've defined it:
> do.call(cbind, lapply(1:2, function(i) function.ICC(X[,c(i, i+2)])))
[,1] [,2]
subjects 10 10
raters 2 2
model "oneway" "oneway"
type "consistency" "consistency"
unit "single" "single"
icc.name "ICC(1)" "ICC(1)"
value -0.8320611 -0.4634146
r0 0 0
Fvalue 0.09166667 0.3666667
df1 9 9
df2 10 10
p.value 0.9993158 0.926668
conf.level 0.95 0.95
lbound -0.9526347 -0.8231069
ubound -0.4669701 0.1848105
So, for your cbind'ed dataframes with 301 columns, omething similar to this should work:
do.call(cbind, lapply(1:301, function(i) function.ICC(X[,c(i, i+301)])))

covariance structure for multilevel modelling

I have a multilevel repeated measures dataset of around 300 patients each with up to 10 repeated measures predicting troponin rise. There are other variables in the dataset, but I haven't included them here.
I am trying to use nlme to create a random slope, random intercept model where effects vary between patients, and effect of time is different in different patients. When I try to introduce a first-order covariance structure to allow for the correlation of measurements due to time I get the following error message.
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
I have included my code and a sample of the dataset, and I would be very grateful for any words of wisdom.
#baseline model includes only the intercept. Random slopes - intercept varies across patients
randomintercept <- lme(troponin ~ 1,
data = df, random = ~1|record_id, method = "ML",
na.action = na.exclude,
control = list(opt="optim"))
#random intercept and time as fixed effect
timeri <- update(randomintercept,.~. + day)
#random slopes and intercept: effect of time is different in different people
timers <- update(timeri, random = ~ day|record_id)
#model covariance structure. corAR1() first order autoregressive covariance structure, timepoints equally spaced
armodel <- update(timers, correlation = corAR1(0, form = ~day|record_id))
Error in `coef<-.corARMA`(`*tmp*`, value = value[parMap[, i]]) : Coefficient matrix not invertible
Data:
record_id day troponin
1 1 32
2 0 NA
2 1 NA
2 2 NA
2 3 8
2 4 6
2 5 7
2 6 7
2 7 7
2 8 NA
2 9 9
3 0 14
3 1 1167
3 2 1935
4 0 19
4 1 16
4 2 29
5 0 NA
5 1 17
5 2 47
5 3 684
6 0 46
6 1 45440
6 2 47085
7 0 48
7 1 87
7 2 44
7 3 20
7 4 15
7 5 11
7 6 10
7 7 11
7 8 197
8 0 28
8 1 31
9 0 NA
9 1 204
10 0 NA
10 1 19
You can fit this if you change your optimizer to "nlminb" (or at least it works with the reduced data set you posted).
armodel <- update(timers,
correlation = corAR1(0, form = ~day|record_id),
control=list(opt="nlminb"))
However, if you look at the fitted model, you'll see you have problems - the estimated AR1 parameter is -1 and the random intercept and slope terms are correlated with r=0.998.
I think the problem is with the nature of the data. Most of the data seem to be in the range 10-50, but there are excursions by one or two orders of magnitude (e.g. individual 6, up to about 45000). It might be hard to fit a model to data this spiky. I would strongly suggest log-transforming your data; the standard diagnostic plot (plot(randomintercept)) looks like this:
whereas fitting on the log scale
rlog <- update(randomintercept,log10(troponin) ~ .)
plot(rlog)
is somewhat more reasonable, although there is still some evidence of heteroscedasticity.
The AR+random-slopes model fits OK:
ar.rlog <- update(rlog,
random = ~day|record_id,
correlation = corAR1(0, form = ~day|record_id))
## Linear mixed-effects model fit by maximum likelihood
## ...
## Random effects:
## Formula: ~day | record_id
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 0.1772409 (Intr)
## day 0.6045765 0.992
## Residual 0.4771523
##
## Correlation Structure: ARMA(1,0)
## Formula: ~day | record_id
## Parameter estimate(s):
## Phi1
## 0.09181557
## ...
A quick glance at intervals(ar.rlog) shows that the confidence intervals on the autoregressive parameter are (-0.52,0.65), so it may not be worth keeping ...
With the random slopes in the model the heteroscedasticity no longer seems problematic ...
plot(rlog,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))

How to compute regressions and extract parameters on all columns of a dataset [duplicate]

This question already has an answer here:
Fitting a linear model with multiple LHS
(1 answer)
Closed 6 years ago.
I want to compute a linear regression on all column (or to a selected column) of a specific dataset. The first column respresent a X axis of the regression, the other each subject response. The second step is to extract for each specific subject the coefficients parameters of regression (linear or logistic).
Actually I do it manually for each column using lm (or glm) and extracting the coefficients to a specific variable and dataset.
Example using lm:
dataset <- as.data.frame(matrix(c(1,1,
3,7,2,1,4,5,3,2,4,6,4,2,5,8,5,5,9,9,6,4,
12,10,7,6,15,11,8,6,15,15,9,8,16,10,10,9,18,9,11,12,
20,12,12,15,21,16,13,18,22,15,14,22,21,10,15,29,24,12)
,nrow=15, ncol=4,byrow=TRUE))
colnames(dataset) <- c("X","Sj1","Sj2","Sj3")
Output:
dataset
X Sj1 Sj2 Sj3
1 1 1 3 7
2 2 1 4 5
3 3 2 4 6
4 4 2 5 8
5 5 5 9 9
6 6 4 12 10
7 7 6 15 11
8 8 6 15 15
9 9 8 16 10
10 10 9 18 9
11 11 12 20 12
12 12 15 21 16
13 13 18 22 15
14 14 22 21 10
15 15 29 24 12
Regressions:
attach (dataset)
mod1 <- lm(Sj1~X)
mod2 <- lm(Sj2~X)
mod3 <- lm(Sj3~X)
Intercept <- 0
Intercept[1] <- mod1$coefficients[[1]]
Intercept[2] <- mod2$coefficients[[1]]
Intercept[3] <- mod3$coefficients[[1]]
Slope <- 0
Slope[1] <- mod1$coefficients[[2]]
Slope[2] <- mod2$coefficients[[2]]
Slope[3] <- mod3$coefficients[[2]]
data.frame(Intercept,Slope,row.names=colnames(dataset)[-1])
and the final output is
Intercept Slope
Sj1 -4.580952 1.7392857
Sj2 1.104762 1.6035714
Sj3 6.104762 0.5285714
There is a code to perform it automatically, indipendently from the number of columns? I tried apply and function without results.
What is the best way to do this?
lm accepts a matrix on the LHS. See the documentation.
f <- as.formula(paste0("cbind(", paste(names(dataset)[-1], collapse = ","), ") ~ X"))
mods <- lm(f, data = dataset)
coef(mods)
# Sj1 Sj2 Sj3
#(Intercept) -4.580952 1.104762 6.1047619
#X 1.739286 1.603571 0.5285714
PS: You should get out of the habit of using attach.

Positive Spearman's r value with negative abline

I'm looking over some data I collected this week-end and one of the Spearman correlation test gave me a positive value, but when I add an abline on the plot, it is descending. I'm curious how this can be possible.
Here's are the data for Study:
Year Matches Total
1 1958 2 7
2 1959 2 14
3 1960 5 9
4 1961 2 20
5 1962 4 27
6 1963 5 20
7 1964 5 25
8 1965 5 20
9 1966 3 18
10 1967 5 28
11 1968 6 26
12 1969 4 24
13 1970 6 22
14 1971 7 32
And here's the procedure I used:
Results<- (Study$Matches/Study$Total)*100
Year<-Study$Year
plot(Year, Results, main = "MAIN")
fit1 <- lm (Results ~ Year, data = Study)
abline(fit1, lty = "dashed")
cor.test(Year, Results, method = "s")
The fit produces a linear model. Pearson is the linear correlation and it is negative. Spearman is non-linear and based on the ranking.
> cor.test(Year,Results,method="spearman")
Spearman's rank correlation rho
data: Year and Results
S = 438.9647, p-value = 0.9048
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.03524238
Warning message:
In cor.test.default(Year, Results, method = "spearman") :
Cannot compute exact p-value with ties
> cor(Year,Results,method="spearman")
[1] 0.03524238
> cor(Year,Results,method="pearson")
[1] -0.17501
It is correctly plotting the negative linear correlation. It just happens that Spearman is positive. Unintuitive things can happen when correlations are low. The high p-value is a clue too.

Resources