How to get adjusted dependent variable - r

I am working on adjusting urine sodium by urine creatinine and age in order to use the adjusted variable in further analysis.
How do I create a new variable with the adjusted version of the data?? Do I divide NA24 by creatinine and age? Do I multiply them? Please help.
I ran a linear model as follows, but not sure what to do with the information:
Call:
lm(formula = PRENA24 ~ PRECR24mmol * PREALD, data = c1.3)
Residuals:
Min 1Q Median 3Q Max
-228.439 -43.024 -5.215 37.790 274.414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.84482 29.60684 2.258 0.02423 *
PRECR24mmol 7.00565 2.10989 3.320 0.00094 ***
PREALD -0.66555 0.60912 -1.093 0.27488
PRECR24mmol:PREALD 0.06335 0.04392 1.442 0.14962
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 65.94 on 798 degrees of freedom
Multiple R-squared: 0.2963, Adjusted R-squared: 0.2937
F-statistic: 112 on 3 and 798 DF, p-value: < 2.2e-16
I need to adjust the PRENA24 value and I want to make a new column with these values (i.e. PRENA24.ADJ).
I know the following is incorrect, but I am not sure what else to do with the information from the linear model. The post lab data is separated by treatment type as well.
c1 <- c1.3 %>%
mutate(PRENA24.ADJ = (PRENA24-66.84482+(7.00565*PRECR24mmol)+(-0.66555*PREALD)))
c2 <- c1 %>%
mutate(NA24.ADJ = (NA24-24.59443+(10.54905*CR24mmol)+(0.58894*ALD)))

Related

Polynomial Regression with for loop

for Boston dataset perform polynomial regression with degree 5,4,3 and 2 I want to use loop but get error :
Error in [.data.frame(data, 0, cols, drop = FALSE) :
undefined columns selected
library(caret)
train_control <- trainControl(method = "cv", number=10)
#set.seed(5)
cv <-rep(NA,4)
n=c(5,4,3,2)
for (i in n) {
cv[i]=train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
}
outside the loop train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
works well
Since you are using poly(..., raw = FALSE) that means you are getting orthogonal contrasts. Hence no need of for-loop, use the maximum degree since the coefficients and standard errors will not change for each coefficient.
Check quick example below using lm and iris dataset:
summary(lm(Sepal.Length~poly(Sepal.Width, 2), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 2), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.63153 -0.62177 -0.08282 0.50531 2.33336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06692 87.316 <2e-16 ***
poly(Sepal.Width, 2)1 -1.18838 0.81962 -1.450 0.1492
poly(Sepal.Width, 2)2 -1.41578 0.81962 -1.727 0.0862 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8196 on 147 degrees of freedom
Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
> summary(lm(Sepal.Length~poly(Sepal.Width, 3), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 3), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6876 -0.5001 -0.0876 0.5493 2.4600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06588 88.696 <2e-16 ***
poly(Sepal.Width, 3)1 -1.18838 0.80687 -1.473 0.1430
poly(Sepal.Width, 3)2 -1.41578 0.80687 -1.755 0.0814 .
poly(Sepal.Width, 3)3 1.92349 0.80687 2.384 0.0184 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8069 on 146 degrees of freedom
Multiple R-squared: 0.06965, Adjusted R-squared: 0.05054
F-statistic: 3.644 on 3 and 146 DF, p-value: 0.01425
Take a look at the summary table. Everything is the same. Only the poly(Sepal.Width,3)3 was added when a degree of 3 was used. Meaning if we used a degree of 3, we could easily tell what degree 2 will look like. Hence no need of for loop.
Note that you could use different variables in poly: eg poly(cbind(Sepal.Width, Petal.Length, Petal.Width), 4) and still be able to easily recover poly(Sepal.Width, 2).

How do I construct a line of code to test the conditional mean difference in votes by region, controlling for other variables in the model?

Call:
lm(formula = votes ~ redist + party + region, data = prob2)
Residuals:
Min 1Q Median 3Q Max
-56.824 -15.175 0.333 13.903 55.549
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.4762 16.7739 -2.950 0.00339 **
redist 1.3881 0.2792 4.972 1.02e-06 ***
party 37.2574 2.1062 17.689 < 2e-16 ***
regionmidwest -9.2403 2.8632 -3.227 0.00136 **
regionsouth -23.4173 3.1394 -7.459 6.45e-13 ***
regionwest 5.3285 3.8537 1.383 0.16761
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.87 on 364 degrees of freedom
Multiple R-squared: 0.5756, Adjusted R-squared: 0.5698
F-statistic: 98.74 on 5 and 364 DF, p-value: < 2.2e-16
I have to conduct a joint hypothesis and initially tried the following: linearHypothesis(q2, c("votes = region"), as I called the regression above "q2". However, I got an error saying that "The hypothesis "votes = region" is not well formed: contains bad coefficient/variable names".
Appreciate any help!

Wald test on regression coefficients of factorial variable in R

I'm a newbie in R and I have this fitted model:
> mqo_reg_g <- lm(G ~ factor(year), data = data)
> summary(mqo_reg_g)
Call:
lm(formula = G ~ factor(year), data = data)
Residuals:
Min 1Q Median 3Q Max
-0.11134 -0.06793 -0.04239 0.01324 0.85213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.111339 0.005253 21.197 < 2e-16 ***
factor(year)2002 -0.015388 0.007428 -2.071 0.038418 *
factor(year)2006 -0.016980 0.007428 -2.286 0.022343 *
factor(year)2010 -0.024432 0.007496 -3.259 0.001131 **
factor(year)2014 -0.025750 0.007436 -3.463 0.000543 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.119 on 2540 degrees of freedom
Multiple R-squared: 0.005952, Adjusted R-squared: 0.004387
F-statistic: 3.802 on 4 and 2540 DF, p-value: 0.004361
I want to test the difference between the coefficients of factor(year)2002 and Intercept; factor(year)2006 and factor(year)2002; and so on.
In STATA I know people use the function "test" that performs a Wald tests about the parameters of the fitted model. But I could find how to do in R.
How can I do it?
Thanks!

How to extract p-values from sumurca object?

I'd like to extract the p-values from the summary output of ur.za in package urca.
library(urca)
data(nporg)
gnp <- na.omit(nporg[, "gnp.r"])
za.gnp <- ur.za(gnp, model="both", lag=2)
summary(za.gnp)
> summary(za.gnp)
################################
# Zivot-Andrews Unit Root Test #
################################
Call:
lm(formula = testmat)
Residuals:
Min 1Q Median 3Q Max
-39.753 -9.413 2.138 9.934 22.977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.49068 10.25301 2.096 0.04096 *
y.l1 0.77341 0.05896 13.118 < 2e-16 ***
trend 1.19804 0.66346 1.806 0.07675 .
y.dl1 0.39699 0.12608 3.149 0.00272 **
y.dl2 0.10503 0.13401 0.784 0.43676
du -25.44710 9.20734 -2.764 0.00788 **
dt 2.11456 0.84179 2.512 0.01515 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.72 on 52 degrees of freedom
(3 observations deleted due to missingness)
Multiple R-squared: 0.9948, Adjusted R-squared: 0.9942
F-statistic: 1651 on 6 and 52 DF, p-value: < 2.2e-16
Teststatistic: -3.8431
Critical values: 0.01= -5.57 0.05= -5.08 0.1= -4.82
Potential break point at position: 21
All methods I found for lm summary objects don't seem to work here. And I've spent quite some time searching through str(summary(za.gnp)) to no avail. Any hints on where to look?
Objects of class ur.za are S4 objects, which behave differently to S3 objects like those produced by lm. One difference is the concept of the slot accessed via the # operator.
summary(za.gnp) has pval slot but its value is NULL.
summary(za.gnp)#pval
NULL
However, it also has a testreg slot which contains an lm object with the test results that you can use to obtain the p values in the usual way:
coef(summary(summary(za.gnp)#testreg))[,"Pr(>|t|)"]
(Intercept) y.l1 trend y.dl1 y.dl2 du
4.096351e-02 4.007914e-18 7.674887e-02 2.716223e-03 4.367588e-01 7.884201e-03
dt
1.514797e-02

Different results of lm with same dataset written in two different languages (English and Korean)

Results of lm function applied on two dataset (numeric variables + categorical variables) written in two different languages (one written in English and the other one written in Korean) are different. Except the categorical variables, numeric variable are exactly the same. What could explain the difference in the results?
#data
df3 <- repmis::source_DropboxData("df3_v0.1.csv","gg30a74n4ew3zzg",header = TRUE)
#the one written in korean
out1<-lm(YD~SANJI+TAmin8+TMINup18do6+typ_rain6+DTD9,data=df3)
summary(out1)
#the one written in eng
df3$SANJI[df3$SANJI=="전북"]<-"JB"
df3$SANJI[df3$SANJI=="충북"]<-"CHB"
df3$SANJI[df3$SANJI=="경북"]<-"KB"
df3$SANJI[df3$SANJI=="전남"]<-"JN"
df3$SANJI2[df3$SANJI2=="고창"]<-"Gochang"
df3$SANJI2[df3$SANJI2=="괴산"]<-"Goesan"
df3$SANJI2[df3$SANJI2=="단양"]<-"Danyang"
df3$SANJI2[df3$SANJI2=="봉화"]<-"Fenghua"
df3$SANJI2[df3$SANJI2=="신안"]<-"Sinan"
df3$SANJI2[df3$SANJI2=="안동"]<-"Andong"
df3$SANJI2[df3$SANJI2=="영광"]<-"younggang"
df3$SANJI2[df3$SANJI2=="영양"]<-"youngyang"
df3$SANJI2[df3$SANJI2=="영주"]<-"youngju"
df3$SANJI2[df3$SANJI2=="예천"]<-"Yecheon"
df3$SANJI2[df3$SANJI2=="의성"]<-"Yusaeng"
df3$SANJI2[df3$SANJI2=="제천"]<-"Jechon"
df3$SANJI2[df3$SANJI2=="진안"]<-"Jinan"
df3$SANJI2[df3$SANJI2=="청송"]<-"Changsong"
df3$SANJI2[df3$SANJI2=="해남"]<-"Haenam"
out2<-lm(YD~SANJI+TAmin8+TMINup18do6+typ_rain6+DTD9,data=df3)
summary(out2)
#the one written in korean
#Call:
#lm(formula = YD ~ SANJI + TAmin8 + TMINup18do6 + typ_rain6 +
# DTD9, data = df3)
#Residuals:
# Min 1Q Median 3Q Max
#-98.836 -23.173 -2.261 22.626 111.367
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 970.33251 84.12479 11.534 < 2e-16 ***
#SANJI전남 -33.75664 12.53277 -2.693 0.008158 **
#SANJI전북 -44.17939 11.22274 -3.937 0.000144 ***
#SANJI충북 -44.09285 9.16736 -4.810 4.74e-06 ***
#TAmin8 -25.56618 3.36053 -7.608 9.37e-12 ***
#TMINup18do6 4.58052 0.96528 4.745 6.19e-06 ***
#typ_rain6 -0.19754 0.02862 -6.903 3.23e-10 ***
#DTD9 -16.15975 2.65128 -6.095 1.59e-08 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 37.2 on 112 degrees of freedom
#Multiple R-squared: 0.58, Adjusted R-squared: 0.5538
#F-statistic: 22.1 on 7 and 112 DF, p-value: < 2.2e-16
#the one written in eng
#Call:
#lm(formula = YD ~ SANJI + TAmin8 + TMINup18do6 + typ_rain6 +
# DTD9, data = df3)
#Residuals:
# Min 1Q Median 3Q Max
#-98.836 -23.173 -2.261 22.626 111.367
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 926.23966 84.32621 10.984 < 2e-16 ***
#SANJIJB -0.08654 12.32752 -0.007 0.994
#SANJIJN 10.33620 13.09434 0.789 0.432
#SANJIKB 44.09285 9.16736 4.810 4.74e-06 ***
#TAmin8 -25.56618 3.36053 -7.608 9.37e-12 ***
#TMINup18do6 4.58052 0.96528 4.745 6.19e-06 ***
#typ_rain6 -0.19754 0.02862 -6.903 3.23e-10 ***
#DTD9 -16.15975 2.65128 -6.095 1.59e-08 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 37.2 on 112 degrees of freedom
#Multiple R-squared: 0.58, Adjusted R-squared: 0.5538
#F-statistic: 22.1 on 7 and 112 DF, p-value: < 2.2e-16
Your overall model fits are the same, you just have different reference classes for your factor ("SANJIJ"). Having a different reference level will also affect your intercept but won't change the estimation of your continuous covariates.
You can use relevel() to force a particular reference class (assuming SANJIJ is already a factor) or explicitly create the factor() with a levels= parameter, otherwise the default order is sorted alphabetically and the levels may not sort the same way in the different languages.

Resources