R: Rollapply lm regression on zoo matrix objects - r

I would like to perform a rolling regression using lm on many pairs of data series within a single zoo object.
While I am able to perform a rolling regression on one single pair of data series in a zoo object by the following codes:
FunLm <- function(x,Param,Days) coef(lm(AAA ~ Z, data = as.data.frame(x), weights=Param*(seq(Days,1,by=-1))))
DataLmCoef <- rollapplyr(Data, Days, FunLm, Param, Days, by.column = FALSE)
with zoo of this structure:
Z AAA
2012-07-01 1 853
2012-07-04 2 864
2012-07-05 3 865
2012-07-06 4 873
2012-07-07 5 870
2012-07-08 6 874
My question is, if I have the following zoo object:
Z AAA BBB CCC
2012-07-01 1 853 123 65
2012-07-04 2 864 124 62
2012-07-05 3 865 126 63
2012-07-06 4 873 120 66
2012-07-07 5 870 121 68
2012-07-08 6 874 123 69
without using loop, how can I perform rolling regression similarly on Z~AAA, Z~BBB, Z~CCC, Z~DDD, .... and get two zoo matrix objects with one storing intercepts and the other storing slopes?

Following the example from the rollapply man page
You can add more then one test in the roll function
for example
> seat <- as.zoo(log(UKDriverDeaths))
> time(seat) <- as.yearmon(time(seat))
> seat <- merge(y = seat, y1 = lag(seat, k = -1),
y12 = lag(seat, k = -12), all = FALSE)
> fm <- rollapply(seat, width = 36,
FUN = function(z)
data.frame(
test1 = t(coef(lm(y ~ y1 + y12, data = as.data.frame(z)))),
test3 = t(coef(lm(y ~ y12, data = as.data.frame(z))))
) ,
by.column = FALSE, align = "right")
And the result
> head(fm)
test1..Intercept. test1.y1 test1.y12 test3..Intercept. test3.y12
דצמ 1972 0.9629793 0.15344243 0.7240740 1.530598 0.8026003
ינו 1973 1.1336058 0.13920023 0.7155899 1.570067 0.7973688
פבר 1973 0.9978077 0.14346100 0.7293183 1.440635 0.8145803
מרץ 1973 0.9879002 0.12929214 0.7442218 1.375245 0.8226257
אפר 1973 1.2281307 0.11700612 0.7250115 1.545356 0.8003661
מאי 1973 1.4483700 0.08860055 0.7245032 1.706343 0.7792279

Related

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

extract beta, sd and P-value from meta-regression using meta package in r to a nice output

I am using the code below to do meta-regression in R and repeat it several time for different variables.
My dataframe and codes are as follow
data<-read.table(text="Studlab PCI.total.FU CABG.total.FU PCI CABG Mean.Age Females..
A 4515 4485 45 51 65.1 22.35
B 4740 4785 74 49 65.95 23.15
C 3621.4 3598.6 41 31 63.15 28.65
D 2337 2314.2 20 29 60 30.5
E 1835.2 1835.2 20 16 66.2 22
F 2014.8 2033.2 11 6 64.45 28.55
G 1125 1125 4 5 61.95 20.65
H 1500 1500 6 3 62.25 23.5
I 976 1000 11 3 61.5 21
J 202 194 10 0 62.4 1", sep="", header=T)
library(meta);library(metafor)
mr <- metainc( PCI, PCI.total.FU,CABG, CABG.total.FU,
data = data, studlab = Studlab, method = "Inverse")
Then for meta-regression I used the following code
MEG<-metareg (mr, ~Mean.Age);MEG ;
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
# Then I repeat meta-regression with another variable
MEG<-metareg (mr, ~Females..);MEG
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
and so on. So; b,se, pval and paste0 steps will be repeated frequently to get the needed output
The content of MEG is shown in the screenshot below.
My question is there is anyway to repeat this function (those repeated steps) several times with different variables (here I used "Mean.Age" then I used "Females..". In another term , I reproduce several MEG with different variables. I am thinking if there is anyway like Macro or so to call those function repeatedly without continuous copy and paste the code several times
Any advice will be greatly appreciated.
I am doing that to finally create a table like this

how to add regression lines for each factor on a plot

I've created a model and I'm trying to add curves that fit the two parts of the data, insulation and no insulation. I was thinking about using the insulation coefficient as a true/false term, but I'm not sure how to translate that into code. Entries 1:56 are "w/o" and 57:101 are "w/". I'm not sure how to include the data I'm using but here's the head and tail:
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
1 8 2003 476 21 a 33.32 69 -8 22.66667 1 w/o
2 9 2003 1052 30 e 112.33 73 -1 35.05172 2 w/o
3 10 2003 981 28 a 24.98 60 -6 35.05172 3 w/o
4 11 2003 1094 32 a 73.51 53 2 34.18750 4 w/o
5 12 2003 1409 32 a 93.23 44 6 44.03125 5 w/o
6 1 2004 1083 32 a 72.84 34 3 33.84375 6 w/o
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
96 7 2011 551 29 e 55.56 72 0 19.00000 96 w/
97 8 2011 552 27 a 61.17 78 1 20.44444 97 w/
98 9 2011 666 34 e 73.87 71 -2 19.58824 98 w/
99 10 2011 416 27 a 48.03 64 0 15.40741 99 w/
100 11 2011 653 31 e 72.80 53 1 21.06452 100 w/
101 12 2011 751 33 a 83.94 45 2 22.75758 101 w/
bill$id <- seq(1:101)
bill$insulation <- as.factor(ifelse(bill$id > 56, c("w/"), c("w/o")))
m1 <- lm(kWhd.1 ~ avgT + insulation + I(avgT^2), data=bill)
with(bill, plot(kWhd.1 ~ avgT, xlab="Average Temperature (F)",
ylab="Daily Energy Use (kWh/d)", col=insulation))
no_ins <- data.frame(bill$avgT[1:56], bill$insulation[1:56])
curve(predict(m1, no_ins=x), add=TRUE, col="red")
ins <- data.frame(bill$avgT[57:101], bill$insulation[57:101])
curve(predict(m1, ins=x), add=TRUE, lty=2)
legend("topright", inset=0.01, pch=21, col=c("red", "black"),
legend=c("No Insulation", "Insulation"))
ggplot2 makes this a lot easier than base plotting. Something like this should work:
ggplot(bill, aes(x = avgT, y = kWhd.1, color = insulation)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE) +
geom_point()
In base, I'd create a data frame with point you want to predict on, something like
pred_data = expand.grid(
kWhd.1 = seq(min(bill$kWhd.1), max(bill$kWhd.1), length.out = 100),
insulation = c("w/", "w/o")
)
pred_data$prediction = predict(m1, newdata = pred_data)
And then use lines to add the predictions to your plot. My base graphics is pretty rusty, so I'll leave that to you (or another answerer) if you want it.
In base R it's important to order the x-values. Since this is to be done on multiple factors, we can do this with by, resulting in a list L.
Since your example data is not complete, here's an example with iris where we consider Species as the "factor".
L <- by(iris, iris$Species, function(x) x[order(x$Petal.Length), ])
Now we can do the plot and add loess predictions as lines with a sapply.
with(iris, plot(Sepal.Width ~ Petal.Length, col=Species))
sapply(seq(L), function(x)
lines(L[[x]]$Petal.Length,
predict(loess(Sepal.Width ~ Petal.Length, L[[x]], span=1.1)), # span=1.1 for smoothing
col=x))
Yields

How to use IQR outlier function, based on a key, in R

I want to use this IQR function:
smooth_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.3 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- round(qnt[1] - H)
y[x > (qnt[2] + H)] <- round(qnt[2] + H)
y
}
on the below df, on the total column for every specific key, based on the key column:
key total
US4ZNB 10
US4ZNB 1075
US4ZNB 10000
US4ZNB 1138
US4ZNB 1156
US4YYM 1114
US4YYM 1072
US4YYM 50
US4YYM 1181
US4YYM 8000
JM4YYM 15000
JM4YYM 2000
JM4YYM 100
JM4YYM 2200
JM4YYM 2300
ddply from the plyr package does exactly this. It applies a function to each subset of the data based off a column.
plyr::ddply(df, "key", plyr::numcolwise(smooth_outliers))
The first argument is your data with "key" and "total", the second argument is the grouping variable, in this case "key".
The final variable is the function you want to apply, the numcolwise function is used here essentially so it applied it to the column rather than a whole row. So we make the row-based smooth-outliers function a column based function.
Then voila.
You'll get a data frame that lists each each key and its IQR as calculated by the smooth_outliers function.
Here's the result.
key total
1 JM4YYM 1421
2 JM4YYM 1712
3 JM4YYM 1709
4 US4YYM 1114
5 US4YYM 1473
6 US4YYM 1181
7 US4YYM 1767
8 US4YYM 1005
9 US4ZAW 1138
10 US4ZAW 1156
11 US4ZAW 1982
12 US4ZNB 1338
13 US4ZNB 1075
14 US4ZNB 1806
As you can see, each key is matched up with one of the outputs from the smooth_outliers function.
After ideas elaboration, I manage to find solution for my issue. I just used dplyr::group_by:
df.new <- df %>%
group_by(key) %>%
mutate(val=smooth_outliers(total))
Thanks you all.

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Resources