How to give a function a specific column of a list using a for-loop but prevent that output is named according to the iterator command - r

Given the following example:
library(metafor)
dat <- escalc(measure = "RR", ai = tpos, bi = tneg, ci = cpos, di = cneg, data = dat.bcg, append = TRUE)
dat
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
trial author year tpos tneg cpos cneg ablat alloc yi vi
1 1 Aronson 1948 4 119 11 128 44 random -0.8893 0.3256
2 2 Ferguson & Simes 1949 6 300 29 274 55 random -1.5854 0.1946
3 3 Rosenthal et al 1960 3 228 11 209 42 random -1.3481 0.4154
4 4 Hart & Sutherland 1977 62 13536 248 12619 52 random -1.4416 0.0200
5 5 Frimodt-Moller et al 1973 33 5036 47 5761 13 alternate -0.2175 0.0512
6 6 Stein & Aronson 1953 NA NA NA NA 44 alternate NA NA
7 7 Vandiviere et al 1973 8 2537 10 619 19 random -1.6209 0.2230
8 8 TPT Madras 1980 505 87886 499 87892 NA random 0.0120 0.0040
9 9 Coetzee & Berjak 1968 29 7470 45 7232 27 random -0.4694 0.0564
10 10 Rosenthal et al 1961 17 1699 65 1600 42 systematic -1.3713 0.0730
11 11 Comstock et al 1974 186 50448 141 27197 18 systematic -0.3394 0.0124
12 12 Comstock & Webster 1969 5 2493 3 2338 33 systematic 0.4459 0.5325
13 13 Comstock et al 1976 27 16886 29 17825 33 systematic -0.0173 0.0714
Now what i basically want is to iterate with the rma() command (only for mods argument) from - let's say - [7:8] and to store this result in a variable equal to the columnname.
Two problems:
1) When i enter the command:
rma(yi, vi, data = dat, mods = ~dat[[8]], subset = (alloc=="systematic"), knha = TRUE)
The modname is named as dat[[8]]. But I want the modname to be the columname (i.e. colnames(dat[i]))
Model Results:
estimate se tval pval ci.lb ci.ub
intrcpt 0.5543 1.4045 0.3947 0.7312 -5.4888 6.5975
dat[[8]] -0.0312 0.0435 -0.7172 0.5477 -0.2185 0.1560
2) Now imagine that I have a lot of columns more and I want to iterate from [8:53], such that each result gets stored in a variable named equal to the columnname.
Problem 2) has been solved:
for(i in 7:8){
assign(paste(colnames(dat[i]), i, sep=""), rma(yi, vi, data = dat, mods = ~dat[[i]], subset = (alloc=="systematic"), knha = TRUE))}

To answers 1st part of your question, you can change the names by accessing the attributes of the model object.
In this case
# inspect the attributes
attr(model$vb, which = "dimnames")
# assign the name
attr(model$vb, which = "dimnames")[[1]][2] <- paste(colnames(dat)[8])

Related

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

extract beta, sd and P-value from meta-regression using meta package in r to a nice output

I am using the code below to do meta-regression in R and repeat it several time for different variables.
My dataframe and codes are as follow
data<-read.table(text="Studlab PCI.total.FU CABG.total.FU PCI CABG Mean.Age Females..
A 4515 4485 45 51 65.1 22.35
B 4740 4785 74 49 65.95 23.15
C 3621.4 3598.6 41 31 63.15 28.65
D 2337 2314.2 20 29 60 30.5
E 1835.2 1835.2 20 16 66.2 22
F 2014.8 2033.2 11 6 64.45 28.55
G 1125 1125 4 5 61.95 20.65
H 1500 1500 6 3 62.25 23.5
I 976 1000 11 3 61.5 21
J 202 194 10 0 62.4 1", sep="", header=T)
library(meta);library(metafor)
mr <- metainc( PCI, PCI.total.FU,CABG, CABG.total.FU,
data = data, studlab = Studlab, method = "Inverse")
Then for meta-regression I used the following code
MEG<-metareg (mr, ~Mean.Age);MEG ;
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
# Then I repeat meta-regression with another variable
MEG<-metareg (mr, ~Females..);MEG
#==================================
b = round(MEG[["b"]], digits = 2)
se = round(MEG[["se"]], digits = 2)
pval = round(MEG[["pval"]], digits = 2)
paste0(b,"±",se,", P=",pval)
and so on. So; b,se, pval and paste0 steps will be repeated frequently to get the needed output
The content of MEG is shown in the screenshot below.
My question is there is anyway to repeat this function (those repeated steps) several times with different variables (here I used "Mean.Age" then I used "Females..". In another term , I reproduce several MEG with different variables. I am thinking if there is anyway like Macro or so to call those function repeatedly without continuous copy and paste the code several times
Any advice will be greatly appreciated.
I am doing that to finally create a table like this

how to add regression lines for each factor on a plot

I've created a model and I'm trying to add curves that fit the two parts of the data, insulation and no insulation. I was thinking about using the insulation coefficient as a true/false term, but I'm not sure how to translate that into code. Entries 1:56 are "w/o" and 57:101 are "w/". I'm not sure how to include the data I'm using but here's the head and tail:
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
1 8 2003 476 21 a 33.32 69 -8 22.66667 1 w/o
2 9 2003 1052 30 e 112.33 73 -1 35.05172 2 w/o
3 10 2003 981 28 a 24.98 60 -6 35.05172 3 w/o
4 11 2003 1094 32 a 73.51 53 2 34.18750 4 w/o
5 12 2003 1409 32 a 93.23 44 6 44.03125 5 w/o
6 1 2004 1083 32 a 72.84 34 3 33.84375 6 w/o
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
96 7 2011 551 29 e 55.56 72 0 19.00000 96 w/
97 8 2011 552 27 a 61.17 78 1 20.44444 97 w/
98 9 2011 666 34 e 73.87 71 -2 19.58824 98 w/
99 10 2011 416 27 a 48.03 64 0 15.40741 99 w/
100 11 2011 653 31 e 72.80 53 1 21.06452 100 w/
101 12 2011 751 33 a 83.94 45 2 22.75758 101 w/
bill$id <- seq(1:101)
bill$insulation <- as.factor(ifelse(bill$id > 56, c("w/"), c("w/o")))
m1 <- lm(kWhd.1 ~ avgT + insulation + I(avgT^2), data=bill)
with(bill, plot(kWhd.1 ~ avgT, xlab="Average Temperature (F)",
ylab="Daily Energy Use (kWh/d)", col=insulation))
no_ins <- data.frame(bill$avgT[1:56], bill$insulation[1:56])
curve(predict(m1, no_ins=x), add=TRUE, col="red")
ins <- data.frame(bill$avgT[57:101], bill$insulation[57:101])
curve(predict(m1, ins=x), add=TRUE, lty=2)
legend("topright", inset=0.01, pch=21, col=c("red", "black"),
legend=c("No Insulation", "Insulation"))
ggplot2 makes this a lot easier than base plotting. Something like this should work:
ggplot(bill, aes(x = avgT, y = kWhd.1, color = insulation)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE) +
geom_point()
In base, I'd create a data frame with point you want to predict on, something like
pred_data = expand.grid(
kWhd.1 = seq(min(bill$kWhd.1), max(bill$kWhd.1), length.out = 100),
insulation = c("w/", "w/o")
)
pred_data$prediction = predict(m1, newdata = pred_data)
And then use lines to add the predictions to your plot. My base graphics is pretty rusty, so I'll leave that to you (or another answerer) if you want it.
In base R it's important to order the x-values. Since this is to be done on multiple factors, we can do this with by, resulting in a list L.
Since your example data is not complete, here's an example with iris where we consider Species as the "factor".
L <- by(iris, iris$Species, function(x) x[order(x$Petal.Length), ])
Now we can do the plot and add loess predictions as lines with a sapply.
with(iris, plot(Sepal.Width ~ Petal.Length, col=Species))
sapply(seq(L), function(x)
lines(L[[x]]$Petal.Length,
predict(loess(Sepal.Width ~ Petal.Length, L[[x]], span=1.1)), # span=1.1 for smoothing
col=x))
Yields

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

How to estimate Residuals of Autoregressive Model AR(2)with missing data for dataframe in R

I want to estimate/acquire residuals of AR(2) model. As residuals =Actual value- Fitted value.I have a huge dataframe with 4000 columns (company in each), time series. Now I want to run a AR(2) model and get the residuals from it.These columns are a liquidity measure of each company. Now I need to transform this liquidity measure that has been calculated for each company to be transformed by Auto Regressive model with 2 lags,so, that first auto correlation is removed.To transform the liquidity measure for each company using AR(2) process outlined in below equation and use residuals for subsequent analysis.
Where, Ct i is a measure of liquidity for stock i at month t, x is the number of lags included in the autoregressive process, and ut i is the residuals in liquidity for stock i at month t.
I provide a small part of my data as below.
DATE A B C D E F
31/12/1999 79.5 NA NA 6 NA NA
03/01/2000 79.5 NA NA 6 NA NA
04/01/2000 79.5 NA 325 6 961 3081.9
05/01/2000 79.5 NA 322.5 6 945 2524.7
06/01/2000 79.5 NA 327.5 6 952 3272.3
07/01/2000 79.5 NA 327.5 6 941 2102.9
10/01/2000 79.5 7 327.5 6 946 2901.5
11/01/2000 79.5 7 327.5 6 888 9442.5
12/01/2000 79.5 7 331.5 6 870 7865.8
13/01/2000 79.5 7 334 6 853 7742.1
I have from stats package found out this code as below. Could you please help specify this code,so, that it runs for each column(except date) and take care of missing values while having a lag of 2.
ar(x, aic = TRUE, order.max = NULL,
method = c("yule-walker", "burg", "ols", "mle", "yw"),
na.action, series, ...)
Assuming your data is called df you can add the residuals of AR(2) models for each column as follows:
df <- data.frame(df, apply(df[-1], 2, function(x) arima(x, order = c(2,0,0))$res))
df
Date Company1 Company2 Company1.1 Company2.1
1 Jan-2000 0.05365700 0.01821019 -0.036876374 0.0006985845
2 Feb-2000 0.07201239 0.01680506 -0.001093970 -0.0005063298
3 Mar-2000 0.08740745 0.01924687 -0.003796628 0.0017217050
4 Apr-2000 0.10866274 0.01792439 0.010745400 0.0007815183
5 May-2000 0.14189171 0.01848372 0.032286719 0.0014418422
6 Jun-2000 0.15228030 0.01472494 0.023719800 -0.0024127538
7 Jul-2000 0.10231285 0.01634404 -0.025709302 -0.0016760248
8 Aug-2000 0.10838209 0.01919361 0.019162611 0.0009089139
9 Sep-2000 0.08358543 0.01624093 -0.022191184 -0.0010003877
10 Oct-2000 0.10907866 0.01768522 0.022780332 0.0001924767
Edited code:
df <- data.frame(df, apply(df[-1], 2, function(x) arima(x[!is.na(x)], order = c(2,0,0), method = "ML")$res))

Resources