how to use metafor to calculate effect sizes for pre-post one group study - metafor

I was trying to use metafor to perform a meta-analysis to explore hormone concentration changes after the intervention.
The studies being included only have one intervention group without any control group. The pre and post mean_sd values were reported in each individual study.
However, based on the metafor package, it requires putting ri values as well, which no studies have reported the correlation coefficients before. How am I supposed to sort this out?
I was told to try a wide range of ri values and then perform sensitivity analyses to see if ri values would influence the results. Does anyone have an idea of how to perform this analysis?
I've also attached my code here (set ri equals 0 across studies):
es <- escalc(measure = "SMCC", ni= n, ri = c(rep(0, 14)),
m1i = pyy_pre_m, m2i = pyy_post_m,
sd1i = pyy_pre_sd, sd2i = pyy_post_sd,
data = df,append = FALSE)
es
Many thanks,
Kim

Related

How to analyze usefulness of imputation output in R

I am working with a dataset with 3,500 observations and includes a Body Mass Index variable. There are around 300 NA values for the BMI variable which I have imputed using multiple imputation. Apologies that this is not reproducible, wasn't sure how to do that quickly in this case. Here is the code I used to impute the data.
a1 <- amelia(x = df, m = 30, idvars = c("EDUC_1","REGION_3"),
noms=c("REGION_1","REGION_2","REGION_4","SMOKE","MARRIED","NON_WHITE","MOD_SEV_ANX","HYPERTEN",
"DIABETES","BELOW_100_POVERTY", "IMMIGRANT","FEMALE", "EDUC_2","EDUC_3","EDUC_4","EDUC_5"),
p2s = 0)
a1
imp.mod <- zelig(BMICALC ~ AGE+FAMSIZE+factor(BELOW_100_POVERTY)+factor(IMMIGRANT)
+factor(FEMALE)+factor(EDUC_2)+factor(EDUC_3)
+factor(EDUC_4)+factor(EDUC_5)+factor(REGION_1)+factor(REGION_2)+factor(REGION_4)+
factor(SMOKE)+factor(MARRIED)+factor(NON_WHITE)+factor(MOD_SEV_ANX)+factor(HYPERTEN)+
factor(DIABETES), model = "ls", data = a1, cite = F)
summary(imp.mod)
And here is the output
From this website here I have found information on how to interpret whether the imputation needs more investigation, or whether I can continue to analyze the data using a regression model. I have included the code where I create the 2 visuals per the website's instructions but I am unclear on how to interpret whether the imputation is accurate. Do the two distributions in the first graphic need to be close? Can someone clarify what the y=x line and blue confidence intervals/dots mean in the second image? Is the output here indicative of whether the imputation will suffice to be used in regression analysis? I've attached the code and output below. Thank you!
compare.density(a1, var = "BMICALC")
overimpute(a1, var = "BMICALC")

Regressing out or Removing age as confounding factor from experimental result

I have obtained cycle threshold values (CT values) for some genes for diseased and healthy samples. The healthy samples were younger than the diseased. I want to check if the age (exact age values) are impacting the CT values. And if so, I want to obtain an adjusted CT value matrix in which the gene values are not affected by age.
I have checked various sources for confounding variable adjustment, but they all deal with categorical confounding factors (like batch effect). I can't get how to do it for age.
I have done the following:
modcombat = model.matrix(~1, data=data.frame(data_val))
modcancer = model.matrix(~Age, data=data.frame(data_val))
combat_edata = ComBat(dat=t(data_val), batch=Age, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
pValuesComBat = f.pvalue(combat_edata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
data_val is the gene expression/CT values matrix.
Age is the age vector for all the samples.
For some genes the p-value is significant. So how to correctly modify those gene values so as to remove the age effect?
I tried linear regression as well (upon checking some blogs):
lm1 = lm(data_val[1,] ~ Age) #1 indicates first gene. Did this for all genes
cor.test(lm1$residuals, Age)
The blog suggested checking p-val of correlation of residuals and confounding factors. I don't get why to test correlation of residuals with age.
And how to apply a correction to CT values using regression?
Please guide if what I have done is correct.
In case it's incorrect, kindly tell me how to obtain data_val with no age effect.
There are many methods to solve this:-
Basic statistical approach
A very basic method to incorporate the effect of Age parameter in the data and make the final dataset age agnostic is:
Do centring and scaling of your data based on Age. By this I mean group your data by age and then take out the mean of each group and then standardise your data based on these groups using this mean.
For standardising you can use two methods:
1) z-score normalisation : In this you can change each data point to as (x-mean(x))/standard-dev(x)); by using group-mean and group-standard deviation.
2) mean normalization: In this you simply subtract groupmean from every observation.
3) min-max normalisation: This is a modification to z-score normalisation, in this in place of standard deviation you can use min or max of the group, ie (x-mean(x))/min(x)) or (x-mean(x))/max(x)).
On to more complex statistics:
You can get the importance of all the features/columns in your dataset using some algorithms like PCA(principle component analysis) (https://en.wikipedia.org/wiki/Principal_component_analysis), though it is generally used as a dimensionality reduction algorithm, still it can be used to get the variance in the whole data set and also get the importance of features.
Below is a simple example explaining it:
I have plotted the importance using the biplot and graph, using the decathlon dataset from factoextra package:
library("factoextra")
data(decathlon2)
colnames(data)
data<-decathlon2[,1:10] # taking only 10 variables/columns for easyness
res.pca <- prcomp(data, scale = TRUE)
#fviz_eig(res.pca)
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
hep.PC.cor = prcomp(data, scale=TRUE)
biplot(hep.PC.cor)
output
[1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m" "X110m.hurdle"
[7] "Discus" "Pole.vault" "Javeline" "X1500m"
On these similar lines you can use PCA on your data to get the importance of the age parameter in your data.
I hope this helps, if I find more such methods I will share.

Find the nearest neighbor using caret

I'm fitting a k-nearest neighbor model using R's caret package.
library(caret)
set.seed(0)
y = rnorm(20, 100, 15)
predictors = matrix(rnorm(80, 10, 5), ncol=4)
data = data.frame(cbind(y, predictors))
colnames(data)=c('Price', 'Distance', 'Cost', 'Tax', 'Transport')
I left one observation as the test data and fit the model using the training data.
id = sample(nrow(data)-1)
train = data[id, ]
test = data[-id,]
knn.model = train(Price~., method='knn', train)
predict(knn.model, test)
When I display knn.model, it tells me it uses k=9. I would love to know which 9 observations are actually the "nearest" to the test observation. Besides manually calculating the distances, is there an easier way to display the nearest neighbors?
Thanks!
When you are using knn you are creating clusters with points that are near based on independent variables. Normally, this is done using train(Price~., method='knn', train), such that the model chooses the best prediction based on some criteria (taking into account also the dependent variable as well). Given the fact I have not checked whether the R object stores the predicted price for each of the trained values, I just used the model trained to predicte the expected price given the model (where the expected price is located in the space).
At the end, the dependent variable is just a representation of all the other variables in a common space, where the price associated is assumed to be similar since you cluster based on proximity.
As a summary of steps, you need to calculate the following:
Get the distance for each of the training data points. This is done through predicting over them.
Calculate the distance between the trained data and your observation of interest (in absolut value, since you do not care about the sign but just about the absolut distances).
Take the indexes of the N smaller ones(e.g.N= 9). you can get the observations and related to this lower distances.
TestPred<-predict(knn.model, newdata = test)
TrainPred<-predict(knn.model, train)
Nearest9neighbors<-order(abs(TestPred-TrainPred))[1:9]
train[Nearest9neighbors,]
Price Distance Cost Tax Transport
15 95.51177 13.633754 9.725613 13.320678 12.981295
7 86.07149 15.428847 2.181090 2.874508 14.984934
19 106.53525 16.191521 -1.119501 5.439658 11.145098
2 95.10650 11.886978 12.803730 9.944773 16.270416
4 119.08644 14.020948 5.839784 9.420873 8.902422
9 99.91349 3.577003 14.160236 11.242063 16.280094
18 86.62118 7.852434 9.136882 9.411232 17.279942
11 111.45390 8.821467 11.330687 10.095782 16.496562
17 103.78335 14.960802 13.091216 10.718857 8.589131

Limiting Number of Predictions - R

I'm trying to create a model that predicts whether a given team makes the playoffs in the NHL, based on a variety of available team stats. However, I'm running into a problem.
I'm using R, specifically the caret package, and I'm having fairly good success so far, with one issue: I can't limit the the number of teams that are predicted to make the playoffs.
I'm using a categorical variable as the prediction -- Y or N.
For example, using the random forest method from the caret package,
rf_fit <- train(playoff ~ ., data = train_set, method = "rf")
rf_predict <- predict(rf_fit,newdata = test_data_playoffs)
mean(rf_predict == test_data_playoffs$playoff)
gives an accuracy of approximately 90% for my test set, but that's because it's overpredicting. In the NHL, 16 teams make the playoffs, but this predicts 19 teams to make the playoffs. So I want to limit the number of "Y" predictions to 16.
Is there a way to limit the number of possible responses for a categorical variable? I'm sure there is, but google searching has given me limited success so far.
EDIT: Provide sample data which can be created with the following code:
set.seed(100) # For reproducibility
data <- data.frame(Y = sample(1:10,32,replace = T)/10, N = rep(NA,32))
data$N <- 1-data$Y
This creates a data frame similar to what you get by using the "prob" option, where you have a list of probabilities for Y and N
pred <- predict(fit,newdata = test_data_playoffs, "prob")

SVM in R (e1071): Give more recent data higher influence (weights for support vector machine?)

I'm working with Support Vector Machines from the e1071 package in R. This is my first project using SVM.
I have a dataset containing order histories of ~1k customers over 1 year and I want to predict costumer purchases. For every customer I have the information if a certain item (out of ~50) was bought or not in a certain week (for 52 weeks aka 1 yr).
My goal is to predict next month's purchases for every single customer.
I believe that a purchase let's say 1 month ago is more meaningful for my prediction than a purchase 10 months ago.
My question is now how I can give more recent data a higher impact? There is a 'weight' option in the svm-function but I'm not sure how to use it.
Anyone who can give me a hint? Would be much appreciated!
That's my code
# Fit model using Support Vecctor Machines
# install.packages("e1071")
library(e1071)
response <- train[,5]; # purchases
formula <- response ~ .;
tuned.svm <- tune.svm(train, response, probability=TRUE,
gamma=10^(-6:-3), cost=10^(1:2));
gamma.k <- tuned.svm$best.parameter[[1]];
cost.k <- tuned.svm$best.parameter[[2]];
svm.model <- svm(formula, data = train,
type='eps-regression', probability=TRUE,
gamma=gamma.k, cost=cost.k);
svm.pred <- predict(svm.model, test, probability=TRUE);
Side notes: I'm fitting a model for every single customer. Also, since I'm interested in the probability, that customer i buys item j in week k, I put
probability=TRUE
click here to see a sccreenshot of my data
Weights option in the R SVM Model is more towards assigning weights to solve the problem of imbalance classes. its class.Weights parameter and is used to assign weightage to different classes 1/0 in a biased dataset.
To answer your question: to give more weightage in a SVM Model for recent data, a simple trick in absence of an ibuild weight functionality at observation level is to repeat the recent columns (i.e. create duplicate rows for recent data) hence indirectly assigning them higher weight
Try this package: https://CRAN.R-project.org/package=WeightSVM
It uses a modified version of 'libsvm' and is able to deal with instance weighting. You can assign higher weights to recent data.
For example. You have simulated data (x,y)
x <- seq(0.1, 5, by = 0.05)
y <- log(x) + rnorm(x, sd = 0.2)
This is an unweighted SVM:
model1 <- wsvm(x, y, weight = rep(1,99))
Blue dots is the unweighted SVM and do not fit the first instance well. We want to put more weights on the first several instances.
So we can use a weighted SVM:
model2 <- wsvm(x, y, weight = seq(99,1,length.out = 99))
Green dots is the weighted SVM and fit the first instance better.

Resources