Using dplyr to run rma() on multiple subsets - r

I want to run a subgroup meta-analysis within metafor package. The simplest way to do it is:
model.s.1 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level1")
model.s.2 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level2")
...
model.s.n <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Leveln")
However, it's very confusing to do it by hand if a factor for subgroups has multiple levels. I tried to use dplyr to solve this and extract simply coefficients for all subgroups:
Dataset %>%
mutate(S=as.factor(S)) %>%
group_by(S) %>%
summarize(Coeff=coef.rma(rma(yi=ES, vi=Va, method="DL", data=.)))
But the result looked like this:
S Coeff
<fct> <dbl>
1 hmdb 0.114
2 HMDB0000123 0.114
3 HMDB0000148 0.114
4 HMDB0000158 0.114
5 HMDB0000159 0.114
6 HMDB0000161 0.114
7 HMDB0000162 0.114
8 HMDB0000167 0.114
9 HMDB0000168 0.114
10 HMDB0000172 0.114
# ... with 14 more rows
It seems that the rma function omits the group_by and calculates the pooled effect for the whole dataset each time. What might be the cause? Are there any alternatives for such approach?

We may do a group_split and then loop through the list elements with map
library(tidyverse)
Dataset %>%
group_split(S= factor(S)) %>%
map_dfr(~ .x %>%
summarise(S = first(S), Coeff=coef.rma(rma(yi=ES,
vi=Va, method="DL", data=.))))

Dear #akrun I have one more question on a similar piece of code (previous one was in wrong window, sorry for that)
Let's assume that for every subset of studies I'd like to add a fixed-effect meta-regression with a binary factor (0/1) - we call it F.
library(tidyverse)
Dataset %>%
group_split(S=factor(S)) %>%
map.dfr(~ .x %>%
summarise(S=first(S), Coeff=coef.rma(rma(yi=ES,vi=Va, mods=~F, method="DL",
data=.))))
If a certain subset from S has only zero's or one's, it will give an error message from rma function. How can I then add a formula to drop such cases from list and repalce them with "NA"?
Thank you,
Jakub

library(metafor)
library(tidyverse)
Results <- Org %>% # Primary analysis - DerSimonian-Laire Estimator
group_split(Metabolite= factor(Metabolite)) %>%
map_dfr(~ .x %>%
summarise(Metabolite = first(Metabolite),
Coeff = ifelse(nlevels(Biospecimen)>1,
ifelse((rma(yi=Est,sei=SE, method="DL", data=.))$k>=5,
coef.rma(rma(yi=Est,sei=SE, mods=~Biospecimen, method="DL", data=.)),NA),NA)))
It worked, but produced warnings from rma function. However results seem to be corrrect. Thanks a lot #akrun

Related

I tried predicting on 5rows of the dataset, but why does it keep predicting on the whole dataset?

So I build a lm model in R on 65OOO rows (mydata) and I want to see only the predictions for the first 5 rows in order to see how good my model predicts. Below you can see the code I wrote to execute this but it keeps predicting the values of all 65000 rows. Is someone able to help me?
lm_model2002 <- lm(`AC: Volume` ~ `Market Area (L1)`,data=mydata)
summary(lm_model2002)
df = head(data.frame(`Market Area (L1)`=mydata$`Market Area (L1)`),5)
predict(lm_model2002,newdata=df)
but now the real problem: I took the first row of mydata and copied this row 5 times, then I made a vector that ranges from 1 to 2 and replaced one of the variables ( price per unit) with that vector. As a result, I want to predict the exact same rows but with only a different price, so that i am able to plot this evolution of a higher price:
lm_model3204<- lm(`AC: Volume` ~ log(price_per_unit)*(Cluster_country_hierarchical+`Loyalty-cumulative-volume-10`+`Loyalty-cumulative-orders-10`+`Loyalty-number-of-order-10`+price_discount+Incoterms)+Cluster_spg*(price_discount+Cluster_country_hierarchical)+price_discount*(Month+`GDP per capita`+`Loyalty-cumulative-orders-10`+`Loyalty-cumulative-volume-10`)+`Payer CustGrp`+`CRU Index`,data = mydata)
summary(lm_model3204)
test_data <- mydata[1:1,]
df <- data.frame(test_data,ntimes=c(5))
df <- as.data.frame(lapply(df, rep, df$ntimes))
priceperunit<-seq(1,2,by=0.25)
df$price_per_unit<-priceperunit
pred <- predict(lm_model3204,newdata=df)
Please use a minimal reproducible example next time you post a question.
You just have to predict the first five rows. Here an example with the in-built iris dataset
data("iris")
lm_model2002 <- lm(Sepal.Length ~ Sepal.Width,data=iris)
summary(lm_model2002)
predict(lm_model2002,newdata=iris[1:5,])
output:
> predict(lm_model2002,newdata=iris[1:5,])
1 2 3 4 5
5.744459 5.856139 5.811467 5.833803 5.722123
Or:
df <- head(iris,5)
predict(lm_model2002,newdata=df)
EDIT
After your last comment, to see the change in prediction by changing one of the independent variables
data(iris)
df <- iris[rep(1,5),]
Petal_Length<-seq(1,2,by=0.25)
df$Petal.Length<-Petal_Length
lm_model3204 <- lm(Sepal.Length ~ Petal.Length+Sepal.Width,data=iris)
pred <- predict(lm_model3204,newdata=df)

Convert multiple moran.test outputs into structured, storable, copy-pastable strings

I wish to collapse the output of spdep::moran.test into a single string that is regularly structured with variable names and values and that can both be saved as a text value into a dataframe, and be human readable in the RStudio console and copy-pastable into MS Word to form a table without too many additional manual adjustments. (I have multiple tests to run and wish to copy-paste their output in one go.)
In the course of looking for a solution, I stumbled upon the report package which claims to turn an htest class object into a "report" (I don't know what this looks like in R) and thus may address my goal to some extent. However, the report function doesn't work on moran.test, as presented in the code below.
I am exploring and there are probably alternative and more straightforward approaches which I haven't considered. Thus my question is twofold: 1. Solve the immediate issue with report and/or 2. Provide an alternative and more efficient solution to my goal.
The data preparation below is drawn from https://mgimond.github.io/simple_moransI_example.
library(sf)
library(spdep)
library(report)
# Load shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
# Prevent error "old-style crs object detected; please recreate object with a recent sf::st_crs()"
st_crs(s) <- st_crs(s)
# Define neighboring polygons
nb <- poly2nb(s, queen=TRUE)
# Assign weights to the neighbors
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run Moran’s I test
(mt <- moran.test(s$Income,lw, alternative="greater"))
#Moran I test under randomisation
#data: s$Income
#weights: lw
#Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
#alternative hypothesis: greater
#sample estimates:
# Moran I statistic Expectation Variance
#0.68279551 -0.04000000 0.01525284
# Moran’s I test is of class htest required by function report::report
class(mt)
#[1] "htest"
# Function report::report returns an error
report(mt)
#Error in `$<-.data.frame`(`*tmp*`, "tau", value = c(`Moran I statistic` = 0.68279551202875, :
# replacement has 3 rows, data has 1
The desired output could look something like:
"P-value 2.421e-09 | Statistic 0.68279551 | Expectation -0.04000000 | Variance 0.01525284"
The point is the names and values, not the separators. This is based on my current assumptions of how to approach this task, which are probably imperfect.
You might want to take a look at the broom package:
broom::tidy(mt)
#> # A tibble: 1 x 7
#> estimate1 estimate2 estimate3 statistic p.value method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.683 -0.04 0.0153 5.85 2.42e-9 Moran I test u… greater
library(tidyverse)
mt %>%
broom::tidy() %>%
as.list() %>%
enframe() %>%
mutate(value = value %>% as.character()) %>% unite(data, sep = "=") %>%
pull(data) %>%
paste0(collapse = ", ")
#> [1] "estimate1=0.68279551202875, estimate2=-0.04, estimate3=0.0152528397222445, statistic=c(`Moran I statistic standard deviate` = 5.85248209823413), p.value=2.42145194022024e-09, method=Moran I test under randomisation, alternative=greater"
You can make a table and create a csv file from multiple tests (e.g. having multiple objects of class htest like mt, mt1 and mt2):
list(mt, mt2, mt3) %>% map(broom::tidy) %>% bind_rows() %>% write_csv("tests.csv")

Batch distribution fitting using Tidyverse and fitdistrplus

I have a dataset that is as follows (10,000+ Rows):
P_ID
SNUM
RNUM
X
ID_233
10
2
40.31
ID_233
10
3
23.21
ID_234
12
5
11.00
ID_234
12
6
0.31
ID_234
13
1
0.00
ID_235
10
2
66.23
From this dataset, I want to fit each distinct P_ID to a Gamma distribution (ignoring the testing of how well the sampled data fits the distribution)
Using the fitdistrplus package, I can achieve this by extracting the X for an individual P_ID into a vector and then run it through fw <- fitdist(data,"gamma") and then extract the shape and rate descriptive variables out, but this is all very manual.
I would like to find a method using tidyverse to go from the data frame above to:
P_ID
Distrib
G_Shape
G_Rate
ID_233
Gamma
1.21557116
0.09206639
ID_234
Gamma
3.23234542
0.34566432
ID_235
Gamma
2.34555553
0.92344521
How would i achieve this with Tidyverse and Pipes and not doing a succession of for loops?
You could apply fitdist for every individual using group_by and extract shape and rate values out of each model.
library(dplyr)
library(purrr)
library(fitdistrplus)
data %>%
group_by(P_ID) %>%
summarise(model = list(fitdist(X, "gamma"))) %>%
mutate(G_Shape = map_dbl(model, pluck, 'estimate', 'shape'),
G_rate = map_dbl(model, pluck, 'estimate', 'rate')) -> result
result

Quantile estimates for subpopulations where some subpopulations only have one case using srvyr and survey R packages

I am trying to produce estimates of the 25th percentile of a continuous variable for a series of sub-groups, where the data is taken from a survey that uses sampling weights. I am doing this in R using the survey and srvyr packages.
This issue I face is that in a small minority of cases a sub-group only has one observation and therefore a 25th percentile is meaningless. This would be fine however it results in a error which prevents the percentiles being calculated for those subgroups with sufficient observations.
Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
need at least two non-NA values to interpolate
The code runs when the offending groups are removed, however I have had to identify them manually which is far from ideal.
Is there a way to achieve the same outcome but where for single observation groups an NA, or just the value of that observation, is outputted rather than an error? Alternatively is there a neat way of automatically excluding such groups from the calculation?
Below is a reproducible example to illustrate my issue using the apistrat dataset from the survey package.
library(dplyr)
library(survey)
library(srvyr)
data(api)
#25th percentile of api00 by school type and whether school is year round or not
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Error in approxfun(cum.w, xx[oo], method = method, f = f, yleft = min(xx), :
#need at least two non-NA values to interpolate
apistrat %>% group_by(yr.rnd, stype) %>% tally() %>% filter(n==1)
#one group out of 6 has only a single api00 observation and therefore a quantile can't be interpolated
#Removing that one group means the code can now run as intended
apistrat %>%
as_survey(strata = stype, weights = pw) %>%
filter(!(yr.rnd=="Yes"&stype=="H")) %>%
group_by(yr.rnd, stype, .drop=TRUE) %>%
summarise(survey_quantile(api00, 0.25, na.rm=T))
#Get the same error if you do it the 'survey' package way
dstrat <- svydesign(id=~1,strata=~stype,data=apistrat, fpc=~fpc)
svyby(~api99, ~stype+yr.rnd, dstrat, svyquantile, quantiles=0.25)
One work-around is to wrap the call to svyquantile() using tryCatch()
> svyq<-function( ...){tryCatch(svyquantile(...), error=function(e) matrix(NA,1,1))}
> svyby(~api99, ~stype+yr.rnd, dstrat, svyq, quantiles=0.25,keep.var=FALSE,na.rm=TRUE)
stype yr.rnd statistic
E.No E No 560.50
H.No H No 532.75
M.No M No 509.00
E.Yes E Yes 456.00
H.Yes H Yes NA
M.Yes M Yes 436.00
With quantiles and svyby you need to be explicit about whether you want standard errors -- the code above doesn't. If you want standard errors, you'd need the error= branch of tryCatch to return an actual svyquantile object with NAs in it.

Correlation matrix with dplyr, tidyverse and broom - P-value matrix

all. I want to obtain the p-value from a correlation matrix using dplyr and/or broom packages and testing multiple variables at the same time. I'm aware of other methods, but dplyr seems easier and more intuitive for me. In addition, dplyr will need to correlate each variable to obtain the specific p-value, what makes the process easier and faster.
I checked other links, but they did not work for this question (example 1, example 2, example 3)
When I use this code, the correlation coefficients are reported. However, the P-values are not.
agreg_base_tipo_a %>%
dplyr::select(S2.RT, BIS_total, IDATE, BAI, ASRS_total) %>%
do(as.data.frame(cor(., method="spearman", use="pairwise.complete.obs")))
Please, check out this reproducible code to word:
set.seed(1164)
library(tidyverse)
ds <- data.frame(id=(1) ,a=rnorm(10,2,1), b=rnorm(10,3,2), c=rnorm(5,1,05))
ds %>%
select(a,b,c) %>%
do(as.data.frame(cor(., method="spearman", use="pairwise.complete.obs")))
This answer is based on akrun's comment from this post. By using the rcorr function, we can calculate the correlations and P values. To access these components, use ds_cor$r and ds_cor$P.
set.seed(1164)
library(tidyverse)
library(Hmisc)
ds <- data.frame(id=(1) ,a=rnorm(10,2,1), b=rnorm(10,3,2), c=rnorm(5,1,05))
ds_cor <- ds %>%
select(-id) %>%
as.matrix() %>%
rcorr(type = "spearman")
ds_cor
# a b c
# a 1.00 0.28 -0.42
# b 0.28 1.00 -0.25
# c -0.42 -0.25 1.00
#
# n= 10
#
#
# P
# a b c
# a 0.4250 0.2287
# b 0.4250 0.4929
# c 0.2287 0.4929

Resources