Convert multiple moran.test outputs into structured, storable, copy-pastable strings - r

I wish to collapse the output of spdep::moran.test into a single string that is regularly structured with variable names and values and that can both be saved as a text value into a dataframe, and be human readable in the RStudio console and copy-pastable into MS Word to form a table without too many additional manual adjustments. (I have multiple tests to run and wish to copy-paste their output in one go.)
In the course of looking for a solution, I stumbled upon the report package which claims to turn an htest class object into a "report" (I don't know what this looks like in R) and thus may address my goal to some extent. However, the report function doesn't work on moran.test, as presented in the code below.
I am exploring and there are probably alternative and more straightforward approaches which I haven't considered. Thus my question is twofold: 1. Solve the immediate issue with report and/or 2. Provide an alternative and more efficient solution to my goal.
The data preparation below is drawn from https://mgimond.github.io/simple_moransI_example.
library(sf)
library(spdep)
library(report)
# Load shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
# Prevent error "old-style crs object detected; please recreate object with a recent sf::st_crs()"
st_crs(s) <- st_crs(s)
# Define neighboring polygons
nb <- poly2nb(s, queen=TRUE)
# Assign weights to the neighbors
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run Moran’s I test
(mt <- moran.test(s$Income,lw, alternative="greater"))
#Moran I test under randomisation
#data: s$Income
#weights: lw
#Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
#alternative hypothesis: greater
#sample estimates:
# Moran I statistic Expectation Variance
#0.68279551 -0.04000000 0.01525284
# Moran’s I test is of class htest required by function report::report
class(mt)
#[1] "htest"
# Function report::report returns an error
report(mt)
#Error in `$<-.data.frame`(`*tmp*`, "tau", value = c(`Moran I statistic` = 0.68279551202875, :
# replacement has 3 rows, data has 1
The desired output could look something like:
"P-value 2.421e-09 | Statistic 0.68279551 | Expectation -0.04000000 | Variance 0.01525284"
The point is the names and values, not the separators. This is based on my current assumptions of how to approach this task, which are probably imperfect.

You might want to take a look at the broom package:
broom::tidy(mt)
#> # A tibble: 1 x 7
#> estimate1 estimate2 estimate3 statistic p.value method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.683 -0.04 0.0153 5.85 2.42e-9 Moran I test u… greater
library(tidyverse)
mt %>%
broom::tidy() %>%
as.list() %>%
enframe() %>%
mutate(value = value %>% as.character()) %>% unite(data, sep = "=") %>%
pull(data) %>%
paste0(collapse = ", ")
#> [1] "estimate1=0.68279551202875, estimate2=-0.04, estimate3=0.0152528397222445, statistic=c(`Moran I statistic standard deviate` = 5.85248209823413), p.value=2.42145194022024e-09, method=Moran I test under randomisation, alternative=greater"
You can make a table and create a csv file from multiple tests (e.g. having multiple objects of class htest like mt, mt1 and mt2):
list(mt, mt2, mt3) %>% map(broom::tidy) %>% bind_rows() %>% write_csv("tests.csv")

Related

Using outside function in dplyr to standardize values via selected geometric mean. (Getting it via sample instead of geom mean of full column)

Good evening fellow programmers/statistians etc.
I'm trying to standardize a set of variables dividing them by the geometric mean of a set of (same or not) variables I'm using as reference. Problem is, when trying to get it to work via dplyr, I'm getting results that I suspect are not the ones they should be if I do it case by case.
I have here some code explaining what I have done and why it failed. But It seems that dplyr is not getting my values via sample/row-wise, and instead is taking the full column to do my geometric mean.
I have been reviewing some questions, including some about geometric means, but for now I have not yet found how to solve it.
# A set of functions I'm using to calculate the geom mean.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
gm_mean2 = function(x, na.rm=TRUE){
exp(mean(log(x[x > 0]), na.rm=TRUE))
}
# And also psych::geometric.mean()
# x <- c(4, 8, 9, 9, 12, 14, 17)
# gm_mean(x) # It works as intended.
# gm_mean2(x) #It works as intended.
# psych::geometric.mean(x) #Indeed it works
So, using the iris dataset, I want to standardize a set of columns (coln1), dividing by the geometric mean of another set of columns (Which I would want to set as a variable, but since I'm not getting it to work as separate, I'm trying them without grouping them in a variable)
For now I have tried this (and failed)
library(dplyr)
coln1 <- colnames(iris)[1:2]
coln1 <- colnames(iris)[1:2]
iris %>% mutate(across( any_of(coln1), ~ .x / psych::geometric.mean(c(Sepal.Length,Sepal.Width)) )) ## Doesn't work as intended? No. Not at all.
# Let me illustrate. Value that we are getting doing it case by case its == to the output?
iris[1,1] / psych::geometric.mean(c(iris[1,1],iris[1,2]))
1.207 != 1.2187
iris[1,1] / psych::geometric.mean(c(iris$Sepal.Length,iris$Sepal.Width))
1.2817 == 1.287
# Its doing it by taking the full column of values, all of them, and not the values corresponding to that sample (in this case 2, but we could have more or less variables changing it in the psych:geometri.c.mean.)
# Notes.
# The geometric mean is the nth root of n products or e to the mean log of x. Useful for describing non-normal, i.e., geometric distributions. We are usign it via psych:: because it could be negative and we should solve that.
# iris %>% mutate(across( any_of(coln1), ~ .x / exp(mean(log(Sepal.Length+Sepal.Width))) )) # No. Cause this is not using the mean since its one value instead of two.
I think you've done a great job setting it up, it's just 'rowwise()' that you're missing really! I've re-arranged the logic in the mutate call but it's basically just rowwise.
coln1 <- colnames(iris)[3:4]
iris %>%
rowwise() %>%
mutate(geo.mean = psych::geometric.mean(c(Sepal.Length,Sepal.Width)),
across(.cols = all_of(coln1), .fns = ~ .x / geo.mean, .names = '{.col}_{.fn}'))
# A tibble: 150 x 8
# Rowwise:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species geo.mean Petal.Length_1 Petal.Width_1
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 setosa 4.22 0.331 0.0473
# prove it's correctly functioning with first entry:
1.4 / psych::geometric.mean(c(5.1, 3.5))
[1] 0.3313667
0.2 / psych::geometric.mean(c(5.1, 3.5))
[1] 0.04733811

Using dplyr to run rma() on multiple subsets

I want to run a subgroup meta-analysis within metafor package. The simplest way to do it is:
model.s.1 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level1")
model.s.2 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level2")
...
model.s.n <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Leveln")
However, it's very confusing to do it by hand if a factor for subgroups has multiple levels. I tried to use dplyr to solve this and extract simply coefficients for all subgroups:
Dataset %>%
mutate(S=as.factor(S)) %>%
group_by(S) %>%
summarize(Coeff=coef.rma(rma(yi=ES, vi=Va, method="DL", data=.)))
But the result looked like this:
S Coeff
<fct> <dbl>
1 hmdb 0.114
2 HMDB0000123 0.114
3 HMDB0000148 0.114
4 HMDB0000158 0.114
5 HMDB0000159 0.114
6 HMDB0000161 0.114
7 HMDB0000162 0.114
8 HMDB0000167 0.114
9 HMDB0000168 0.114
10 HMDB0000172 0.114
# ... with 14 more rows
It seems that the rma function omits the group_by and calculates the pooled effect for the whole dataset each time. What might be the cause? Are there any alternatives for such approach?
We may do a group_split and then loop through the list elements with map
library(tidyverse)
Dataset %>%
group_split(S= factor(S)) %>%
map_dfr(~ .x %>%
summarise(S = first(S), Coeff=coef.rma(rma(yi=ES,
vi=Va, method="DL", data=.))))
Dear #akrun I have one more question on a similar piece of code (previous one was in wrong window, sorry for that)
Let's assume that for every subset of studies I'd like to add a fixed-effect meta-regression with a binary factor (0/1) - we call it F.
library(tidyverse)
Dataset %>%
group_split(S=factor(S)) %>%
map.dfr(~ .x %>%
summarise(S=first(S), Coeff=coef.rma(rma(yi=ES,vi=Va, mods=~F, method="DL",
data=.))))
If a certain subset from S has only zero's or one's, it will give an error message from rma function. How can I then add a formula to drop such cases from list and repalce them with "NA"?
Thank you,
Jakub
library(metafor)
library(tidyverse)
Results <- Org %>% # Primary analysis - DerSimonian-Laire Estimator
group_split(Metabolite= factor(Metabolite)) %>%
map_dfr(~ .x %>%
summarise(Metabolite = first(Metabolite),
Coeff = ifelse(nlevels(Biospecimen)>1,
ifelse((rma(yi=Est,sei=SE, method="DL", data=.))$k>=5,
coef.rma(rma(yi=Est,sei=SE, mods=~Biospecimen, method="DL", data=.)),NA),NA)))
It worked, but produced warnings from rma function. However results seem to be corrrect. Thanks a lot #akrun

Generate a crude incidence rate table (stratified by a factor variable) from a Lexis Model

I am using the 'Epi' package in R to model follow-up data from a study.
I am having no issues with declaring the Lexis model or running Poisson and (combined with the survival package) Cox regressions.
As part of the initial data review I want to find a simple way to make a table of crude unadjusted incidence/event rates from data in a lexis model in R (pre-fitting any poisson/cox models).
I have found a coded approach which allows me to do this and to stratify by a variable as part of exploratory data analysis:
#Generic Syntax Example
total <-cbind(tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum),tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum))
#Add up the number of events within the stratifying variable
#Add up the amount of follow-up time within the stratifying the variable
rates <- tapply(lexis_model$lex.Xst,lexis_model$stratifying_var,sum)/tapply(lexis_model$lex.dur,lexis_model$stratifying_var,sum)*10^3
#Given rates per 1,000 person years
ratetable <- (cbind(totals,rates))
#Specific Example based on the dataset
totals <-cbind(tapply(lexis_model$lex.Xst,lexis_model$grade,sum),tapply(lexis_model$lex.dur,lexis_model$grade,sum))
rates <- tapply(lexis_model$lex.Xst,lexis_model$grade,sum)/tapply(lexis_model$lex.dur,lexis_model$grade,sum)*10^3
ratetable <- (cbind(totals,rates))
ratetable
rates
1 90 20338.234 4.4251630
2 64 7265.065 8.8092811
#Shows number of events, years follow-up, number of events per 1000 years follow-up, stratified by the stratifying variable
Note this is crude unadjusted/absolute rates - not the output of a Poisson model. Whilst I appreciate that the code above does indeed produce the desired output (and is pretty straightforward) I wanted to see if people were aware of a command which can take a lexis dataset and output this. I've had a look at the available commands in the Epi and epitools package - may have missed somthing but could not see an obvious way to do this.
As this is a quite common thing to want to do I wondered if anyone was aware of a package/function that could do this by specifying the simply the lexis dataset and the stratification variable (or indeed a single function to the steps above in a single go).
Ideally the output would look something like the below (which is taken from STATA which I am trying to move away from in favour of R!):
A copy of the first twenty rows or so of the actual data is here (the data has already been put in to a lexis model using Epi package so all relevant lexis variables are there):
https://www.dropbox.com/s/yjyz1kzusysz941/rate_table_data.xlsx?dl=0
I would do this simply using the tidyverse R package as such:
library(tidyverse)
lexis_model %>%
group_by(grade) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3) -> rateable
rateable
# A tibble: 2 x 4
# grade sum_Xst sum_dur rate
# <dbl> <int> <dbl> <dbl>
# 1 1 2 375.24709 5.329821
# 2 2 0 92.44079 0.000000
And you could wrap this into a function yourself:
rateFunc <- function(data, strat_var)
{
lexis_model %>%
group_by_(strat_var) %>%
summarise(sum_Xst = sum(lex.Xst), sum_dur = sum(lex.dur)) %>%
mutate(rate = sum_Xst/sum_dur*10^3)
}
which you would then call:
rateFunc(lexis_model, "grade")
This is useful because, using the combination of tidyverse summarise and mutate it is very easy to add more summary statistics to the table.
EDIT:
After clarification on the question, this can be done using the popEpi package using the rate command:
popEpi::rate(lexis_model, obs = lex.Xst, pyrs = lex.dur, print = grade)
# Crude rates and 95% confidence intervals:
# grade lex.Xst lex.dur rate SE.rate rate.lo rate.hi
# 1: 1 2 375.2472 0.00532982 0.003768752 0.001332942 0.0213115
# 2: 2 0 92.4408 0.00000000 0.000000000 0.000000000 NaN

R - Using nested dataframe to run function with different sets of parameters

I would like to create a wrapper for the Levenberg-Marquardt Nonlinear Least-Squares function nls.lm (minpack.lm library) similar to nls2 (nls2 library) to give a brute force method for evaluating the fit of a model to observed data.
The idea is to create a range of starting value combinations and either:
pass these to a function, then compare the function output to the observed data to create an R^2 value for each of the starting value combinations and run the nls.lm fitting with the best one of them.
or
run nls.lm on all combinations and select the best returned fit.
I wanted to do this without looping and after inspiration from here am trying to use nested dataframes, with one column for the parameter input list, one for the values returned by my function, one for the R^2 values, and one for the best fit models,something like:
df
# start_val fun_out R^2
# 1 {a=2,b=2} {22,24,26...} 0.8
# 2 {a=3,b=5} {35,38,41...} 0.6
This is the code I have so far:
require(dplyr);require(tidyr)
foo <- function(x,a,b) a*x^2+b # function I am fitting
x <- 1:10 # independent variable
y_obs <- foo(x,1.5,2.5) + rnorm(length(x),0,10) # observed data (dependent variable)
start_range <- data.frame(a=c(1,2),b=c(2,3)) # range of allowed starting points for fitting
reps <- 2 # number of starting points to generate
# Create a data frame of starting points
df<-as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range)) %>%
mutate(data=as.list(data)) %>%
as.data.frame()
df
# id data
# 1 1 1.316356, 2.662923
# 2 2 1.059356, 2.723081
I get stuck now trying to pass the parameters in data into the function foo(). I've tried using do.call(), and even with using constant parameters the following error appears:
mutate(df,y=do.call(foo,list(x,1,2)))
# Error: wrong result size (5), expected 2 or 1
Is there a way to create columns of a dataframe which contain lists directly without using nest()?
Also when trying to create the list to pass to do.call() using the dataframe columns, how do you create a list where the first element is the vector x, the second is the parameter a and the third is the parameter b? The follwing splits the list down the column:
mutate(df,my_list=list(x,data))
# id data my_list
# 1 1 1.316356, 2.662923 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
# 2 2 1.059356, 2.723081 1.316356, 2.662923, 1.059356, 2.723081
Running nls2 using algorithm = "random-search" and all = TRUE and the specified maxiter will evaluate foo at maxiter random points and return starting_fits which are the fits at those points. It consists of a set of "nls" class objects evaluated at each of the randomly chosen starting values. It does not do an optimization from each of these starting values but just returns the "nls" object at each. That is, nls is not run. Now for each starting fit run nlsLM giving fits, a list of nlsLM fits and from that summarize them in data (a data frame with one row per run) and show the least.
If we only want to pick the best starting value and just run nlsLM once from that then use the alternate code near the end.
library(nls2)
fo <- y_obs ~ foo(x, a, b)
starting_fits <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps), all = TRUE)
fits <- lapply(starting_fits, function(fit) nlsLM(fo, start = coef(fit)))
data <- data.frame(RSS = sapply(fits, deviance), t(sapply(fits, coef)),
start = t(sapply(starting_fits, coef)))
# data$fits <- fits # optional to store each row's fitted object in that row
subset(data, RSS == min(RSS)) # minimum(s)
giving:
RSS a b start.a start.b
2 706.3956 1.396616 7.226525 1.681819 2.768374
R squared is used for linear regression. It is not valid for nonlinear regression. Residual sum of squares (RSS) is shown above instead.
Alternately if you just want to pick out the best starting value and run nlsLM on that then just omit the all=TRUE argument from the nls2 call and do this. If you need the coefficients and RSS for later code then try coef(fit) and deviance(fit) .
starting_fit <- nls2(fo, algorithm = "random-search",
start = start_range, control = nls.control(maxiter = reps))
fit <- nlsLM(fo, start = coef(starting_fit))
Note 1: If you are getting errors from nlsLM try replacing nlsLM(...) with try(nlsLM(...)). This will issue error messages (use try(..., silent = TRUE) if you don't want them) but will not stop processing.
Note 2: I assume that the foo shown in the question is just an example and the real function is more complex. The foo shown is linear in the coefficients so one could use lm for it. No nonlinear optimization is needed.
An approach like this perhaps?
library(dplyr)
library(purrr)
foo2 <- function(x,data) data$a*x^2+data$b
r2 <- function(e, o) 1 - sum((e - 0)^2) / sum((e - mean(e)^2))
df <- as.data.frame(sapply(start_range, function(x) runif(reps,min=x[[1]],max=x[[2]]))) %>%
mutate(id=seq_len(reps)) %>% # fudge to make nest behave as I want
nest(1:ncol(start_range))
df %>%
mutate(fun_out = map(data, foo2, x = x),
R2 = map(fun_out, o = y_obs, r2))
Result:
# A tibble: 3 x 4
id data fun_out R2
<int> <list> <list> <list>
1 1 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
2 2 <tibble [1 x 2]> <dbl [10]> <dbl [1]>
3 3 <tibble [1 x 2]> <dbl [10]> <dbl [1]>

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Resources