Correlation matrix with dplyr, tidyverse and broom - P-value matrix - r

all. I want to obtain the p-value from a correlation matrix using dplyr and/or broom packages and testing multiple variables at the same time. I'm aware of other methods, but dplyr seems easier and more intuitive for me. In addition, dplyr will need to correlate each variable to obtain the specific p-value, what makes the process easier and faster.
I checked other links, but they did not work for this question (example 1, example 2, example 3)
When I use this code, the correlation coefficients are reported. However, the P-values are not.
agreg_base_tipo_a %>%
dplyr::select(S2.RT, BIS_total, IDATE, BAI, ASRS_total) %>%
do(as.data.frame(cor(., method="spearman", use="pairwise.complete.obs")))
Please, check out this reproducible code to word:
set.seed(1164)
library(tidyverse)
ds <- data.frame(id=(1) ,a=rnorm(10,2,1), b=rnorm(10,3,2), c=rnorm(5,1,05))
ds %>%
select(a,b,c) %>%
do(as.data.frame(cor(., method="spearman", use="pairwise.complete.obs")))

This answer is based on akrun's comment from this post. By using the rcorr function, we can calculate the correlations and P values. To access these components, use ds_cor$r and ds_cor$P.
set.seed(1164)
library(tidyverse)
library(Hmisc)
ds <- data.frame(id=(1) ,a=rnorm(10,2,1), b=rnorm(10,3,2), c=rnorm(5,1,05))
ds_cor <- ds %>%
select(-id) %>%
as.matrix() %>%
rcorr(type = "spearman")
ds_cor
# a b c
# a 1.00 0.28 -0.42
# b 0.28 1.00 -0.25
# c -0.42 -0.25 1.00
#
# n= 10
#
#
# P
# a b c
# a 0.4250 0.2287
# b 0.4250 0.4929
# c 0.2287 0.4929

Related

Convert multiple moran.test outputs into structured, storable, copy-pastable strings

I wish to collapse the output of spdep::moran.test into a single string that is regularly structured with variable names and values and that can both be saved as a text value into a dataframe, and be human readable in the RStudio console and copy-pastable into MS Word to form a table without too many additional manual adjustments. (I have multiple tests to run and wish to copy-paste their output in one go.)
In the course of looking for a solution, I stumbled upon the report package which claims to turn an htest class object into a "report" (I don't know what this looks like in R) and thus may address my goal to some extent. However, the report function doesn't work on moran.test, as presented in the code below.
I am exploring and there are probably alternative and more straightforward approaches which I haven't considered. Thus my question is twofold: 1. Solve the immediate issue with report and/or 2. Provide an alternative and more efficient solution to my goal.
The data preparation below is drawn from https://mgimond.github.io/simple_moransI_example.
library(sf)
library(spdep)
library(report)
# Load shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/nhme.rds"))
# Prevent error "old-style crs object detected; please recreate object with a recent sf::st_crs()"
st_crs(s) <- st_crs(s)
# Define neighboring polygons
nb <- poly2nb(s, queen=TRUE)
# Assign weights to the neighbors
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
# Run Moran’s I test
(mt <- moran.test(s$Income,lw, alternative="greater"))
#Moran I test under randomisation
#data: s$Income
#weights: lw
#Moran I statistic standard deviate = 5.8525, p-value = 2.421e-09
#alternative hypothesis: greater
#sample estimates:
# Moran I statistic Expectation Variance
#0.68279551 -0.04000000 0.01525284
# Moran’s I test is of class htest required by function report::report
class(mt)
#[1] "htest"
# Function report::report returns an error
report(mt)
#Error in `$<-.data.frame`(`*tmp*`, "tau", value = c(`Moran I statistic` = 0.68279551202875, :
# replacement has 3 rows, data has 1
The desired output could look something like:
"P-value 2.421e-09 | Statistic 0.68279551 | Expectation -0.04000000 | Variance 0.01525284"
The point is the names and values, not the separators. This is based on my current assumptions of how to approach this task, which are probably imperfect.
You might want to take a look at the broom package:
broom::tidy(mt)
#> # A tibble: 1 x 7
#> estimate1 estimate2 estimate3 statistic p.value method alternative
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 0.683 -0.04 0.0153 5.85 2.42e-9 Moran I test u… greater
library(tidyverse)
mt %>%
broom::tidy() %>%
as.list() %>%
enframe() %>%
mutate(value = value %>% as.character()) %>% unite(data, sep = "=") %>%
pull(data) %>%
paste0(collapse = ", ")
#> [1] "estimate1=0.68279551202875, estimate2=-0.04, estimate3=0.0152528397222445, statistic=c(`Moran I statistic standard deviate` = 5.85248209823413), p.value=2.42145194022024e-09, method=Moran I test under randomisation, alternative=greater"
You can make a table and create a csv file from multiple tests (e.g. having multiple objects of class htest like mt, mt1 and mt2):
list(mt, mt2, mt3) %>% map(broom::tidy) %>% bind_rows() %>% write_csv("tests.csv")

Batch distribution fitting using Tidyverse and fitdistrplus

I have a dataset that is as follows (10,000+ Rows):
P_ID
SNUM
RNUM
X
ID_233
10
2
40.31
ID_233
10
3
23.21
ID_234
12
5
11.00
ID_234
12
6
0.31
ID_234
13
1
0.00
ID_235
10
2
66.23
From this dataset, I want to fit each distinct P_ID to a Gamma distribution (ignoring the testing of how well the sampled data fits the distribution)
Using the fitdistrplus package, I can achieve this by extracting the X for an individual P_ID into a vector and then run it through fw <- fitdist(data,"gamma") and then extract the shape and rate descriptive variables out, but this is all very manual.
I would like to find a method using tidyverse to go from the data frame above to:
P_ID
Distrib
G_Shape
G_Rate
ID_233
Gamma
1.21557116
0.09206639
ID_234
Gamma
3.23234542
0.34566432
ID_235
Gamma
2.34555553
0.92344521
How would i achieve this with Tidyverse and Pipes and not doing a succession of for loops?
You could apply fitdist for every individual using group_by and extract shape and rate values out of each model.
library(dplyr)
library(purrr)
library(fitdistrplus)
data %>%
group_by(P_ID) %>%
summarise(model = list(fitdist(X, "gamma"))) %>%
mutate(G_Shape = map_dbl(model, pluck, 'estimate', 'shape'),
G_rate = map_dbl(model, pluck, 'estimate', 'rate')) -> result
result

Using dplyr to run rma() on multiple subsets

I want to run a subgroup meta-analysis within metafor package. The simplest way to do it is:
model.s.1 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level1")
model.s.2 <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Level2")
...
model.s.n <- rma(yi=ES, vi=Va, data=dataset, method="DL", subset=S=="S_Leveln")
However, it's very confusing to do it by hand if a factor for subgroups has multiple levels. I tried to use dplyr to solve this and extract simply coefficients for all subgroups:
Dataset %>%
mutate(S=as.factor(S)) %>%
group_by(S) %>%
summarize(Coeff=coef.rma(rma(yi=ES, vi=Va, method="DL", data=.)))
But the result looked like this:
S Coeff
<fct> <dbl>
1 hmdb 0.114
2 HMDB0000123 0.114
3 HMDB0000148 0.114
4 HMDB0000158 0.114
5 HMDB0000159 0.114
6 HMDB0000161 0.114
7 HMDB0000162 0.114
8 HMDB0000167 0.114
9 HMDB0000168 0.114
10 HMDB0000172 0.114
# ... with 14 more rows
It seems that the rma function omits the group_by and calculates the pooled effect for the whole dataset each time. What might be the cause? Are there any alternatives for such approach?
We may do a group_split and then loop through the list elements with map
library(tidyverse)
Dataset %>%
group_split(S= factor(S)) %>%
map_dfr(~ .x %>%
summarise(S = first(S), Coeff=coef.rma(rma(yi=ES,
vi=Va, method="DL", data=.))))
Dear #akrun I have one more question on a similar piece of code (previous one was in wrong window, sorry for that)
Let's assume that for every subset of studies I'd like to add a fixed-effect meta-regression with a binary factor (0/1) - we call it F.
library(tidyverse)
Dataset %>%
group_split(S=factor(S)) %>%
map.dfr(~ .x %>%
summarise(S=first(S), Coeff=coef.rma(rma(yi=ES,vi=Va, mods=~F, method="DL",
data=.))))
If a certain subset from S has only zero's or one's, it will give an error message from rma function. How can I then add a formula to drop such cases from list and repalce them with "NA"?
Thank you,
Jakub
library(metafor)
library(tidyverse)
Results <- Org %>% # Primary analysis - DerSimonian-Laire Estimator
group_split(Metabolite= factor(Metabolite)) %>%
map_dfr(~ .x %>%
summarise(Metabolite = first(Metabolite),
Coeff = ifelse(nlevels(Biospecimen)>1,
ifelse((rma(yi=Est,sei=SE, method="DL", data=.))$k>=5,
coef.rma(rma(yi=Est,sei=SE, mods=~Biospecimen, method="DL", data=.)),NA),NA)))
It worked, but produced warnings from rma function. However results seem to be corrrect. Thanks a lot #akrun

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

How do I extract ecdf values out of ecdfplot()

If I use the ecdfplot() function of the latticeExtra package how do I get the actual values calculated i.e. the y-values which correspond to the ~x|g input?
I've been looking at ?ecdfplot but there's not discription to it. For the usual highlevel function ecdf() it works with the command plot=FALSE but this does not work for ecdfplot().
The reason I want to use ecdfplot() rather than ecdf() is that I need to calculate the ecdf() values for a grouping variable. I know I could do this handish too but I'm quite convinced that there is a highroad too.
Here a small expample
u <- rnorm(100,0,1)
mygroup <- c(rep("group1",50),rep("group2",50))
ecdfplot(~u, groups=mygroup)
I would like to extract the y-values given each group for the corresponding x-values.
If you stick with the ecdf() function in the base package, you can simply do as follows:
Create ecdf function with your data:
fun.ecdf <- ecdf(x) # x is a vector of your data
Now use this "ecdf function" to generate the cumulative probabilities of any vector you feed it, including your original, sorted data:
my.ecdf <- fun.ecdf(sort(x))
I know you said you don't want to use ecdf, but in this case it is much easier to use it than to get the data out of the trellis object that ecdfplot returns. (After all, that's all that ecdfplot is doing- it's just doing it behind the scenes).
In the case of your example, the following will get you a matrix of the y values (where x is your entire input u, though you could choose a different one) for each ECDF:
ecdfs = lapply(split(u, mygroup), ecdf)
ys = sapply(ecdfs, function(e) e(u))
# output:
# group1 group2
# [1,] 0.52 0.72
# [2,] 0.68 0.78
# [3,] 0.62 0.78
# [4,] 0.66 0.78
# [5,] 0.72 0.80
# [6,] 0.86 0.94
# [7,] 0.10 0.26
# [8,] 0.90 0.94
# ...
ETA: If you just want each column to correspond to the 50 x-values in that column, you could do:
ys = sapply(split(u, mygroup), function(g) ecdf(g)(g))
(Note that if the number of values in each group aren't identical, this will end up as a list rather than a matrix with columns).

Resources