How can I append spearman Rho stat to new object? - r

I am carrying out a number of spearman's rank correlations and I want to produce a list of all the Rho estimates automatically.
Here is some sample data:
A <- data.frame('Area' = c(4, 6, 5),
'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')
I currently have the following code that carries out a correlation for each data frame in the list:
lapply(files, function (x)
cor.test(~flow + Area, data = x, method = 'spearman'))
However, what I would like to do is add another line to this to extract the Rho estimation for each correlation and append this to a new list.
How can I do this?

Related

Multiply probability distributions in R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)

Avoid writing large number of column names in a model formula with bs() terms

I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.
df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
f= c("m","f"), output = c(0,1))
library(splines)
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
factor(f) ,
data = df,
family = "binomial")
In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?
We can use some string manipulation with sprintf, together with reformulate:
predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e,
# df = 2) + factor(f)
If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).
predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) +
# bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)
Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").
Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.

create multiple train, val, test splits using sample

I want to create train, val, test splits (60:20:20). I repeated the process multiple times.
Test set should contain only 2 observations each time. But why does it sometimes contain 1 or 3 observations.
What is role of replace in sample(). Should I keep it FALSE
library(dplyr)
tbl <- tibble(id = 1:10)
train = list()
val = list()
test = list()
for (run in 1:5)
{
assignment <- sample(1:3, size = nrow(tbl), prob = c(0.6, 0.2, 0.2), replace = TRUE)
# Create a train, validation and test sets
train[[run]] <- tbl[assignment == 1, ]
val[[run]] <- tbl[assignment == 2, ]
test[[run]] <- tbl[assignment == 3, ]
}
If we exactly 6, 2, 2, values for 1, 2, 3 as sample, just replicate the 1, 2, 3 and sample it
v1 <- sample(rep(1:3, c(6, 2, 2)))
Then do a split
split(tbl, v1)
When we use prob, it can change the frequency slightly because it is just a probability. Regarding the use of replace = TRUE, it is needed in the OP's code as the length of 1:3 is just 3, whereas size = nrow(tbl) is 10, thus without replacement, it can't fill those 7 extra elements

standard deviation (and percentiles) of multiple arrays of the same dimension in R

I have several different arrays of the same dimension. Is there a way to find the standard deviation, mean, and some percentiles of all the arrays? My final result should be one array with the same dimension as each of the individual arrays.
I tried the following it clearly doesn't work
m1 <- array(runif(8), dim = c(2, 2, 2))
m2 <- array(runif(8), dim = c(2, 2, 2))
m3 <- array(runif(8), dim = c(2, 2, 2))
sd(m1, m2, m3)
Consider creating a single array and use apply to loop over the dimensions and get the sd
out <- apply(array(c(m1, m2, m3), dim = c(2, 2, 2, 3)), c(1, 2, 3), sd)
-checking the output
> sd(c(m1[1], m2[1], m3[1]))
[1] 0.1623589
> out[1]
[1] 0.1623589
Use the same method for mean
out2 <- apply(array(c(m1, m2, m3), dim = c(2, 2, 2, 3)), c(1, 2, 3), mean)

Creating names for a matrix of data in R

I have a simple 12 x 2 matrix called m that contains my dataset (see below).
Question
I was wondering why when I use dimnames(m) to create two names for the two columns of my data, I run into an Error? Is there a better way to create column names for this data in R?
Here is my R code:
Group1 = rnorm(6, 7) ; Group2 = rnorm(6, 9)
Level = gl(n = 2, k = 6)
m = matrix(c(Group1 , Group2, Level), nrow = 12, ncol = 2)
dimnames(m) <- list( DV = Group1, Level = Level)
replace dimnames(m) with
colnames(m) <- c("DV","Level")

Resources