I am carrying out a number of spearman's rank correlations and I want to produce a list of all the Rho estimates automatically.
Here is some sample data:
A <- data.frame('Area' = c(4, 6, 5),
'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')
I currently have the following code that carries out a correlation for each data frame in the list:
lapply(files, function (x)
cor.test(~flow + Area, data = x, method = 'spearman'))
However, what I would like to do is add another line to this to extract the Rho estimation for each correlation and append this to a new list.
How can I do this?
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 months ago.
Improve this question
I'm trying to multiply some probability functions as to update the probability given certain factors. I've tried several things using the pdqr and bayesmeta packages, but they all work out not the way I intend, what am I missing?
A reproducible example showing two different distributions, a and b, which I want to multiply. That is because, as you notice, b doesn't have measurements in the low values, so a probability of 0. This should be reflected in the updated distribution.
library(tidyverse)
library(pdqr)
library(bayesmeta)
#measurements
a <- c(1, 2, 2, 4, 5, 5, 6, 6, 7, 7, 7, 8, 7, 8, 2, 6, 9, 10)
b <- c(5, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 7)
#create probability distribution functions
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
#try to combine distributions
summarized <- distr_a + distr_b
multiplied <- distr_a * distr_b
mixture <- form_mix(list(distr_a, distr_b))
convolution <- convolve(distr_a, distr_b)
The resulting PDF's are plotted like this:
The bayesmeta::convolve() does the same as summarizing two pdqr PDF's and seem to oddly shift the distributions to the right and make them not as high as supposed to be.
Ordinarily multiplying the pdqr PDF's leaves a very low probablity overall.
Using the pdqr::form_mix() seems to even the PDF's out in between, but leaving probabilies above 0 for the lower x-values.
So, I tried to gain some insight in what I wanted to do, by using the PDF's for a and b to generate probabilities for each x value and multiply that:
#multiply distributions manually
x <- c(1:10)
manual <- data.frame(x) %>%
mutate(a = distr_a(x),
b = distr_b(x),
multiplied = a*b)
This indeed gives a resulting shape I am after, it however (logically) has too low probabilities:
I would like to multiply (multiple) PDF's. What am I doing wrong? Are my statistics wrong, or am I missing a usefull function?
UPDATE:
It seems I am a stats noob on this subject, but I would like to achieve something like the below distribution. Given that both situation a and b are true, I would expect the distribution te be something like the dotted line. Is that possible?
multiplied is the correct one. One can check with log-normal distributions. The sum of two independant log-normal random variables is log-normal with µ = µ_a + µ_b and sigma² = sigma²_a + sigma²_b.
a <- rlnorm(25000, meanlog = 0, sdlog = 1)
b <- rlnorm(25000, meanlog = 1, sdlog = 1)
distr_a <- new_d(a, type = "continuous")
distr_b <- new_d(b, type = "continuous")
distr_ab <- form_trans(
list(distr_a, distr_b), trans = function(x, y) x*y
)
# or: distr_ab <- distr_a * distr_b
plot(distr_ab, xlim = c(0, 40))
curve(dlnorm(x, meanlog = 1, sdlog = sqrt(2)), add = TRUE, col = "red")
As demonstrated here:
https://www.r-bloggers.com/2019/05/bayesian-models-in-r-2/
# Example distributions
probs <- seq(0,1,length.out= 100)
prior <- dbinom(x = 8, prob = probs, size = 10)
lik <- dnorm(x = probs, mean = .5, sd = .1)
# Multiply distributions
unstdPost <- lik * prior
# If you wanted to get an actual posterior, it must be a probability
# distribution (integrate to 1), so we can divide by the sum:
stdPost <- unstdPost / sum(unstdPost)
# Plot
plot(probs, prior, col = "black", # rescaled
type = "l", xlab = "P(Black)", ylab = "Density")
lines(probs, lik / 15, col = "red")
lines(probs, unstdPost, col = "green")
lines(probs, stdPost, col = "blue")
legend("topleft", legend = c("Lik", "Prior", "Unstd Post", "Post"),
text.col = 1:4, bty = "n")
Created on 2022-08-06 by the reprex package (v2.0.1)
I want to use bs function for numerical variables in my dataset when fitting a logistic regression model.
df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
f= c("m","f"), output = c(0,1))
library(splines)
model <- glm(output~ bs(a, df=2)+ bs(b, df=2)+ bs(d, df=2)+ bs(e, df=2)+
factor(f) ,
data = df,
family = "binomial")
In my actual dataset, I need to apply bs() to way more columns than this example. Is there a way I can do this without writing all the terms?
We can use some string manipulation with sprintf, together with reformulate:
predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2) + bs(b, df = 2) + bs(d, df = 2) + bs(e,
# df = 2) + factor(f)
If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).
predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2) + bs(b, df = 4, degree = 2) +
# bs(d, df = 3, degree = 2) + bs(e, df = 6, degree = 3) + factor(f)
Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").
Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.
I want to create train, val, test splits (60:20:20). I repeated the process multiple times.
Test set should contain only 2 observations each time. But why does it sometimes contain 1 or 3 observations.
What is role of replace in sample(). Should I keep it FALSE
library(dplyr)
tbl <- tibble(id = 1:10)
train = list()
val = list()
test = list()
for (run in 1:5)
{
assignment <- sample(1:3, size = nrow(tbl), prob = c(0.6, 0.2, 0.2), replace = TRUE)
# Create a train, validation and test sets
train[[run]] <- tbl[assignment == 1, ]
val[[run]] <- tbl[assignment == 2, ]
test[[run]] <- tbl[assignment == 3, ]
}
If we exactly 6, 2, 2, values for 1, 2, 3 as sample, just replicate the 1, 2, 3 and sample it
v1 <- sample(rep(1:3, c(6, 2, 2)))
Then do a split
split(tbl, v1)
When we use prob, it can change the frequency slightly because it is just a probability. Regarding the use of replace = TRUE, it is needed in the OP's code as the length of 1:3 is just 3, whereas size = nrow(tbl) is 10, thus without replacement, it can't fill those 7 extra elements
I have several different arrays of the same dimension. Is there a way to find the standard deviation, mean, and some percentiles of all the arrays? My final result should be one array with the same dimension as each of the individual arrays.
I tried the following it clearly doesn't work
m1 <- array(runif(8), dim = c(2, 2, 2))
m2 <- array(runif(8), dim = c(2, 2, 2))
m3 <- array(runif(8), dim = c(2, 2, 2))
sd(m1, m2, m3)
Consider creating a single array and use apply to loop over the dimensions and get the sd
out <- apply(array(c(m1, m2, m3), dim = c(2, 2, 2, 3)), c(1, 2, 3), sd)
-checking the output
> sd(c(m1[1], m2[1], m3[1]))
[1] 0.1623589
> out[1]
[1] 0.1623589
Use the same method for mean
out2 <- apply(array(c(m1, m2, m3), dim = c(2, 2, 2, 3)), c(1, 2, 3), mean)
I have a simple 12 x 2 matrix called m that contains my dataset (see below).
Question
I was wondering why when I use dimnames(m) to create two names for the two columns of my data, I run into an Error? Is there a better way to create column names for this data in R?
Here is my R code:
Group1 = rnorm(6, 7) ; Group2 = rnorm(6, 9)
Level = gl(n = 2, k = 6)
m = matrix(c(Group1 , Group2, Level), nrow = 12, ncol = 2)
dimnames(m) <- list( DV = Group1, Level = Level)
replace dimnames(m) with
colnames(m) <- c("DV","Level")