My problem is when I compute running correlation for some odd reason I do not get the same p-value for the same estimates/correlations values.
My target is to calculate a running Spearman correlation on two vectors in the same data.frame (subject1 and subject2 in the example below). In addition, my window (length of the vector) and stide (the jumps/steps between each window) are constant. As such, when looking at the formula below (from wiki) I should get the same critical t hence the same p-value for the same Spearman correlation. These is because the n states the same (it's the same window size) and the r is same. However, my end p value is different.
#Needed pkgs
require(tidyverse)
require(pspearman)
require(gtools)
#Sample data
set.seed(528)
subject1 <- rnorm(40, mean = 85, sd = 5)
set.seed(528)
subject2 <- c(
lag(subject1[1:21]) - 10,
rnorm(n = 6, mean = 85, sd = 5),
lag(subject1[length(subject1):28]) - 10)
df <- data.frame(subject1 = subject1,
subject2 = subject2) %>%
rowid_to_column(var = "Time")
df[is.na(df)] <- subject1[1] - 10
rm(subject1, subject2)
#Function for Spearman
psSpearman <- function(x, y)
{
out <- pspearman::spearman.test(x, y,
alternative = "two.sided",
approximation = "t-distribution") %>%
broom::tidy()
return(data.frame(estimate = out$estimate,
statistic = out$statistic,
p.value = out$p.value )
}
#Running correlation along the subjects
dfRunningCor <- running(df$subject1, df$subject2,
fun = psSpearman,
width = 20,
allow.fewer = FALSE,
by = 1,
pad = FALSE,
align = "right") %>%
t() %>%
as.data.frame()
#Arranging the Results into easy to handle data.frame
Results <- do.call(rbind.data.frame, dfRunningCor) %>%
t() %>%
as.data.frame() %>%
rownames_to_column(var = "Win") %>%
gather(CorValue, Value, -Win) %>%
separate(Win, c("fromIndex", "toIndex")) %>%
mutate(fromIndex = as.numeric(substring(fromIndex, 2)),
toIndex = as.numeric(toIndex, 2)) %>%
spread(CorValue, Value) %>%
arrange(fromIndex) %>%
select(fromIndex, toIndex, estimate, statistic, p.value)
My problem is when I plot the Results with estimates (Spearman rho;estimate), window number (fromIndex) and I color the p value, I should get like a "tunnel"/"path" of the same color across the same area - I don't.
For example, in the picture below, points in the same height in the red circle should be with the same color - but the aren't.
Code for the graph:
Results %>%
ggplot(aes(fromIndex, estimate, color = p.value)) +
geom_line()
What I found so far is that it might might be due to:
1. Functions like Hmisc::rcorr() tend to not give the same p.value in small sample or many ties. This is why I use pspearman::spearman.test which from what I read here suppose to solve this problem.
2. Small sample size - I tried using a bigger sample size. I still get the same problem.
3. I tried rounding my p values - I still get the same problem.
Thank you for your help!
Edit.
Could it be "pseudo" coloring by ggplot? Could it be that ggplot just interpolate "last" color until the next point?. Which is why I get "light blue" from point 5 to 6 but "dark blue" from point 7 to 8?
The results you obtain for the p.value variable are coherent with the estimate value.
You can check it as follows:
Results$orderestimate <- order(-abs(Results$estimate))
Results$orderp.value <- order(abs(Results$p.value))
identical(Results$orderestimate ,Results$orderp.value)
I don't think you should include a colour for the p.value in the graph, it is an unnecessary visual distraction and it is hard to interpret.
If I were you I would only display the p.value and perhaps include a point to indicate the sign of the estimate variable.
p <- Results %>%
ggplot(aes(fromIndex, p.value)) +
geom_line()
# If you want to display the sign of the estimate
Results$estimate.sign <- as.factor(sign(Results$estimate))
p+geom_point( aes(color = estimate.sign ))
Related
Lets say that I have four groups, each of 100 items. They each have a rating on variable X.
I would like to randomly select ten from each group, with the caveat that the ten should evenly span variable X.
In other words, I would like the ten I draw from each group to have the most uniform distribution on variable X possible.
I've tried using cut points and stratified() but this hasn't worked because the different groups have somewhat different distributions. I've also tried using quantiles, but this has led to normal rather than uniform distributions.
Thanks!
here is some sample data
item = 1:400
group = rep(1:4, each = 100)
x = rnorm(400)
d <- as.data.frame(cbind(item,group,x))
You can use rank order statistics (I inreased the population size so that you can visually see it work):
#Prep data
item = 1:4000
group = rep(1:4, each = 1000)
x = rnorm(4000)
d <- as.data.frame(cbind(item,group,x))
#Load package (need purrr installed as well)
library(dplyr)
#Draw the sample
sample_size <- 10
drawn_sample <- d %>%
mutate(
rank = rank(x)
) %>%
split.data.frame(.$group) %>%
purrr::map(
filter,
rank %in% sample(rank, sample_size)
) %>%
bind_rows()
#Visualize sample vs original distribution
x %>%
density() %>%
plot()
drawn_sample$x %>%
density() %>%
lines(col = 'red')
Could you please help me?
I'm writing a code in R to automatize a null model analysis of multiple networks. First, the code reads multiple TXT matrices into R. Second, it calculates a topological metric for each network. Third, it randomizes each network N times using a null model. Fourth, it calculates the same topological metric for all randomized versions of the original matrices.
In the fifth and final step, the idea is to compare the observed scores against the distributions of randomized scores. First, by doing a simple count of how many randomized scores are above or below the observed score, in order to estimate the P-values. Second, by plotting the distribution of randomized scores as a density and adding a vertical line to show the observed score.
Here are examples of the data frames that need to be analyzed:
networks <- paste("network", rep(1:3), sep = "")
randomizations <- seq(1:10)
observed.ex <- data.frame(network = networks,
observed = runif(3, min = 0, max = 1))
randomized.ex <- data.frame(network = sort(rep(networks, 10)),
randomization = rep(randomizations, 3),
randomized = rnorm(length(networks)*
length(randomizations),
mean = 0.5, sd = 0.1))
In the first step of the final analysis, the code estimates the P-values by doing simple counts. As you see, I need to make copies of the calculation call for each network:
randomized.network1 <- subset(randomized.ex, network == "network1")
sum(randomized.network1$randomized >= observed.ex$observed[1]) /
length(randomized.network1$randomized)
sum(randomized.network1$randomized <= observed.ex$observed[1]) /
length(randomized.network1$randomized)
randomized.network2 <- subset(randomized.ex, network == "network2")
sum(randomized.network2$randomized >= observed.ex$observed[2]) /
length(randomized.network2$randomized)
sum(randomized.network2$randomized <= observed.ex$observed[2]) /
length(randomized.network2$randomized)
randomized.network3 <- subset(randomized.ex, network == "network3")
sum(randomized.network3$randomized >= observed.ex$observed[3]) /
length(randomized.network3$randomized)
sum(randomized.network3$randomized <= observed.ex$observed[3]) /
length(randomized.network3$randomized)
In the second step of the final analysis, the code makes density plots. As you see, I need to make copies of the vertical line call for each network:
ggplot(randomized.ex, aes(randomized)) +
geom_density() +
facet_grid(network~.) +
geom_vline(data=filter(randomized.ex, network == "network1"),
aes(xintercept = observed.ex$observed[1]), colour = "red") +
geom_vline(data=filter(randomized.ex, network == "network2"),
aes(xintercept = observed.ex$observed[2]), colour = "red") +
geom_vline(data=filter(randomized.ex, network == "network3"),
aes(xintercept = observed.ex$observed[3]), colour = "red")
Is there a way to make this final analysis more general, so it always does the same calculations and plots, no matter how many networks are read in the beginning?
Thank you very much!
It looks like this can be neatly wrapped in an lapply loop that iterates over each file. How does the below work for you? You could also pass in filenames rather than the number of files (currently 1:3) and have the first line "read" in your TXT matrices.
library(dplyr) #For %>%, group_by, and summarize
output <- lapply(1:3, function(network_num){
network <- paste0("network", network_num)
n_randomizations <- 10
observed.ex <- runif(1)
randomized.ex <- rnorm(n_randomizations, mean = 0.5, sd = 0.1)
return(data.frame(network=network, observed=observed.ex, randomized=randomized.ex))
}) %>% do.call(what = rbind)
output %>%
group_by(network) %>%
summarize(p_value=mean(observed>=randomized))
ggplot(output) +
geom_density(aes(randomized)) +
facet_grid(network~.) +
geom_vline(aes(xintercept = observed), col="red")
I have the following data object:
require(tidyverse)
sample(x = 0:1, size = 4, replace = TRUE) %>% sum()
I have created a bootstrap simulation of this code by using the replicate function (we were simulating coin tosses and heads50 is the final data object):
heads50 <- replicate(50, sample(0:1, 4, TRUE)) %>% sum()
However, when I run the sum function it gives me the total aggregate number of heads over all replications of this experiment, not the output of each trial (i.e. how many heads when tossing the coin 4 times per trial is what I want to know, not just the overall number so I can plot probability later on)
I've also created a data object to try to group by possibilities (i.e. to calculate probability of tossing one heads v 2 heads v 3 heads v 4 heads out of four in a trial) like so:
data50 <- tibble(heads = heads50) %>%
group_by(heads) %>%
summarise(n = n(), p=n/50)
Problem is it is not doing that when I try to generate a histogram, but just giving me a sum overall probability with one bar:
ggplot(data50, aes(x = heads, y = p)) +
geom_bar(stat = "identity", fill = "green") +
labs(x = "Number of Heads", y = "Probability of Heads in 4 flips(p)") +
theme_minimal()
Anyone have an idea of how to sum each trial and separate out the possibilities? I have tried to restart rstudio and reload the tidyverse package, which includes dplyr with the 6 core functions.
The fundamental problem here is when you're calling the sum() function. When the sum() is outside replicate(), what happens is that replicate() will make a 4x50 matrix of zeros and ones, and then sum() will just flatten it and add it all up. Instead, what you want is a sum taken on a per-trial basis; we want to do the addition within the replication loop, not outside it. Try:
heads50 <- replicate(50, sample(0:1, size = 4, replace = T) %>% sum)
Another option would be to sum your matrix only along columns; that is,
heads50 <- replicate(50, sample(0:1, size = 4, replace = T)) %>% colSums
where this time the colSums() function sits outside the replicate() as it did in your original example.
#Aaron has pointed it out correctly. You don't really need a pipe to sum up your counts. You can write it as the following, or if you are familiar with the binomial distribution, what you have is essentially, rbinom(50,4,0.5).
So below I kind of wrote your code again to simulate and plot. Hopefully its useful in some ways:
library(dplyr)
set.seed(123)
data.frame(trial=1:50) %>%
mutate(heads=rbinom(50,4,0.5)) %>%
count(heads) %>%
mutate(p=n/sum(n)) %>%
ggplot(aes(x=heads,y=p)) +
geom_bar(stat="identity",fill="green") +
labs(x="Number of heads",y="Probability of Heads in 4 flips(p)") +
theme_minimal()
I am trying to plot a regression model for a data set with measurements for "mue" (friction coefficient for breaking train), speed of train and temp of train. I built a simple regression model using lm so I could test plotting with plot_ly. The plot attached shows the blue markers of the original data and the surface plotted doesn't look right. It should look more like a regression surface... I also plotted in 2D to make sure the regression actually works and it does. I've posted the code below and am wondering if anyone here has any advice. Been trying everything I can find online and none of it seems to be working. I think the issue might have to do with building a grid? I've tried that a few times, but I always get error messages for vectors not matching up, etc. I'd be happy to post that as well if needed. Thank you!
3D Regression Plot
2D Regression Plot
set.seed(123) # randum number generator
training.samples <- avg.frame$avg.mue %>%
createDataPartition(p = 0.8, list = FALSE) # pick 80 percent of data
train.data <- avg.frame[training.samples, ] # 80 percent is training data
test.data <- avg.frame[-training.samples, ] # 20 percent is test data
model_2 <- lm(avg.mue ~ avg.speed + avg.temp, data = train.data)
vals <- predict(model_2, train.data)
avg.mue <- matrix(vals, nrow = length(test.data$avg.speed), ncol = length(test.data$avg.temp))
plane <- avg.mue
p <- plot_ly(data = train.data, z = ~avg.mue, x = ~avg.speed, y = ~avg.temp, opacity = 0.6) %>%
add_markers()
p %>% add_surface(z = ~plane, x = ~avg.speed, y = ~avg.temp, showscale = FALSE) %>%
layout(showlegend = FALSE)
I have a matrix with x rows (i.e. the number of draws) and y columns (the number of observations). They represent a distribution of y forecasts.
Now I would like to make sort of a 'heat map' of the draws. That is, I want to plot a 'confidence interval' (not really a confidence interval, but just all the values with shading in between), but as a 'heat map' (an example of a heat map ). That means, that if for instance a lot of draws for observation y=y* were around 1 but there was also a draw of 5 for that same observation, that then the area of the confidence interval around 1 is darker (but the whole are between 1 and 5 is still shaded).
To be totally clear: I like for instance the plot in the answer here, but then I would want the grey confidence interval to instead be colored as intensities (i.e. some areas are darker).
Could someone please tell me how I could achieve that?
Thanks in advance.
Edit: As per request: example data.
Example of the first 20 values of the first column (i.e. y[1:20,1]):
[1] 0.032067416 -0.064797792 0.035022338 0.016347263 0.034373065
0.024793101 -0.002514447 0.091411355 -0.064263536 -0.026808208 [11] 0.125831185 -0.039428744 0.017156454 -0.061574540 -0.074207109 -0.029171227 0.018906181 0.092816957 0.028899699 -0.004535961
So, the hard part of this is transforming your data into the right shape, which is why it's nice to share something that really looks like your data, not just a single column.
Let's say your data is this a matrix with 10,000 rows and 10 columns. I'll just use a uniform distribution so it will be a boring plot at the end
n = 10000
k = 10
mat = matrix(runif(n * k), nrow = n)
Next, we'll calculate quantiles for each column using apply, transpose, and make it a data frame:
dat = as.data.frame(t(apply(mat, MARGIN = 2, FUN = quantile, probs = seq(.1, 0.9, 0.1))))
Add an x variable (since we transposed, each x value corresponds to a column in the original data)
dat$x = 1:nrow(dat)
We now need to get it into a "long" form, grouped by the min and max values for a certain deviation group around the median, and of course get rid of the pesky percent signs introduced by quantile:
library(dplyr)
library(tidyr)
dat_long = gather(dat, "quantile", value = "y", -x) %>%
mutate(quantile = as.numeric(gsub("%", "", quantile)),
group = abs(50 - quantile))
dat_ribbon = dat_long %>% filter(quantile < 50) %>%
mutate(ymin = y) %>%
select(x, ymin, group) %>%
left_join(
dat_long %>% filter(quantile > 50) %>%
mutate(ymax = y) %>%
select(x, ymax, group)
)
dat_median = filter(dat_long, quantile == 50)
And finally we can plot. We'll plot a transparent ribbon for each "group", that is 10%-90% interval, 20%-80% interval, ... 40%-60% interval, and then a single line at the median (50%). Using transparency, the middle will be darker as it has more ribbons overlapping on top of it. This doesn't go from the mininum to the maximum, but it will if you set the probs in the quantile call to go from 0 to 1 instead of .1 to .9.
library(ggplot2)
ggplot(dat_ribbon, aes(x = x)) +
geom_ribbon(aes(ymin = ymin, ymax = ymax, group = group), alpha = 0.2) +
geom_line(aes(y = y), data = dat_median, color = "white")
Worth noting that this is not a conventional heatmap. A heatmap usually implies that you have 3 variables, x, y, and z (color), where there is a z-value for every x-y pair. Here you have two variables, x and y, with y depending on x.
That is not a lot to go on, but I would probably start with the hexbin or hexbinplot package. Several alternatives are presented in this SO post.
Formatting and manipulating a plot from the R package "hexbin"