comparing a vector to a probability distribution - r

I have a vector:
r <- runif(10)
r
[1] 0.52324423 0.89110751 0.44616915 0.70163640 0.63741495 0.31263977
[7] 0.73947973 0.83278799 0.04971461 0.01820381
I also have a probability distribution
p <- c(0, cumsum(rep(0.25, 4)))
p
[1] 0.00 0.25 0.50 0.75 1.00
I would like to assign factors to r based on the probability distribution in p.
In other words, I would like my output to be:
r
[1] 3 4 2 3 3 2 3 4 1 1
When I try this, I get a warning:
which( r >= p) -1
[1] 3
Warning message:
In r < p : longer object length is not a multiple of shorter object length
In other words, only the first value in r is compared to p.
How would I go about converting r into a vector of levels that I can then turn into factors?

You can use cut
as.integer(cut(r, breaks=p))

Related

split() returns "longer object length is not a multiple of shorter object length"

Context
I asked this question recently:
Comparing partitions from split() using a nested for loop containing an if statement
where I needed to compare partitions generated by split() from a distance matrix using the code fix provided by #robertdj
set.seed(1234) # set random seed for reproducibility
# generate random normal variates
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y) # merge vectors into dataframe
d <- dist(x) # generate distance matrix
splt <- split(d, 1:5) # split data with 5 values in each partition
for (i in 1:length(splt)) {
for (j in 1:length(splt)) {
if (i != j) {
a <- length(which(splt[[i]] >= min(splt[[j]]))) / length(splt[[i]])
b <- length(which(splt[[j]] <= max(splt[[i]]))) / length(splt[[j]])
}
}
}
I generated a MWE where each split contained the same number of elements. I did this just for illustrative purposes, fully knowing that this would not necessarily hold for real data.
As per #Robert Hacken's comment if I instead do
d <- na.omit(d[lower.tri(d)])
I get partitions of unequal length.
Real Data
However my real data does not have the "same size" property. My real data contains many more partitions than only 5 in my MWE.
Here is my code
splt <- split(dist_matrix, sub("(?:(.*)\\|){2}(\\w+)\\|(\\w+)\\|.*?$", "\\1-\\2", colnames(dist_matrix)))
The distance matrix dist_matrix contains FASTA headers from which I extract the species names.
I then use splt above in the doubly nested loop.
For instance, splt[[4]] contains 5 values, whereas splt[[10]] contains 9.
splt[[4]]
[1] 0.1316667 0.1383333 0.1166667 0.1333333 0.1216667
splt[[10]]
[1] 0.1450000 0.1483333 0.1316667 0.1316667 0.1333333 0.1333333 0.1166667 0.1166667 0.1200000
Expected Output
For my real problem, each partition corresponds to distances for a single species to all other unique species. So, if Species X has two DNA sequences representing it and there are 10 species in total, the partition for Species X should contain 20 distances. However I don't want the partition to include the distance between the two sequences for species A.
splt would thus contain 10 partitions (each not necessarily of the same length) for all species
The expected output of a and b is a number between 0-1 inclusive. I think these numbers should be small in my real example, but they are large when I try to run my code, which I think is a consequence of the warning().
What I've Done
I've read on SO that %in% is typically used to resolve the warning
In splt[[i]] == splt[[j]] :
longer object length is not a multiple of shorter object length
except in my case, I believe I would need %notin% <- Negate(%in%).
However, %notin% gives the error in my original post
the condition has length > 1
Question
How can my nested loop be altered to remove the warning?
I'm going to go out on a limb by interpreting parts of what you say, discarding your code, and seeing what I can come up with. If nothing else, it may spark conversation to explain what about my interpretations are correct (and which are incorrect).
Starting with the splt as generated by the random data, then replacing elements 4 and 5 with longer vectors,
set.seed(1234)
x <- rnorm(5)
y <- rnorm(5)
df <- data.frame(x, y)
d <- dist(x)
splt <- split(d, 1:5)
splt[[4]] <- rnorm(4)
splt[[5]] <- rnorm(10)
We have:
splt <- list("1" = c(1.48449499149608, 2.62312694474001), "2" = c(2.29150692606848, 0.15169544670039), "3" = c(1.13863195324393, 3.43013887931241), "4" = c(-0.477192699753547, -0.998386444859704, -0.77625389463799, 0.0644588172762693), "5" = c(-0.693720246937475, -1.44820491038647, 0.574755720900728, -1.02365572296388, -0.0151383003641817, -0.935948601168394, 1.10229754620026, -0.475593078869057, -0.709440037512506, -0.501258060594761))
splt
# $`1`
# [1] 1.484495 2.623127
# $`2`
# [1] 2.2915069 0.1516954
# $`3`
# [1] 1.138632 3.430139
# $`4`
# [1] -0.47719270 -0.99838644 -0.77625389 0.06445882
# $`5`
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
You reference expressions like which(splt[[i]] >= min(splt[[j]])), which I'm interpreting to mean *"what is the ratio of splt[[i]] that is above the max value in splt[[j]]. Since we're comparing (for example) splt[[1]] with all of splt[[2]] through splt[[5]] here, and likewise for the others, we're going to have a square matrix where the diagonal is splt[[i]]-vs-splt[[i]] (likely not interesting).
Some quick math so we know what we should end up with:
splt[[1]]
# [1] 1.484495 2.623127
range(splt[[2]])
# [1] 0.1516954 2.2915069
Since 1 from [[1]] is greater than 2's max of 2.29, we expect 0.5 in a comparison between the two (for >= max(.)); similarly, none of [[1]] is below 0.15, so we expect a 0 there.
Similarly, [[5]] over [[4]]:
splt[[5]]
# [1] -0.6937202 -1.4482049 0.5747557 -1.0236557 -0.0151383 -0.9359486 1.1022975 -0.4755931 -0.7094400 -0.5012581
range(splt[[4]])
# [1] -0.99838644 0.06445882
### 2 of 10 are greater than the max
sum(splt[[5]] >= max(splt[[4]])) / length(splt[[5]])
# [1] 0.2
### 9 of 10 are lesser than the min
sum(splt[[5]] <= min(splt[[4]])) / length(splt[[5]])
# [1] 0.2
We can use outer, but sometimes that can be confusing, especially since in this case we'd need to Vectorize the anon-func passed to it. I'll adapt your double-for loop premise into nested sapply calls.
Greater than the other's max
sapply(splt, function(y) sapply(setNames(splt, paste0("max", seq_along(splt))), function(z) sum(y >= max(z)) / length(y)))
# 1 2 3 4 5
# max1 0.5 0.0 0.5 0.00 0.0
# max2 0.5 0.5 0.5 0.00 0.0
# max3 0.0 0.0 0.5 0.00 0.0
# max4 1.0 1.0 1.0 0.25 0.2
# max5 1.0 0.5 1.0 0.00 0.1
Interpretation and subset validation:
1 with max of 2: comparing [[1]] (first column) with the max value from [[2]] (second row), half of 1's values are greater, so we have 0.5 (as expected).
5 with max of 4: comparing [[5]] (fifth column) with the max value from [[4]] (fourth row), 0.2 meet the condition.
Less than the other's min
sapply(splt, function(y) sapply(setNames(splt, paste0("min", seq_along(splt))), function(z) sum(y <= min(z)) / length(y)))
# 1 2 3 4 5
# min1 0.5 0.5 0.5 1.00 1.0
# min2 0.0 0.5 0.0 1.00 0.8
# min3 0.0 0.5 0.5 1.00 1.0
# min4 0.0 0.0 0.0 0.25 0.2
# min5 0.0 0.0 0.0 0.00 0.1
Same two pairs:
1 with min of 2 (row 2, column 1) is 0, as expected
5 with min of 4 (row 4, column 5) is 0.2, as expected
Edit: #compbiostats pointed out that while sum(..) should produce the same results as length(which(..)), the latter may be more robust to missing-data (e.g., NA values, c.f., Difference between sum(), length(which()), and nrow() in R). For sum(..) to share that resilience, we should add na.rm=TRUE) to both sum(.) and min(.) in the above calls. Thanks #compbiostats!

automatically try different initial values in optim

I use optim(.) to try to find the best fitting parameters for some function fn(dat, par, out=FALSE) where par must be a vector of two elements and out determines the output format. I use
optim(par=c(1,1), fn, dat=dat)
to identify the best-fitting values of par. Depending on the data in dat, this either works ot throws an error that
function cannot be evaluated at initial parameters
which I understand requires different starting values for optim(.). My problem is that I apply the function to many data sets in parallel and wonder whether I indeed need to try different values by hand or whether there is some way of automatizing this along the lines of
if no error then great
if error try par=c(0.5,1)
if no error then great
if error try par=c(0.5,0.5)
...
You could run a grid search before you start and discard NA parameters. Here is an example.
A test function:
fn <- function(x) {
if (x[1] < 0)
NA
else
prod(x)
}
Now run a grid search.
library("NMOF")
res <- gridSearch(fn,
npar = 2, ## length of x
lower = -1, ## lower bound for x
upper = 3, ## upper bound for x
n = 5) ## number of levels per element in x
## 2 variables with 5, 5 levels: 25 function evaluations required.
The function shows you all the parameter combinations it tried.
res$levels
## [[1]]
## [1] -1 -1
##
## [[2]]
## [1] 0 -1
##
## [[3]]
## [1] 1 -1
##
## ....
And it provides the objective function values associated with these combinations.
res$values
## [1] NA 0 -1 -2 -3 NA 0 0 0 0 NA 0 1 2 3
## [16] NA 0 2 4 6 NA 0 3 6 9
## => many objective functions values are NA
The best (none-NA) solution:
res$minlevels
## [1] 3 -1
## => your starting value for optim:
##
## optim(gridSearch(fn, npar = 2,
## lower = -1, upper = 3, n = 5)$minlevels,
## fn, dat = dat)
Of course, this won't give you a guarantee that at least one none-NAvector is found, but the chances may improve.

Formula to compute the between group sum of squares in R

Can anyone tell me how to code the SS between in R
to compute by hand, it is ∑ ni(meanXi - the grand mean)2
thanks,
lp
If you have a vector of values x and the mean in x_mean, you can compute the SS error manually simply like this:
> x=c(1,2,3,4,5)
> x_mean = mean(x)
> x-x_mean
[1] -2 -1 0 1 2
> (x-x_mean)^2
[1] 4 1 0 1 4
> sum((x-x_mean)^2)
[1] 10
Not sure this is what you want, but
# create sample dataset: 5 groups, 10 values per group
set.seed(1)
df <- data.frame(group=rep(LETTERS[1:5],each=10),value=rnorm(50))
# calculate between-group sum of squares (SSB)
sum((aggregate(value~group,df,mean)$value-mean(df$value))^2)
# [1] 0.07938908
This calculates the mean by group using aggregate(...) and then sums the squared difference between that and the grand mean (mean(df$value)).

Calculate autocorrelation with lag u in R

Hi I tried calculating autocorrelation with lag u, u = 1...9
I expect 9x1 autocorrelation functions. However when I try to use this code it always gave me 10x1 autocorrelation function with the first term = 1. I am not sure how to proceed.
# initialize a vector to store autocovariance
maxlag <- 9
varstore <- rep(NA,maxlag)
# Calculate Variance
varstore[1] <- sd(as.vector(sample1),na.rm=T)^2
# Estimate autocovariances for all residuals
for (lag in 1:maxlag)
varstore[lag+1] <- mean(sample1[,1:(10-lag)] *
sample1[,(lag+1):10],na.rm=T)
print(round(varstore,3))
# calculate autocorrelations
corrstore <- varstore/varstore[1]
print(corrstore)
And this is what I get:
[1] 1.0000000 0.6578243 0.5670389 0.5292314 0.5090411 0.4743944 0.4841038 0.4756297
[9] 0.4275208 0.4048436
You get a vector of length 10 because of the recycling.
for lag =maxlog ( the last step of your for loop)
varstore[lag+1]
will create a new entry with NA. To see this clearly, try this for example :
v <- NA ## a vector of length 1
v[10] <- 2
v
[1] NA NA NA NA NA NA NA NA NA 2 ## you get a vector of legnth 10!!
That'said , why do you want a vector of length 9? Why not to use the acf function? Here the output of the acf function:
length(acf(1:10)$lag)
[1] 10

Getting frequency values from histogram in R

I know how to draw histograms or other frequency/percentage related tables.
But now I want to know, how can I get those frequency values in a table to use after the fact.
I have a massive dataset, now I draw a histogram with a set binwidth. I want to extract the frequency value (i.e. value on y-axis) that corresponds to each binwidth and save it somewhere.
Can someone please help me with this?
Thank you!
The hist function has a return value (an object of class histogram):
R> res <- hist(rnorm(100))
R> res
$breaks
[1] -4 -3 -2 -1 0 1 2 3 4
$counts
[1] 1 2 17 27 34 16 2 1
$intensities
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$density
[1] 0.01 0.02 0.17 0.27 0.34 0.16 0.02 0.01
$mids
[1] -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
$xname
[1] "rnorm(100)"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
From ?hist:
Value
an object of class "histogram" which is a list with components:
breaks the n+1 cell boundaries (= breaks if that was a vector).
These are the nominal breaks, not with the boundary fuzz.
counts n integers; for each cell, the number of x[] inside.
density values f^(x[i]), as estimated density values. If
all(diff(breaks) == 1), they are the relative frequencies counts/n
and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i]
= breaks[i].
intensities same as density. Deprecated, but retained for
compatibility.
mids the n cell midpoints.
xname a character string with the actual x argument name.
equidist logical, indicating if the distances between breaks are all
the same.
breaks and density provide just about all you need:
histrv<-hist(x)
histrv$breaks
histrv$density
Just in case someone hits this question with ggplot's geom_histogram in mind, note that there is a way to extract the data from a ggplot object.
The following convenience function outputs a dataframe with the lower limit of each bin (xmin), the upper limit of each bin (xmax), the mid-point of each bin (x), as well as the frequency value (y).
## Convenience function
get_hist <- function(p) {
d <- ggplot_build(p)$data[[1]]
data.frame(x = d$x, xmin = d$xmin, xmax = d$xmax, y = d$y)
}
# make a dataframe for ggplot
set.seed(1)
x = runif(100, 0, 10)
y = cumsum(x)
df <- data.frame(x = sort(x), y = y)
# make geom_histogram
p <- ggplot(data = df, aes(x = x)) +
geom_histogram(aes(y = cumsum(..count..)), binwidth = 1, boundary = 0,
color = "black", fill = "white")
Illustration:
hist = get_hist(p)
head(hist$x)
## [1] 0.5 1.5 2.5 3.5 4.5 5.5
head(hist$y)
## [1] 7 13 24 38 52 57
head(hist$xmax)
## [1] 1 2 3 4 5 6
head(hist$xmin)
## [1] 0 1 2 3 4 5
A related question I answered here (Cumulative histogram with ggplot2).

Resources