how can I set the bin centre values of histogram myself? - r

Lets say I have a data frame like below
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
Which then I can calculate the histogram on each of them columns using
matAllCols <- apply(mat, 2, hist)
Now if you look at matAllCols$breaks , you can see sometimes 11, sometimes 12 etc.
what I want is to set a threshold for it. for example it should always be 12 and the distances between each bin centre (which is stored as matAllCols$mids) be 0.01
Doing it for one column at the time seems to be simple, but when I tried to do it for all columns, it does not work. also this is only breaks, how to set the mids is also not straightforward
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = 12))
is there anyway to do this ?

You can solve the probrem by giving the all breakpoints between histogram cells as breaks. (But this is written in stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html as #Colonel Beauvel said)
set.seed(1); mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
# You need to check the data range to decide the breakpoints.
range(mat) # [1] 0.002025041 0.483281274
# You can set the breakpoints manually.
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.52, 0.04)))

You are looking for
set.seed(1)
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.5, 0.05)))
or simply
x <- rexp(200, rate = 10)
hist(x[x>=0 & x <=0.5] , breaks = seq(0, 0.5, 0.05))

Related

Changing the colour palette based on quantile values in pheatmap

enter image description hereI am very new to R and I am trying to make a pheatmap out of my data. I just copied some existing code included in a tutorial and it seems it pretty nicely fitted to my data after some tweaking. I also included some quantile code, that should change the color breaks based on the data I have, because most of the values are between 0-100 but just couple of them is in thousands.
I would like to make it so the small values are more variable in colours and keep the light yellow only to the most extreme values. From the legend it seems it is the other way around right now...
Can somebody help me to tweak the breaks?
Thanks!
Here is my code and the output heatmap (changed because it contained sensitive data..).
x <- read.table("proteins_cpm_commas.tsv", header=TRUE, row.names = 1)
x <- as.matrix(x)
x
pheatmap(x,
drop_levels = TRUE,
cluster_rows=F,
cluster_cols=F,
treeheight_col = 0,
treeheight_row = 0,
fontsize = 8,
color = inferno(length(mat_breaks) - 1),
breaks = mat_breaks,
main = "Title ")
mat_breaks <- seq(min(x), max(x), length.out = 20)
mat_breaks
quantile_breaks <- function(xs, n = 20) {
breaks <- quantile(xs, probs = seq(0, 1, length.out = n))
breaks[!duplicated(breaks)]
}
mat_breaks <- quantile_breaks(x, n = 20)
mat_breaks
While this post is not using the pheatmap package, since the solutions I came up with are all rather hacks, I would recommend a solution using the ComplexHeatmap package. Example using mock data:
suppressPackageStartupMessages(
lapply(c("ComplexHeatmap", "circlize", "viridisLite"),
require, character.only=TRUE))
set.seed(23)
x <- matrix(rexp(2000, rate=.001), ncol=20)
quantile_breaks <- function(xs, n = 10) {
breaks <- quantile(xs, probs = seq(0, 1, length.out = n))
breaks[!duplicated(breaks)]
}
mat_breaks <- quantile_breaks(x, n = 11)
col_fun_prop <- colorRamp2(quantile_breaks(x, n = 11), viridisLite::inferno(11))
Heatmap(x,col = col_fun_prop,
heatmap_legend_param=list(
labels=round(mat_breaks), at=mat_breaks, col_fun = col_fun_prop,
title = "Prop", break_dist = 1))
Created on 2022-03-31 by the reprex package (v2.0.1)

Finding index of array of matrices, that is closest to each element of another matrix in R

I have an array Q which has size nquantiles by nfeatures by nfeatures. In this, essentially the slice Q[1,,] would give me the first quantile of my data, across all nfeatures by nfeatures of my data.
What I am interested in, is using another matrix M (again of size nfeatures by nfeatures) which represents some other data, and asking the question to which quantile do each of the elements in M lie in Q.
What would be the quickest way to do this?
I reckon I could do double for loop across all rows and columns of the matrix M and come up with a solution similar to this: Finding the closest index to a value in R
But doing this over all nfeatures x nfeatures values will be very inefficient. I am hoping that there might exist a vectorized way of approaching this problem, but I am at a lost as to how to approach this.
Here is a reproducible way of the slow way I can approach the problem with O(N^2) complexity.
#Generate some data
set.seed(235)
data = rnorm(n = 100, mean = 0, sd = 1)
list_of_matrices = list(matrix(data = data[1:25], ncol = 5, nrow = 5),
matrix(data = data[26:50], ncol = 5, nrow = 5),
matrix(data = data[51:75], ncol = 5, nrow = 5),
matrix(data = data[76:100], ncol = 5, nrow = 5))
#Get the quantiles (5 quantiles here)
Q <- apply(simplify2array(list_of_matrices), 1:2, quantile, prob = c(seq(0,1,length = 5)))
#dim(Q)
#Q should have dims nquantiles by nfeatures by nfeatures
#Generate some other matrix M (true-data)
M = matrix(data = rnorm(n = 25, mean = 0, sd = 1), nrow = 5, ncol = 5)
#Loop through rows and columns in M to find which index of the array matches up closest with element M[i,j]
results = matrix(data = NA, nrow = 5, ncol = 5)
for (i in 1:nrow(M)) {
for (j in 1:ncol(M)) {
true_value = M[i,j]
#Subset Q to the ith and jth element (vector of nqauntiles)
quantiles = Q[,i,j]
results[i,j] = (which.min(abs(quantiles-true_value)))
}
}
'''

Creating a matrix with random entries with given probabilities in R

I want to create a 100x100 matrix A with entry a_ij being randomly selected from the set {0,1} with P(a_ij=1)=0.2 and P(a_ij=0)=0.8.
This is what I’ve tried so far:
n<-100
matrix<-matrix(0,100,100)
mynumbers<-c(1,0)
myprobs<-c(0.2,0.8)
for(i in 1:100){
for (j in 1:100){
matrix[i,j]<-sample(mynumbers, 1, replace=TRUE, prob=myprobs)
}
}
matrix
I’m not sure about the sample size being 1, but this way only seems to work if I choose size=1... Is this the correct way to do it? Thank you in advance!
As #akrun noted there are much easier ways. A matrix of 100 x 100 means 10,000 entries. prob = .2 is saying success = 1 = P(a_ij=1)=0.2, size in this case means one trial at a time. The matrix parameters should be pretty self-evident.
set.seed(2020)
trials <- rbinom(n = 10000, size = 1, prob = .2)
my.matrix <- matrix(trials, nrow = 100, ncol = 100)
or to more closely resemble your code
n <- 10000
mynumbers<-c(1,0)
myprobs<-c(0.2,0.8)
trials2 <- sample(x = mynumbers,
size = n,
replace = TRUE,
prob = myprobs)
my.matrix2 <- matrix(trials2, nrow = 100, ncol = 100)

how to draw a matrix image with R

I'm trying to draw a similar a matrix image like this using a known matrix. in this image each square represent the frequency of the corresponding number in vertical axis, and darker color square means high frequency of that number. For example, my known matrix could be generate as
Ture <- rep(8, 100)
PA <- rep(7, 100)
ED <- sample(6:8, 100, replace = T)
ER <- rep(0, 100)
IC1 <- sample(1:2, 100, replace = T)
NE <- sample(3:4, 100, replace = T)
BCV <- sample(5:7, 100, replace = T)
Oracle <- sample(5:6, 100, replace = T)
M <- rbind(Ture, PA, ED, ER, IC1, NE, BCV, Oracle)
Thanks very much!
Further to my comment above, you can do the following
image(M, axes = F, col = rev(gray.colors(12, start = 0, end = 1)))
axis(1, at = seq(0, 1, length.out = nrow(M)), labels = rownames(M))
axis(2, at = seq(0, 1, length.out = 11), labels = seq(0, 100, length.out = 11))

dlm package in R: What is causing this error: `tsp<-`(`*tmp*`, value = c(1, 200, 1))

I am using dlm package in R for performing Kalman filtering for the following simulated data.
## Multivariate time-series of dimension 200 and length 3
obsTimeSeries <- cbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
tseries <- ts(obsTimeSeries, frequency = 1)
kalmanBuild <- function (par) {
kalmanMod <- dlm(FF = diag(1, 200), GG = diag(1, 200),
V = exp(par[1]) * diag(1, 200),
W = exp(par[2]) * diag(1, 200),
m0 = rep(0, 200), C0 = 1e100 * diag(1, 200))
kalmanMod
}
kalmanMLE <- dlmMLE(tseries, parm = rep(0, 2), build = kalmanBuild)
kalmanMod <- kalmanBuild(kalmanMLE$par)
kalmanFilt <- dlmFilter (tseries, kalmanMod)
The code until kalmanMod works fine. It give an error in dlmFilter(tseries, kalmanMod) saying `tsp<-(*tmp*, value = c(1, 200, 1))`.
I tried to look for the location of error. It seems that the filtering works fine, that is, the means and variances are estimated correctly, until in the very last part when the code assigns tsp(ans$a) <- ytsp, the error occurs.
Has anyone else face this problem? If yes, then what am I doing wrong.
Try changing your code to:
obsTimeSeries <- rbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
rather than:
obsTimeSeries <- cbind(rnorm(200, 1, 2), rnorm(200, 2, 2), rnorm(200, 3, 2))
Your time series was set up to be 3 series at 200 time points. If you change it to rbind you will have a ts with 200 series at 3 time points.

Resources