I have troubles to sample within a certain background or excluding some possibilities.
I am trying to create a R function that shuffles genomic regions.
For now the function works well and follow those steps:
Retrieves all the genomic regions lengths and chromosomes of the query.
Calculates all the possible starts as the specified chromosome total size minus the length of each query regions.
Calculates the shuffled genomic regions as the start is sampled from 0 to the possible starts and the width is simply the width of each query regions.
This function uses GenomicRanges object, here is its code:
GrShuffle <- function(regions, chromSizes = LoadChromSizes("hg19")) {
# Gets all the regions lengths from the query.
regionsLength <- regions#ranges#width
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[as.vector(regions#seqnames), ] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
granges <- GRanges(regions#seqnames, IRanges(start = randomStarts,
width = regionsLength),
strand=regions#strand)
return(granges)
}
But now I need to use a universe, i.e. an other set of regions that will determine in which ranges the randoms will take place. The universe works like a restriction to sampling. It will be another set of regions like the query. And no shuffling should take place outside of those regions.
Any clue on how to sample within ranges in R?
The lapply is important as it drastically reduces the execution time of the function compared to using a loop.
[EDIT]
Here is a reproducible example that does not use GenomicRanges to siplify at maximum what I want to achieve.
## GENERATES A RANDOM QUERY
chromSizes <- c(100,200,250)
names(chromSizes) <- c("1","2","3")
queryChrom <- sample(names(chromSizes), 100, replace = TRUE)
queryLengths <- sample(10, 100, replace = TRUE)
queryPossibleStarts <- chromSizes[queryChrom] - queryLengths
queryStarts <- unlist(lapply(queryPossibleStarts, sample.int, size = 1))
query <- data.frame(queryChrom, queryStarts, queryStarts + queryLengths)
colnames(query) <- c("chrom", "start", "end")
##
##SIMPLIFIED FUNCTION
# Gets all the regions lengths from the query.
regionsLength <- query$end - query$start
# The possible starts are the chromosome sizes - the regions lengths.
possibleStarts <- chromSizes[query$chrom] - regionsLength
# Gets all the random starts from sampling the possible starts.
randomStarts <- unlist(lapply(possibleStarts, sample.int, size = 1))
shuffledQuery <- data.frame(queryChrom, randomStarts, randomStarts + queryLengths)
colnames(shuffledQuery) <- c("chrom", "start", "end")
##
Related
I have a fairly large raster file with a resolution of 0.9 arcsec x 0.9 arcsec carrying values between 0 and 100 (and 255 for NA) called forest1. I want to aggregate this raster to the resolution of another raster (called dfr_2010_crop) which has a 0.5° x 0.5° resolution using the mean of values. Unfortunately, the strategy I am currently using requires too much memory. Namely, I am using
TreeCover = rasterToPoints(forest1, fun=NULL, spatial=T)
TreeCoverPercent <- rasterize(TreeCover, dfr_2010_crop, fun=function(x,...) {sum(x, na.rm=T)/(4*10^4)}, field=g )
whereby g is the correct field I have saved before. 4*10^4 is the number of 0.9 arcsec x 0.9 arcsec in a 0.5° x 0.5° cell. R tells me that he cannot allocate a vector of the size of 7.9GB after running the first line. I have tried to solve this problem in the following ways:
rasterOptions(maxmemory=1e+08)
And, after this did not work I have tried to work in blocks. I tried following the approach given here ([https://strimas.com/post/processing-large-rasters-in-r/][1]) where they use the calc() function when working in blocks. However, I failed to customite it to my setting as I do not know how to call the blocks as raster files inside of the loop. However, here is my try:
canProcessInMemory(forest1, 1, TRUE)
#working in block
f_in <- f
f_out <- tempfile(fileext = ".tif")
# input and output rasters
r_in <- stack(f_in)
r_out <- raster(r_in)
# blocks
b <- blockSize(r_in)
print(b)
r_in <- readStart(r_in)
r_out <- writeStart(r_out, filename = f_out)
# loop over blocks
for (i in seq_along(b$row)) {
# read values for block
# format is a matrix with rows the cells values and columns the layers
v <- getValues(r_in, row = b$row[i], nrows = b$nrows[i])
# mean cell value across layers
v <- rasterToPoints(v, fun=NULL, spatial=T)
# write to output file
r_out <- writeValues(r_out, v, b$row[i])
}
# close files
r_out <- writeStop(r_out)
r_in <- readStop(r_in)
Looking forward to any suggestions and thanks for your help.
The ratio of your resolutions turns out to be an exact integer:
res1 = 0.9/(60*60) # resolution converted to degrees
res2 = 0.5
res.factor = res2 / res1
res.factor
# [1] 2000
You can double check this with you actual rasters using res.factor = res(forest1) / res(dfr_2010_crop) - I can't do that because you did not provide a reproducible example.
This means that you can simply use raster::aggregate to change the resolution.
TreeCoverPercent = aggregate(forest1, res.factor)
In case your res.factor was not a precise integer, then you can still use this method by rounding to the nearest integer, followed by resampling to the final desired resolution.
TreeCoverPercent = aggregate(forest1, round(res.factor))
TreeCoverPercent = resample(TreeCoverPercent, dfr_2010_crop)
In order to conduct least cost analyses in R, I am trying to produce the cost surface from a DEM. I want to treat land areas in a different way from sea areas (sea costs being half of land travel costs on a plain). To do so, I have produced two transition layers which I calculated with two different transitionFunction (altDiff). (geoCorrection and the Tobler function I omitted here)
library(raster)
library(gdistance)
# functions ---------------------------------------------------------------
altDiff_land <- function(x) (x[2] - x[1])
altDiff_sea <- function(x) if((x[2]==0) & (x[1]==0)) { 0 } else { 99 }
# DEM ---------------------------------------------------------------------
dem <- raster(nrows=18, ncols=18)
dem <- setValues(dem, runif(ncell(dem),min = 0, max = 50))
# cost surface LAND -------------------------------------------------------
dem_land <- dem
land_trans <- transition(x = dem_land, transitionFunction = altDiff_land,
directions = 8, symm = FALSE)
# cost surface SEA --------------------------------------------------------
dem_sea <- dem
dem_sea[dem_sea[]>40] <- 1
dem_sea[dem_sea[]!=1] <- 0
sea_trans <- transition(x = dem_sea, transitionFunction = altDiff_sea,
directions = 8, symm = FALSE)
The next step would be to stack the two transition layers (land_trans, sea_trans) in order to select the maximum values from them producing a final transition layer that I can use for least cost analysis.
I tried
a <- stack(land_trans,sea_trans)
conductance <- max(a)
plot(raster(conductance))
text(raster(conductance))
(which is working in this example) but the problem is that with my original DEM (of 450 MB) I have never got a result (even with aggregating the DEM with factor 500!) which is either because the max function is incredibly slow or because it does not work well with transitionStacks/layers. Rasterizing the transition layers before the analysis is not an option since the costDistance function from gdistance needs a transition layer.
Is there an alternative for max?
It seems that the produced transitionlayers have different amounts of values (x) in their matrices. So, when I try
a_test <- land_trans
a_test#transitionMatrix#x[a_test#transitionMatrix#x<sea_trans#transitionMatrix#x]
<- sea_trans#transitionMatrix#x
as an alternative for max all I get is
Warning message:
In a_test#transitionMatrix#x[a_test#transitionMatrix#x <
test_sea#transitionMatrix#x] <- test_sea#transitionMatrix#x :
number of items to replace is not a multiple of replacement length
But the DEMs are of the same size! Why do the transitionlayers differ? Can I force them to have identical sizes?
With SOM I experimented a little. First I used MiniSOM in Python but I was not impressed and changed to the kohonen package in R, which offers more features than the previous one. Basically, I applied SOM for three use cases: (1) clustering in 2D with generated data, (2) clustering with more-dimensional data: built-in wine data set, and (3) outlier detection. I solved all the three use cases but I would like to raise a question in connection with the outlier detection I applied. For this purpose I used the vector som$distances, which contains a distance for each row of the input data set. The values with excelling distances can be outliers. However, I do not know how this distance is computed. The package description (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) states for this metric : "distance to the closest unit".
Could you please tell how this distance is computed?
Could you please comment the outlier detection I used? How would you have done it? (In the generated data set it really finds the outliers. In
the real wine data set there are four relatively excelling values among the 177 wine sorts. See
the charts below. The idea that crossed my mind to use bar charts for depicting this I really like.)
Charts:
Generated data, 100 point in 2D in 5 distinct clusters and 2
outliers (Category 6 shows the outliers):
Distances shown for all the 102 data points, the last two ones are
the outliers which were correctly identified. I repeated the test
with 500, and 1000 data points and added solely 2 outliers. The
outliers were also found in those cases.
Distances for the real wine data set with potential outliers:
The row id of the potential outliers:
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
it produces the following output:
index value
59 59 12.22916
110 110 13.41211
121 121 15.86576
158 158 11.50079
My annotated code snippet:
data(wines)
scaled_wines <- scale(wines)
# creating and training SOM
som.wines <- som(scaled_wines, grid = somgrid(5, 5, "hexagonal"))
summary(som.wines)
#looking for outliers, dist = distance to the closest unit
som.wines$distances
len <- length(som.wines$distances)
index_in_vector <- c(1:len)
df_wine<-data.frame(cbind(index_in_vector, som.wines$distances))
colnames(df_wine) <-c("index", "value")
po <-ggplot(df_wine, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som.wines$distances") + xlab("Number of Rows in the Data Set")
plot(po)
# print the row id of the outliers
# the threshold 10 can be taken from the bar chart,
# below which the vast majority of the values fall
df_wine[df_wine$value > 10, ]
Further Code Samples
With regard to the discussion in the comments I also post the code snippets asked for. As far as I remember, the code lines responsible for clustering I constructed based on samples I found in the description of the Kohonen package (https://cran.r-project.org/web/packages/kohonen/kohonen.pdf). However, I am not completely sure, it was more than a year ago. The code is provided as is without any warranty :-). Please bear in mind that a particular clustering approach may perform with different accuracy on different data. I would also recommend to compare it with t-SNE on the wine data set (data(wines) available in R). Moreover, implement the heat-maps to demonstrate how the data with regard to individual variables are located. (In the case of the above example with 2 variables it is not important but it would be nice for the wine data set).
Data Generation with Five Clusters and 2 Outliers and Plotting
library(stats)
library(ggplot2)
library(kohonen)
generate_data <- function(num_of_points, num_of_clusters, outliers=TRUE){
num_of_points_per_cluster <- num_of_points/num_of_clusters
cat(sprintf("#### num_of_points_per_cluster = %s, num_of_clusters = %s \n", num_of_points_per_cluster, num_of_clusters))
arr<-array()
standard_dev_y <- 6000
standard_dev_x <- 2
# for reproducibility setting the random generator
set.seed(10)
for (i in 1:num_of_clusters){
centroid_y <- runif(1, min=10000, max=200000)
centroid_x <- runif(1, min=20, max=70)
cat(sprintf("centroid_x = %s \n, centroid_y = %s", centroid_x, centroid_y ))
vector_y <- rnorm(num_of_points_per_cluster, mean=centroid_y, sd=standard_dev_y)
vector_x <- rnorm(num_of_points_per_cluster, mean=centroid_x, sd=standard_dev_x)
cluster <- array(c(vector_y, vector_x), dim=c(num_of_points_per_cluster, 2))
cluster <- cbind(cluster, i)
arr <- rbind(arr, cluster)
}
if(outliers){
#adding two outliers
arr <- rbind(arr, c(10000, 30, 6))
arr <- rbind(arr, c(150000, 70, 6))
}
colnames(arr) <-c("y", "x", "Cluster")
# WA to remove the first NA row
arr <- na.omit(arr)
return(arr)
}
scatter_plot_data <- function(data_in, couloring_base_indx, main_label){
df <- data.frame(data_in)
colnames(df) <-c("y", "x", "Cluster")
pl <- ggplot(data=df, aes(x = x,y=y)) + geom_point(aes(color=factor(df[, couloring_base_indx])))
pl <- pl + ggtitle(main_label) + theme(plot.title = element_text(hjust = 0.5))
print(pl)
}
##################
# generating data
data <- generate_data(100, 5, TRUE)
print(data)
scatter_plot_data(data, couloring_base_indx<-3, "Original Clusters without Outliers \n 102 Points")
Preparation, Clustering and Plotting
I used the hierarchical clustering approach with the Kohonen Map (SOM).
normalising_data <- function(data){
# normalizing data points not the cluster identifiers
mtrx <- data.matrix(data)
umtrx <- scale(mtrx[,1:2])
umtrx <- cbind(umtrx, factor(mtrx[,3]))
colnames(umtrx) <-c("y", "x", "Cluster")
return(umtrx)
}
train_som <- function(umtrx){
# unsupervised learning
set.seed(7)
g <- somgrid(xdim=5, ydim=5, topo="hexagonal")
#map<-som(umtrx[, 1:2], grid=g, alpha=c(0.005, 0.01), radius=1, rlen=1000)
map<-som(umtrx[, 1:2], grid=g)
summary(map)
return(map)
}
plot_som_data <- function(map){
par(mfrow=c(3,2))
# to plot some charactristics of the SOM map
plot(map, type='changes')
plot(map, type='codes', main="Mapping Data")
plot(map, type='count')
plot(map, type='mapping') # how many data points are held by each neuron
plot(map, type='dist.neighbours') # the darker the colours are, the closer the point are; the lighter the colours are, the more distant the points are
#to switch the plot config to the normal
par(mfrow=c(1,1))
}
plot_disstances_to_the_closest_point <- function(map){
# to see which neuron is assigned to which value
map$unit.classif
#looking for outliers, dist = distance to the closest unit
map$distances
max(map$distances)
len <- length(map$distances)
index_in_vector <- c(1:len)
df<-data.frame(cbind(index_in_vector, map$distances))
colnames(df) <-c("index", "value")
po <-ggplot(df, aes(index, value)) + geom_bar(stat = "identity")
po <- po + ggtitle("Outliers?") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Distances in som$distances") + xlab("Number of Rows in the Data Set")
plot(po)
return(df)
}
###################
# unsupervised learning
umtrx <- normalising_data(data)
map<-train_som(umtrx)
plot_som_data(map)
#####################
# creating the dendogram and then the clusters for the neurons
dendogram <- hclust(object.distances(map, "codes"), method = 'ward.D')
plot(dendogram)
clusters <- cutree(dendogram, 7)
clusters
length(clusters)
#visualising the clusters on the map
par(mfrow = c(1,1))
plot(map, type='dist.neighbours', main="Mapping Data")
add.cluster.boundaries(map, clusters)
Plots with the Clusters
You can also create nice heat-maps for selected variables but I had not implemented them for clustering with 2 variables it does not really make sense. If you implement it for the wine data set, please add the code and the charts to this post.
#see the predicted clusters with the data set
# 1. add the vector of the neuron ids to the data
mapped_neurons <- map$unit.classif
umtrx <- cbind(umtrx, mapped_neurons)
# 2. taking the predicted clusters and adding them the the original matrix
# very good description of the apply functions:
# https://www.guru99.com/r-apply-sapply-tapply.html
get_cluster_for_the_row <- function(x, cltrs){
return(cltrs[x])
}
predicted_clusters <- sapply (umtrx[,4], get_cluster_for_the_row, cltrs<-clusters)
mtrx <- cbind(mtrx, predicted_clusters)
scatter_plot_data(mtrx, couloring_base_indx<-4, "Predicted Clusters with Outliers \n 100 points")
See the predicted clusters below in case there were outliers and in case there were not.
I am not quite sure though, but I often find that the distance measurement of two objects reside in a particular dimensional space uses mostly Euclidean distance. For example, two points A and B in a two dimensional space having location of A(x=3, y=4) and B(x=6, y=8) are 5 distance unit apart. It is a result of performing calculation of squareroot((3-6)^2 + (4-8)^2). This is also applied to the data whose greater dimension, by adding trailing power of two of the difference of the two point's value in a particular dimension. If A(x=3, y=4, z=5) and B(x=6, y=8, z=7) then the distance is squareroot((3-6)^2 + (4-8)^2 + (5-7)^2), and so on. In kohonen, I think that after the model has finished the training phase, the algorithm then calculates the distances of each datum to all nodes and then assign it to the nearest node (a node which has the smallest distance to it). Eventually, the values inside the variable 'distances' returned by the model is the distance of every datum to its nearest node. One thing to note from your script is that the algorithm does not measure the distance directly from the original property values that the data have, because they have been scaled prior to feeding the data to the model. The distance measurement is applied to the scaled version of the data. The scaling is a standard procedure to eliminate the dominance of a variable on top of another.
I believe that your method is acceptable, because the values inside the 'distances' variable are the distance of each datum to its nearest node. So if a value of the distance between a datum and its nearest node is high, then this also means: the distance of the datum to other nodes are obviously much much higher.
I have stack raster dataset with several layers, however, I want to calculate the sum of each cell with for different layer selection, and finally generate a new layer, anyone has some good suggestion by using calc or overlay or some other raster calculation in R?
I can do by loops and make the calculation, but it will consume many times when I have many layers, and also use many of the storage, my script as follows,
## library(raster)
make_calc <- function(rr, start, end) {
rr <- as.array(rr)
start <- as.array(start)
end <- as.array(end)
dms <- dim(raster)
tmp <- array(NA, dim = dms[1:2])
for (i in 1:dms[1]) {
for (j in 1:dms[2]) {
tmp[i,j] <- sum(raster[i,j,start[i,j,1]:end[i,j,1]], na.rm = TRUE)
}
}
return(tmp)
}
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
start <- raster(res = 10)
start[] <- sample(1:2, ncell(start), replace = TRUE)
end <- raster(res = 10)
end[] <- sample(3:4, ncell(end), replace = TRUE)
result <- make_calc(rr, start, end)
Why are you coercing into arrays? You can easily collapse a raster into a vector but, that does not even seem necessary here. In the future, please try to be more clear on what your expected outcome is.
Based on your code, I really don't know what you are getting at. I am going to take a few guesses on summing specified rasters in the stack, drawing a random sample, across rasters to be summed and finally, drawing a random sample of cells to be summed.
For a sum on specified rasters in a stack, you can just index what you are after in the stack using a double bracket. In this case, rasters 1 and 3 in the stack would be the only ones summed.
library(raster)
rr <- raster(res = 10)
rr[] <- 1
rr <- stack(rr, rr, rr, rr)
( sum_1_3 <- calc(rr[[c(1,3)]], sum) )
If you are wanting a random sample of the values across rasters, for every cell, you could write a function that is passed to calc. Here is an example that grabs a random sample of n size, across the raster layers values and sums them.
rs.sum <- function(x, n=2) {sum( x[sample(1:length(x),n)], na.rm=TRUE)}
rs.sum.raster <- calc(rr, rs.sum)
If you are wanting to apply a function to a limited random selection of cells, you could create a random sample of the raster that would be used as an index. Here we create a random sample of cells, create an empty raster and pipe the sum of rasters 1 and 2 (in the stack) based on the random sample cell index. A raster in the stack is indexed using the double bracket and the raster values are indexed using a single bracket so, for raster 1 in the stack with limiting to the values in the random sample you would use: rr[[1]][rs]
rs <- sample(1:ncell(rr[[1]]), 300)
r.sum <- rr[[1]]
r.sum[] <- NA
r.sum[rs] <- rr[[1]][rs] + rr[[2]][rs]
plot(r.sum)
I would like to build the hexbin plot where for every bin is the "ratio between class 1 and class2 points falling into this bin" is plotted (either log or not).
x <- rnorm(10000)
y <- rnorm(10000)
h <- hexbin(x,y)
plot(h)
l <- as.factor(c( rep(1,2000), rep(2,8000) ))
Any suggestions on how to implement this? Is there a way to introduce function to every bin based on bin statistics?
#cryo111's answer has the most important ingredient - IDs = TRUE. After that it's just a matter of figuring out what you want to do with Inf's and how much do you need to scale the ratios by to get integers that will produce a pretty plot.
library(hexbin)
library(data.table)
set.seed(1)
x = rnorm(10000)
y = rnorm(10000)
h = hexbin(x, y, IDs = TRUE)
# put all the relevant data in a data.table
dt = data.table(x, y, l = c(1,1,1,2), cID = h#cID)
# group by cID and calculate whatever statistic you like
# in this case, ratio of 1's to 2's,
# and then Inf's are set to be equal to the largest ratio
dt[, list(ratio = sum(l == 1)/sum(l == 2)), keyby = cID][,
ratio := ifelse(ratio == Inf, max(ratio[is.finite(ratio)]), ratio)][,
# scale up (I chose a scaling manually to get a prettier graph)
# and convert to integer and change h
as.integer(ratio*10)] -> h#count
plot(h)
You can determine the number of class 1 and class 2 points in each bin by
library(hexbin)
library(plyr)
x=rnorm(10000)
y=rnorm(10000)
#generate hexbin object with IDs=TRUE
#the object includes then a slot with a vector cID
#cID maps point (x[i],y[i]) to cell number cID[i]
HexObj=hexbin(x,y,IDs = TRUE)
#find count statistics for first 2000 points (class 1) and the rest (class 2)
CountDF=merge(count(HexObj#cID[1:2000]),
count(HexObj#cID[2001:length(x)]),
by="x",
all=TRUE
)
#replace NAs by 0
CountDF[is.na(CountDF)]=0
#check if all points are included
sum(CountDF$freq.x)+sum(CountDF$freq.y)
But printing them is another story. For instance, what if there are no class 2 points in one bin? The fraction is not defined then.
In addition, as far as I understand hexbin is just a two dimensional histogram. As such, it counts the number of points that fall into a given bin. I do not think that it can handle non-integer data as in your case.