Binning data by order with selected bin size - r

I have a vector of values that I want to order by value in descending order, then bin in bins of size 100, with the final bin containing all of the remaining values.
#generate random data
set.seed(1)
x <- rnorm(8366)
#In descending order
y <- x[order(-x)]
Now I have used cut to bin by value before, but I want the bins to be of finite size. So the first bin will have the first 100 values in y, the second bin the next hundred etc until I have ten bins, with the final bin containing all of the remaining values. I am not sure how to go about this.

The below will return the bins as a list:
mylist <- split(y, c(rep(1:9, each = 100), rep(10, 8366 - 900)))
The first 9 elements contain 100 records each and the rest are stored in the 10th element.

I'm not sure what you mean by "bin". Do you want to summarize each 100 values in some way? For example, sum them? If so, here's one solution:
#generate random data
set.seed(1)
x <- rnorm(8836)
n <- ceiling(length(x)/100) * 100
y <- rep(0, n)
#In descending order
y[1:length(x)] <- x[order(-x)]
X <- matrix(y, nrow = , ncol = 100, byrow = T)
apply(X, 2, sum)

You can use cut :
res <- cut(y,c(rev(y)[seq(1,901,100)],Inf),right = F)
table(res)
# res
# [-3.67,-2.33) [-2.33,-2.05) [-2.05,-1.87) [-1.87,-1.72) [-1.72,-1.6)
# 100 100 100 100 100
# [-1.6,-1.5) [-1.5,-1.41) [-1.41,-1.34) [-1.34,-1.27) [-1.27,Inf)
# 100 100 100 100 7466

Related

Fill matrix where submatrices are dimensions of value in a vector (vector can be random numbers) in R

I have a matrix that will represent infection values of bats (can either be 1 or 0). The animals live in larger units ("roost", each roost is a submatrix), and the full matrix is the population. For starters, I am trying to fill my matrix with submatrices, values all equal to 1.
The code is currently working for roosts where number of bats is all the same. Ex:
# Define our variables
numRoosts = 3
# Uniformly sized roosts....
roostSizes = rep(3, numRoosts) # Each roost has 3 bats, looks like c(3, 3, 3)
# Adjaceny matrix describing connections between bats in all roosts
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# Fill the submatrices with dimension 'size x size' with 1's to create the subroost network
# Line below works for uniform roost sizes, but is buggy for nonuniform roost sizes
batAdjacencyMatrix[(1+n*size):((n+1)*size), (1+n*size):((n+1)*size)] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
This gives me the output I wanted:
The issue is when I try to change the roosts to nonuniform sizes:
numRoosts = 3
# If you want variable sized roosts....
minRoostPopulation = 2 # Min number of bats in a roost
maxRoostPopulation = 5 # Maximum number of bats in a roost
roostSizes <- round(runif(numRoosts, minRoostPopulation, maxRoostPopulation)) #Here c(5, 2, 4)
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# Fill the submatrices with dimension 'size x size' with 1's to create the subroost network
# Line below works for uniform roost sizes, but is buggy for nonuniform roost sizes
batAdjacencyMatrix[(1+n*size):((n+1)*size), (1+n*size):((n+1)*size)] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
There's something wrong with the indexing in my for loop- I can see when I put the numbers in manually. But I can't figure out how to define my submatrices so that the next number in the vector shifts down/right to the end of the matrix prior. Any thoughts? Thanks in advance!
Your indexing is indeed somewhat off. If I were to modify your approach, I would do it like this:
numRoosts = 3
roostSizes <- c(5, 2, 4)
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# sum of sizes of preceding roosts
size.prior <- sum(head(roostSizes, n))
# indices of the current roost
ind <- size.prior + (1:size)
batAdjacencyMatrix[ind, ind] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
However, an easier way is to use magic::adiag() which can build such block-diagonal matrices:
library(magic)
roostSizes <- c(5, 2, 4)
# create the three matrices of 1's
mat <- lapply(roostSizes, function(n) matrix(1, n, n))
# bind them diagonally
batAdjacencyMatrix <- do.call(adiag, mat)
rownames(batAdjacencyMatrix) <- colnames(batAdjacencyMatrix) <-
paste0('Bat_', seq_len(sum(roostSizes)))

Simulations with Columns as Input for Each Row in R

I'm trying to run a simulation with a combination of static variables and values within columns, sum the output, and store the individual outputs in a vector or dataframe.
mean1 <- 2.4
sd1 <- 0.5
df <- data.frame(x = c(2, 3, 4), y = c(5, 6, 7))
What I want to do is :
divide each row in column x by each row in column y
multiply by a normal distribution using mean1 and sd1
sum the resultant row values, so I'd have a single value per simulation.
I think I understand how I'd get the value if I wasn't going row by row, so for row 1 it'd be:
v1 <- replicate(n = 1, expr = rnorm(n = 100, mean = mean1, sd = sd1) * 2 / 5, simplify = TRUE)
But where I'm drawing a blank is how to run that for each row, then sum the results of each row for each simulation, in this case sum the three values from each of the three rows 100 times, so I'd have an output with 100 values.
Dividing x by y is constant so you can do it once and save it in a variable. You can then use replicate 100 times and generate 1 random number at every iteration to multiply and take sum.
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
Or you can also generate 100 random values together and sum them with sapply.
r_val <- rnorm(n, mean = mean1, sd = sd1)
sapply(r_val, function(x) sum(val * x))
Ronak answered my question with:
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
I had to add back the df$column reference (df$x here) as opposed to creating a constant since the actual application had more variables and math that was more complicated than the example, but the structure worked perfectly.
Thank you!

Select a sample at random and use it to generate 1000 bootstrap samples

I would like to generate 1000 samples of size 25 from a standard normal distribution, calculate the variance of each one, and create a histogram. I have the following:
samples = replicate(1000, rnorm(25,0,1), simplify=FALSE)
hist(sapply(samples, var))
Then I would like to randomly select one sample from those 1000 samples and take 1000 bootstraps from that sample. Then calculate the variance of each and plot a histogram. So far, I have:
sub.sample = sample(samples, 1)
Then this is where I'm stuck, I know a for loop is needed for bootstrapping here so I have:
rep.boot2 <- numeric(lengths(sub.sample))
for (i in 1:lengths(sub.sample)) {
index2 <- sample(1:1000, size = 25, replace = TRUE)
a.boot <- sub.sample[index2, ]
rep.boot2[i] <- var(a.boot)[1, 2]
}
but running the above produces an "incorrect number of dimensions" error. Which part is causing the error?
I can see 2 problems here. One is that you are trying to subset sub.sample with as you would with a vector but it is actually a list of length 1.
a.boot <- sub.sample[index2, ]
To fix this, you can change
sub.sample = sample(samples, 1)
to
sub.sample = as.vector(unlist(sample(samples, 1)))
The second problem is that you are generating a sample of 25 indexes from between 1 and 1000
index2 <- sample(1:1000, size = 25, replace = TRUE)
but then you try to extract these indexes from a list with a length of only 25. So you will end up with mostly NA values in a.boot.
If I understand what you want to do correctly then this should work:
samples = replicate(1000, rnorm(25,0,1), simplify=FALSE)
hist(sapply(samples, var))
sub.sample = as.vector(unlist(sample(samples, 1)))
rep.boot2=list()
for (i in 1:1000) {
index2 <- sample(1:25, size = 25, replace = TRUE)
a.boot <- sub.sample[index2]
rep.boot2[i] <- var(a.boot)
}

Sampling from a subset of data

I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.

How to aggregate a binary raster into percentage in R

I have a binary raster(r) at 1 meter resolution and I want to convert it into a percentage value at 4m resolution.This new raster would have each pixel value representing the percent, calculated on basis of total frequency of 1 out of 16 pixels.I looked at the raster package which has aggregate function. However, this doesn't work.
newras <-aggregate(r, fact=4, fun= percent)
What you do does not work because there is no function called percentage. But you can make one. In this case, the mean value is the fraction, so you multiply that with 100 to get the percentage.
Example data
library(raster)
r <- raster()
set.seed(0)
values(r) <- sample(0:1, ncell(r), replace=TRUE)
Aggregate
a <- aggregate(r, 4, fun=function(x,...) 100 * mean(x))
# or
a <- 100 * aggregate(r, 4, mean)
Consider NA values
r[sample(ncell(r), 0.9 * ncell(r))] <- NA
# Make a function and use it
percentage <- function(x, ...) { x <- na.omit(x); 100 * mean(x) }
a <- aggregate(r, 4, fun=percentage)
# or do
a <- 100 * aggregate(r, 4, fun=mean, na.rm=TRUE)
Here's a method just using matrices. I am using a 40 by 40 matrix. The method will require some thought if dimensions are not multiples of 4.
Original matrix:
mtx <- matrix(sample(0:1, 40^2, TRUE), 40, 40)
Indices to use as arguments for grouping:
inds <- Map(seq, seq(1, 37, 4), seq(4, 40, 4))
Group into 4 by 4 blocks. blockarray has 16 rows (each element within groups) and 100 columns (representing groups). Note that 40 x 40 = 16 x 100.
blockarray <- mapply(function(i, j) mtx[i, j],
rep(inds, times = 10),
rep(inds, each = 10))
To get the percentage matrix:
pcts <- matrix(colMeans(blockarray)*100, 10, 10)
Visual inspection of results:
image(mtx, zlim = 0:1, col = c("white", "black"))
image(pcts, zlim = c(0, 100), col = colorRampPalette(c("white", "black"))(11))
Validation of results:
sum(mtx[1:4, 5:8])/16*100
pcts[1, 2]

Resources