Sampling from a subset of data - r

I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.

I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?

Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.

Related

Fill matrix where submatrices are dimensions of value in a vector (vector can be random numbers) in R

I have a matrix that will represent infection values of bats (can either be 1 or 0). The animals live in larger units ("roost", each roost is a submatrix), and the full matrix is the population. For starters, I am trying to fill my matrix with submatrices, values all equal to 1.
The code is currently working for roosts where number of bats is all the same. Ex:
# Define our variables
numRoosts = 3
# Uniformly sized roosts....
roostSizes = rep(3, numRoosts) # Each roost has 3 bats, looks like c(3, 3, 3)
# Adjaceny matrix describing connections between bats in all roosts
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# Fill the submatrices with dimension 'size x size' with 1's to create the subroost network
# Line below works for uniform roost sizes, but is buggy for nonuniform roost sizes
batAdjacencyMatrix[(1+n*size):((n+1)*size), (1+n*size):((n+1)*size)] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
This gives me the output I wanted:
The issue is when I try to change the roosts to nonuniform sizes:
numRoosts = 3
# If you want variable sized roosts....
minRoostPopulation = 2 # Min number of bats in a roost
maxRoostPopulation = 5 # Maximum number of bats in a roost
roostSizes <- round(runif(numRoosts, minRoostPopulation, maxRoostPopulation)) #Here c(5, 2, 4)
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# Fill the submatrices with dimension 'size x size' with 1's to create the subroost network
# Line below works for uniform roost sizes, but is buggy for nonuniform roost sizes
batAdjacencyMatrix[(1+n*size):((n+1)*size), (1+n*size):((n+1)*size)] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
There's something wrong with the indexing in my for loop- I can see when I put the numbers in manually. But I can't figure out how to define my submatrices so that the next number in the vector shifts down/right to the end of the matrix prior. Any thoughts? Thanks in advance!
Your indexing is indeed somewhat off. If I were to modify your approach, I would do it like this:
numRoosts = 3
roostSizes <- c(5, 2, 4)
batAdjacencyMatrix <- as.data.frame(matrix(0, nrow = sum(roostSizes), ncol = sum(roostSizes)))
colnames(batAdjacencyMatrix) = rownames(batAdjacencyMatrix) = paste0("Bat_", 1:sum(roostSizes))
# Start filling in the network structure
n = 0
for(size in roostSizes){
# sum of sizes of preceding roosts
size.prior <- sum(head(roostSizes, n))
# indices of the current roost
ind <- size.prior + (1:size)
batAdjacencyMatrix[ind, ind] <- 1 # Matrix indexing
# Increment the counter
n = n + 1
}
However, an easier way is to use magic::adiag() which can build such block-diagonal matrices:
library(magic)
roostSizes <- c(5, 2, 4)
# create the three matrices of 1's
mat <- lapply(roostSizes, function(n) matrix(1, n, n))
# bind them diagonally
batAdjacencyMatrix <- do.call(adiag, mat)
rownames(batAdjacencyMatrix) <- colnames(batAdjacencyMatrix) <-
paste0('Bat_', seq_len(sum(roostSizes)))

Subset in the data frame rows in R

I have a data frame with 30 rows and 4 columns (namely, x, y, z, u). It is given below.
mydata = data.frame(x = rnorm(30,4), y = rnorm(30,2,1), z = rnorm(30,3,1), u = rnorm(30,5))
Further, I have a sequence values, which represent row number in my data frame.
myseq = c(seq(1, 30, by = 5))
myseq
[1] 1 6 11 16 21 26
Now, I wanted to compute the prob values for each segment of 99 rows.
filt= subset(mydata[1:6,], mydata[1:6,]$x < mydata[1:6,]$y & mydata[1:6,]$z < mydata[1:6,]$u
filt
prob = length(filt$x)/30
prob
Then I need to compute the above prob for 1:6,.., 27:30 and so on . Here, I have only 6 prob values. So, I can do one by one. If I have 100 values it would be tedious. Are there any way to compute the prob values?.
Thank you in advance.
BTW: in subset(DF[1:99,], ...), use DF[1:99,] in the first argument, not again, ala
subset(DF[1:99,], cumsuml < inchivaluel & cumsumr < inchivaluer)
Think about how to do this in a list.
The first step is to break your data into the va starting points. I'll start with a list of the indices to break it into:
inds <- mapply(seq, va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
this now is a list of sequences, starting with 1:99, then 100:198, etc. See str(inds) to verify.
Now we can subset a portion of the data based on each element's vector of indices:
filts <- lapply(inds, function(ind) subset(DF[ind,], cumsuml < inchivaluel & cumsumr < inchivaluer))
We now have a list of vectors, let's summarize it:
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))
Bottom line, it helps to think about how to break this problem into lists, examples at http://stackoverflow.com/a/24376207/3358272.
BTW: instead of initially making a list of indices, we could just break up the data in that first step, ala
DF2 <- mapply(function(a,b) DF[a:b,], va, c(va[-1], nrow(DF)), SIMPLIFY=FALSE)
filts <- lapply(DF2, function(x) subset(x, cumsuml < inchivaluel & cumsumr < inchivaluer))
results <- sapply(filts, function(filt) length(filt$cumsuml)/length(alpha))

Create a matrix from a list consisting of unequal matrices for individual bootstraps

I tried to create a matrix from a list which consists of N unequal matrices...
The reason to do this is to make R individual bootstrap samples.
In the example below you can find e.g. 2 companies, where we have 1 with 10 & 1 with just 5 observations.
Data:
set.seed(7)
Time <- c(10,5)
xv <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2);
y <- matrix( c(rnorm(10,5,2), rnorm(5,20,1)));
z <- matrix(c(rnorm(10,5,2), rnorm(5,20,1), rnorm(10,5,2), rnorm(5,20,1)), ncol=2)
# create data frame of input variables which helps
# to conduct the rowise bootstrapping
data <- data.frame (y = y, xv = xv, z = z);
rows <- dim(data)[1];
cols <- dim(data)[2];
# create the index to sample from the different panels
cumTime <- c(0, cumsum (Time));
index <- findInterval (seq (1:rows), cumTime, left.open = TRUE);
# draw R individual bootstrap samples
bootList <- replicate(R = 5, list(), simplify=F);
bootList <- lapply (bootList, function(x) by (data, INDICES = index, FUN = function(x) dplyr::sample_n (tbl = x, size = dim(x)[1], replace = T)));
---------- UNLISTING ---------
Currently, I try do it incorrectly like this:
Example for just 1 entry of the list:
matrix(unlist(bootList[[1]], recursive = T), ncol = cols)
The desired output is just
bootList[[1]]
as a matrix.
Do you have an idea how to do this & if possible reasonably efficient?
The matrices are then processed in unfortunately slow MLE estimations...
i found a solution for you. From what i gather, you have a Dataframe containing all observations of all companies, which may have different panel lengths. And as a result you would like to have a Bootstap sample for each company of same size as the original panel length.
You mearly have to add a company indicator
data$company = c(rep(1, 10), rep(2, 5)) # this could even be a factor.
L1 = split(data, data$company)
L2 = lapply(L1, FUN = function(s) s[sample(x = 1:nrow(s), size = nrow(s), replace = TRUE),] )
stop here if you would like to have saperate bootstap samples e.g. in case you want to estimate seperately
bootdata = do.call(rbind, L2)
Best wishes,
Tim

Extend a vector by randomly increasing values in R

In this example I'm trying to generate a random time series for 3 individuals at 4 time points (below x contains the 1st timepoints for each individual). I want the values to be randomly increasing rather than decreasing in time. Below is my current solution.
set.seed(0)
x <- rnorm(3)
x
[1] 1.2629543 -0.3262334 1.3297993
y <- c(x,
x*runif(1,.8,1.2),
x*runif(1,.9,1.3),
x*runif(1,1,1.4))
y
[1] 1.2629543 -0.3262334 1.3297993 1.4642135 -0.3782206 1.5417106 1.6138915 -0.4168839 1.6993107 1.5967772
[11] -0.4124631 1.6812906
This has some problems.
For each individual the same coefficient is used for calculating the values for same timepoint resulting in identical trends. How could I get a random coefficient for each multiplication? I could use lapply but then the vector will be "grouped" by individuals not by timepoints.
I don't wish to write the formulas for last timepoints separately and be so precise. Exact coefficients are not important, I just need the values to have a tendency to slightly increase but occasional decreasing should also be allowed. How could I extend the vector more "effectively"?
How to make negative values to also increase?
I managed to solve this thanks to Federico Manigrasso. The solution is below.
TimeSer <- function(num.id, years, init.val) {
df <- data.frame(id = factor(rep(1:num.id, length(years))),
year = rep(years, each = num.id))
yrs <- length(years) - 1
minim <- seq(-.1, by = -.1, len = yrs)
maxim <- seq(.4, by = .4, len = yrs)
val <- list(init.val)
for (i in 1:yrs) {
val[[i + 1]] <- unlist(lapply(init.val, function (x) {
x + (x * runif(1, minim[i], maxim[i]))
}))
}
df$val <- unlist(val)
df
}
df <- TimeSer(num.id = 3, years = 2006:2016, init.val = rnorm(3,1e5, 1e5))
Visual representation of the results:
num.id <- length(unique(df$id))
par(mfrow=c(1,num.id))
lapply(1:num.id, function(x) {
plot(unique(df$year), df$val[df$id == x], type = 'l', col = x)
})
I suggest to put the output in a list, It a lot less messy and you can transform into a vector later (using unlist).
This is how I would rewrite your code
x<-rnorm(3)
time<-3
output<-list(x) #init output list with initial data
par1<-c(0.8,0.9,1)
par2<-c(1.2,1.3,1.4)
for( i in 1:time){
a<-unlist(lapply(x,function(x){x+runif(1,par1[i],par2[i])}))
output[[i+1]]<-a
x<-a
}
let me know if this solves all your problems..

Add replicate column to end of matrix

I have a matrix of one column and 6 rows. I would like to replicate that column i times but change one value randomly each time, and after each iteration, calculate the mean and variance across all columns.
For example:
values = rnorm(6, 6, 1); matrix1 = matrix(values, 6)
After i=1, would look like:
values2 = values
values2[sample(1:6, 1)] = values2[sample(1:6, 1)]+runif(1, 0, 1)
matrix2 = matrix(c(values, values2), 6)
At the end, I would like to output a data frame that looks like so:
i mean var
1 1.23 2.31
2 1.24 2.33 etc...
For many i's. I imagine there is a way to do this with loops, but my skills are not such that I can figure it out. Thanks for all your help!
If you know how many times you're doing this, it would be best to construct your final matrix beforehand, especially if i is large. However, without that:
jitter.func <- function(x, vec) {
cell <- sample(1:length(vec), 1)
vec[cell] <- vec[cell] + runif(1, 0, 1)
return(c(mean=mean(vec), var=var(vec)))
}
i <- 10
sapply(1:i, jitter.func, vec=values)
j <- 20 # Number of columns
i <- 6 # Number of rows
vec <- matrix(rnorm(i,6,1),ncol=j,nrow=i)# vector replicated j times
idx <- sample(seq(i),j,replace=TRUE) # j random rows
vec[cbind(idx, seq(j))] <- vec[cbind(idx, seq(j))]+runif(j) # add random number to random row in each column
apply(vec,2,plyr::each(mean,var)) # summary statistics

Resources