Random subsampling in R

Random subsampling in R - r

I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.

Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]

What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)

Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)

This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results

Related

Count number of unique instances in a column depending on values in other columns

I've got the following table (which is called train) (in reality much bigger)
UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 1 0 0 0 1 0
514415 0 0 1 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 1 0 0 0 1 0
411011 0 0 0 0 0 0 0 1 0
I want to calculate the number of unique UNSPSC per column where the value is equal to 1. So for column blood it will be 2 and for column ultrasoft will be 3.
I'm doing this but don't know how to continue:
apply(train[,-1], 2, ......)
I'm trying to not to use loops.

To continue from where you left, we can use apply with margin=2 and calculate the length of unique values of "UNSPSC" for each column.
apply(train[-1], 2, function(x) length(unique(train$UNSPSC[x==1])))
#adaptor alert bact blood collection packet
# 0 0 1 2 0 0
#patient ultrasoft whit
# 0 3 0
Better option is with sapply/lapply which gives the same result but unlike apply does not convert the dataframe into matrix.
sapply(train[-1], function(x) length(unique(train$UNSPSC[x==1])))

If you have columns of only 0 and 1, like in the example, just use colSums:
colSums(train[,-1]) # you remove the non numeric columns before use, like UNSPSC
# adaptor alert bact blood collection packet patient
# 0 0 1 2 0 0 0
# ultrasoft whit
# 10 0

Compute percentage weights on rows when one column is not numeric

I have this data called out:
Dates Consumer Staples Energy Financials Health Care
1 12/31/99 0 0 0 0 0
2 03/31/00 0 0 0 0 0
3 06/30/00 0 0 0 0 0
4 09/30/00 0 0 0 0 0
5 12/31/00 0 0 0 0 0
6 03/31/01 1000 0 0 50 0
7 06/30/01 0 0 0 0 0
I would like to compute the weights for each category on each row
but need to avoid summing the first column which is a date
Weights <- round(out[2:6]/rowSums(out[2:6])*100, 2)
1/ Is there a way to keep the dates in the first column, and compute
the weights of the next 5 columns in the same data set
2/ When a date has only 0 data, how to avoid the NAs?
Thank you for you help

outN <- out[,-1]
rownames(outN) <- out[,1]
Cap_Weights <- round(outN/rowSums(outN)*100, 2)
Cap_Weights[is.na(Cap_Weights)] <- 0

How to randomly divide an integer into a fixed number of integers, such that the obtained tuples are uniformly distributed?

Based on this reply: Random numbers that add to 100: Matlab
I tried to apply the suggested method to randomly divide an integer into a fixed number of integers whose sum is equal to the integer. Although that method seems to result in a uniformly distributed set of points when the values are not integers, in the case of integers, the resulting tuples are not obtained with equal probability.
This is shown by the following implementation in R, where a simple case is tested with 3 divisors and with the integer to be divided equal to 5:
# Randomly divide an integer into a defined number of integers
# Goal: obtain with equal probability any combination of variable values, with the condition that sum(variables) = dividend.
# install.packages(rgl) # Install rgl package if not yet installed. This allows to use the plot3d function to create a 3D scatterplot.
library(rgl)
n_draws = 10000
n_variables = 3 # Number of divisors. These need to be randomly calculated. Their value must be in the interval [0:dividend] and their sum must be equal to the dividend. Two variables can have the same value.
dividend = 5 # Number that needs to be divided.
rand_variables = matrix(nrow = n_draws, ncol = n_variables) # This matrix contains the final values for each variable (one column per variable).
rand_samples = matrix(nrow = n_draws, ncol = n_variables-1) # This matrix contains the intermediate values that are used to randomly divide the dividend.
for (k in 1:n_draws){
rand_samples[k,] = sample(x = c(0:dividend), size = n_variables-1, replace = TRUE) # Randomly select (n_variables - 1) values within the range 0:dividend. The values in rand_samples are uniformly distributed.
midpoints = sort(rand_samples[k,])
rand_variables[k,] = sample(diff(c(0, midpoints, dividend)), n_variables) # Calculate the values of each variable such that their sum is equal to the dividend.
}
plot3d(rand_variables) # Create a 3D scatterplot showing the values of rand_variables. This plot does not show how frequently each combination of values of the n_variables is obtained, only which combinations of values are possible.
table(data.frame(rand_variables)) # This prints out the count of each combination of values of n_variables. It shows that the combinations of values in the corners (e.g. (5,0,0)) are obtained less frequently than other combinations (e.g. (1,2,2)).
The last line gives the following output, which shows how many times were obtained each combination of values of (X1, X2, X3) that respect the condition X1 + X2 + X3 = 5:
, , X3 = 0
X2
X1 0 1 2 3 4 5
0 0 0 0 0 0 397
1 0 0 0 0 471 0
2 0 0 0 469 0 0
3 0 0 446 0 0 0
4 0 456 0 0 0 0
5 358 0 0 0 0 0
, , X3 = 1
X2
X1 0 1 2 3 4 5
0 0 0 0 0 450 0
1 0 0 0 539 0 0
2 0 0 560 0 0 0
3 0 588 0 0 0 0
4 426 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 2
X2
X1 0 1 2 3 4 5
0 0 0 0 428 0 0
1 0 0 603 0 0 0
2 0 549 0 0 0 0
3 461 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 3
X2
X1 0 1 2 3 4 5
0 0 0 500 0 0 0
1 0 549 0 0 0 0
2 455 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 4
X2
X1 0 1 2 3 4 5
0 0 465 0 0 0 0
1 458 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
, , X3 = 5
X2
X1 0 1 2 3 4 5
0 372 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
As the output shows, the combinations of values in the corners of the plane (e.g. (5,0,0)) are obtained less frequently than other tuples.
How can I obtain any integer tuple with the same probability?
I'm looking for a solution that is applicable for any positive integer and for any number of divisors.

I think trying to make these combinations/permutations manually is reinventing the wheel. There are efficient algorithms to do this implemented in partitions. For example,
library(partitions) # compositions, parts, restrictedparts may be of interest
sample_size <- 1000
pool <- compositions(5, 3) # pool of possible tuples
samp <- sample(ncol(pool), sample_size, TRUE) # sample uniformly
## These are you sampled tuples, each column
z <- matrix(pool[,samp], 3)
Side note: don't use a data.frame, use a matrix to store a set of integers. data.frames will be entirely copied every time you modify something ([.data.frame is not a primitive), whereas matrices will modify in place.

Vertex names by creating a network object via an edgelist (R package: network)

I want to create a network object, representing a directed network on basis of an edgelist. The first column contains some unique ID of project leaders, the second project partners, let's say:
library("network")
x <- cbind(rbind(1,1,2,2,3), rbind(3,7,10,9,6))
y.nw <- network(x, matrix="edgelist", directed=TRUE, loops=FALSE)
Now my problem is: I need all vertexes to have the right ID, since after creating the network object I have to transfer it back to a adjacency matrix with the right corresponding firm IDs. However, I am not sure in which order I should assign them, since I sorted the dataframe by column 1 (project leaders), which, however, not always show up as project partners as well.

If your ids are sequential integers as in your example, you can produce the adjacency matrix corresponding to the edgelist in your example with:
>as.sociomatrix(y.nw))
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
But maybe you have a different type of id system in your real input?

function to assign value in a matrix (j-programming)

I have two vectors (say, X and Y) which correspond to row and columns numbers. I want to write a function (a verb, in j-programming) that takes these and assign 1 in a n x n zero matrix. Here's for a simple case.
I have these vectors:
X=:1 2 1 5
Y=:0 3 3 9
and a zeros matrix:
mat=: 10 10$0
and I wrote the following function (I used boxing):
1(|:(,./<"0(|:(X,:Y)))) } 10 10$0
but the problem is it takes these vectors and assigns 1 to every column. So if I take (1,0) it assigns 1 to rows number 1 and 0 in all the columns (like this in Matlab (1,:) ). how can I overcome this problem?

I understand you to want to amend a boolean noun to put 1 at designated coordinates. You start with the coordinate pairs as separate lists. I recommend stitching those lists together like this:
Y,.X
0 1
3 2
3 1
9 5
Y comes before X because in J axes are naturally arranged in decreasing sequence (that is, most fine-grained to the right.) To use these as coordinate pairs with Amend, they'll need to be boxed:
<"1 Y,.X
+---+---+---+---+
|0 1|3 2|3 1|9 5|
+---+---+---+---+
Those will work with Amend to set 1 at those particular coordinates, so:
1 (<"1 Y,.X)} 10 10$0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
If I've understood your question, this is the matrix you were looking to produce.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Random subsampling in R - r

Try something like this: mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]

Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)

Related

Count number of unique instances in a column depending on values in other columns

Compute percentage weights on rows when one column is not numeric

How to randomly divide an integer into a fixed number of integers, such that the obtained tuples are uniformly distributed?

Vertex names by creating a network object via an edgelist (R package: network)

function to assign value in a matrix (j-programming)

Categories

Resources