function to assign value in a matrix (j-programming) - vector

I have two vectors (say, X and Y) which correspond to row and columns numbers. I want to write a function (a verb, in j-programming) that takes these and assign 1 in a n x n zero matrix. Here's for a simple case.
I have these vectors:
X=:1 2 1 5
Y=:0 3 3 9
and a zeros matrix:
mat=: 10 10$0
and I wrote the following function (I used boxing):
1(|:(,./<"0(|:(X,:Y)))) } 10 10$0
but the problem is it takes these vectors and assigns 1 to every column. So if I take (1,0) it assigns 1 to rows number 1 and 0 in all the columns (like this in Matlab (1,:) ). how can I overcome this problem?

I understand you to want to amend a boolean noun to put 1 at designated coordinates. You start with the coordinate pairs as separate lists. I recommend stitching those lists together like this:
Y,.X
0 1
3 2
3 1
9 5
Y comes before X because in J axes are naturally arranged in decreasing sequence (that is, most fine-grained to the right.) To use these as coordinate pairs with Amend, they'll need to be boxed:
<"1 Y,.X
+---+---+---+---+
|0 1|3 2|3 1|9 5|
+---+---+---+---+
Those will work with Amend to set 1 at those particular coordinates, so:
1 (<"1 Y,.X)} 10 10$0
0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
If I've understood your question, this is the matrix you were looking to produce.

Related

Count number of unique instances in a column depending on values in other columns

I've got the following table (which is called train) (in reality much bigger)
UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 1 0 0 0 1 0
514415 0 0 1 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 1 0 0 0 1 0
411011 0 0 0 0 0 0 0 1 0
I want to calculate the number of unique UNSPSC per column where the value is equal to 1. So for column blood it will be 2 and for column ultrasoft will be 3.
I'm doing this but don't know how to continue:
apply(train[,-1], 2, ......)
I'm trying to not to use loops.
To continue from where you left, we can use apply with margin=2 and calculate the length of unique values of "UNSPSC" for each column.
apply(train[-1], 2, function(x) length(unique(train$UNSPSC[x==1])))
#adaptor alert bact blood collection packet
# 0 0 1 2 0 0
#patient ultrasoft whit
# 0 3 0
Better option is with sapply/lapply which gives the same result but unlike apply does not convert the dataframe into matrix.
sapply(train[-1], function(x) length(unique(train$UNSPSC[x==1])))
If you have columns of only 0 and 1, like in the example, just use colSums:
colSums(train[,-1]) # you remove the non numeric columns before use, like UNSPSC
# adaptor alert bact blood collection packet patient
# 0 0 1 2 0 0 0
# ultrasoft whit
# 10 0

Merging two columns with two values

I have columns which I know there name and that their data are 0 and 1.
I would like to merge them to one but if in one row exist the 1 take the one value or if I have 1 and 1 keep 1.
Example of data:
stockI stockII
1 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
1 1
the output I could expect:
stockI/stockII
0
1
0
0
0
0
0
0
0
1
Is there any cbind method to make it?
We can try
as.integer(with(df1, (c(FALSE,stockI[-1] &
stockI[-nrow(df1)]) & stockI) | (stockI & stockII)))
#[1] 0 1 0 0 0 0 0 0 0 1

Random subsampling in R

I am new in R, therefore my question might be really simple.
I have a 40 sites with abundances of zooplankton.
My data looks like this (columns are species abundances and rows are sites)
0 0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 85 0
0 0 0 0 0 45 5 57 0
0 0 0 0 0 13 0 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 7 0
0 3 0 0 12 8 0 57 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 59 0 0 0
0 0 0 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 105 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 100 0
0 35 0 55 0 0 0 0 0
1 4 0 0 0 0 0 0 0
0 0 0 0 0 34 21 0 0
0 0 0 0 0 9 17 0 0
0 54 0 0 0 27 5 0 0
0 1 0 0 0 1 0 0 0
0 17 0 0 0 54 3 0 0
What I would like to is take a random sub-sample (e.g. 50 individuals) from each site without replacement several times (bootstrap) in order to calculate diversity indexes to the new standardized abundances afterwards.
Try something like this:
mysample <- mydata[sample(1:nrow(mydata), 50, replace=FALSE),]
What the OP is probably looking for here is a way to bootstrap the data for a Hill or Simpson diversity index, which provides some assumptions about the data being sampled:
Each row is a site, each column is a species, and each value is a count.
Individuals are being sampled for the bootstrap, NOT THE COUNTS.
To do this, bootstrapping programs will often model the counts as a string of individuals. For instance, if we had a record like so:
a b c
2 3 4
The record would be modeled as:
aabbbcccc
Then, a sample is usually drawn WITH replacement from the string to create a larger set based on the model set.
Bootstrapping a site: In R, we have a way to do this that is actually quite simple with the 'sample' function. If you select from the column numbers, you can provide probabilities using the count data.
# Test data.
data <- data.frame(a=2, b=3, c=4)
# Sampling from first row of data.
row <- 1
N_samples <- 50
samples <- sample(1:ncol(data), N_samples, rep=TRUE, prob=data[row,])
Converting the sample into the format of the original table: Now we have an array of samples, with each item indicating the column number that the sample belongs to. We can convert back to the original table format in multiple ways, but here is a fairly simple one using a simple counting loop:
# Count the number of each entry and store in a list.
for (i in 1:ncol(data)){
site_sample[[i]] <- sum(samples==i)
}
# Unlist the data to get an array that represents the bootstrap row.
site_sample <- unlist(site_sample)
Just stumbled upon this thread, and the vegan package has a function called 'rrarify' that does precisely what you're looking to do (and in the same ecological context, too)
This should work. It's a little more complicated than it looks at first, since each cell contains counts of a species. The solution uses the apply function to send each row of the data to the user-defined sample_species function. Then we generate n random numbers and order them. If there are 15 of species 1, 20 of species 2, and 20 of species 3, the random numbers generated between 1 and 15 signify species 1, 16 and 35 signify species 2, and 36-55 signify species 3.
## Initially takes in a row of the data and the number of samples to take
sample_species <- function(counts,n) {
num_species <- length(counts)
total_count <- sum(counts)
samples <- sample(1:total_count,n,replace=FALSE)
samples <- samples[order(samples)]
result <- array(0,num_species)
total <- 0
for (i in 1:num_species) {
result[i] <- length(which(samples > total & samples <= total+counts[i]))
total <- total+counts[i]
}
return(result)
}
A <- matrix(sample(0:100,10*40,replace=T), ncol=10) ## mock data
B <- t(apply(A,1,sample_species,50)) ## results

Vertex names by creating a network object via an edgelist (R package: network)

I want to create a network object, representing a directed network on basis of an edgelist. The first column contains some unique ID of project leaders, the second project partners, let's say:
library("network")
x <- cbind(rbind(1,1,2,2,3), rbind(3,7,10,9,6))
y.nw <- network(x, matrix="edgelist", directed=TRUE, loops=FALSE)
Now my problem is: I need all vertexes to have the right ID, since after creating the network object I have to transfer it back to a adjacency matrix with the right corresponding firm IDs. However, I am not sure in which order I should assign them, since I sorted the dataframe by column 1 (project leaders), which, however, not always show up as project partners as well.
If your ids are sequential integers as in your example, you can produce the adjacency matrix corresponding to the edgelist in your example with:
>as.sociomatrix(y.nw))
1 2 3 4 5 6 7 8 9 10
1 0 0 1 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1 1
3 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0
But maybe you have a different type of id system in your real input?

How can I calculate an empirical CDF in R?

I'm reading a sparse table from a file which looks like:
1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1
Note row lengths are different.
Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.
I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.
To do this I can first sum up each column, but I need to take zeros for the undef columns.
How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).
This will read the data in:
dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)
Resulting in:
> head(df)
Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1 1 0 7 0 0 1 0 0 0 5 0 0
2 1 0 0 1 0 0 0 3 0 0 0 0
3 0 0 0 1 0 0 0 2 0 0 0 0
4 1 0 0 1 0 3 0 0 0 0 1 0
5 0 0 0 1 0 0 0 2 0 0 0 0
....
If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.
We get the column sums using
df.csum <- colSums(df, na.rm = TRUE)
the ecdf() function generates the ECDF you wanted,
df.ecdf <- ecdf(df.csum)
and we can plot it using the plot() method:
plot(df.ecdf, verticals = TRUE)
You can use the ecdf() (in base R) or Ecdf() (from the Hmisc package) functions.

Resources