resampling matrix in R - r

I have generated an observed matrix, here is the code:
obs.matrix <- matrix(c(rep(1,10),rep(2,10)),nrow=10,ncol=2)
and now I want to get 3000 permuted dataset, in each dataset, there should be 10's 1 and 10's 2 but they can be in different columns.
I don't know how to do the rest.
I have tried but failed.
x =obs.matrix
theta = function(resample){sample(c(1,2),replace = T)}
result <- bootstrap::bootstrap(x,3000,theta)
Thanks for help.

Related

Generating multinomial random data in R

I am trying to generate data from a multinomial distribution in R using the function rmultinom, but I am having some problems.
The fact is that I want a data frame of 50 rows and 20 columns and a total sum of the outcomes equal to 3 times n*p.
I am using this code:
p <- 20
n <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(50, N, prob_true)
But I get some very strange results and a data frame with 20 rows and 50 columns.
How can I solve this problem?
Thanks in advance!
The help available at ?rmultinom says that n in rmultinom(n, size, prob) is:
"number of random vectors to draw"
And size is:
"specifying the total number of objects that are put into K boxes in the typical multinomial experiment"
And the help says that the output is:
"For rmultinom(), an integer K x n matrix where each column is a random vector generated according to the desired multinomial law, and hence summing to size"
So you're asking for 50 vectors/variables with a total number of "objects" equal to 3000, so each column is drawn as a vector that sums to 3000.
colSums(a) does result in 3000.
Do you want your vectors/variables as rows? Then this would work just by transposing a:
t(a)
but if you want 20 columns, each that is its own variable, you would need to switch your n and p (I also subbed in n in the rmultinom call):
n <- 20
p <- 50
N <- 3*(n*p)
prob_true <- rep(1/p, p)
a <- rmultinom(n, N, prob_true)

Is there a way in R for doing a pairwise-weighted correlation matrix?

I have a survey with a lot of numeric variables (both continuous and dummy-binary) and more than 800 observations. Of course, there is missing data for most of the variables (at a different rate). I need to use a weighted correlation table because some samples represent more population than others. Also, I want to minimize the not used samples, and in this way keep the max. of observations for each pair of variables. I know how to do a pairwise correlation matrix (e.g., cor(data, use="pairwise.complete.obs")). Also I know how to do a weighted correlation matrix (e.g., cov.wt(data %>% select(-weight), wt=data$weight, cor=TRUE)). However, I couldn't find a way (yet) to use both together. Is there a way for doing a pairwise-weighted correlation matrix in R? Super appreciate it if any help or recommendations.
Good question
Here how I do it
It is not fast but faster than looping.
df_correlation is a dataframe with only the variables I want to compute the correlations
and newdf is my original dataframe with the weight and other variables
data_list <- combn(names(df_correlation),2,simplify = FALSE)
data_list <- map(data_list,~c(.,"BalancingWeights"))
dimension <- length(names(df_correlation))
allcorr <- matrix(data =NA,nrow = dimension,ncol = dimension)
row.names(allcorr)<-names(df_correlation)
colnames(allcorr) <- names(df_correlation)
myfunction<- function(data,x,y,weight){
indice <-!(is.na(data[[x]])|is.na(data[[y]]))
return(wCorr::weightedCorr(data[[x]][indice],
data[[y]][indice], method = c("Pearson"),
weights = data[[weight]][indice], ML = FALSE, fast = TRUE))
}
b <- map_dbl(data_list,~myfunction(newdf,.[1],.[2],.[3]))
allcorr[upper.tri(allcorr, diag = FALSE)]<- b
allcorr[lower.tri(allcorr,diag=FALSE)] <- b
view(allcorr)

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

Pearson coefficient per rows on large matrices

I'm currently working with a large matrix (4 cols and around 8000 rows).
I want to perform a correlation analysis using Pearson's correlation coefficient between the different rows composing this matrix.
I would like to proceed the following way:
Find Pearson's correlation coefficient between row 1 and row 2. Then between rows 1 and 3... and so on with the rest of the rows.
Then find Pearson's correlation coefficient between row 2 and row 3. Then between rows 2 and 4... and so on with the rest of the rows. Note I won't find the coefficient with row 1 again...
For those coefficients being higher or lower than 0.7 or -0.7 respectively, I would like to list on a separate file the row names corresponding to those coefficients, plus the coefficient. E.g.:
row 230 - row 5812 - 0.76
I wrote the following code for this aim. Unfortunately, it takes a too long running time (I estimated almost a week :( ).
for (i in 1:7999) {
print("Analyzing row:")
print(i)
for (j in (i+1):8000) {
value<- cor(alpha1k[i,],alpha1k[j,],use = "everything",method = "pearson")
if(value>0.7 | value<(-0.7)){
aristi <- c(row.names(alpha1k)[i],row.names(alpha1k)[j],value)
arist1p<-rbind(arist1p,aristi)
}
}
Then my question is if there's any way I could do this faster. I read about making these calculations in parallel but I have no clue on how to make this work. I hope I made myself clear enough, thank you on advance!
As Roland pointed out, you can use the matrix version of cor to simplify your task. Just transpose your matrix to get a "row" comparison.
mydf <- data.frame(a = c(1,2,3,1,2,3,1,2,3,4), b = rep(5,2,10), c = c(1:10))
cor_mat <- cor(t(mydf)) # correlation of your transposed matrix
idx <- which((abs(cor_mat) > 0.7), arr.ind = T) # get relevant indexes in a matrix form
cbind(idx, cor_mat[idx]) # combine coordinates and the correlation
Note that parameters use = everything and method = "pearson" are used by default for correlation. There is no need to specify them.

Using mat2listw function in R to create spatial weights matrix

I am attempting to create a weights object in R with the mat2listw function. I have a very large spatial weights matrix (roughly 22,000x22,000)
that was created in Excel and read into R, and I'm now trying to implement:
library(spdep)
SW=mat2listw(matrix)
I am getting the following error:
Error in if (any(x<0)) stop ("values in x cannot be negative"): missing
value where TRUE/FALSE needed.
What's going wrong here? My current matrix is all 0's and 1's, with no
missing values and no negative elements. What am I missing?
I'd appreciate any advice. Thanks in advance for your help!
Here is a simple test to your previous comment:
library(spdep)
m1 <-matrix(rbinom(100, 1, 0.5), ncol =10, nrow = 10) #create a random 10 * 10 matrix
m2 <- m1 # create a duplicate of the first matrix
m2[5,4] <- NA # assign an NA value in the second matrix
SW <- mat2listw(m1) # create weight list matrix
SW2 <- mat2listw(m2) # create weight list matrix
The first matrix one does not fail, but the second matrix does. The real question is now why your weight matrix is created containing NAs. Have you considered creating spatial weight matrix in r? Using dnearneigh or other function.

Resources