I want to sample 60 random rows 1000 times with the replace=TRUE and calculate the correlation coefficients between first and second columns in each sample.
I don't know how to sample row randomly, so I tried to sample 60 numbers in 1:60, and matched the row numbers.
The row data is 60x2 matrix which is called data1.
My code is
k <- list()
data.sam <- list()
set.seed(1)
for (j in 1:60){
for (i in 1:1000){
k[[i]] <- sample(1:60, 60, replace = TRUE)
}
data.sam[[i]][j,] <- data1[k[[i]][j],]
corr <- vector()
corr[i] <- cor(data.sam[[i]][,1],data.sam[[i]][,2])
}
And the statement is showed:
Error in `*tmp*`[[i]] : subscript out of bounds
It doesn't look like the j variable is doing very much. Your indexing is already vectorized by k[[i], so you don't need two loops explicitly. Also don't reset the corr variable inside the loop.
Instead, I might write:
data1 <- matrix(rnorm(120), 60,2)
for (i in 1:1000){
k[[i]] <- sample(1:60, 60, replace = TRUE)
data.sam[[i]] <- data1[k[[i]],]
corr[i] <- cor(data.sam[[i]][,1],data.sam[[i]][,2])
}
Which give this:
hist(corr)
Related
I have two datasets with abundance data from groups of different species. Columns are species and rows are sites. The sites (rows) are identical between the two datasets and what i am trying to do is to correlate the columns of the first dataset to the columns of the second dataset in order to see if there is a positive or a negative correlation.
library(Hmisc)
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$P
rcorr(otu.table.filter$sp1,new6$spA, type="spearman"))$r
the first will give me the p value of the relation between sp1 and spA and the second the r value
I initially created a loop that allowed me to check all species of the first dataframe with a single column of the second dataframe. Needless to say if I was to make this work I would have to repeat the process a few hundred times.
My simple loop for one column of df1(new6) against all columns of df2(otu.table.filter)
pvalues = list()
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$P
}
rvalues = list()
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]] <-(rcorr(otu.table.filter[ , i], new6$Total, type="spearman"))$r
}
p<-NULL
for(i in 1:length(pvalues)){
tmp <-print(pvalues[[i]][2])
p <- rbind(p, tmp)
}
r<-NULL
for(i in 1:length(rvalues)){
tmp <-print(rvalues[[i]][2])
r <- rbind(r, tmp)
}
fdr<-as.matrix(p.adjust(p, method = "fdr", n = length(p)))
sprman<-cbind(r,p,fdr)
and using the above as a starting point I tried to create a nested loop that each time would examine a column of df1 vs all columns of df2 and then it would proceed to the second column of df1 against all columns of df2 etc etc
but here i am a bit lost and i could not find an answer for a solution in r
I would assume that the pvalues output should be a list of
pvalues[[i]][[j]]
and similarly the rvalues output
rvalues[[i]][[j]]
but I am a bit lost and I dont know how to do that as I tried
pvalues = list()
rvalues = list()
for (j in 1:7){
for(i in 1:ncol(otu.table.filter)) {
pvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$P
}
for(i in 1:ncol(otu.table.filter)) {
rvalues[[i]][[j]] <-(rcorr(otu.table.filter[ , i], new7[,j], type="spearman"))$r
}
}
but I cannot make it work cause I am not sure how to direct the output in the lists and then i would also appreciate if someone could help me with the next part which would be to extract for each comparison the p and r value and apply the fdr function (similar to what i did with my simple loop)
here is a subset of my two dataframes
Here a small demo. Let's assume two matrices x and y with a sample size n. Then correlation and approximate p-values can be estimated as:
n <- 100
x <- matrix(rnorm(10 * n), nrow = n)
y <- matrix(rnorm(5 * n), nrow = n)
## correlation matrix
r <- cor(x, y, method = "spearman")
## p-values
pval <- function(r, n) 2 * (1 - pt(abs(r)/sqrt((1 - r^2)/(n - 2)), n - 2))
pval(r, n)
## for comparison
cor.test(x[,1], y[,1], method = "spearman", exact = FALSE)
More details can be found here: https://stats.stackexchange.com/questions/312216/spearman-correlation-significancy-test
Edit
And finally a loop with cor.test:
## for comparison
p <- matrix(NA, nrow = ncol(x), ncol=ncol(y))
for (i in 1:ncol(x)) {
for (j in 1:ncol(y)) {
p[i, j] <- cor.test(x[,i], y[,j], method = "spearman")$p.value
}
}
p
The values differ a somewhat, because the first uses the t-approximation then the second the "exact AS 89 algorithm" of cor.test.
I have this population:
MyPopulation <- c(1:100)
and I want to create a data frame of 40 columns and 5 lines. Each column has to be a random sample of MyPopulation, so I try this:
MySample <- data.frame(NoSample = c(1:5))
for (i in 1:40) {
MySample$i <- sample(MyPopulation,5)
}
The result is a data frame with only 1 more column (named i) with a random sample as values.
What am I doing wrong?
The easiest solution probably would be
MyPopulation <- c(1:100)
MySample <- data.frame(NoSample = c(1:5))
for (i in 1:40) {
MySample[,i+1] <- sample(MyPopulation,5)
}
You cannot assign new columns that way, try MySample[paste(i)] = ...
That is you cannot assign a numeric value to a column, hence strings.
Maybe you can try replicate + as.data.frame
MySample <- as.data.frame(replicate(40,sample(MyPopulation,5)))
You can also create a single stream of random values and then state the column dimension in a matrix with the row count being imputed:
m <- matrix(sample(1:1000, 200, replace = TRUE), ncol = 40)
df <- as.data.frame(m)
I am trying to randomly sample 10 individuals from a population and repeat 1000 times. Is this possible? Here is my code so far and I am not quite sure if I am on the right track. I keep receiving the error "number of items to replace is not a multiple of replacement length".
Here is my code:
B<-1000
for (i in 1:B){
FR3_Acropora_Sample[i]<-(sample(FR3_Acropora$Ratio,size=10,replace=TRUE))
}
Consider replicate (wrapper to sapply):
# MATRIX
sample_matrix <- replicate(B, sample(FR3_Acropora$Ratio, size=10, replace=TRUE))
# LIST
sample_list <- replicate(B, sample(FR3_Acropora$Ratio, size=10, replace=TRUE),
simplify = FALSE)
I believe you can accomplish this as follows. I create a sample dataset of numbers 1 through 50 - you'll skip this step of course. I initialize a vector of lists with a length of 100. I loop from 1 to 100 and choose a random sample to assign to each empty space in my vector. I can then access any sample with sampleList[[x]] where x is any number 1 to 100.
x <- c(1:50)
sampleList <- vector(mode="list", length=100)
for (i in 1:100) {
sampleList[[i]] = sample(x, size = 10, replace = TRUE)
}
Using your variable names, this would look like:
B<-1000
FR3_Acropora_Sample <- vector(mode="list", length=1000)
for (i in 1:B){
FR3_Acropora_Sample[[i]]=sample(FR3_Acropora$Ratio,size=10,replace=TRUE)
}
I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE
I created a function to run cross validation on whichever model I enter into it:
# vectors to hold results
master_modelname <- vector()
master_modelmea <- vector()
# model coefficients
master_sink <- attr(mini_kitchensink$terms,"term.labels")
master_stepF <- attr(stepF$terms,"term.labels")
master_stepB <- attr(stepB$terms,"term.labels")
# folds
myKfolds <- function(preds, dataset, num_folds=5, dependant=c("loss")) {
# create vectors for holding predictions and actuals from k folds iterations
cv_prediction <- data.frame()
testsetCopy <- data.frame()
kmeas <- vector()
folds <- createFolds(dataset[,dependant], k=num_folds)
for ( f in folds ) {
ktrain <- dataset[-f,]
ktest <- dataset[f,]
kmodel <- lm(paste(dependant,"~", paste(preds, collapse="+"),sep=""), data=ktrain)
predictions <- predict(kmodel, interval="prediction", newdata=ktest)
temp <- as.data.frame(predictions)
cv_prediction <- rbind(cv_prediction, temp)
testsetCopy <- rbind(testsetCopy, ktest)
errors <- ktest[,dependant] - as.numeric(predictions[,"fit"])
kmeas <- c(kmeas,mean(abs(errors)))
}
master_modelname <- c(master_modelname, "bla")
master_modelmea <- c(master_modelmea, mean(kmeas))
}
After calling the function:
myKfolds(master_stepB, ptrain, num_folds=5)
It runs after about 10 minutes.
But my vectors master_modelname and master_modelmea are empty.
However, if I just highlight and run the last two lines of my function, it works, these two vectors now have values 'bla' and a single number which is mean(kmeas)
I have to "force run" the final two lines in the function to get expected results. So does this mean that for some reason the ast two lines of my function are not running for some reason?