I would like to randomly replace elements in a matrix with some specified value, here -99. I tried the first method below and it did not work. Then I tried a different approach, also below, and it did work.
Why does the first method not work? What am I doing incorrectly? Thank you for any advice.
I suspect the second method is better because, apart from working, it allows me to specify the percentage of the elements I want replaced. The first method does not since it can randomly draw the same i,j pairs repeatedly.
Here is the first method, the one that does not work:
# This does not work
set.seed(1234)
ncols <- 10
nrows <- 5
NA_value <- -99
my.fake.data <- round(rnorm(ncols*nrows, 20, 5))
my.fake.grid <- matrix(my.fake.data, nrow=nrows, ncol=ncols, byrow=TRUE)
my.fake.grid
random.i <- sample(ncols, round(0.40*nrows*ncols), replace = TRUE)
random.j <- sample(nrows, round(0.40*nrows*ncols), replace = TRUE)
my.fake.grid[random.j, random.i] <- NA_value
my.fake.grid
Here is the second method, the one that does work:
# This works
set.seed(1234)
ncols <- 10
nrows <- 5
NA_value <- -99
my.fake.data <- round(rnorm(ncols*nrows, 20, 5))
my.fake.grid <- matrix(my.fake.data, nrow=nrows, ncol=ncols, byrow=TRUE)
my.fake.grid
my.fake.data2 <- c(my.fake.grid)
random.x <- sample(length(my.fake.data2), round(0.40*length(my.fake.data2)), replace = FALSE)
my.fake.data2[random.x] <- NA_value
my.fake.grid2 <- matrix(my.fake.data2, nrow=nrows, ncol=ncols, byrow=FALSE)
my.fake.grid2
Could try
library(data.table) # For faster cross/join, alterantively could use expand.grid
temp <- as.matrix(CJ(seq_len(nrows), seq_len(ncols))) # Create all possible row/column index combinations
indx <- temp[sample(nrow(temp), round(0.4 * nrow(temp))), ] # Sample 40% of them
my.fake.grid[indx] <- NA_value # Replace with -99
sum(my.fake.grid == -99)/(ncols * nrows) # Validating percentage
##[1] 0.4
Related
I am stuck.
We are asked to pick 30 random data from our dataset, then replace the picked values with NAs.
I'm stuck at the beginning, using the following function, as it selects 30 random data items from each column, while I want 30 random data picked among the whole dataset.
data2[sample(nrow(data2),30), ]
I hope you can help me out, thank you for your help/
Do you mean to replace 30 random rows?
data2 <- iris # as an example
throwouts <- sample(nrow(data2),30)
data2[throwouts, ] <- NA
print(data2)
Do you mean to replace 30 values in random rows and random columns?
data2 <- iris # as an example
coords <- expand.grid(1:nrow(data2),1:ncol(data2)) # all the possible values
coords <- coords[ sample(nrow(coords), 30), ] # take 30 unique ones of all possible values
for(i in 1:30) # erase each of them individually
data2[coords$Var1[i], coords$Var2[i] ] <- NA
print(data2)
The following seems to be memory efficient, it uses a logical matrix of FALSE values and 30 TRUE values in random positions to assign NA's.
set.seed(2020)
v <- rep(FALSE, prod(dim(df1)))
v[sample(length(v), 30)] <- TRUE
is.na(df1) <- matrix(v, nrow = nrow(df1))
rm(v)
This can easily be written as a function.
assignNA <- function(x, n){
v <- rep(FALSE, prod(dim(x)))
v[sample(length(v), 30)] <- TRUE
is.na(x) <- matrix(v, nrow = nrow(x))
x
}
set.seed(2020)
assignNA(df1, n = 30)
Tested with the data
df1 <- iris
Hello I am trying to speed up a block of code that is currently working, but is quite slow with the amount of data that I have. I need to identify the top n% highest value in a row and subsequently use this to make an average by subsetting a dataframe and averaging the values of the subset. Any help or suggestions would be appreciated. This is my current approach:
corrMat <- matrix(runif(944*9843), nrow=944, ncol = 9843)
GeneExpression <- matrix(runif(11674*9843, min=0, max=100), nrow = 11674, ncol=9843)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression)
for(i in colnames(data)){
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
data
Here's a smaller version of your example along with my base R solution. Chances are there's also a neat tidyverse way of doing this but I wouldn't know.
corrMat <- matrix(runif(24*18), nrow=24)
GeneExpression <- matrix(runif(36*18, min=0, max=100), nrow = 36)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- data2 <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression) # rownames are NULL so this is not needed
start <- Sys.time() # benchmarking
for(i in 1:ncol(data)){ # iterate by column rather than colname
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
eric <- Sys.time() - start
start <- Sys.time()
# apply over rows of topCells to take row means of GeneExpression
# per row of topCells
# then just transpose
data2 <- t(apply(topCells, 1, function(x) rowMeans(GeneExpression[, x, drop = F])))
milan <- Sys.time() - start
all(data == data2)
[1] TRUE
eric
Time difference of 0.08776498 secs
milan
Time difference of 0.02593184 secs
Using your original example data, my solution takes 6.43s to run.
Hope this helps.
I am trying to optimize an algorithm and I really want to avoid all my loops. Hence I am wondering if there is a way to avoid the following simple loop:
library(FNN)
data <- cbind(1:10, 1:10)
NN.index <- get.knn(data, 5)$nn.index
bc <- matrix(0, nrow(NN.index), max(NN.index))
for(i in 1:nrow(bc)){
bc[i,NN.index[i,]] <- 1
}
were bc is a matrix of zeros.
In R, if the bracket of a matrix M take a k-by-2 matrix 'I', then each row of the k-by-2 matrix I is recognized as the row and column index of M. For example
M = matrix(1:20, nrow =4, ncol = 3)
print(M)
I = rbind(c(1,2), c(4,2), c(3,3))
print(M[I])
In this case, M[1,2], M[4,2] and M[3,3] are extracted.
In your case, we can create row_index and col_index from NN.index as below, and then assign 1 to the corresponding entries.
bc <- matrix(0, nrow(NN.index), max(NN.index))
row_index <- rep(1:nrow(NN.index), times = ncol(NN.index))
col_index <- as.vector(NN.index)
bc[cbind(row_index, col_index)] <- 1
print(bc)
I want to calculate correlation statistics using cor.test(). I have a data matrix where the two pairs to be tested are on consecutive lines (I have more than thousand pairs so I need to correct for that also later). I was thinking that I could loop through every two and two lines in the matrix and perform the test (i.e. first test correlation between row1 and row2, then row3 and row4, row5 and row6 etc.), but I don't know how to make this kind of loop.
This is how I do the test on a single pair:
d = read.table(file="cor-test-sample-data.txt", header=T, sep="\t", row.names = 1)
d = as.matrix(d)
cor.test(d[1,], d[2,], method = "spearman")
You could try
res <- lapply(split(seq_len(nrow(mat1)),(seq_len(nrow(mat1))-1)%/%2 +1),
function(i){m1 <- mat1[i,]
if(NROW(m1)==2){
cor.test(m1[1,], m1[2,], method="spearman")
}
else NA
})
To get the p-values
resP <- sapply(res, function(x) x$p.value)
indx <- t(`dim<-`(seq_len(nrow(mat1)), c(2, nrow(mat1)/2)))
names(resP) <- paste(indx[,1], indx[,2], sep="_")
resP
# 1_2 3_4 5_6 7_8 9_10 11_12 13_14
#0.89726818 0.45191660 0.14106085 0.82532260 0.54262680 0.25384239 0.89726815
# 15_16 17_18 19_20 21_22 23_24 25_26 27_28
#0.02270217 0.16840791 0.45563229 0.28533447 0.53088721 0.23453161 0.79235990
# 29_30 31_32
#0.01345768 0.01611903
Or using mapply (assuming that the rows are even)
ind <- seq(1, nrow(mat1), by=2) #similar to the one used by #CathG in for loop
mapply(function(i,j) cor.test(mat1[i,], mat1[j,],
method='spearman')$p.value , ind, ind+1)
data
set.seed(25)
mat1 <- matrix(sample(0:100, 20*32, replace=TRUE), ncol=20)
Try
d = matrix(rep(1:9, 3), ncol=3, byrow = T)
sapply(2*(1:(nrow(d)/2)), function(pair) unname(cor.test(d[pair-1,], d[pair,], method="spearman")$estimate))
pvalues<-c()
for (i in seq(1,nrow(d),by=2)) {
pvalues<-c(pvalues,cor.test(d[i,],d[i+1,],method="spearman")$p.value)
}
names(pvalues)<-paste(row.names(d)[seq(1,nrow(d),by=2)],row.names(d)[seq(2,nrow(d),by=2)],sep="_")
I'd like to sample a vector x of length 7 with replacement and sample that vector 10 separate times. I've tried the something like the following but can't get the resulting 7x10 output I'm looking for. This produces a 1x7 vector but I can't figure out to get the other 9 vectors
x <- runif(7, 0, 1)
for(i in 1:10){
samp <- sample(x, size = length(x), replace = T)
}
This is a very convenient way to do this:
replicate(10,sample(x,length(x),replace = TRUE))
Since you seem to want to sample with replacement, you can just get the 7*10 samples at once (which is more efficient for large sizes):
x <- runif(7)
n <- 10
xn <- length(x)
matrix(x[sample.int(xn, xn*n, replace=TRUE)], nrow=xn)
# Or slightly shorter:
matrix(sample(x, length(x)*n, replace=TRUE), ncol=n)
The second version uses sample directly, but there are some issues with that: if x is a numeric of length 1, bad things happen. sample.int is safer.
x <- c(pi, -pi)
sample(x, 5, replace=T) # OK
x <- pi
sample(x, 5, replace=T) # OOPS, interpreted as 1:3 instead of pi...
Looks like you got a suitable answer, but here's an approach that's similar to your first attempt. The difference is that we define samp with the appropriate dimensions, and then iteratively index into that object and fill it one row at a time:
samp <- matrix(NA, ncol = 7, nrow = 10)
for(i in 1:10){
samp[i,] <- sample(x, size = length(x), replace = T)
}