I am stuck.
We are asked to pick 30 random data from our dataset, then replace the picked values with NAs.
I'm stuck at the beginning, using the following function, as it selects 30 random data items from each column, while I want 30 random data picked among the whole dataset.
data2[sample(nrow(data2),30), ]
I hope you can help me out, thank you for your help/
Do you mean to replace 30 random rows?
data2 <- iris # as an example
throwouts <- sample(nrow(data2),30)
data2[throwouts, ] <- NA
print(data2)
Do you mean to replace 30 values in random rows and random columns?
data2 <- iris # as an example
coords <- expand.grid(1:nrow(data2),1:ncol(data2)) # all the possible values
coords <- coords[ sample(nrow(coords), 30), ] # take 30 unique ones of all possible values
for(i in 1:30) # erase each of them individually
data2[coords$Var1[i], coords$Var2[i] ] <- NA
print(data2)
The following seems to be memory efficient, it uses a logical matrix of FALSE values and 30 TRUE values in random positions to assign NA's.
set.seed(2020)
v <- rep(FALSE, prod(dim(df1)))
v[sample(length(v), 30)] <- TRUE
is.na(df1) <- matrix(v, nrow = nrow(df1))
rm(v)
This can easily be written as a function.
assignNA <- function(x, n){
v <- rep(FALSE, prod(dim(x)))
v[sample(length(v), 30)] <- TRUE
is.na(x) <- matrix(v, nrow = nrow(x))
x
}
set.seed(2020)
assignNA(df1, n = 30)
Tested with the data
df1 <- iris
Related
I have a data set containing 526 rows nd 560 columns. In this data set, I want to run pca analysis for each 16 columns, respectively, in the loop and save the PCA scores for each row. I tried the below code but it did not work. I would be happy to get your advice.
Thanks in advance for your help.
for(i in 1:ncol(df)) {
df[ , i:(i+15)] <- prcomp(df[, i:(i+15)], scale. = TRUE, center = T)
}
Here is a way with a lapply loop. Create a vector f of consecutive integers, each repeated 16 times. Then split the data.frame names by this vector and lapply function prcomp to each subset. Finally, extract the scores.
f <- c(1, rep(0, 15))
f <- rep(f, length(names(df1))/16)
f <- cumsum(f)
nms <- split(names(df1), f)
pca_list <- lapply(nms, function(x){
prcomp(df1[x], center = TRUE, scale. = TRUE)
})
scores_list <- lapply(pca_list, '[[', 'x')
Test data creation code
set.seed(2021)
df1 <- replicate(560, rnorm(526))
df1 <- as.data.frame(df1)
I am looking for a efficient way in R to derive possible combinations.
I have a data frame with 3 columns and on the basis first column contents I am calculating all the possible combinations.
df <- data.frame("H" = c("H1","H2","H3","H4"), "W1" = c(95, 0, 85 ,0) , "W2" = c(50, 85, 0,0))
df$H <- as.character.factor(df$H)
nH <- nrow(df)
nW <- 2
library(plyr)
library(gtools)
if(nW<=5){
# Find all possible combinations
mat1 <- matrix(nrow = 0, ncol = nH)
for(i in 1:nH){
# mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
mat1 <- rbind.fill.matrix(mat1, t(combn(df$H,nH-(i-1))))
}
df_comb <- data.frame(mat1)
}
View(df_comb)
df_comb gives correct output. Above code works good for small data sets but when the values for H column is more than 15 , R results into out of memory.
Looking for ways in which calculation of combinations in above scenario can be done efficiently in R till H1, H2 .... H49, H50.
EDIT:
Tried a different Approach, Now after certain number of possible combinations (in below case - 32767), applied random sampling to generate combinations using ratio method.
nH <- 26
nW <- 2
if(nW<=5){
# Find all possible combinations ~~~~~ Random Sampling
ncomb <- 0
for(i in 1:nH){
ncomb <- ncomb + choose(nH, nH-(i-1))
}
nmax <- 10000 # Total number of combinations cannot exceed 10000
mat1 <- matrix( nrow = 0, ncol = nH)
for(i in 1:nH){ # For each Group 26C1 26C2 26C3 ..... 26C25 26C26
ncombi <- choose(nH, nH-(i-1)) #For i = 1 , 26C25
ncombComputed <- ceiling(nmax/ncomb*choose(nH, nH-(i-1)))
if(ncomb <= 32767 ){ # This condition is independent of NMAX - For 15
#Combinations
print("sefirst")
final <- mat1
print(paste(nH," ",i))
abc <- combinations(nH,nH-(i-1),df$herbicide)
mat1 <- rbind.fill.matrix(mat1, combinations(nH,nH-(i-1),df$H))
}
else {
print(i)
print("second")
combi <- matrix( nrow = 0, ncol = nH-(i-1))
#random sampling
while(nrow(combi) < ncombComputed){
combi<- rbind(combi,sort(sample(df$herbicide,nH-(i-1))))
combi <- unique(combi)
}
mat1 <- rbind.fill.matrix(mat1, combi)
}
}
df_comb_New <- data.frame(mat1)
}
The above code gives the result but for 26 Entries its taking 36 seconds for 10000 Combinations.Now I am looking that is there a way to optimize the while loop so that execution becomes faster or any other way to achieve the same result in efficient manner.
I have a dataset where a subset of measurements for each entry are randomly missing:
dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA
I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.
Currently I am doing this as follows:
cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA
However, this operation is very slow as the actual dataset contains >1,000 entries.
Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?
You can count the number of non-NA pairwise observations using matrix approach.
Let's use this data generation code. I made data larger and added more NAs.
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
Then you filter code is taking 85 seconds
tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2)
sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds
My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.
tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds
Simple check shows the results are the same:
all(filter == filter2)
# TRUE
I would like to randomly replace elements in a matrix with some specified value, here -99. I tried the first method below and it did not work. Then I tried a different approach, also below, and it did work.
Why does the first method not work? What am I doing incorrectly? Thank you for any advice.
I suspect the second method is better because, apart from working, it allows me to specify the percentage of the elements I want replaced. The first method does not since it can randomly draw the same i,j pairs repeatedly.
Here is the first method, the one that does not work:
# This does not work
set.seed(1234)
ncols <- 10
nrows <- 5
NA_value <- -99
my.fake.data <- round(rnorm(ncols*nrows, 20, 5))
my.fake.grid <- matrix(my.fake.data, nrow=nrows, ncol=ncols, byrow=TRUE)
my.fake.grid
random.i <- sample(ncols, round(0.40*nrows*ncols), replace = TRUE)
random.j <- sample(nrows, round(0.40*nrows*ncols), replace = TRUE)
my.fake.grid[random.j, random.i] <- NA_value
my.fake.grid
Here is the second method, the one that does work:
# This works
set.seed(1234)
ncols <- 10
nrows <- 5
NA_value <- -99
my.fake.data <- round(rnorm(ncols*nrows, 20, 5))
my.fake.grid <- matrix(my.fake.data, nrow=nrows, ncol=ncols, byrow=TRUE)
my.fake.grid
my.fake.data2 <- c(my.fake.grid)
random.x <- sample(length(my.fake.data2), round(0.40*length(my.fake.data2)), replace = FALSE)
my.fake.data2[random.x] <- NA_value
my.fake.grid2 <- matrix(my.fake.data2, nrow=nrows, ncol=ncols, byrow=FALSE)
my.fake.grid2
Could try
library(data.table) # For faster cross/join, alterantively could use expand.grid
temp <- as.matrix(CJ(seq_len(nrows), seq_len(ncols))) # Create all possible row/column index combinations
indx <- temp[sample(nrow(temp), round(0.4 * nrow(temp))), ] # Sample 40% of them
my.fake.grid[indx] <- NA_value # Replace with -99
sum(my.fake.grid == -99)/(ncols * nrows) # Validating percentage
##[1] 0.4
I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?