Efficiently populating rows given possible values for each variable in R - r

I have a dataframe with 42 variables, each of which have different possible values. I am aiming to create a much larger dataframe which contains a row for each possible combination of values for each of the variables.
This will be millions of rows long and too large to hold in RAM. I have therefore been trying to make a script which appends each possible value to an existing file. The following code works but does so too slowly to be practical (also includes only 5 variables), taking just under 5 minutes to run on my machine.
V1 <- c(seq(0, 30, 1), NA)
V2 <- c(seq(20, 55, 1), NA)
V3 <- c(0, 1, NA)
V4 <- c(seq(1, 16, 1), NA)
V5 <- c(seq(15, 170, 1), NA)
df_empty <- data.frame(V1 = NA, V2 = NA, V3 = NA, V4 = NA)
write.csv(df_empty, "table_out.csv", row.names = FALSE)
start <- Sys.time()
for(v1 in 1:length(V1)){
V1_val <- V1[v1]
for(v2 in 1:length(V2)){
V2_val <- V2[v2]
for(v3 in 1:length(V3)){
V3_val <- V3[v3]
for(v4 in 1:length(V4)){
V4_val <- V4[v4]
row <- cbind(V1_val, V2_val, V3_val, V4_val)
write.table(as.matrix(row), file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)
}
}
}
}
print(abs(Sys.time() - start)) # 4.8 minutes
print(paste(nrow(read.csv("table_out.csv")), "rows in file"))
I have tested using data.table::fwrite() but this failed to be any faster than write.table(as.matrix(x))
I'm sure the issue I have is with using so many for loops but am unsure how to translate this into a more efficient approach.
Thanks

I guess you can try the following code to generate all combinations
M <- as.matrix(do.call(expand.grid,mget(x = ls(pattern = "^V\\d+"))))
and then you are able to save res to you designated file, e.g.,
write.table(M, file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)

Related

unused arguments when trying to create a matrix in R

I want create such matrix
dat <- matrix(
"an_no" = c(14, 17),
"an_yes" = c(3, 1),
row.names = c("TL-MCT-t", "ops"),
stringsAsFactors = FALSE
)
but i get error unused arguments.
What i did wrong and how perform correct matrix with such arguments?
as.matrix didn't help.
Thanks for your help.
You are using the arguments that you would use to build a data frame. If you want a matrix using this syntax you can do:
dat <- as.matrix(data.frame(
an_no = c(14, 17),
an_yes = c(3, 1),
row.names = c("TL-MCT-t", "ops")))
dat
#> an_no an_yes
#> TL-MCT-t 14 3
#> ops 17 1
You don't need the stringsAsFactors = FALSE because none of your data elements are strings, and in any case, stringsAsFactors is FALSE by default unless you are using an old version of R. You also don't need quotation marks around an_no and an_yes because these are both legal variable names in R.
The matrix function estructure is this:
matrix(data = NA,
nrow = 1,
ncol = 1,
byrow = FALSE,
dimnames = NULL)
Appears you're trying to create a data.frame
data.frame(row_names = c("TL-MCT-t", "ops"),
an_no = c(14,17),
an_yes = c(3,1)
)

Perform an operation with complete cases without changing the original vectors

I would like to calculate a rank-biserial correlation. But the (only it seems) package can't handle missing values that well. It has no built in "na.omit = TRUE" function. I could remove the missings in the data frame, but that would be a hustle with many different calculations.
n <- 500
df <- data.frame(id = seq (1:n),
ord = sample(c(0:3), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
df <- as.data.frame(apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} ))
library(rcompanion)
wilcoxonRG(x = df$ord, g = df$sex, verbose = T)
I imagine something stupidly easy like "complete.cases(wilcoxonRG(x = df$ord, g = df$sex, verbose = T)). It's probably not that hard but I could only find comeplete data frame manipulations. Thanks in advance!

How can a blocking factor be included in makeClassifTask() from mlr package?

In some classification tasks, using mlr package, I need to deal with a data.frame similar to this one:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
I need to cross-validate the model keeping together the values with the same ID, I know from the tutorial that:
https://mlr-org.github.io/mlr-tutorial/release/html/task/index.html#further-settings
We could include a blocking factor in the task. This would indicate that some observations "belong together" and should not be separated when splitting the data into training and test sets for resampling.
The question is how can I include this blocking factor in the makeClassifTask?
Unfortunately, I couldn't find any example.
What version of mlr do you have? Blocking should be part of it since a while. You can find it directly as an argument in makeClassifTask
Here is an example for your data:
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = cv10)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})
#all entries are empty, blocking indeed works!
The answer by #jakob-r no longer works. My guess is something changed with cv10.
Minor edit to use "blocking.cv = TRUE"
Complete working example:
set.seed(pi)
# Dummy data frame
df <- data.frame(
# Repeated values ID
ID = sort(sample(c(0:20), 100, replace = TRUE)),
# Some variables
X1 = runif(10, 1, 10),
# Some Label
Label = sample(c(0,1), 100, replace = TRUE)
)
df
df$ID = as.factor(df$ID)
df2 = df
df2$ID = NULL
df2$Label = as.factor(df$Label)
resDesc <- makeResampleDesc("CV",iters=10,blocking.cv = TRUE)
tsk = makeClassifTask(data = df2, target = "Label", blocking = df$ID)
res = resample("classif.rpart", tsk, resampling = resDesc)
# to prove-check that blocking worked
lapply(1:10, function(i) {
blocks.training = df$ID[res$pred$instance$train.inds[[i]]]
blocks.testing = df$ID[res$pred$instance$test.inds[[i]]]
intersect(blocks.testing, blocks.training)
})

SAVE groups clustering to out of r

I wrote the follow code to clustering data :
clusrer.data <- function(data,n) {
miRNA.exp.cluster <- scale(t(miRNA.exp))
k.means.fit <- kmeans(miRNA.exp.cluster,n)
#i try to save the results of k-means cluster by this code :
k.means.fit <- as.data.frame(k.means.fit)
write.csv(k.means.fit, file="k-meanReslut.csv")
#x<-k.means.fit$clusters
#write.csv(x, file="k-meanReslut.csv")
}
but I can not save the clusters to outside of (clusters) (8, 6, 7, 20, 18), I want to save each cluster separated (with columns and rows) in txt file or CSV.
Here is one approach of splitting the original dataset according to cluster and saving that chunk to a file. I added cluster assignment to the original dataset for easier visual check. Example is partly taken from ?kmeans. Feel free to adapt the way files are written, as well as the way file name is created.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
x <- cbind(x, cluster = cl$cluster)
by(x, INDICES = cl$cluster, FUN = function(sp) {
write.table(sp, file = paste0("file", unique(sp$cluster), ".txt"),
row.names = TRUE, col.names = TRUE)
})

R: Running row-wise operations between data frames

I'd like to run a statistical test, row-by-matching-row, between two data frames gex and mxy. The catch is that I need to run it several times, each time using a different column from gex, yielding a different vector of test results for each run.
Here is what I have so far (using example values), after much help from #kristang.
gex <- data.frame("sample" = c(987,7829,15056,15058,15072),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
## BELOW IS FOR DIAGNOSTICS
write.table(mxy, file = "mxy.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
write.table(gex, file = "gex.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
## ABOVE IS FOR DIAGNOSTICS
for(i in seq(nrow(mxy)))
for(colName in listx){
zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
mxy[i,])
## BELOW IS FOR DIAGNOSTICS
write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
## ABOVE IS FOR DIAGNOSTICS
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.csv(geneexptest, file = paste0(colName, ".csv"),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
}
The problem is that while it seems to go through and create the correct number of output files with the correct number of rows, etc...but it does not yield correct z-scores. I want it to calculate:
((Value from row z & given column of gex) - (Mean of values in row z across mxy)) / (Standard deviation of values in row z across mxy)
Then move on to the next row, and so on, filling in the first vector. THEN, I want it to calculate the same thing using the next column of gex, filling in a separate vector. I hope this makes sense.
I have a separate script which runs the same test using a pre-determined column vs the other data frame. The relevant for loop from that script looks like this:
for(i in seq_along(mxy)){
zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}
I think there may be a typo in your code, specifically you say you want "Mean of values in row z across mxy" but are using the mean(mxy[,i])) which selects the i'th column, not the i'th row. I re-wrote this section with for loops for clarity. (not sure why you were using lapply?)
# a function fo calculationg the z score
zScore <- function(x,y)(x - mean(y,na.rm=T))/sd(y,na.rm=T)
for(i in seq(nrow(mxy))) # note that length(mxy) is actually the number of columns in mxy
for(colName in listx){
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}
and alternative that does not rely on append:
for(colName in listx){
geneexptest <- NULL
for(i in seq(nrow(mxy))) {
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- rbind(geneexptest,
data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE))
}
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}

Resources