R: Appending to a data frame in a for loop - r

So I have this loop, and it writes multiple csv files, with each one having been appended out of the results of the run. As you can see below, this particular loop runs a statistical function (zScore) across each row of an subset from gex against mxy, then publishes the results for each row, then moves onto the next subset of gex.
My question is, instead of writing the appended result as a csv file, is there a way that I can just build a dataframe within the loop that looks the same?
Thank you for your kind help.
gex <- data.frame("sample" = c("BIX","HEF","TUR","ZOP","VAG"),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
for(i in seq(nrow(mxy))){
for(colName in listx){
zvalues <- zScore(gex[i,colName],
mxy[i,])
geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL,
stringsAsFactors = TRUE)
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}
}

In your posted code you have one csv file for each element of listx, and you are writing a number of lines one-by-one into each of these files. Instead, you could create a data frame for each element of listx and write each out with a single call to write.table.
dfs <- lapply(listx, function(colName) {
do.call(rbind, lapply(seq(nrow(mxy)), function(i) {
zvalues <- zScore(gex[i,colName], mxy[i,])
data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
}))
})
dfs
# [[1]]
# gex.sample.i. zvalues
# 1 BIX 1.1105593
# 2 HEF 0.5451948
# 3 TUR -1.4060388
# 4 ZOP -1.4218218
# 5 VAG 0.2780513
#
# [[2]]
# gex.sample.i. zvalues
# 1 BIX 2.0607386
# 2 HEF 1.6703912
# 3 TUR 1.3249181
# 4 ZOP 0.8865058
# 5 VAG 1.5289732
Now you can output the full data frame for each column using write.table.
Combining all the data frames together in a single call to rbind will be much more efficient than calling rbind at each loop iteration; see Circle 2 of The R Inferno for more details.

Related

Indentation in the first column of a flextable object

I am building flextable objects to show tables, and sometimes I would like to add one or several indentations in the first column, where I show some rows' names.
Next I share some code to simulate some data and have a reproducible example. The true starting point of my question is ft (Table 1):
library(dplyr)
library(flextable)
# Simulate data
g_A <- expand.grid(x = "A", y = c("A_1", "A_2"), z = c("A_1_a", "A_1_b", "A_2_a", "A_2_b"))
g_B <- expand.grid(x = "B", y = c("B_1", "B_2"), z = c("B_1_a", "B_1_b", "B_2_a", "B_2_b"))
g <- rbind(g_A, g_B)
n <- 123
set.seed(1)
df <- sample_n(g, n, replace = TRUE)
# Build table
tmp <- c(table(df$x)[1],
table(df$y)[1],
table(df$z)[1:2],
table(df$y)[2],
table(df$z)[3:4],
table(df$x)[2],
table(df$y)[3],
table(df$z)[5:6],
table(df$y)[4],
table(df$z)[7:8])
my_tab <- data.frame("tmp" = names(tmp), "counts" = tmp, "percentages" = round(tmp/n*100, 2))
# flextable operations
ft <- flextable(my_tab)
ft <- set_header_labels(ft, tmp = "")
ft <- align(ft, align = "center")
ft <- align(ft, j = 1, align = "left")
# ft
Now, I would like to indent some names in the first column. For example, to indent A_1 I have tried the following strategies:
compose(ft, i = 2, j = 1, as_paragraph(" A_1"))
compose(ft, i = 2, j = 1, as_paragraph("\t A_1"))
# Or
# colformat_char(ft, i = 2, j = 1, prefix = " ")
# colformat_char(ft, i = 2, j = 1, prefix = "\t")
But they don't work (the result is the same as in Table 1). A "second best" strategy could be the following one (Table 2):
compose(ft, i = 2, j = 1, as_paragraph("- A_1"))
# Or
# colformat_char(ft, i = 2, j = 1, prefix = "- ")
However, I would like a proper indentation.
Finally, I share Table 3, my expected final result, with an indentation in place of each "-".
Waiting for your insights!
Ciao
To indent cells in a flextable you can use padding function:
ft <- padding(ft, i=2, j=1, padding.left=20)

Efficiently populating rows given possible values for each variable in R

I have a dataframe with 42 variables, each of which have different possible values. I am aiming to create a much larger dataframe which contains a row for each possible combination of values for each of the variables.
This will be millions of rows long and too large to hold in RAM. I have therefore been trying to make a script which appends each possible value to an existing file. The following code works but does so too slowly to be practical (also includes only 5 variables), taking just under 5 minutes to run on my machine.
V1 <- c(seq(0, 30, 1), NA)
V2 <- c(seq(20, 55, 1), NA)
V3 <- c(0, 1, NA)
V4 <- c(seq(1, 16, 1), NA)
V5 <- c(seq(15, 170, 1), NA)
df_empty <- data.frame(V1 = NA, V2 = NA, V3 = NA, V4 = NA)
write.csv(df_empty, "table_out.csv", row.names = FALSE)
start <- Sys.time()
for(v1 in 1:length(V1)){
V1_val <- V1[v1]
for(v2 in 1:length(V2)){
V2_val <- V2[v2]
for(v3 in 1:length(V3)){
V3_val <- V3[v3]
for(v4 in 1:length(V4)){
V4_val <- V4[v4]
row <- cbind(V1_val, V2_val, V3_val, V4_val)
write.table(as.matrix(row), file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)
}
}
}
}
print(abs(Sys.time() - start)) # 4.8 minutes
print(paste(nrow(read.csv("table_out.csv")), "rows in file"))
I have tested using data.table::fwrite() but this failed to be any faster than write.table(as.matrix(x))
I'm sure the issue I have is with using so many for loops but am unsure how to translate this into a more efficient approach.
Thanks
I guess you can try the following code to generate all combinations
M <- as.matrix(do.call(expand.grid,mget(x = ls(pattern = "^V\\d+"))))
and then you are able to save res to you designated file, e.g.,
write.table(M, file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)

function to sample variable number of substrings given string length

I'm trying to write an R function that will sample a variable number of 5-element substrings, based on the length of the original string in each row of a data frame. I first calculated the number of times I'd like each draw to repeat, and would like to add this into the function so that the number of samples taken for each row is based on the "num_draws" column for that row. my thought was to use a generalized instance, and then use an apply statement outside of the function to act on each row, but I can't figure out how to set up the function to call col 3 as a generalized instance (without calling either the value of just the first row, or the value of all rows).
example data frame:
BP TF num_draws
1 CGGCGCATGTTCGGTAATGA TFTTTFTTTFFTTFTTTTTF 6
2 ATAAGATGCCCAGAGCCTTTTCATGTACTA TFTFTFTFFFFFFTTFTTTTFTTTTFFTTT 9
3 TCTTAGGAAGGATTC FTTTTTTTTTFFFFF 4
desired output:
[1]GGCGC FTTTF
AATGA TTTTF
TTFFT TGTTC
TAATG TTTTT
AATGA TTTTF
CGGCG TFTTT
[2]AGATG FTFTF
ATAAG TFTFT
ATGCC FTFFF
GCCCA FFFFF
ATAAG TFTFT
GTACT TFFTT
GCCCA FFFFF
TGCCC TFFFF
AGATG FTFTF
[3]TTAGG TTTTT
CTTAG TTTTT
GGAAG TTTTT
GGATT TTFFF
example code:
#make example data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
#calculate number of draws by length
frag_length<- 5
my_df<- cbind(my_df, (round((nchar(my_df[,1]) / frag_length) * 1.5, digits = 0)))
colnames(my_df) <- c("BP", "TF", "num_draws")
#function to sample x number of draws per row (this does not work)
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
samp<- substr(string, nStart, nStart + 4)
replicate(n= string[,3], expr = samp)
}
apply(my_df[,1:2], c(1,2), Fragment)
One option would be to change the function to have another argument n and create the nStart inside the replicate call
Fragment = function(string, n) {
replicate(n= n, {nStart <- sample(1:(nchar(string) -5), 1)
samp <- substr(string, nStart, nStart + 4)
})
}
apply(my_df, 1, function(x) data.frame(lapply(x[1:2], Fragment, n = x[3])))
$`1`
# BP TF
#1 GGCGC FFTTF
#2 GGTAA TFFTT
#3 GCGCA TTFTT
#4 CGCAT TFFTT
#5 GGCGC FTTTF
#6 TGTTC FTTFT
#$`2`
# BP TF
#1 GTACT TTTTF
#2 ATAAG FTTFT
#3 GTACT TFTFF
#4 TAAGA TTTTF
#5 CCTTT FFTTF
#6 TCATG TTTTF
#7 CCAGA TFTFT
#8 TTCAT TFTFT
#9 CCCAG FTFTF
#$`3`
# BP TF
#1 AAGGA TTTFF
#2 AGGAT TTTTT
#3 CTTAG TFFFF
#4 TAGGA TTTFF

Optimize the for loop in R

DUMMY DATA SET: (difference from my data set is item_code is string in my case)
in_cluster <- data.frame(item_code = c(1:500))
in_cluster$cluster <-
sample(5, size = nrow(in_cluster), replace = TRUE)
real_sales <- data.frame(item_code = numeric(0), sales = numeric(0))
real_sales <-
data.frame(
item_code = sample(500, size = 100000, replace = TRUE),
sales = sample(500, size = 100000, replace = TRUE)
)
mean_trajectory <- data.frame(sales = c(1:52))
mean_trajectory$sales <- sample(500, size = 52, replace = TRUE)
training_df <- data.frame(
LTF_t_minus_1 = numeric(0),
LTF_t = numeric(0),
LTF_t_plus_1 = numeric(0),
RS_t_minus_1 = numeric(0),
RS_t = numeric(0),
STF_t_plus_1 = numeric(0)
)
training_df[nrow(training_df) + 1, ] <-
c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19) # week 0
week = 2
I have a simple function in R in which all I do is:
system.time({
for (r in 1:nrow(in_cluster)) {
item <- in_cluster[r,]
sale_row <-
dplyr::filter(real_sales, item_code == item$item_code)
if (nrow(sale_row) > 2) {
new_df <- data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
RS_t_minus_1 = sale_row$sales[[week - 1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week + 1]]
)
training_df <-
bind_rows(training_df, new_df)
}
}
})
I am quite new to R and found this really weird looking at how small the data really is yet how long (421.59 seconds to loop through 500 rows) it is taking to loop through the data frame.
EDIT_IMPORTANT: However for above given dummy data set all it took was 1.10 seconds to get the output > could this be because of having string for item_code? does it take that much time to process a string item_code. (I didn't use string for dummy data sets because I do not know how to have 500 unique strings for item_code in in_cluster, and have the same strings in real_sales as item_code)
I read through few other articles which suggested ways to optimize the R code and used bind_rows instead of rbind or using:
training_df[nrow(training_df) + 1,] <-
c(mean_trajectory$sales[[week-1]], mean_trajectory$sales[[week]], mean_trajectory$sales[[week+1]], sale_row$sales[[week-1]], sale_row$sales[[week]], sale_row$sales[[week+1]])
using bind_rows seems to have improved the performance by 36 seconds when looping through 500 rows of data frame in_cluster
Is it possible to use lapply in this scenario? I tried code below and got an error:
Error in filter_impl(.data, dots) : $ operator is invalid for
atomic vectors
myfun <- function(item, sales, mean_trajectory, week) {
sale_row<- filter(sales, item_code == item$item_code)
data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week-1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week+1]],
RS_t_minus_1 = sale_row$sales[[week-1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week+1]])
}
system.time({
lapply(in_cluster, myfun, sales= sales, mean_trajectory = mean_trajectory) %>% bind_rows()
})
Help with lapply would be appreciated, however my main target is to speed up the loop.
Ok, so there a lot of bad practices in your code.
You are operating per row
You are creating 2(!) new data frames per row (very expensive)
You are growing objects in a loop )that training_df <- bind_rows(training_df, new_df) keeps growing in each iteration while running a pretty expensive operation (bind_rows))
You are running the same operation over and over again when you could just run them once (why are you running mean_trajectory$sales[[week-1]] and al per row while mean_trajectory has nothing to do with the loop? You could just assign it afterwards).
And the list goes on...
I would suggest an alternative simple data.table solution which will perform much better. The idea is to first make a binary join between in_cluster and real_sales (and run all the operations while joining without creating extra data frames and then binding them). Then, run all the mean_trajectoryrelated lines only once. (I ignored the training_df[nrow(training_df) + 1, ] <- c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19) initialization as it's irrelevant here and you can just add it afterwards using and rbind)
library(data.table) #v1.10.4
## First step
res <-
setDT(real_sales)[setDT(in_cluster), # binary join
if(.N > 2) .(RS_t_minus_1 = sales[week - 1], # The stuff you want to do
RS_t = sales[week], # by condition
STF_t_plus_1 = sales[week + 1]),
on = "item_code", # The join key
by = .EACHI] # Do the operations per each join
## Second step (run the `mean_trajectory` only once)
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
LTF_t = mean_trajectory$sales[week],
LTF_t_plus_1 = mean_trajectory$sales[week + 1])]
Some benchmarks:
### Creating your data sets
set.seed(123)
N <- 1e5
N2 <- 5e7
in_cluster <- data.frame(item_code = c(1:N))
real_sales <-
data.frame(
item_code = sample(N, size = N2, replace = TRUE),
sales = sample(N, size = N2, replace = TRUE)
)
mean_trajectory <- data.frame(sales = sample(N, size = 25, replace = TRUE))
training_df <- data.frame(
LTF_t_minus_1 = numeric(0),
LTF_t = numeric(0),
LTF_t_plus_1 = numeric(0),
RS_t_minus_1 = numeric(0),
RS_t = numeric(0),
STF_t_plus_1 = numeric(0)
)
week = 2
###############################
################# Your solution
system.time({
for (r in 1:nrow(in_cluster)) {
item <- in_cluster[r,, drop = FALSE]
sale_row <-
dplyr::filter(real_sales, item_code == item$item_code)
if (nrow(sale_row) > 2) {
new_df <- data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
RS_t_minus_1 = sale_row$sales[[week - 1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week + 1]]
)
training_df <-
bind_rows(training_df, new_df)
}
}
})
### Ran forever- I've killed it after half an hour
######################
########## My solution
library(data.table)
system.time({
res <-
setDT(real_sales)[setDT(in_cluster),
if(.N > 2) .(RS_t_minus_1 = sales[week - 1],
RS_t = sales[week],
STF_t_plus_1 = sales[week + 1]),
on = "item_code",
by = .EACHI]
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
LTF_t = mean_trajectory$sales[week],
LTF_t_plus_1 = mean_trajectory$sales[week + 1])]
})
# user system elapsed
# 2.42 0.05 2.47
So for 50MM rows the data.table solution ran for about 2 secs, while your solution ran endlessly until I've killed it (after half an hour).

R: Running row-wise operations between data frames

I'd like to run a statistical test, row-by-matching-row, between two data frames gex and mxy. The catch is that I need to run it several times, each time using a different column from gex, yielding a different vector of test results for each run.
Here is what I have so far (using example values), after much help from #kristang.
gex <- data.frame("sample" = c(987,7829,15056,15058,15072),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
## BELOW IS FOR DIAGNOSTICS
write.table(mxy, file = "mxy.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
write.table(gex, file = "gex.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
## ABOVE IS FOR DIAGNOSTICS
for(i in seq(nrow(mxy)))
for(colName in listx){
zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
mxy[i,])
## BELOW IS FOR DIAGNOSTICS
write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
## ABOVE IS FOR DIAGNOSTICS
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.csv(geneexptest, file = paste0(colName, ".csv"),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
}
The problem is that while it seems to go through and create the correct number of output files with the correct number of rows, etc...but it does not yield correct z-scores. I want it to calculate:
((Value from row z & given column of gex) - (Mean of values in row z across mxy)) / (Standard deviation of values in row z across mxy)
Then move on to the next row, and so on, filling in the first vector. THEN, I want it to calculate the same thing using the next column of gex, filling in a separate vector. I hope this makes sense.
I have a separate script which runs the same test using a pre-determined column vs the other data frame. The relevant for loop from that script looks like this:
for(i in seq_along(mxy)){
zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}
I think there may be a typo in your code, specifically you say you want "Mean of values in row z across mxy" but are using the mean(mxy[,i])) which selects the i'th column, not the i'th row. I re-wrote this section with for loops for clarity. (not sure why you were using lapply?)
# a function fo calculationg the z score
zScore <- function(x,y)(x - mean(y,na.rm=T))/sd(y,na.rm=T)
for(i in seq(nrow(mxy))) # note that length(mxy) is actually the number of columns in mxy
for(colName in listx){
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}
and alternative that does not rely on append:
for(colName in listx){
geneexptest <- NULL
for(i in seq(nrow(mxy))) {
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- rbind(geneexptest,
data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE))
}
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}

Resources