I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.
Related
I'm a newbie to the bigstatsr package. I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption. I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.
Currently I do it sequentially, here's some mock code:
library(bigstatsr)
library(RSQLite)
library(dplyr)
number_genes <- 50e3
number_samples <- 70e3
large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes,
ncol = number_samples,
type = "double",
backingfile = "fbm_large_genomic_matrix")
# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")
sample_index_counter <- 1
for(current_sample in vector_with_sample_names){
sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
dplyr::tbl("genomic_data") %>%
dplyr::filter(sample == current_sample) %>%
dplyr::collect()
large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
sample_index_counter <- sample_index_counter + 1
}
big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())
I have two questions:
Is there a way of populating the matrix more efficiently? Not sure if big_apply could be used here, perhaps foreach
Do I always have to use big_write in order to load my matrix later? If so why can't I just use the bk file?
Thanks in advance
That is a very good first try that you have by yourself.
What is inefficient here is to test for dplyr::filter(sample == current_sample) for every single sample. I would try to use match() first to get the indices. Then, what would be a bit inefficient is to populate each column individually. As you said, you could use big_apply() to do this by blocks.
big_write() is for writing the FBM to some text file (e.g. csv). What you want here is to use FBM()$save() (second line of the example in the README), and then use big_attach() on the .rds file (next line of the README).
I have 30 sas-files (dataset1.sas7bdat through dataset30.sas7bdat, approx. 10 GB per file) in a folder, and need to analyse a subset of rows in these data files (all rows where the character variable A begins with 10).
Thus, I need to read each of the sas-files into R, filter a subset with grep on variable A and then save each of these filtered datasets as a .rds-file.
I'm trying to achieve this using a for loop of list.files() and the Haven package to read the sas-file. In order to avoid going out-of-memory, I need to remove the imported dataset on each iteration after the subset has been filtered and saved as .rds.
Though not elegant nor satisfying, I could hard-code it manually 30 times over like this, copy/pasting and incrementing the suffixes by 1 each time:
dt1 <- haven::read_sas("~/folder/dataset1.sas7bdat")
dt1 <- data.table::as.data.table(dt1)
dt1 <- dt1[grep("^10", A)]
saveRDS(dt1, "~/folder/subset1.rds")
dt2 <- haven::read_sas("~/folder/dataset2.sas7bdat")
dt2 <- data.table::as.data.table(dt2)
dt2 <- dt1[grep("^10", A)]
saveRDS(dt2, "~/folder/subset2.rds")
etc.
While the following for loop technically works to read the files into memory, it is never going to finish due to massively going out of memory, so it does not allow me to filter the data:
folder <- "~/folder/"
file_list <- list.files(path = folder, pattern = "^dataset")
for (i in 1:length(file_list)) {
assign(file_list[i], Haven::read_sas(paste(folder, file_list[i], sep='')))
}
Is there a way to - on each iteration in the loop - filter the dataset, remove the unfiltered dataset and save the subset in a .rds-file?
I can't seem to come up with a way to incorporate this into my approach of using the assign() function.
Is there a better way to go about this?
You could potentially look at freeing up memory by clearing the memory after each load and filter via
rm(list=ls())
I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):
library(haven)
library(data.table)
fromFolder <- "~/folder_with_input_data/"
toFolder <- "~/folder_with_output_data/"
import_sas <- function(filename) {
dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
dt <- as.data.table(dt)
dt <- dt[grep("^10", A)]
saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
remove(dt)
}
file_list <- list.files(path = fromFolder, pattern="^dataset")
for (filename in file_list) {
import_sas(filename)
}
I haven't tested this with the full 30 files yet. I'll do that tonight.
If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.
Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.
I'm a newbie in R and pre-processing a big data of million lines to label the connected component and sending the output to a file. But It is taking aweful lot of time using for loop and cat(). Is there any alternative way to write the output file in most faster way in R? I am sharing a sample of code. Any alternative methods or rewriting it with a function that makes it more efficient would be highly appreciated.
#Simple example of undirected graph
g <- graph_from_literal(a--b, a--c, b--c, d--e)
plot(g)
#Connected components
#The option, mode, is ignored for undirected graphs
comp <- components(g, mode = "weak")
#output to a file
fout <- file("output.txt", "w")
for (v in V(g)) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
cat(sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size), file=fout)
}
close(fout)
It seems like everything is vectorized and no for loop is needed. This gives the same output and uses data.table::fwrite, which will be quite a bit faster than cat.
vv = V(g)
vn = vv$name
comp_id = comp$membership[vv$name]
comp_size = comp$csize[comp_id]
data.table::fwrite(data.table(vn, comp_id, comp_size), "output.txt", col.names = FALSE, sep = "\t")
If you don't want the data table dependency, you could use base::write.table, which would still be better than pasting together strings with tabs yourself.
I faced similar problem, i.e. how to write 3 millions of (short) lines into a text file. I found that using writeChar speed up increasingly the file writting process (from several minutes to seconds).
Below, I replaced cat by writeChar in your code:
g <- graph_from_literal(a--b, a--c, b--c, d--e)
plot(g)
#Connected components
#The option, mode, is ignored for undirected graphs
comp <- components(g, mode = "weak")
# first clean the file if it exists
fout <- file("output.txt", "wb")
close(fout)
# switch in appending mode
fout <- file("output.txt", "ab")
for (v in V(g)) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
# set eos = NULL to avoid NULL terminators
writeChar(sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size), con = fout, eos = NULL)
}
close(fout)
(Caveat emptor: I don't have any of your data, so this is untested.)
Instead of doing the write each time within your loop, instead generate a vector of the strings (one file-line each) and write once at the end. This type of file I/O is much more efficient.
all_lines <- sapply(V(g), function(v) {
vn <- V(g)$name[v]
comp_id <- comp$membership[vn][[1]]
comp_size <- comp$csize[comp_id]
sprintf("%s\t%s\t%s\n", vn, comp_id, comp_size)
})
writeLines(all_lines, "output.txt")
The use of sapply is one efficiency of R, doing things as "vectors of things". Though it is not strictly necessary (this could be done with a for loop, though several precautions need to be taken in order to not be grossly inefficient, especially when dealing with a million lines), once one can "grok" the intent of vector-mechanics, it might become easier to understand and deal with.
I'm trying to update an xml file with new nodes using xml2. It's easy if I just write everything manually as text,
oldXML <- read_xml("<Root><Trial><Number>3.14159 </Number><Adjective>Fast </Adjective></Trial></Root>")
but I'm developing an application that will run calculations and then put those values into the xml, so I need a mix of character and variables. It ends up looking like:
var1 <- 4.567
var2 <- "Slow"
newLine <- read_xml(paste0("<Trial><Number>",var1," </Number><Adjective>",var2," </Adjective></Trial>"))
xml_add_child(oldXML,newLine)
I suspect there's a much less kludgy way to do this than using paste0, but I can't get anything else to work. I'd like to be able to just instruct it to update the xml by reference to the dataframe, such that it can create new trials:
<Trial>
<Number>df$number[1]</Number>
<Adjective>df$adjective[1]</Adjective>
</Trial>
<Trial>
<Number>df$number[2]</Number>
<Adjective>df$adjective[2]</Adjective>
</Trial>
Is there any way to create new Trial nodes in approximately that fashion, or at least more naturally than using paste0 to insert variables? Is this something the XML package does better than xml2?
If you have your new values in a data.frame like this:
vars <- data.frame(Number = c(4.567, 3.211),
Adjective = c("Slow", "Slow"),
stringsAsFactors = FALSE)
you can convert it to a list of xml_document's as follows:
vars_xml <- lapply(purrr::transpose(vars),
function(x) {
as_xml_document(list(Trial = lapply(x, as.list)))
})
Then you can add the new nodes to the original xml:
for(trial in vars_xml) xml_add_child(oldXML, trial)
I don't know that this is better than your paste approach. Either way, you can wrap it in a function so you only have to write the ugly code once.
Here's a solution that builds on #Ista's excellent answer. Basically, I've dropped the first lapply in favor of purrr::map (we could probably replace the second lapply with a map, but I couldn't find a more readable way to accomplish that).
library(purrr)
vars_xml <- transpose(vars) %>%
map(~as_xml_document(list(Trial = lapply(.x, as.list))))
I've seen elsewhere on this site that a single variable can be passed to sqldf but I haven't been able to track down a way to loop though all the values in "myplace":
myplace <- ("place1", "place2", "place3", ...)
myquery <- sqldf(select things from mydata where place = myplace)
myplace.graph1 <- ggplot2(myquery)
My aim is to automate the production of a number of tables and graphs for each "myplace" element - I want to do this as my dataset is updated monthly and I have to report on it. (For this reason, I don't think grouping, as suggested by a similar query on this site, is the way to go although I stand to be corrected).
I am in the process of learning the R ropes in order to replace a bunch of muddled spreadsheets - my data is now in in sqlite but I couldn't see a way of looping through a "dbgetQuery" either.
It could well be that a completely different approach is required - I'm exploring R because the graphs look great, I can document my steps and it's open source - I would appreciate any advice.
Many thanks
1) A query is just a text string. You can manipulate that text string in all the usual ways before passing it to sqldf. See ?paste, ?paste0, ?sprintf, etc.
qry <- paste("select things from mydata where place =", myplace[1])
myplace.graph1 <- plot(sqldf(qry))
myplace.graph <- list()
for(i in seq_along(myplace))
{
qry <- paste("select things from mydata where place =", myplace[i])
myplace.graph[[i]] <- plot(sqldf(qry))
}
2) Or, without a loop:
myplace.graph <- lapply(myplace, function(x) {
qry <- paste("select things from mydata where place =", x)
plot(sqldf(qry))
}
3) Or using $.fn from the gsubfn package (which is automatically loaded by sqldf so is available) as in Example 5 on the sqldf home page:
sql <- "select things from mydata where place = '$p' "
lapply(myplace, function(p) plot(fn$sqldf(sql)))