How do you bulk insert documents in MongoDB from R? - r

I have a dataframe in R and I am trying to bulk insert each row of the dataframe as separate document in MongoDB. The closest I could do is using following script which creates a document and makes the rows of dataframe as its sub document.
x <- toJSON(unname(split(score, 1:nrow(score))))
bson <- mongo.bson.from.JSON(x)
mongo.insert(mongo,'abc.abc',x)
On the other hand, I want each row as separate document. I also see that above method is very fast but if we loop around the rows, it would highly reduce the speed

The new mongolite package does this automatically:
library(mongolite)
m <- mongo("iris")
m$insert(iris)

library(rmongodb)
df <- data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4),
stringsAsFactors = F)
lst <- split(df, rownames(df))
bson_lst <- lapply(lst, mongo.bson.from.list)
mongo.insert.batch(mongo = mongo, ns = "db.collection", lst = bson_lst)
And please, dont't use mongo.bson.from.JSON, use mongo.bson.from.list instead. It is much more straighforward (and much more faster!) way to convert R object to bson object.

Related

How to populate bigstatsr::FBM with sqlite database for later consumption?

I'm a newbie to the bigstatsr package. I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption. I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.
Currently I do it sequentially, here's some mock code:
library(bigstatsr)
library(RSQLite)
library(dplyr)
number_genes <- 50e3
number_samples <- 70e3
large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes,
ncol = number_samples,
type = "double",
backingfile = "fbm_large_genomic_matrix")
# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")
sample_index_counter <- 1
for(current_sample in vector_with_sample_names){
sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
dplyr::tbl("genomic_data") %>%
dplyr::filter(sample == current_sample) %>%
dplyr::collect()
large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
sample_index_counter <- sample_index_counter + 1
}
big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())
I have two questions:
Is there a way of populating the matrix more efficiently? Not sure if big_apply could be used here, perhaps foreach
Do I always have to use big_write in order to load my matrix later? If so why can't I just use the bk file?
Thanks in advance
That is a very good first try that you have by yourself.
What is inefficient here is to test for dplyr::filter(sample == current_sample) for every single sample. I would try to use match() first to get the indices. Then, what would be a bit inefficient is to populate each column individually. As you said, you could use big_apply() to do this by blocks.
big_write() is for writing the FBM to some text file (e.g. csv). What you want here is to use FBM()$save() (second line of the example in the README), and then use big_attach() on the .rds file (next line of the README).

Using a For-loop to create multiple objects with incremental suffixes, then reading in .csv file to each new object (also with incremental suffixes)

I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.

R code optimization: For loop and writing to a database

I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.

r run easyPubMed queries row wise from a data frame column

I have a dataframe like below;
I would like to run row by row query in Pubmed using easyPubMed package. For each row/query should fetch list of PMIDs. This list should be retrived in another column called 'PMID'.
This might work
library(easyPubMed)
library(purrr)
Query <- c('rituximab OR bevacizumab','meningitis OR headache')
Heading <- c('A','B')
x <- as.data.frame(cbind(Heading,Query),stringsAsFactors = F)
x$PMID<- ""
ids <- map(x[,"Query"],get_pubmed_ids)
for (i in 1:length(ids)) {
x[i,"PMID"]<- paste(ids[[i]][["IdList"]],collapse = ",")
}
I think that "sapply" won't return expected results so, going the "map" way from "purrr" package is safer.

How to convert dataframe to xml using xml2 package?

I'm trying to update an xml file with new nodes using xml2. It's easy if I just write everything manually as text,
oldXML <- read_xml("<Root><Trial><Number>3.14159 </Number><Adjective>Fast </Adjective></Trial></Root>")
but I'm developing an application that will run calculations and then put those values into the xml, so I need a mix of character and variables. It ends up looking like:
var1 <- 4.567
var2 <- "Slow"
newLine <- read_xml(paste0("<Trial><Number>",var1," </Number><Adjective>",var2," </Adjective></Trial>"))
xml_add_child(oldXML,newLine)
I suspect there's a much less kludgy way to do this than using paste0, but I can't get anything else to work. I'd like to be able to just instruct it to update the xml by reference to the dataframe, such that it can create new trials:
<Trial>
<Number>df$number[1]</Number>
<Adjective>df$adjective[1]</Adjective>
</Trial>
<Trial>
<Number>df$number[2]</Number>
<Adjective>df$adjective[2]</Adjective>
</Trial>
Is there any way to create new Trial nodes in approximately that fashion, or at least more naturally than using paste0 to insert variables? Is this something the XML package does better than xml2?
If you have your new values in a data.frame like this:
vars <- data.frame(Number = c(4.567, 3.211),
Adjective = c("Slow", "Slow"),
stringsAsFactors = FALSE)
you can convert it to a list of xml_document's as follows:
vars_xml <- lapply(purrr::transpose(vars),
function(x) {
as_xml_document(list(Trial = lapply(x, as.list)))
})
Then you can add the new nodes to the original xml:
for(trial in vars_xml) xml_add_child(oldXML, trial)
I don't know that this is better than your paste approach. Either way, you can wrap it in a function so you only have to write the ugly code once.
Here's a solution that builds on #Ista's excellent answer. Basically, I've dropped the first lapply in favor of purrr::map (we could probably replace the second lapply with a map, but I couldn't find a more readable way to accomplish that).
library(purrr)
vars_xml <- transpose(vars) %>%
map(~as_xml_document(list(Trial = lapply(.x, as.list))))

Resources