Load data from MongoDB into R - r

I am trying to query my MongoDB database from R.
I think I lost part of it in the process.
Does R have any limit, and how can I ensure all my records are loaded into R?
Code:
# inspect number of record in mongodb
db.complaints.count()
>395 853
# write a query to load data into R
library(dplyr)
complaints = data.frame(stringsAsFactors = FALSE)
db = "customers.complaints"
cursor = mongo.find(mongo, db)
i = 1
while (mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors=F)
complaints = rbind.fill(complaints, tmp.df)
}
I get [1] 47077 15 after checking the loading in R with dim(complaints).
How can make sure I get all my collections in R?

http://www.analyticbridge.com/profiles/blogs/time-issue-in-creating-a-huge-data-frame-from-mongodb-collection
the above code using environment variables might help you! Please do comment over here if you get a solution.

Related

Sparklyr performance comparison in R to other on disk solutions like SAS. Remove duplicates using distinct takes hours in Sparklyr, seconds in SAS

I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.
I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).
After importing using the function I ran very basic code to dedpulicate one column.
It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.
This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.
sc <- spark_connect(master = "local", version = "2.3")
download <- function(datapath, dataname) {
spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
#spec_explicit <- c(x = "character", y = "numeric")
system.time(dataname <- spark_read_csv(
sc,
path = datapath,
columns = spec_with_r,
Memory = FALSE
))
return(dataname)
}
test <- download("./data/metastases17.csv", test)
test2 <- test %>% select(DX) %>% distinct()

Creating database with csv files in Rstudio

I was trying to create a database, and when I looked it up online, I found this tutorial.
here
The step it took was to use
my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)
When I do file.exists("database.sqlite"), it prints FALSE. I was wondering if there's a way to get "database.sqlite" so I can finish creating this database? Is it from a package? Any help would be appreciated!
The file that you created with the first line was portal-database-output.sqlite under the data/ directory. If you were to do,
file.exists("data/portal-database-output.sqlite")
then it should return TRUE.
You need to read in the data, create the database, then you can add your data to it.
library(tidyverse)
download.file("https://ndownloader.figshare.com/files/3299483",
"species.csv")
species <- read_csv("data/species.csv")
my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)
copy_to(my_db, surveys)
Output
my_db
src: sqlite 3.35.5 [portal-database-output.sqlite]
tbls: species, sqlite_stat1, sqlite_stat4
file.exists("data/portal-database-output.sqlite")
[1] TRUE

How to write contents of data frame back to range?

I need to perform the following sequence:
Open Excel Workbook
Read specific worksheet into R dataframe
Read from a database updating dataframe
Write dataframe back to worksheet
I have steps 1-3 working OK using the BERT tool. (the R scripting interface)
For step 2 I use range.to.data.frame from BERT
Any pointer on how to perform step 4? There is no data.frame.to.range
I tried range$put_Value(df) but no error return and no update to Excel
I can update a single cell from R using put_Value - which I cannot see documented
#
# manipulate status data using R BERT tool
#
wb <- EXCEL$Application$get_ActiveWorkbook()
wbname = wb$get_FullName()
ws <- EXCEL$Application$get_ActiveSheet()
topleft = ws$get_Range( "a1" )
rng = topleft$get_CurrentRegion()
#rngbody = rng$get_Offset(1,0)
ssot = rng$get_Value()
ssotdf = range.to.data.frame( ssot, headers=T )
# emulate data update on 2 columns
ssotdf$ServerStatus = "Disposed"
ssotdf$ServerID = -1
# try to write df back
retcode = rng$put_Value(ssotdf)
This answer doesn't use R Excel BERT.
Try the openxlsx library. You probably can do all the steps using that library. For the step 4, after installing openxlsx library, the following code will write a file:
openxlsx::write.xlsx(ssotdf, 'Dataframe.xlsx',asTable = T)
I think your problem is that you are not changing the size of the range, so you are not going to see your new columns. Try creating a new range that has two extra columns before you insert the data.
I just had the same question and was able to resolve it by transforming the data.frame to a matrix in the call to put_value. I figured this out after playing with the old version in excel-functions.r. Try something like:
retcode = rng$put_Value(as.matrix(ssotdf))
You may have already solved your problem but, if not, the following stripped down R function does what I think you need:
testDF <- function(rng1,rng2){
app <- EXCEL$Application
ref1 <- app$get_Range( rng1 ) # get source range reference
data <- ref1$get_Value() # get source range data
#
ref2 <- app$get_Range( rng2 ) # get destination range reference
ref2$put_Value( data ) # put data in destination range
}
I simulated a dataframe by setting values in range "D4:F6" of the speadsheet to:
col1 col2 col3
1 2 txt1
7 3 txt2
then ran
testDF("D4:F6","H10:J12")
in the Bert console. The dataframe then appears in range "H10:J12".

Subset large .csv file at reading in R

I have a very large .csv file (~4GB) which I'd like to read, then subset.
The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).
f:
id City Value
1 London 17
2 Coventry 21
3 Cambridge 14
......
I've already tried the usual approaches:
f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)
f.colclass <- sapply(f,class)
f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE,
header=T,colClasses=f.colclass)
which seem to work for up to 1-2M rows, but not for the whole file.
I've also tried subsetting at the reading itself using pipe:
f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors = F,pipe('grep "Cambridge" f ') )
and this also seems to crash.
I thought packages sqldf or data.table would have something, but no success yet !!
Thanks in advance, p.
I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame
library(sqldf)
query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)
#or rather than saving all of the raw data in f, you may want to perform a sum
f_sum <- read.csv.sql(file = "f.csv",
sql = "select sum(Value) from file where City=='Cambridge' " )
One solution to this type of error is
you can convert your csv file to excel file first.
Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
then using RODBC package you can access such a large dataset.
I am working with a datasets of size more than 20 GB this way.
Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).
The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration
# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")
# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go
i = 0 # set i to 0 as it'll increase as we loop through the chunks
# loop through the chunks and select rows that contain cambridge
repeat {
# things to do only on the first read-through
if(i==0){
# read in columns only on the first go
grab_header=TRUE
# load the chunk
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# initiate container for desired data
df = tmp_chunk[cond,] # save desired subset in initial container
cols = colnames(df) # save column names to re-use on next chunks
}
# things to do on all subsequent non-first chunks
else if(i>0){
grab_header=FALSE
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)
# set stopping criteria for the loop
# when it reads in 0 rows, exit loop
if(nrow(tmp_chunk)==0){break}
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# append to existing dataframe
df = rbind(df, tmp_chunk[cond,])
}
# add 1 to i to avoid the things needed to do on the first read-in
i=i+1
}
close(con) # close connection
# check out the results
head(df)

Shiny + SQLite - why is Shiny extremely slow?

We have developing a Shiny app for a few months now. But our Shiny app is extremely slow when it tries to load a huge amount of data. We even use the reactive function to reuse the data. But it is still slow as before when we request different sets of data.
We have a log file and it shows that Shiny takes at least 30.12672 seconds or 52.24799 seconds each time to load the data from our database.
What are the reasons make Shiny so slow? Is it the server or the database? What can we do to speed it up?
We are using SQLite database. Is it the reason that makes Shiny slow?
If so, what other types of database system should we go for to process huge amount of data sets? Cassandra? HBase? Apache Spark?
EDIT:
For instace,
query <- "SELECT
s.timestamp,
s.particle_concentration as `PM2.5`,
n.code as site
FROM speckdata AS s
LEFT JOIN nodes AS n
ON n.nid = s.nid
AND n.datatype = 'speck'
WHERE strftime('%Y', s.localdate) = 'YEAR'
"
# Match the pattern and replace it.
dataQuery <- sub("YEAR", as.character(year), query)
# Store the result in data1.
data = dbGetQuery(DB, dataQuery)
if(nrow(data) > 0) {
# Convert timestamp to date and bind it to the data.
data$date <- as.POSIXct(as.numeric(as.character(data$timestamp)), origin="1970-01-01", tz="GMT")
}
# Chosen to group the data in one panel.
timePlot(
data,
pollutant = c(species, condition),
avg.time = avg_time,
lwd = 2,
lty = 1,
name.pol = c(species_text_value, condition_text_value),
type = "site",
group = TRUE,
auto.text = FALSE
)
That is extremely slow in Shiny.
But when we query the data set using the SQLite manager, it only takes 1.9 seconds for 4719282 rows!
I would suggest testing the performance of your SQLite query directly off the database. If that's your slow point you will want to optimize your query to make it more efficient. Before I can help further it would be good to know exactly where the performance issues are.

Resources