Fetching data in Jupyter Notebook is taking too long - jupyter-notebook

I want to fetch all rows from a table, using the following code:
table_row_count = 1000000
batch_size = 10000
sql = """SELECT t.*
FROM (
SELECT ROWNUM AS row_num,
sub_t.*
FROM (
SELECT name_employer
FROM my_table
WHERE section = 'OTHER'
) sub_t
) t
WHERE t.row_num BETWEEN :LOWER_BOUND AND :UPPER_BOUND"""
data = []
for lower_bound in range(0, table_row_count, batch_size):
cursor.execute(sql, {'LOWER_BOUND': lower_bound,
'UPPER_BOUND': lower_bound + batch_size - 1})
for row in cursor.fetchall():
data.append(row)
The original source of the code: cx_Oracle: fetchall() stops working with big SELECT statements
However, it is taking forever. My data has 5 mil. of rows. Is there any other way to do this?

For big result sets, increase arraysize. Try something like cursor.arraysize = 10000 and then tune the size to suit your data and performance requirements.
Refer to the Tuning cx_Oracle manual.
You may also want to look at the best practices in https://github.com/cjbj/cx-oracle-notebooks
Also see the latest cx_Oracle release announcement - it's time to upgrade to python-oracledb.

Related

R: how to query from a database multiple times based on different dates

I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}

RODBC sqlQuery returns different number of rows every time I run it

I have a simple SQL query that should return 74m rows. I know so because I ran the same query in SSMS using count(*). However, when using RODBC sqlQuery in R, it returned less rows and I tried running the same code many times and found that the returned dataframe contains different number of rows every time (ranging from 1.1m to 20m). I cannot replicate the sql query exactly here due to confidentiality but conceptually it is similar to the below.
con <- odbcConnect("db_name", uid="my_user_id", pwd="my_password")
df_2 <- sqlQuery(con, paste("SELECT *
FROM table_name
WHERE CountryName <> 'US' and Month in (1,2,3)"),
stringsAsFactors = FALSE)
Why would an exact same code return different results every time? I tried running this more than 10 times.

Improving Performance with large Data Sizes in RMySQL & R

I need to create a Data Downloader which may be asked to fetch from a few 1000s to 30-40 Million records.
In RMySQL, I am currently fetching a constant size of 4000 records in every round using below Code:
rs = dbSendQuery(con, query)
totRow = 0
resultDF = data.frame()
while (!dbHasCompleted(rs)) {
chunk <- dbFetch(rs, 4000)
resultDF<-rbind(resultDF, chunk)
totRow = totRow + nrow(chunk)
#Status Print
if(totRow %% 100000 == 0){
print(paste(OutFile, format(totRow, scientific=F)))
}
}
dbClearResult(rs)
Since, I am doing a rbind every time, this methods slows down horribly for large data sizes. What can be better approach? Does RMySQL support returning how many records are present in result?
I was thinking of performing the fetch with a dynamic size, which keeps increasing with every fetch, just like C++ Vectors. Are there any better alternatives here?

Shiny + SQLite - why is Shiny extremely slow?

We have developing a Shiny app for a few months now. But our Shiny app is extremely slow when it tries to load a huge amount of data. We even use the reactive function to reuse the data. But it is still slow as before when we request different sets of data.
We have a log file and it shows that Shiny takes at least 30.12672 seconds or 52.24799 seconds each time to load the data from our database.
What are the reasons make Shiny so slow? Is it the server or the database? What can we do to speed it up?
We are using SQLite database. Is it the reason that makes Shiny slow?
If so, what other types of database system should we go for to process huge amount of data sets? Cassandra? HBase? Apache Spark?
EDIT:
For instace,
query <- "SELECT
s.timestamp,
s.particle_concentration as `PM2.5`,
n.code as site
FROM speckdata AS s
LEFT JOIN nodes AS n
ON n.nid = s.nid
AND n.datatype = 'speck'
WHERE strftime('%Y', s.localdate) = 'YEAR'
"
# Match the pattern and replace it.
dataQuery <- sub("YEAR", as.character(year), query)
# Store the result in data1.
data = dbGetQuery(DB, dataQuery)
if(nrow(data) > 0) {
# Convert timestamp to date and bind it to the data.
data$date <- as.POSIXct(as.numeric(as.character(data$timestamp)), origin="1970-01-01", tz="GMT")
}
# Chosen to group the data in one panel.
timePlot(
data,
pollutant = c(species, condition),
avg.time = avg_time,
lwd = 2,
lty = 1,
name.pol = c(species_text_value, condition_text_value),
type = "site",
group = TRUE,
auto.text = FALSE
)
That is extremely slow in Shiny.
But when we query the data set using the SQLite manager, it only takes 1.9 seconds for 4719282 rows!
I would suggest testing the performance of your SQLite query directly off the database. If that's your slow point you will want to optimize your query to make it more efficient. Before I can help further it would be good to know exactly where the performance issues are.

rmongodb is very slow in creating data.frame

I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.

Resources