We have developing a Shiny app for a few months now. But our Shiny app is extremely slow when it tries to load a huge amount of data. We even use the reactive function to reuse the data. But it is still slow as before when we request different sets of data.
We have a log file and it shows that Shiny takes at least 30.12672 seconds or 52.24799 seconds each time to load the data from our database.
What are the reasons make Shiny so slow? Is it the server or the database? What can we do to speed it up?
We are using SQLite database. Is it the reason that makes Shiny slow?
If so, what other types of database system should we go for to process huge amount of data sets? Cassandra? HBase? Apache Spark?
EDIT:
For instace,
query <- "SELECT
s.timestamp,
s.particle_concentration as `PM2.5`,
n.code as site
FROM speckdata AS s
LEFT JOIN nodes AS n
ON n.nid = s.nid
AND n.datatype = 'speck'
WHERE strftime('%Y', s.localdate) = 'YEAR'
"
# Match the pattern and replace it.
dataQuery <- sub("YEAR", as.character(year), query)
# Store the result in data1.
data = dbGetQuery(DB, dataQuery)
if(nrow(data) > 0) {
# Convert timestamp to date and bind it to the data.
data$date <- as.POSIXct(as.numeric(as.character(data$timestamp)), origin="1970-01-01", tz="GMT")
}
# Chosen to group the data in one panel.
timePlot(
data,
pollutant = c(species, condition),
avg.time = avg_time,
lwd = 2,
lty = 1,
name.pol = c(species_text_value, condition_text_value),
type = "site",
group = TRUE,
auto.text = FALSE
)
That is extremely slow in Shiny.
But when we query the data set using the SQLite manager, it only takes 1.9 seconds for 4719282 rows!
I would suggest testing the performance of your SQLite query directly off the database. If that's your slow point you will want to optimize your query to make it more efficient. Before I can help further it would be good to know exactly where the performance issues are.
Related
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
I have an R Markup file in which I establish a database connection, query data, and store the data in an csv file. The query is based on a specific date range. How can I automated make multiple queries, so that one after another e.g. every week is queried from the database? I cannot make a query for e.g. the whole year, but I need to store the data separately for each week. I could make a data frame, in which I have two columns for the start and end date, which I would like to use for the query.
But how can I automatically run the queries multiple times depending on the date data frame?
My code so far:
#load libraries
drv <- PostgreSQL()
db_con <- dbConnect(drv, host=my_host, user=my_user, dbname=my_name, port=my_port, password=my_password)
start = "2015-01-01"
end = "2015-01-02"
result <- dbGetQuery(
db_con,
"SELECT * FROM table WHERE date >= start AND date <= end;")
st_write(result, pathname)
Consider parameterization using DBI::sqlInterpolate with a Map (wrapper to mapply) iteration:
db_con <- dbConnect(
PostgreSQL(), host=my_host, user=my_user, dbname=my_name,
port=my_port, password=my_password)
)
# ALL WEEKLY DATES IN 2015
dates_df <- data.frame(
start=seq.Date(as.Date("2015-01-01"), as.Date("2016-01-01"), by="week"),
end=seq.Date(as.Date("2015-01-08"), as.Date("2016-01-08"), by="week")
)
# USER DEFINED METHOD TO QUERY AND WRITE DATA
query_db <- function(s, e) {
# PREPARED STATEMENT WITH PLACEHOLDERS
sql <- "SELECT * FROM table WHERE date >= ?start AND date <= ?end;"
# BIND PARAMETERS AND QUERY
stmt <- DBI::sqlInterpolate(db_con, sql, start=s, end=e)
result <- dbGetQuery(db_con, stmt)
# WRITE DATA TO DISK
st_write(result, pathname)
# RETURN QUERY RESULTSET
return(result)
}
# WRITE AND STORE DATA IN MEMORY
df_list <- Map(query_db, dates_df$start, dates_df$end)
I suggest DBI::dbBind and a frame (similar to #Parfait's answer).
For demonstration, I have a "sessions" table on my pg instance that has a field ScheduledStart. In this case, it's a TIMESTAMPTZ column, not a Date column, so I need to take one more step in my demo (convert from R's Date to POSIXt classes).
# pg <- DBI::dbConnect(...)
ranges <- data.frame(
start = seq(as.Date("2020-03-01"), length.out = 4, by = "week")
)
ranges$end <- ranges$start + 6
# this line is only necessary because of my local table
ranges[] <- lapply(ranges, as.POSIXct)
Here is the bulk of the "query multiple weeks":
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
DBI::dbBind(res, ranges)
out <- DBI::dbFetch(res)
DBI::dbClearResult(res)
out
# n
# 1 8
# 2 1
# 3 0
# 4 0
While sqlInterpolate is much better than forming your own query strings (e.g., with sprintf or paste), using dbBind allows for internal iteration like above, and allows the DBMS to optimize the query with the binding parameter ? instead of actual data. (Using sqlInterpolate, the DBMS would see four different queries. Using dbBind, it sees one query, optimizes it, and uses it four times.)
That query was really boring (select * ... works, too), but I think it gets the point across. The only downside of this method is that while it makes it really easy to get all of the data, there is nothing here that inherently tells you which of your queries a particular row came from. I suspect that you can determine that from your data, that your main intent on breaking it down by-week is the amount retrieved per-query.
Side note: I often use code like this in functions, where it is feasible that something between dbSendQuery and dbClearResult might interrupt operation. In that case, I tend to reorder my code a little, like this:
somefunc <- function(...) {
# ...
res <- DBI::dbSendQuery(pg, "select count(*) as n from Sessions where ScheduledStart between ? and ?")
on.exit({
suppressWarnings(DBI::dbClearResult(res))
}, add = TRUE)
DBI::dbBind(res, ranges)
DBI::dbClearResult(res)
return(out)
}
I am using the package qualtRics in TERR in Spotfire to pull in data directly from specific surveys in Qualtrics. The code I am using is:
registerApiKey(API.TOKEN = "xxxx")
df <- getSurvey(surveyID = "xxxx",
root_url = "https://az1.qualtrics.com", verbose = TRUE)
My output df is a data table. I have 2 different surveys that I am pulling in 4 different times, 2 of those times I am unpivoting data, for a total of 4 data tables.
I want to be able to refresh this data. If I click Reload Data or try to refresh each table individually, nothing happens. I'm assuming I need to add some code that refreshes the data function (?), and I am trying to avoid replacing the data tables each time because, for 2 of those, I have to manually select which columns I am unpivoting (and I have 75+ columns).
Is there a way I can accomplish what I'm looking for? I am a beginner Spotfire/R user, so I am learning as I go!
I am not able to reply to your question as i dont have enough permission so keeping it as separate answer.
Replacing table each time is good idea,
By This you can fix your no of columns for pivoting/UnPivoting.
------R Code
row <- data.frame(Data_Points = nrows,
Col1 = col1, Col2 = col2, YStart = y1, YEnd = y2)
row <- cbind(df, row)
return(row)
And also you can list your fix columns into DocumentProperty and loop it into your DataFunction.
Instead of using spotfire's pivot/unpivot, you can try doing the unpivot within the R code of the data function.
I am trying to query my MongoDB database from R.
I think I lost part of it in the process.
Does R have any limit, and how can I ensure all my records are loaded into R?
Code:
# inspect number of record in mongodb
db.complaints.count()
>395 853
# write a query to load data into R
library(dplyr)
complaints = data.frame(stringsAsFactors = FALSE)
db = "customers.complaints"
cursor = mongo.find(mongo, db)
i = 1
while (mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors=F)
complaints = rbind.fill(complaints, tmp.df)
}
I get [1] 47077 15 after checking the loading in R with dim(complaints).
How can make sure I get all my collections in R?
http://www.analyticbridge.com/profiles/blogs/time-issue-in-creating-a-huge-data-frame-from-mongodb-collection
the above code using environment variables might help you! Please do comment over here if you get a solution.
I am using MongoDB do to tick data analysis in R. Initially I used MySQL, which worked fine, but I wanted to test MongoDB for this purpose. The data set contains about 200 million entries at the moment. Using RODBC I could get the query result into a data.frame very quickly using sqlQuery(conn, "select * from td where prd = 'TY' and date = '2012-01-03'")
In MongoDB I have documents like Document{{_id=5537ca647a3ad42a84374f0a, prd=TY, time=1325661600043, px=130.6875, sz=11}}
In Java I can retrieve a days worth of tick data - roughly 100,000 entries, create Tick objects and add them to an array, all in less than 2 seconds.
Using rmongodb, the below takes forever. Any ideas how to improve this?
query <- mongo.bson.from.list( list(product = "TY", date = as.POSIXct("2012-01-04")) )
res.cursor <- mongo.find(mongo, db.coll, query, limit = 100e3, options=mongo.find.exhaust)
resdf <- mongo.cursor.to.data.frame(res.cursor)
Using find.all is equally slow.