I am loading a dataset called df and it is 10 GB file , and it (dbGetQuery) is taking 6-8 minutes . Any suggestions to expedite this data loading script ?
conn <- dbConnect(MySQL(),
host = DB$host,
port = DB$port,
user = DB$user,
password = DB$password,
dbname = "device_onboard")
df<-dbGetQuery(conn, "select * from device_onboard.onboard")
Related
I would like to drop my whole dataframe from R preferably using RODBC with sqlSave statement (not sqlQuery). Here is my sample code.
library(RODBC)
myconn <- odbcDriverConnect("some connection string")
mydf <- data.frame(col_1 = c(1,2,3), col_2 = c(2,3,4))
sqlSave(myconn, mydf, tablename = '[some_db].[some_schema].[my_table]', append = F, rownames = F, verbose=TRUE)
odbcClose(myconn)
After I execute it, I get back error message:
Error in sqlColumns(channel, tablename) :
‘my_table’: table not found on channel
When I check in SQL Server, an empty table is present.
If I run the same code again, I get error message:
Error in sqlSave(myconn, mydf, tablename = "[some_db].[some_schema].[my_table]", :
42S01 2714 [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]There is already an object named 'my_table' in the database.
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE [some_db].[some_schema].[my_table] ("col_1" float, "col_2" float)'
Any suggestions on how to troubleshoot?
UPDATE
In SSMS I can run the following commands successfully:
CREATE TABLE [some_db].[some_schema].[my_table] (
test int
);
drop table [some_db].[some_schema].[my_table]
Here are details of connection string:
Driver=ODBC Driver 17 for SQL Server; Server=someserveraddress; Uid=user_login; Pwd=some_password
To avoid the error, you could specify the database in the connection string:
Driver=ODBC Driver 17 for SQL Server; Server = someserveraddress; database = some_db; Uid = user_login; Pwd = some_password
and avoid using brackets:
sqlSave(myconn, mydf, tablename = 'some_schema.my_table', append = F, rownames = F, verbose=TRUE)
I am trying to do some data exploration for this dataset I have. The table I want to import is 11 million rows. Here is the script and output
#Creating a variable for our BQ project space
project_id = 'project space'
#Query
Step1 <-
"
insertquery
"
#Executing the query from the variable above
Step1_df <- query_exec(Step1, project = project_id, use_legacy_sql = FALSE, max_pages = Inf,page_size = 99000)
Error:
Error in curl::curl_fetch_memory(url, handle = handle) :
Operation was aborted by an application callback
Is there a different bigquery library I can use ? Looking to also speed up the upload time .
When I run the insert statement with odbc driver everythings fine.
drv <- odbc::odbc()
conn <- createConn(drv, trusted_connection = T, dsn="mydsn", uid="myuid", pwd="mypwd")
DBI::dbSendQuery("INSERT INTO \"dbo\".\"testjdbc\" (d) values('4')")
When I run the select statement with jdbc everything is fine too:
drv <- RJDBC::JDBC(driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver", classPath = "C:\\mssql-jdbc-7.0.0.jre8.jar")
conn <- DBI::dbConnect(drv, trusted_connection = T, url = "jdbc:sqlserver://myserver\\\\myinstance:1111;databaseName=mydatabasename", user="myuid", password="mypwd")
DBI::dbGetQuery(conn, "Select * from dbo.mytable")
and the connection for jdbc is valid:
drv <- RJDBC::JDBC(driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver", classPath = "C:\\mssql-jdbc-7.0.0.jre8.jar")
conn <- DBI::dbConnect(drv, trusted_connection = T, url = "jdbc:sqlserver://myserver\\\\myinstance:1111;databaseName=mydatabasename", user="myuid", password="mypwd")
DBI::dbIsValid(conn) # TRUE
But when I try insert statement (the same like the first one) with jdbc driver like this:
drv <- RJDBC::JDBC(driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver", classPath = "C:\\mssql-jdbc-7.0.0.jre8.jar")
conn <- DBI::dbConnect(drv, trusted_connection = T, url = "jdbc:sqlserver://myserver\\\\myinstance:1111;databaseName=mydatabasename", user="myuid", password="mypwd")
DBI::dbSendQuery(conn, "INSERT INTO \"dbo\".\"testjdbc\" (d) values('4')")
then I get the error:
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for INSERT INTO "dbo"."testjdbc" (d) values('4') (The statement did not return a result set.)
So jdbc select is OK but inserts, updates, deletes gives errors while with odbc I can do everything.
The solution is to run inserts not with DBI::dbSendQuery but with RJDBC::dbSendUpdate.
Thank you #Mark Rotteveel for your answer. Thanks to you I have found the solution.
Question: How do I pass a variable in the RPostgreSQL query?
Example: In the example below I try to pass the date '2018-01-03' to the query
library(RPostgreSQL)
dt <- '2018-01-03'
connect <- dbConnect(PostgreSQL(),
dbname="test",
host="localhost",
port=5432,
user="user",
password="...")
result <- dbGetQuery(connect,
"SELECT * FROM sales_tbl WHERE date = #{dt}")
You can use paste0 to generate your query and pass it to dbGetQuery:
library(RPostgreSQL)
dt <- '2018-01-03'
connect <- dbConnect(PostgreSQL(),
dbname="test",
host="localhost",
port=5432,
user="user",
password="...")
query <- paste0("SELECT * FROM sales_tbl WHERE date='", dt, "'")
result <- dbGetQuery(connect, query)
The safest way is to parameterize the query as mentioned here
Example:
library(RPostgreSQL)
dt <- '2018-01-03'
connect <- dbConnect(drv = PostgreSQL(),
dbname ="test",
host = "localhost",
port = 5432,
user = "user",
password = "...")
query <- "SELECT * FROM sales_tbl WHERE date= ?"
sanitized_query <- dbSendQuery(connect, query)
dbBind(sanitized_query, list(dt))
result <- dbFetch(sanitized_query)
Here by passing ? you are sanitizing your query to avoid SQL injection attacks.
Another thing I like to do is to create .Renviron file to store my credintials. For example, for the connection above, the .Renviron file will look like this.
dbname = test
dbuser = me
dbpass = mypass
dbport = 5432
dbhost = localhost
save the file, restart RStudio (to load the .Renviron file at startup). Then access the credentials using the Sys.getenv(variable)
#example:
connect <- dbConnect(drv = PostgreSQL(),
dbname = Sys.getenv("dbname"),
host = Sys.getenv("dbhost"),
port = Sys.getenv("dbport"),
user = Sys.getenv("dbuser"),
password = Sys.getenv("dbpass"))
I have this simple R program that reads a table (1000000 rows, 10 columns) from a sqlite database into an R data.table and then I do some operations on the data and try to write it back into a new table of the same sqlite database. Reading the data takes a few seconds but writing the table back into the sqlite database takes hours. I don't know how long exactly because it has never finished, the longest I have tried is 8 hours.
This is the simplified version of the program:
library(DBI)
library(RSQLite)
library(data.table)
driver = dbDriver("SQLite")
con = dbConnect(driver, dbname = "C:/Database/DB.db")
DB <- data.table(dbGetQuery(con, "SELECT * from Table1"))
dbSendQuery(con, "DROP TABLE IF EXISTS Table2")
dbWriteTable(con, "Table2", DB)
dbDisconnect(con)
dbUnloadDriver(driver)
Im using R version 2.15.2, package version are:
data.table_1.8.6 RSQLite_0.11.2 DBI_0.2-5
I have tried on multiple systems and on different Windows versions and in all cases it takes an incredible amount of time to write this table into the sqlite database. When looking at the file size of the sqlite database it writes at about 50KB per minute.
My question is does anybody know what causes this slow write speed?
Tim had the answer but I can't flag it as such because it is in the comments.
As in:
ideas to avoid hitting memory limit when using dbWriteTable to save an R data table inside a SQLite database
I wrote the data to the database in chunks
chunks <- 100
starts.stops <- floor( seq( 1 , nrow( DB ) , length.out = chunks ) )
system.time({
for ( i in 2:( length( starts.stops ) ) ){
if ( i == 2 ){
rows.to.add <- ( starts.stops[ i - 1 ] ):( starts.stops[ i ] )
} else {
rows.to.add <- ( starts.stops[ i - 1 ] + 1 ):( starts.stops[ i ] )
}
dbWriteTable( con , 'Table2' , DB[ rows.to.add , ] , append = TRUE )
}
})
It takes:
user system elapsed
4.49 9.90 214.26
time to finish writing the data to the database. Apparantly I was hitting the memory limit without knowing it.
Use a single transaction (commit) for all the records. Add a
dbSendQuery(con, "BEGIN")
before the insert and a
dbSendQuery(con, "END")
to complete. Much faster.