Hi I would like to know.
what is the best way to copy large data into an sql database using dplyr?
copy_to fails as data is larger than memory. I am using code:
library(DBI)
library(dplyr)
db <- dbConnect(
RPostgreSQL::PostgreSQL(),
dbname = "postgres",
host="localhost",
user="user",
password="password")
Data<-rio::import("Data/data.feather")
Data <- copy_to(db, Data,temporary=FALSE)
This results in error:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: out of memory
DETAIL: Cannot enlarge string buffer containing 0 bytes by 1305608043 more bytes.
)
is there a way to do this without having to import the data first?
is there a way to do this though sparkly since this is using only one core?
Related
I have searched high and low for answers so apologies if it has already been answered!
Using R I am trying to perform a lazy evaluation of Oracle 11.1 databases. I have used JDBC to facilitate the connection and I can confirm it works fine. I am also able to query tables using dbGetQuery, although the results are so large that I quickly run out of memory.
I have tried dbplyr/dplyr tbl(con, "ORACLE_TABLE") although I get the following error:
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT *
FROM "ORACLE_TABLE" AS "zzz39"
WHERE (0 = 1) (ORA-00933: SQL command not properly ended)
I have also tried using db_table <- tbl(con, in_schema('ORACLE_TABLE'))
This is happening with all databases I am connected to, despite being able to perform a regular dbGetQuery.
Full Code:
# Libraries
library(odbc)
library(DBI)
library(config)
library(RJDBC)
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(xlsx)
library(RSQLite)
library(dbplyr)
Oracle Connection
db <- config::get('db')
drv1 <- JDBC(driverClass=db$driverClass, classPath=db$classPath)
con_db <- dbConnect(drv1, db$connStr, db$orauser, db$orapw, trusted_connection = TRUE)
# Query (This one works but the data set is too large)
db_data <- dbSendQuery(con_db, "SELECT end_dte, reference, id_number FROM ORACLE_TABLE where end_dte > '01JAN2019'")
**# Query (this one wont work)**
oracle_table <- tbl(con_db, "ORACLE_TABLE")
Solved:
Updated Rstudio + Packages.
Follow this manual:
https://www.linkedin.com/pulse/connect-oracle-database-r-rjdbc-tianwei-zhang/
Insert the following code after 'con':
sql_translate_env.JDBCConnection <- dbplyr:::sql_translate_env.Oracle
sql_select.JDBCConnection <- dbplyr:::sql_select.Oracle
sql_subquery.JDBCConnection <- dbplyr:::sql_subquery.Oracle
From R Studio's ODBC database documentation I can see a simple example of how to read a SQL table into an R data frame:
data <- dbReadTable(con, "flights")
Let me paste a graphic of the BGBUref table(?) I'm trying to read to an R data frame. This is from my connection pane in R studio.
If I use the same syntax as above, where con is the output of my dbConnect(...) I get the following:
df <- dbReadTable(con, "BGBURef")
#> Error: <SQL> 'SELECT * FROM "BGBURef"' nanodbc/nanodbc.cpp:1587: 42S02:
#> [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name
#> 'BGBURef'.
Is my understanding of what a "table" is incorrect? Or do I need to do something like this to get to the nested BGBUref table:
df <- dbReadTable(con, "QnRStore\dbo\BGBURef")
#> Error: '\d' is an unrecognized escape in character string starting ""QnRStore\d"
The BGBUref data frame will come up in R Studio if I click on the little spreadsheet icon. I just can't figure out how to get it into a defined data frame, in my case df.
Here's the output when I run these commands:
df <- dbReadTable(con, "QnRStore")
#> Error: <SQL> 'SELECT * FROM "QnRStore"'
#> nanodbc/nanodbc.cpp:1587: 42S02: [Microsoft][ODBC Driver 17 for SQL
#> Server][SQL Server]Invalid object name 'QnRStore'.
and:
dbListTables(con)
#> [1] "spt_fallback_db"
#> [2] "spt_fallback_dev"
#> [3] "spt_fallback_usg"
#> [4] "spt_monitor"
#> [5] "trace_xe_action_map"
#> [6] "trace_xe_event_map"
#> [7] "spt_values"
#> [8] "CHECK_CONSTRAINTS"
#> [9] "COLUMN_DOMAIN_USAGE"
#> [10] "COLUMN_PRIVILEGES"
#> ...
#> [650] "xml_schema_types"
#> [651] "xml_schema_wildcard_namespaces"
#> [652] "xml_schema_wildcards"
General Background
Before anything, consider reading on the relational database architecture where tables are encapsulated in schemas which themselves are encapsulated in databases which are then encapsulated in servers or clusters. Notice the icons in your image correspond to the object type:
cluster/server < catalog/database < schema/namespace < table
Hence, there is no nested tables in your situation but a typical architecture:
myserver < QnRStore < dbo < BGBURef
To access this architecture from server-level in an SQL query, you would use period-qualifying names:
SELECT * FROM database.schema.table
SELECT * FROM QnRStore.dbo.BGBURef
The default schema for SQL Server is dbo (by comparison for Postgres it is public). Usually, DB-APIs like R's odbc connects to a database which allows connection to any underlying schemas and corresponding tables, assuming the connected user has access to such schemas. Please note this rule is not generalizable. For example, Oracle's schema aligns to owner and MySQL's database is synonymous to schema.
See further reading:
What is the difference between a schema and a table and a database?
Differences between Database and Schema using different databases?
Difference between database and schema
What's the difference between a catalog and a schema in a relational database?
A database schema vs a database tablespace?
Specific Case
Therefore, to connect to an SQL Server database table in a default schema, simply reference the table, BGBURef, which assumes the table resides in the dbo schema of your connecting database.
df <- dbReadTable(con, "BGBURef")
If you use a non-default schema, you will need to specify it accordingly which recently you can do with DBI::Id and can use it similarly for dbReadTable and dbWriteTable:
s <- Id(schema = "myschema", table = "mytable")
df <- dbReadTable(con, s)
dbWriteTable(conn, s, mydataframe)
Alternatively, you can run the needed period qualifying SQL query:
df <- dbGetQuery(con, "SELECT * FROM [myschema].[mytable]")
And you can use SQL() for writing to persistent tables:
dbWriteTable(con, SQL("myschema.mytable"), mydataframe)
When using dbplyr it appears that
df = tbl(con, from = 'BGBUref')
if roughly equivalent to
USE QnRStore
GO
SELECT * FROM BGBUref;
From #David_Browne's comment and the image it looks like you have:
A table named 'BGBUref'
In a schema named 'dbo'
In a database called 'QnRStore'
In this case you need the in_schema command.
If your connection (con) is to the QnRStore database then try this:
df = tbl(con, from = in_schema('dbo', 'BGBUref'))
If your connection (con) is not to the QnRStore database directly then this may work:
df = tbl(con, from = in_schema('QnRStore.dbo', 'BGBUref'))
(I use this form when accessing multiple databases via the same connection. Because dbplyr performs best if you use the same connection when joining between tables from different databases.)
I am trying to transfer data from the Thingspeak API into a postgres database. The API limits each request to 8000 observations, but I need to pull millions! I'm using R to iteratively pull from the API, do a bunch of wrangling, and then submit the results as a data.frame to my table within the db.
The current way I am doing this relies on the dbWriteTable() function from the RPostgres package. However, this method does not account for existing observations in the db. I have to manually DELETE FROM table_name before running the script or I'll end up writing duplicate observations each time I try to update the db. I'm still wasting time re-writing observations that I deleted, and the script takes ~2 days to complete because of this.
I would prefer a script that incorporates the functionality of postgres-9.5' ON CONLFICT DO NOTHING clause, so I don't have to waste time re-uploading observations that are already within the db. I've found the st_write() and st-read() functions from the sf packages to be useful for running SQL queries directly from R, but have hit a roadblock. Currently, I'm stuck trying to upload the 8000 observations within each df from R to my db. I am getting the following error:
Connecting to database:
# db, host, port, pw, and user are all objects in my R environment
con <- dbConnect(drv = RPostgres::Postgres()
,dbname = db
,host = host
,port = port
,password = pw
,user = user)
Current approach using RPostgres:
dbWriteTable(con
,"table_name"
,df
,append = TRUE
,row.names = FALSE)
New approach using sf:
st_write(conn = conn
,obj = df
,table = 'table_name'
,query = "INSERT INTO table_name ON CONFLICT DO NOTHING;"
,drop_table = FALSE
,try_drop = FALSE
,debug = TRUE)
Error message:
Error in UseMethod("st_write") :
no applicable method for 'st_write' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
Edit:
Converting to strictly a dataframe, i.e. df <- as.data.frame(df) or attributes(df)$class <- "data.frame", resulted in a similar error message, only without the tbl_df or tbl classes.
Most recent approach with sf:
I'm making some progress with using st_write() by changing to the following:
# convert geom from WKT to feature class
df$geom <- st_as_sfc(df$geom)
# convert from data.frame to sf class
df <- st_as_sf(df)
# write sf object to db
st_write(dsn = con # changed from drv to dsn argument
,geom_name = "geom"
,table = "table_name"
,query = "INSERT INTO table_name ON CONFLICT DO NOTHING;"
,drop_table = FALSE
,try_drop = FALSE
,debug = TRUE
)
New Error:
Error in result_create(conn#ptr, statement) :
Failed to fetch row: ERROR: type "geometry" does not exist at character 345
I'm pretty sure that this is because I have not yet installed the PostGIS extension within my PostgreSQL database. If anyone could confirm I'd appreciate it! Installing PostGIS a pretty lengthy process, so I won't be able to provide an update for a few days. I'm hoping I've solved the problem with the st_write() function though!
I have a really large table (8M rows) that I need to import in R on which I will be doing some processing. Problem is when I try to bring it in R using the DBI package I get an error
My code is below
options(java.parameters = "-Xmx8048m")
psql.jdbc.driver <- "../postgresql-42.2.1.jar"
jdbc.url <- "jdbc:postgresql://server_url:port"
pgsql <- JDBC("org.postgresql.Driver", psql.jdbc.driver, "`")
con <- dbConnect(pgsql, jdbc.url, user="", password= '')
tbl <- dbGetQuery(con, "SELECT * FROM my_table;")
And the error I get is
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT * FROM my_table; (Ran out of memory retrieving query results.)
I can understand its because the result set is too big but I am not sure how to retrieve it by batches instead of all of it together. I have tried using dBSendQuery, dbReadTable and dbGetQuery all of them give the same error.
Any help would be appreciated!
I got it to work by using the RPostgreSQL package instead of the default RJDBC and DBI package.
It was able to do a sendQuery and then used fetch recursively to get the data in chunks of 10,000.
main_tbl <- dbFetch(postgres_query, n=-1) #didnt work so tried in chunks
df<- data.frame()
while (!dbHasCompleted(postgres_query)) {
chunk <- dbFetch(postgres_query, 10000)
print(nrow(chunk))
df = rbind(df, chunk)
}
Debugging the R script I have come across a strange error: “Error in debug(fun, text, condition) : argument must be a closure”.
PC features: Win7/64 bit, Oracle client 12 (both 32 and 64bit), R (64bit)
Earlier the script has been debugged well without errors. I have looked for a clue in the Inet but have found no clear explanation what the mistake is and how to remove it.
Running the script as a plain script but not a function produces no errors.
I would be very grateful for your ideas
The source script (connection to oracle DB and executing a simple query)as follows (conects to Oracle DB and execute the query:
download1<-function(){
if (require("dplyr")){
#install.packages("dplyr")
}
if (require("RODBC")){
#install.packages("RODBC")
}
library(RODBC)
library(dplyr)
# to establish connection with DB or schema
con <- odbcConnect("DB", uid="ANALYTICS", pwd="122334fgcx", rows_at_time = 500,believeNRows=FALSE)
# Check that connection is working (Optional)
odbcGetInfo(con)
# Query the database and put the results into the data frame "dataframe"
ptm <- proc.time()
x<-sqlQuery(con, "select * from my_table")
proc.time()-ptm
# to extract all field names to the separate vector
#field_names<-sqlQuery(con,"SELECT column_name FROM all_tab_cols WHERE table_name = 'MY_TABLE'")
close(con)
}
debug(download1(),text = "", condition = NULL)
Use
debug(download1)
download1()