I set up an ODBC connect to a Netezza (SQL database). The connection is fine. However, R only pulls out 256 rows by default and restricts the number of rows it can pull out.
If I ran the query in Netezza, it would return a total number of rows (300k). I am expecting the same number of rows in R. However, it only returned 256 rows quite a bit short from 300k.
The driver I am using NetezzaSQL version 7.00.02 NSQLODBC.DLL
I tried to change the pre-fetch count to zero in the "Drivers Option' from
Control Panel > Administrative Tools > Data Sources(OBBC) > System DNS
It didn't work. Any ideas?
I think RODBC acts poorly with Netezza. A solution http://datamining.togaware.com/survivor/Database_Connection.html
just add believeNRows=FALSE to either your sqlQuery or odbcConnect call (use the later if you also use sqlFetch.
You can also try using JDBC driver:
library(RJDBC)
drv <- JDBC("org.netezza.Driver", "nzjdbc.jar", "'")
conn <- dbConnect(drv, "jdbc:netezza://host:5480/database", "user", "password")
res <- dbSendQuery(conn, "select * from mytable")
That way you don't have to deal with DSNs, etc.
I know this is kind of out-dated but the problem is not with the RODBC package. The problem lies in how you set up the ODBC connection if you configure the connection in windows you'll see a last tab in the settings where you can specify the amount of rows it'll fetch. And the default is on 256.
Related
I have several large R data.frames that I would like to put into a local duckdb database. The problem I am having is duckdb seems to load everything into memory even though I am specifying a file as the location.
Also, it isn't clear to me the correct way to establish a connection (so I'm not sure if this has something to do with it). I have tried:
duckdrv <- duckdb(dbdir="dt.db", read_only=FALSE)
dkCon <- dbConnect(drv=duckdrv)
and also:
duckdrv <- duckdb()
dkCon <- dbConnect(drv=duckdrv, dbdir="dt.db", read_only=FALSE)
Both work fine, meaning I can create tables, use dbWriteTable, run queries, etc. However, the memory usage is very high (about the same size as the data.frames). I think I read somewhere that duckdb defaults to using a certain % of the available memory which won't work for me because the system that I am using is a shared resource. I also want to run some queries in parallel which will drive memory usage even higher.
I have tried this:
dbExecute(dkCon, "PRAGMA memory_limit='1GB';")
but that doesn't seem to make a difference, even if I close the connection, shutdown the instance and reconnect.
Does anyone know how I can fix this problem? RSQLite, also has high memory usage temporarily when I am writing data to a table but then it goes back to normal and if I open a read only connection it isn't an issue at all. I would like to get duckdb working because I think the queries are supposed to be much faster. Any help would be appreciated!
Memory limit can be set using PRAGMA or SET statement in DuckDB. By default, 75% of the RAM is the limit.
con.execute("PRAGMA memory_limit='200MB'")
OR
con.execute("SET memory_limit='200MB'")
I can confirm that this limit works. However this is not a hard limit and might get exceeded sometimes based on the volume of data, format of data your are querying(eg: parquet from s3), type of query - certain limitations or certain constraints around it at the moment.
Below is one of the examples where the volume of data in plain text(csv) was around 4.23 GB. This data was first loaded into DuckDB and then some SQL queries were run by setting the memory_limit='200MB'. Below screenshot shows max recorded memory used by the py script.
Your approach is correct - using memory_limit pragma, but you used an outdated version.
For example, using DuckDb version 0.5.1:
library("DBI")
con = dbConnect(duckdb::duckdb(), dbdir="my-db.duckdb")
dbExecute(conn = con, paste0("PRAGMA memory_limit='500MB'"))
dbGetQuery(conn = con, "PRAGMA version")
dbExecute(con, "CREATE TABLE gen AS SELECT * FROM 'gen1GB.csv'")
dbGetQuery(conn = con, "select count(*) from gen")
This outputs for me:
library_version source_id
1 0.5.1 7c111322d
count_star()
1 1e+08
Memory usage is less than 500MB. On MacOs can be checked using:
ps axu | grep 'lib\/R' | awk '{print $6 " " $11}'
464768 /usr/local/Cellar/r/4.2.1_4/lib/R/bin/exec/R
You can generate a test csv-file using:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
df = pd.DataFrame(rng.integers(0, 100, size=(100000000, 4)), columns=list('ABCD'))
df.to_csv('gen1GB.csv', index=False)
I need to update column values conditionnaly on other columns in some PostgreSQL database table. I managed to do it writing an SQL statement in R and executing it with dbExecute from DBI package.
library(dplyr)
library(DBI)
# Establish connection with database
con <- dbConnect(RPostgreSQL::PostgreSQL(), dbname = "myDb",
host="localhost", port= 5432, user="me",password = myPwd)
# Write SQL update statement
request <- paste("UPDATE table_to_update",
"SET var_to_change = 'new value' ",
"WHERE filter_var = 'filter' ")
# Back-end execution
con %>% dbExecute(request)
Is it possible to do so using only dplyr syntax ? I tried, out of curiosity,
con %>% tbl("table_to_update") %>%
mutate(var_to_change = if (filter_var == 'filter') 'new value' else var_to_change)
which works in R but obviously does nothing in db since it uses a select statement. copy_to allows only for append and overwite options, so I can't see how to use it unless deleting then appending the filtered observations...
Current dplyr 0.7.1 (with dbplyr 1.1.0) doesn't support this, because it assumes that all data sources are immutable. Issuing an UPDATE via dbExecute() seems to be the best bet.
For replacing a larger chunk in a table, you could also:
Write the data frame to a temporary table in the database via copy_to().
Start a transaction.
Issue a DELETE FROM ... WHERE id IN (SELECT id FROM <temporary table>)
Issue an INSERT INTO ... SELECT * FROM <temporary table>
Commit the transaction
Depending on your schema, you might be able to do a single INSERT INTO ... ON CONFLICT DO UPDATE instead of DELETE and then INSERT.
Background: I am developing a rscript that pulls data from a mysql database, performs a logistic regression and then inserts the predictions back into the database. I want the entire system to be self contained in the script in case of database failure. This includes all mysql stored procedures that the script depends on to aggregate the data on the backend since these would be deleted in such a database failure.
Question: I'm having trouble creating a stored procedure from an R script. I am running the following:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
DROP PROCEDURE IF EXISTS Test.Tester;
DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END //
DELIMITER ;
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
I however get the following error that seems to involve the DELIMITER change.
Error in .local(conn, statement, ...) :
could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'DELIMITER //
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
EN' at line 2
What I've Done: I have spent quite a bit of time searching for the answer, but have come up with nothing. What am I missing?
Just wanted to follow up on this string of comments. Thank you for your thoughts on this issue. I have a couple Python scripts that need to have this functionality and I began researching the same topic for Python. I found this question that indicates the answer. The question states:
"The DELIMITER command is a MySQL shell client builtin, and it's recognized only by that program (and MySQL Query Browser). It's not necessary to use DELIMITER if you execute SQL statements directly through an API.
The purpose of DELIMITER is to help you avoid ambiguity about the termination of the CREATE FUNCTION statement, when the statement itself can contain semicolon characters. This is important in the shell client, where by default a semicolon terminates an SQL statement. You need to set the statement terminator to some other character in order to submit the body of a function (or trigger or procedure)."
Hence the following code will run in R:
mySQLDriver <- dbDriver("MySQL")
connect <- dbConnect(mySQLDriver, group = connection)
query <-
"
CREATE PROCEDURE Test.Tester()
BEGIN
/***DO DATA AGGREGATION***/
END
"
sendQuery <- dbSendQuery(connect, query)
dbClearResult(dbListResults(connect)[[1]])
dbDisconnect(connect)
R Version : 2.14.1 x64
Running on Windows 7
Connecting to a database on a remote Microsoft SQL Server 2012
I have an unordered vectors of names, say:
names<-c(“A”, “B”, “A”, “C”,”C”)
each of which have an id in a table in my db. I need to convert the names to their corresponding ids.
I currently have the following code to do it.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
sapply(names, nameToID, dbConn=dbConn)
###
Barring better ways to do this, which could involve loading the table into R then working with the problem there (which is possible), I understand why the following doesn’t work, but I cannot seem to find a solution. Attempting to use parallelization via the package ‘parallel’ :
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name, dbConn){
#dbConn : active db connection formed via odbcDriverConnect
#name : a char string
sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “dbConn”))
parSapply(cl, names, nameToID, dbConn=dbConn) #incorrect passing of nameToID’s second argument
###
As in the comment, this is not the correct way to assign the second argument to nameToID.
I have also tried the following:
parSapply(cl, names, function(x) nameToID(x, dbConn))
in place of the previous parSapply call, but that also does not work, with the error being thrown saying “the first parameter is not an open RODBC connection”, presumably referring to the first parameter of the sqlQuery(). dbConn remains open though
The following code does work with parallization.
###
names<-c(“A”, “B”, “A”, “C”,”C”)
dbConn<-odbcDriverConnect(connection=”connection string”) #successfully connects
nameToID<-function(name){
#name : a char string
dbConn<-odbcDriverConnect(connection=”string”)
result<-sqlQuery(dbConn, paste(“select id from table where name=’”, name, “’”, sep=””))
odbcClose(dbConn)
result
}
mc<-detectCores()
cl<-makeCluster(mc)
clusterExport(cl, c(“sqlQuery”, “odbcDriverConnect”, “odbcClose”, “dbConn”, “nameToID”)) #throwing everything in
parSapply(cl, names, nameToID)
###
But the constant opening and closing of the connection ruins the gains from parallelization, and seems just a bit silly.
So the overall question would be how to pass the second parameter (the open db connection) to the function within parSapply, in much the same way as it is done in the regular apply? In general, how does one pass a second, third, nth parameter to a function within a parallel routine?
Thanks and if you need any more information let me know.
-DT
Database connection objects can't be exported or passed as function arguments because they contain socket connections. If you try, it will be serialized, sent to the workers and deserialized, but it won't work correctly since the socket connection won't be valid.
The solution is to create the database connection on each worker before calling parSapply. I often do that using clusterEvalQ:
clusterEvalQ(cl, {
library(RODBC)
dbConn <- odbcDriverConnect(connection="connection string")
NULL
})
Now the worker function can be written as:
nameToID <- function(name) {
sqlQuery(dbConn, paste("select id from table where name='", name, "'", sep=""))
}
and called with:
parSapply(cl, names, nameToID)
Also note that since RODBC is loaded on each of the workers you don't have to export functions defined in it, which I think is good programming practice.
With functionality from the RODBC package, I have successfully created an ODBC but receive error messages when I try to query the database. I am using the INFORMIX 3.31 32 bit driver (version 3.31.00.10287).
channel <- odbcConnect("exampleDSN")
unclass(channel)
[1] 3
attr(,"connection.string")
[1] "DSN=exampleDSN;UID=user;PWD=****;DB=exampleDB;HOST=exampleHOST;SRVR=exampleSRVR;SERV=exampleSERV;PRO=onsoctcp ... (more parameters)"
attr(,"handle_ptr")
<pointer: 0x0264c098>
attr(,"case")
[1] "nochange"
attr(,"id")
[1] 4182
attr(,"believeNRows")
[1] TRUE
attr(,"colQuote")
[1] "\""
attr(,"tabQuote")
[1] "\""
attr(,"interpretDot")
[1] TRUE
attr(,"encoding")
[1] ""
attr(,"rows_at_time")
[1] 100
attr(,"isMySQL")
[1] FALSE
attr(,"call")
odbcDriverConnect(connection = "DSN=exampleDSN")
When I try to query and investigate the structure of the returned object, I receive an error message 'chr [1:2] "42000 -201 [Informix][Informix ODBC Driver][Informix]A syntax error has occurred." ...'
Specifically, I wrote an expression to loop through all tables in the database, retrieve 10 rows, and investigate the structure of the returned object.
for (i in 1:153){res <- sqlFetch(channel, sqlTables(channel, tableType="TABLE")$TABLE_NAME[i], max=10); str(res)}
Each iteration returns the same error message. Any ideas where to start?
ADDITIONAL INFO: When I return the object 'res', I receive the following -
> res
[1] "42000 -201 [Informix][Informix ODBC Driver][Informix]A syntax error has occurred."
[2] "[RODBC] ERROR: Could not SQLExecDirect 'SELECT * FROM \"exampleTABLE\"'"
The error message you quote is:
"[RODBC] ERROR: Could not SQLExecDirect 'SELECT * FROM \"exampleTABLE\"'"
Informix only recognizes table names enclosed in double quotes if the environment DELIMIDENT is set in the environment, either of the server or the client (or both). It doesn't much matter what it is set to; I use DELIMIDENT=1 when I want delimited identifiers.
How did you create the table in the Informix database? Unless you created the table with DELIMIDENT set, the table name will not be case sensitive; you do not need the quotes around the table name.
The fact that you're getting error -201 means you've got through the connection process; that is a good start, and simplifies what follows.
I'm not sure whether you're on a Unix machine or a Windows machine - it often helps to indicate that. On Windows, you might have to set the environment with SETNET32 (an Informix program), or there may be a way to specify the DELIMIDENT in the connect string. On Unix, you probably set it in your environment and the R software picks it up. However, there might be problems if you launch R via some sort of menu button or option in a GUI environment; the chances are that the profile is not executed before the R program is.
You can try using the sqlQuery() function in RODBC to retrieve your results. This is the function I use at work and have never had a problem with it:
sqlQuery(channel, "select top 10 * from exampleTABLE")
You should be able to put all of your queries into a list and iterate through them as you were before:
dat <- lapply(queries, function(x) sqlQuery(channel, x))
where queries is your list of queries and channel is your open ODBC connection. I guess I should also encourage you to close said connection when your done with odbcCloseAll()