How to connect to Teradata database using Dask? - teradata

The pandas equivalent code for connecting to Teradata, I have used is:
database = config.get('Teradata connection', 'database')
host = config.get('Teradata connection', 'host')
user = config.get('Teradata connection', 'user')
pwd = config.get('Teradata connection', 'pwd')
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{}".format(database, tables)
df = pd.read_sql_query(query1, connect)
Now, I need to use the Dask library for loading big data as an alternative to pandas.
Please suggest a method to connect the same with Teradata.

Teradata appears to have a sqlalchemy engine, so you should be able to install that, set your connection string appropriately and use Dask's existing from_sql function.
Alternatively, you could do this by hand: you need to decide on a set of conditions which will partition the data for you, each partition being small enough for your workers to handle. Then you can make a set of partitions and combine into a dataframe as follows
def get_part(condition):
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{} WHERE {}".format(database, tables, condition)
return pd.read_sql_query(query1, connect)
parts = [dask.delayed(get_part)(cond) for cond in conditions)
df = dd.from_delayed(parts)
(ideally, you can derive the meta= parameter for from_delayed beforehand, perhaps by getting the first 10 rows of the original query).

Related

Is it possible to send hive conf variables via a hive odbc connection when attempting a query?

I have a hive script that has some hive conf variables along the top. This query works fine when I run it on our emr cluster, expected data are returned. E.g.
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;
select
fruits,
count(1) as n
from table
group by fruits;
I would like to run this query on another server that has a odbc connection with hive.
(I'm in r)
hive_conn <- DBI::dbConnect(odbc(), dsn = "Hive")
results <- DBI::dbGetQuery(hive_conn, "select fruits, count(1) as n from table group by fruits")
This runs fine and returns a data frame as expected.
However, if I want to set some hive configurations, I do not know how to send those with odbc.
How can I tell hive via odbc to run my query with my chosen hive conf settings?
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set mapreduce.map.memory.mb=7168;
set mapreduce.reduce.memory.mb=7168;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.execution.engine=mr;
I found the solution to this in the documentation for for the driver: https://www.simba.com/products/Hive/doc/ODBC_InstallGuide/linux/content/odbc/hi/configuring/serverside.htm
I needed to add these 'server side properties' when I create the connection. You prepend with the string 'SSP_' (server side property) and then add them as name value pairs e.g:
hive_conn <- dbConnect(odbc(),
dsn = "Hive",
SSP_hive.execution.engine = "mr",
SSP_hive.exec.dynamic.partition.mode = "nonstrict",
SSP_hive.exec.dynamic.partition = "true",
SSP_hive.exec.max.dynamic.partitions = 10000,
SSP_mapreduce.map.memory.mb = 7168,
SSP_mapreduce.reduce.memory.mb = 7168,
SSP_hive.exec.max.dynamic.partitions.pernode = 10000,
SSP_hive.exec.compress.output = "true",
SSP_mapred.output.compression.codec = "org.apache.hadoop.io.compress.SnappyCodec"
)

How to use glue_data_sql to write safe parameterized queries on an SQL server database?

The problem
I want to write a wrapper around some DBI functions that allows safe execution of parameterized queries. I've found the this resource that explains how to use the glue package to insert parameters into an SQL query. However, there seem to be two distinct ways to use the glue package to insert parameters:
Method 1 involves using ? in the sql query where the parameters need to be inserted, and then subsequently using dbBind to fill them in. Example from the link above:
library(glue)
library(DBI)
airport_sql <- glue_sql("SELECT * FROM airports WHERE faa = ?")
airport <- dbSendQuery(con, airport_sql)
dbBind(airport, list("GPT"))
dbFetch(airport)
Method 2 involves using glue_sql or glue_data_sql to fill in the parameters by itself (no use of dbBind). Again an example from the link above:
airport_sql <-
glue_sql(
"SELECT * FROM airports WHERE faa IN ({airports*})",
airports = c("GPT", "MSY"),
.con = con
)
airport <- dbSendQuery(con, airport_sql)
dbFetch(airport)
I would prefer using the second method because this has a lot of extra functionality such as collapsing multiple values for an in statement in the where clause of an sql statement. See second example above for how that works (note the * after the parameter which indicates it must be collapsed). The question is: is this safe against SQL injection? (Are there other things I need to worry about?)
My code
This is currently the code I have for my wrapper.
paramQueryWrapper <- function(
sql,
params = NULL,
dsn = standard_dsn,
login = user_login,
pw = user_pw
){
if(missing(sql) || length(sql) != 1 || !is.character(sql)){
stop("Please provide sql as a character vector of length 1.")
}
if(!is.null(params)){
if(!is.list(params)) stop("params must be a (named) list (or NULL).")
if(length(params) < 1) stop("params must be either NULL, or contain at least one element.")
if(is.null(names(params)) || any(names(params) == "")) stop("All elements in params must be named.")
}
con <- DBI::dbConnect(
odbc::odbc(),
dsn = dsn,
UID = login,
PWD = pw
)
on.exit(DBI::dbDisconnect(con), add = TRUE)
# Replace params with corresponding values and execute query
sql <- glue::glue_data_sql(.x = params, sql, .con = con)
query <- DBI::dbSendQuery(conn = con, sql)
on.exit(DBI::dbClearResult(query), add = TRUE, after = FALSE)
return(tibble::as_tibble(DBI::dbFetch(query)))
}
My question
Is this safe against SQL injection? Especially since I am not using dbBind.
Epilogue
I know that there already exists a wrapper called dbGetQuery that allows parameters (see this question for more info - look for the answer by #krlmlr for an example with a parameterized query). But this again relies on the first method using ? which is much more basic in terms of functionality.

Use Rs mongolite to correctly (insert? update?) add data to existing collection

I have the following function written in R that (I think) is doing a poor job of updating my mongo databases collections.
library(mongolite)
con <- mongolite::mongo(collection = "mongo_collection_1", db = 'mydb', url = 'myurl')
myRdataframe1 <- con$find(query = '{}', fields = '{}')
rm(con)
con <- mongolite::mongo(collection = "mongo_collection_2", db = 'mydb', url = 'myurl')
myRdataframe2 <- con$find(query = '{}', fields = '{}')
rm(con)
... code to update my dataframes (rbind additional rows onto each of them) ...
# write dataframes to database
write.dfs.to.mongodb.collections <- function() {
collections <- c("mongo_collection_1", "mongo_collection_2")
my.dataframes <- c("myRdataframe1", "myRdataframe2")
# loop dataframes, write colllections
for(i in 1:length(collections)) {
# connect and add data to this table
con <- mongo(collection = collections[i], db = 'mydb', url = 'myurl')
con$remove('{}')
con$insert(get(my.dataframes[i]))
con$count()
rm(con)
}
}
write.dfs.to.mongodb.collections()
My dataframes myRdataframe1 and myRdataframe2 are very large dataframes, currently ~100K rows and ~50 columns. Each time my script runs, it:
uses con$find('{}') to pull the mongodb collection into R, saved as a dataframe myRdataframe1
scrapes new data from a data provider that gets appended as new rows to myRdataframe1
uses con$remove() and con$insert to fully remove the data in the mongodb collection, and then re-insert the entire myRdataframe1
This last bullet point is iffy, because I run this R script daily in a cronjob and I don't like that each time I am entirely wiping the mongo db collection and re-inserting the R dataframe to the collection.
If I remove the con$remove() line, I receive an error that states I have duplicate _id keys. It appears I cannot simply append using con$insert().
Any thoughts on this are greatly appreciated!
When you attempt to insert documents into MongoDB that already exist in the database as per their primary key you will get the duplicate key exception. In order to work around that you can simply unset the _id column using something like this before the con$insert:
my.dataframes[i]$_id <- NULL
This way, the newly inserted document will automatically get a new _id assigned.
you can use upsert ( which matches document with the first condition if found it will update it, if not it will insert a new one,
first you need to separate id from each doc
_id= my.dataframes[i]$_id
updateData = my.dataframes[i]
updateData$_id <- NULL
then use upsert ( there might be some easier way to concatenate strings in R)
con$update(paste('{"_id":"', _id, '"}' ,sep="" ) , paste('{"$set":', updateData,'}', sep=""), upsert = TRUE)

dbplyr in_schema() function behaving strangely

I am using the in_schema() function from dbplyr package to create a table in a named schema of a postgresql database from R.
It is not a new piece of code and it used to work as expected = creating a table called 'my_table' in schema 'my_schema'.
con <- dbConnect(odbc::odbc(),
driver = "PostgreSQL Unicode",
server = "server",
port = 5432,
uid = "user name",
password = "password",
database = "dbase")
dbWriteTable(con,
in_schema('my_schema', 'my_table'),
value = whatever) # assume that 'whatever' is a data frame...
This piece of code has now developed issues and unexpectedly started to create a table called 'my_scheme.my_table' in the default public scheme of my database, instead of the expected my_schema.my_table.
Has anybody else noticed such behaviour, and is there a solution (except using the default postgresql scheme, which is not practical in my case)?
for that, I would recommend using copy_to() instead of dbWriteTable(): copy_to(con, iris, in_schema("production", "iris"))

dplyr & monetdb - appropriate syntax for querying schema.table?

In monetdb I have set up a schema main and my tables are created into this schema.
For example, the department table is main.department.
With dplyr I try to query the table:
mdb <- src_monetdb(dbname="model", user="monetdb", password="monetdb")
tbl(mdb, "department")
But I get
Error in .local(conn, statement, ...) :
Unable to execute statement 'PREPARE SELECT * FROM "department"'.
Server says 'SELECT: no such table 'department'' [#42S02].
I tried to use "main.department" and other similar combinations with no luck.
What is the appropriate syntax?
There is a somewhat hacky workaround for this: We can manually set the default schema for the connection. I have a database testing, in there is a schema foo with a table called bar.
mdb <- src_monetdb("testing")
dbSendQuery(mdb$con, "SET SCHEMA foo");
t <- tbl(mdb, "bar")
The dbplyr package (a backend of dplyr for database connections) has a in_schema() function for these cases:
conn <- dbConnect(
MonetDB.R(),
host = "localhost",
dbname = "model",
user = "monetdb",
password = "monetdb",
timeout = 86400L
)
department = tbl(conn, dbplyr::in_schema("main", "department"))

Resources