Difference between src_postgres and dbConnect function to connect R with postgres - r

what is the difference between src_postgres and dbConnect function? Both can be used to connect R with postgres using the RPosgresql package. In my experiments I only could use src_postgres to read and dbConnect to write to the database.
When I tried it in different combinations I only received errors.
This seems fairly strange to me.

src_postgres is a function for creating a connection to a PostgreSQL database from the dplyr package. The RPostgreSQL package implements a method for the generic dbConnect from the DBI package. src_postgres calls dbConnect from RPostgreSQL (I assume).
The generic connection object returned by dbConnect is meant to be an open ended interface for sending SQL queries to the data base. This means you could feed it any select, update, insert, delete, etc. query that you like.
src_postgres is part of the higher level interface to working with data from databases that Hadley built in dplyr. The src_* functions connect to a db and then the tbl functions specify a more specific data source (table, view, arbitrary select query) to pull data from. There are some basic table manipulation functions in dplyr but I don't believe it is intended to be a tool for doing update or insert type things in the db. That's just not what that tool is for. Note that the "verbs" implemented in dplyr are all focused on pulling data out and summarising (select, filter, mutate, etc.).
If you need to alter data in a data base on a row level, you'll need to send SQL queries to a connection created by dbConnect. If all you're doing is pulling data from a db and analyzing it in R, that is what dplyr is for.

Related

How to apply dtplyr with SQL Server database

I am trying to apply dtplyr to a SQL Server database.
I succeeded in applying dplyr as shown below, but I don't know how to apply dtplyr
How can I do this?
library(odbc)
library(DBI)
library(tidyverse)
library(dtplyr)
library(dbplyr)
con <- DBI::dbConnect(odbc::odbc(),
Driver = "SQL Server",
Server = "address",
Database = "dbname",
UID = "ID",
PWD = "password")
dplyr::tbl(con, dbplyr::in_schema("dbo", "table1"))
The comments by #Waldi capture the essence of the matter. You can not use dtplyr with SQL Server as it only translates commands to data.table objects.
The official dtplyr documentation states:
The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent ... data.table code
The official dbplyr documentation states:
It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL
Both dbplyr and dtplyr translate dplyr commands. Which one you use depends on whether you are working with data.table type objects (in R memory) or remote SQL databases (whichever flavor of SQL you prefer).

How to use R DBI to create a view?

I'm trying to use R's DBI library to create a view on an Athena database, connected via JDBC. The dbSentStatement command, which is supposed to submit and execute arbitrary SQL without returning a result, throws an error when no result set is returned:
DBI::dbSendStatement(athena_con, my_query)
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set", :
Unable to retrieve JDBC result set
JDBC ERROR: [Simba][JDBC](11300) A ResultSet was expected but not generated from query <query repeated here>
In addition, the view is not created.
I've tried other DBI commands that seemed promising (dbExecute, dbGetQuery, dbSentQuery), but they all throw the same error. (Actually, I expect them all to - dbSendStatement is the one that, from the manual, should work.)
Is there some other way to create a view using DBI, dbplyr, etc.? Or am I doing this right and its a limitation of RJDBC or the driver?
RJDBC pre-dates the more recent DBI specification and uses a different function to access this functionality: RJDBC::dbSendUpdate(con, query) .
DBI's dbSendStatement() doesn't work here yet. For best compatibility, RJDBC could implement this method and forward it to its dbSendUpdate() .
Without given more details of your query, I cannot promise this helps:
But for my case:
nrow <- dbExecute(con, paste0("CREATE VIEW ExampleView AS",
"Random statements"))
Would help you create a view on your backend.
Difference: I'm using SQLite.

r - SQL on large datasets from several access databases

I'm working on a process improvement that will use SQL in r to work with large datasets. Currently the source data is stored in several different MS Access databases. My initial approach was to use RODBC to read all of the source data into r, and then use sqldf() to summarize the data as needed. I'm running out of RAM before I can even begin use sqldf() though.
Is there a more efficient way for me to complete this task using r? I've been looking for a way to run a SQL query that joins the separate databases before reading them into r, but so far I haven't found any packages that support this functionality.
Should your data be in a database dplyr (a part of the tidyverse) would be the tool you are looking for.
You can use it to connect to a local / remote database, push your joins / filters / whatever there and collect() the result as a data frame. You will find the process neatly summarized on http://db.rstudio.com/dplyr/
What I am not quite certain of - but it is not a R issue but rather an MS Access issue - is the means for accessing data across multiple MS Access databases.
You may need to write custom SQL code for that & pass it to one of the databases via DBI::dbGetQuery() and have MS Access handle the database link.
The link you posted looks promising. If it doesn't yield the intended results, consider linking one Access DB to all the others. Links take almost no memory. Union the links and fetch the data from there.
# Load RODBC package
library(RODBC)
# Connect to Access db
channel <- odbcConnectAccess("C:/Documents/Name_Of_My_Access_Database")
# Get data
data <- sqlQuery(channel , paste ("select * from Name_of_table_in_my_database"))
These URLs may help as well.
https://www.r-bloggers.com/getting-access-data-into-r/
How to connect R with Access database in 64-bit Window?

Spark newbie (ODBC/SparkSQL)

I have a spark cluster setup and tried both native scala and spark sql on my dataset and the setup seems to work for the most part. I have the following questions
From an ODBC/extenal connectivity to the cluster, what should i expect?
- the admin/developer shapes the data and persists/caches a few RDDs that will be exposed? (Thinking on the lines of hive tables)
- What would be the equivalent of connecting to a "Hive metastore" in spark/spark sql?
Is thinking along the lines of hive faulted?
My other question was
- when i issue hive queries, (and say create tables and such), it uses the same hive metastore as hadoop/hive
- Where do the tables get created when i issue sql queries using sqlcontext?
- If i persist the table, it is the same concept as persisting an RDD?
Appreciate your answers
Nithya
(this is written with spark 1.1 in mind, be aware that new features tend to be added quickly, some limitations mentioned below might very well disappear at some point in the future).
You can use Spark SQL with Hive syntax and connect to Hive metastore, which will result in your Spark SQL hive commands to be executed on the same data space as if they were executed through Hive directly.
To do that you simply need to instantiate a HiveContext as explained here and provide a hive-site.xml configuration file that specifies, among other things, where to find the Hive metastore.
The result of a SELECT statement is a SchemaRDD, which is an RDD of Row objects that has an associated schema. You can use it just like you use any RDD, including cache and persist and the effect is the same (the fact that the data comes from hive has not influence here).
If your hive command is creating data, e.g. "CREATE TABLE ... ", the corresponding table gets created in exactly the same place as with regular Hive, i.e. /var/lib/hive/warehouse by default.
Executing Hive SQL through Spark provides you with all the caching benefits of Spark: executing a 2nd SQL query on the same data set within the same spark context will typically be much faster than the first query.
Since Spark 1.1, it is possible to start the Thrift JDBC server, which is essentially an equivalent to HiveServer2 and thus allows you to execute SparkQL commands through a JDBC connection.
Note that not all Hive features are available (yet?), see details here.
Finally, you can also discard Hive syntax and metastore and execute SQL queries directly on CSV and Parquet files. My best guess is that this will become the preferred approach in the future, although at the moment the set of SQL features available like this is smaller than when using the Hive syntax.

Export R results to vertica DB

I have a table which should store the results of analytics it have performed. Connection of R and Vertica is done and also able to extract the data from vertica tables, but is not able to store the result of my analysis into the Vertica table.
Can someone help with how to insert records in Vertica through R commands via RODBC?
Here is the code i tried in Oracle:
install.packages("RODBC")
library("RODBC")
channeldev<-odbcConnect("Dev_k", uid="krish", pwd="****", believeNRows=FALSE)
odbcGetInfo(channeldev)
dataframe_dev<- sqlQuery(channeldev, "
SELECT input_stg_id
FROM
k.input_stg WHERE emp_ID=85
and update_timestamp > to_date('8/5/2013 04.00.00','mm/dd/yyyy HH24.MI.SS')")
dataframe_dev
sqlSave(channeldev,dataframe_dev,tablename="K.R2_TEST",append=TRUE)
sqlUpdate(channeldev, dataframe_dev, tablename="K.R2_TEST",index="INPUT_STG_ID")
You can use basically the same RODBC command sequence you have for Oracle:
Load the RODBC library: library(RODBC)
Connect to the Vertica database: odbcConnect()
Save data: sqlSave(). For performance reasons I'd suggest to set fast=TRUE, disable auto commit and commit the transaction at the end.
However, using Vertica bulk load utility COPY with the option LOCAL: it works transparently through ODBC and is much, much faster than sqlSave()

Resources