Query MS SQL using R with criteria from an R data frame - r

I have rather a large table in MS SQL Server (120 million rows) which I would like to query. I also have a dataframe in R that has unique ID's that I would like to use as part of my query criteria. I am familiar with the dplyr package but not sure if its possible to have the R query execute on the MS SQL server rather than bring all data onto my laptop memory (likely would crash my laptop).
of course, other option is to load the dataframe onto sql as a table which is currently what I am doing but I would prefer not to do this.

depending on what exactly you want to do, you may find value in the RODBCext package.
let's say you want to pull columns from an MS SQL table where IDs are in a vector that you have in R. you might try code like this:
library(RODBC)
library(RODBCext)
library(tidyverse)
dbconnect <- odbcDriverConnect('driver={SQL Server};
server=servername;database=dbname;trusted_connection=true')
v1 <- c(34,23,56,87,123,45)
qdf <- data_frame(idlist=v1)
sqlq <- "SELECT * FROM tablename WHERE idcol %in% ( ? )"
qr <- sqlExecute(dbconnect,sqlq,qdf,fetch=TRUE)
basically you want to put all the info you want to pass to the query into a dataframe. think of it like variables or parameters for your query; for each parameter you want a column in a dataframe. then you write the query as a character string and store it in a variable. you put it all together using the sqlExecute function.

Related

sqlQuery to append new data to R object based on R object

I have created an r data frame that currently has 691221 rows of data and I want to continue to add to this without repeating or having to recreate this df every time. So, I just want to append the new data. The original data is in an sql database that I have to access and this is my first time ever using RODBC library.
#this was my initial query to get the first batch of data and create the 691000 df
locs <- sqlQuery(con, 'SELECT * FROM v_AllLocs', rows_at_time = 1)
now tomorrow for example, I want to only append the new data that comes in. Is there some command in the RODBC libaray that can recognize this from an R object and previous command lines? OR I have a date/time stamp as one of the columns and thought I could reference that somehow. I was thinking something like:
lastloc<-max(locs$acq_time_ak)
new<-sqlQuery(con, 'SELECT * FROM v_AllLocs where acq_time_ak'> lastloc , rows_at_time = 1)
locs<-rbind(locs, new)
However, I don't think sqlQuery can recognize the r object in its line? or the str of last loc is a POSIXct and maybe the sqlQuery database can't recognize this? It doesn't work regardless. Also, technically this is really simplistic because in reality, I have subsets of information within this where I have individual X with a time stamp that may have a different time stamp than individual Y. But at the moment maybe just to get started?... how can I get the latest data to add to the r object?
Or regardless of data within the SQL, can I ask for the latest data the SQL db has since XX date. So no matter any attribute within the database, I just know that as of November 16 2021 any new data coming in would be selected. Then subsequent queries id have to change the date or something?

R and PostgreSQL - pre-specify possible column names and types

I have multiple large similar data files stored in .csv format. These are data files released annually. Most of these have the same variables but in some years they have added variables or changed the names of variables.
I am looping through my directory of files (~30 .csv files), converting them to data frames, and importing them to a Google Cloud SQL PostgreSQL 12 database via:
DBI::dbAppendTable(con, tablename, df)
where con is my connection to the database, tablename is the table name, and df is the data frame produced from a .csv.
The problem is each of these .csv files will have a different number of columns and some won't have columns others have.
Is there an easy way to pre-define a structure to the PostgreSQL 12 database that specifies "any of these .csv columns will all go into this one database column" and also "any columns not included in the .csv should be filled with NA in the database". I think I could come up with something in R to make all the dataframes look similar prior to uploading to the database but it seems cumbersome. I am imaging a document like a JSON that the SQL database compares against kind of like below:
SQL | Data frame
----------------------------------
age = "age","Age","AGE"
sex = "Sex","sex","Gender","gender"
...
fnstatus = "funcstatus","FNstatus"
This would specify to the database all the possible columns it might see and how to parse those. And for columns it doesn't see in a given .csv, it would fill all records with NA.
While I cannot say if such a feature is available as Postgres has many novel methods and extended data types, I would be hesitant to utilize such features as maintainability can be a challenge.
Enterprise, server, relational databases like PostgreSQL should be planned infrastructure systems. As r2evans comments, tables [including schemas, columns, users, etc.] should be defined up front. Designers need to think out entire uses and needs before any data migration or interaction. Dynamically adjusting database tables and columns by one-off application needs are usually not recommended. So clients like R should dynamically align data to meet the planned, relational database specifications.
One approach can be to use a temporary table as staging of all raw CSV data, possibly set with all VARCHAR. Then populate this table with all raw data to be finally migrated in a single append query using COALESCE and :: for type casting to final destination.
# BUILD LIST OF DFs FROM ALL CSVs
df_list <- lapply(list_of_csvs, read.csv)
# NORMALIZE ALL COLUMN NAMES TO LOWER CASE
df_list <- lapply(df_list, function(df), setNames(df, tolower(names(df))))
# RETURN LIST OF UNIQUE NAMES
all_names <- unique(lapply(df_list, names))
# CREATE TABLE QUERY
dbSendQuery(con, "DROP TABLE IF EXISTS myTempTable")
sql <- paste("CREATE TABLE myTempTable (",
paste(all_names, collapse = " VARCHAR(100), "),
"VARCHAR(100)",
")")
dbSendQuery(con, sql)
# APPEND DATAFRAMES TO TEMP TABLE
lapply(df_list, function(df) DBI::dbAppendTable(con, "myTempTable", df))
# RUN FINAL CLEANED APPEND QUERY
sql <- "INSERT INTO myFinalTable (age, sex, fnstatus, ...)
SELECT COALESCE(age)::int
, COALESCE(sex, gender)::varchar(5)
, COALESCE(funcstatus, fnstatus)::varchar(10)
...
FROM myTempTable"
dbSendQuery(con, sql)

Analyze big data in R on EC2 server

I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.
As #cory told you, you need to extract the required info using SQL queries.
You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript.
2) beef up your instance

Get all rows that DBI::dbWriteTable has just written

I want to use dbWriteTable() of R's DBI package to write data into a database. Usually, the respective tables are already present so I use the argument append = TRUE. How do I get which rows were added to the table by dbWriteTable()? Most of tables have certain columns with UNIQUE values so a SELECT will work (see below for a simple example). However, this is not true for all of them or only several columns together are UNIQUE making the SELECT more complicated. In addition, I would like to put the writing and querying into a function so I would prefer a consistent approach for all cases.
I mainly need this to get the PRIMARY KEY's added by the database and to allow a user to quickly see what was added. If important, my database is PostgreSQL and I would like to use the odbc package for connection.
I have something like this in mind, however, I am looking for a more general solution:
library(DBI)
con <- dbConnect(odbc::odbc(), dsn = "database")
dbWriteTable(con,
name = "site", value = data.frame(name = c("abcd", "efgh"),
append = TRUE))
dbGetQuery(conn,
paste("SELECT * FROM site WHERE name in ('abcd', 'efgh');"))

Using r to Insert Records into a Database using apply

I have a table i wish to insert records into in a Teradata environment using R
I have connected to the the DB and created my Table using JDBC
From reading the documentation there doesn't appear to be an easy way to insert records into the system except to create your own manual insert statements. I am trying to do this by creating a vectorized approach using apply (or anything similar)
Below is my code but I'm clearly not using apply correctly. Can anyone help?
s <- seq(1:1000)
str_update_table <- sprintf("INSERT INTO foo VALUES (%s)", s)
# Set Up the Connections
myconn <- dbConnect(drv,service, username, password)
# Attempt to run each of the 1000 sql statements
apply(str_update_table,2,dbSendUpdate,myconn)
I have not got the infrastructure to test, but you pass a vector to apply where apply expects an array. With your vector str_update_table the 2 in apply does not make much sense.
Try Map like in
Map(function(x) dbSendUpdate(myconn, x), str_update_table)
(untested)

Resources