Clearing specific rows using RODBC - r

I would like to use the RODBC package to partially overwrite a Microsoft Access table with a data frame. Rather than overwriting the entire table, I am looking for a way in which to remove only specific rows from that table -- and then to append my data frame to its end.
My method for appending the frame is pretty straightforward. I would use the following function:
sqlSave(ch, df, tablename = "accessTable", rownames = F, append = T)
The challenge is finding a function that will allow me to clear specific row numbers from the Access table ahead of time. The sqlDrop and sqlClear functions do not seem to get me there, since they will either delete or clear the entire table as a whole.
Any recommendation to achieve this task would be much appreciated!

Indeed, consider using sqlQuery to subset your Access table of the rows you want to keep, then rbind with current dataframe and finally sqlSave, purposely overwriting original Access table with append = FALSE.
# IMPORT QUERY RESULTS INTO DATAFRAME
keeprows <- sqlQuery(ch, "SELECT * FROM [accesstable] WHERE timedata >= somevalue")
# CONCATENATE df to END
finaldata <- rbind(keeprows, df)
# OVERWRITE ORIGINAL ACCESS TABLE
sqlSave(ch, finaldata, tablename = "accessTable", rownames = FALSE, append = FALSE)
Of course you can also do the counter, deleting rows from table per specified logic and then appending (NOT overwriting) with sqlSave:
# ACTION QUERY TO RUN IN DATABASE
sqlQuery(ch, "DELETE FROM [accesstable] WHERE timedata <= somevalue")
# APPEND TO ACCESS TABLE
sqlSave(ch, df, tablename = "accessTable", rownames = FALSE, append = TRUE)
The key is finding the SQL logic that specifies the rows you intend to keep.

Related

R query database tables iteratively without for loop with lambda or vectorized function for Shiny app

I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)

How to append data from R to Oracle db table with identity column

I created a table in Oracle like
Create table t1
(id_record NUMERIC GENERATED AS IDENTITY START WITH 500000 NOT NULL,
col1 numeric(2,0),
col2 varchar(10),
primary key(id_record))
where id_record is identity column the value of which is generated automatically when appending data to table.
I create a data.frame in R with 2 columns (table_in_R <- data.frame(col1, col2)). Let's skip the values of data frame for simplicity reasons.
When I append data from R to Oracle db using the following code
dbWriteTable(con, 't1', table_in_R,
append =T, row.names=F, overwrite = F)
where con is a connection object the error ORA-00947 arises and no data appended.
When I slightly modify my code (append = F, overwrite = T).
dbWriteTable(con_dwh, 't1', table_in_R,
append =FALSE, row.names=F, overwrite = TRUE)
the data is appended, but the identity column id_record is dropped.
How can I append data to Oracle db without dropping the identity column?
I'd never (based on this answer) recommend this one step approach where the dbWriteTabledirectly maintains the target table.
Instead I'd recommend a two step approach, where the R part fills a temporary table (with overwrite = T i.e. DROP and CREATE)
df <- data.frame(col1 = c(as.integer(1),as.integer(0)), col2 = c('x',NA))
dbWriteTable(jdbcConnection,"TEMP", df, rownames=FALSE, overwrite = TRUE, append = FALSE)
In the second step you simple adds the new rows to the target table using
insert into t1(col1,col2) select col1,col2 from temp;
You may call it direct with a database connection or also from R:
res <- dbSendUpdate(jdbcConnection,"insert into t1(col1,col2) select col1,col2 from temp")
Note there is a workaround anyway:
Define the identity column as
id_record NUMERIC GENERATED BY DEFAULT ON NULL AS IDENTITY
This configuration of the identity column provides the correct sequence value instead of the NULL value - but you will fail on the above linked problem of Inserting NULL in a Number column.
So the second trick is to use a character NA in the data.frame
Add the
identity column to your data.frame and fill it with all as.character(NA).
df <- data.frame(id_record =c(as.character(NA),as.character(NA) ), col1 = c(as.integer(1),as.integer(0)), col2 = c('x',NA))
dbWriteTable(jdbcConnection,"T1", df, rownames=FALSE, overwrite = F, append = T)
Test works fine, but as mentioned I'd recommend the two step approach.

Using clinicaltrials.gov database in R

I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.

creating a looped SQL QUERY using RODBC in R

First and foremost - thank you for taking your time to view my question, regardless of if you answer or not!
I am trying to create a function that loops through my df and queries in the necessary data from SQL using the RODBC package in R. However, I am having trouble setting up the query, since the parameter of the query change through each iteration (example below)
So my df looks like this:
ID Start_Date End_Date
1 2/2/2008 2/9/2008
2 1/1/2006 1/1/2007
1 5/7/2010 5/15/2010
5 9/9/2009 10/1/2009
How would I go about specifying the start date and end date in my sql program?
here's what i have so far:
data_pull <- function(df) {
a <- data.frame()
b <- data.frame()
for (i in df$id)
{
dbconnection <- odbcDriverConnect(".....")
query <- paste("Select ID, Date, Account_Balance from Table where ID = (",i,") and Date > (",df$Start_Date,") and Date <= (",df$End_Date,")")
a <- sqlQuery(dbconnection, paste(query))
b <- rbind(b,a)
}
return(b)
}
However, this doesn't query in anything. I believe it has something to do with how I am specifying the start and the end date for the iteration.
If anyone can help on this it would be greatly appreciated. If you need further explanation, please don't hesitate to ask!
A couple of syntax issues arise from current setup:
LOOP: You do not iterate through all rows of data frame but only the atomic ID values in the single column, df$ID. In that same loop you are passing the entire vectors of df$Start_Date and df$End_Date into query concatenation.
DATES: Your date formats do not align to most data base date formats of 'YYYY-MM-DD'. And still some others like Oracle, you require string to data conversion: TO_DATE(mydate, 'YYYY-MM-DD').
A couple of aforementioned performance / best practices issues:
PARAMETERIZATION: While parameterization is not needed for security reasons since your values are not generated by user input who can inject malicious SQL code, for maintainability and readability, parameterized queries are advised. Hence, consider doing so.
GROWING OBJECTS: According to Patrick Burn's Inferno Circle 2: Growing Objects, R programmers should avoid growing multi-dimensional objects like data frames inside a loop which can cause excessive copying in memory. Instead, build a list of data frames to rbind once outside the loop.
With that said, you can avoid any looping or listing needs by saving your data frame as a database table then joined to final table for a filtered, join query import. This assumes your database user has CREATE TABLE and DROP TABLE privileges.
# CONVERT DATE FIELDS TO DATE TYPE
df <- within(df, {
Start_Date = as.Date(Start_Date, format="%m/%d/%Y")
End_Date = as.Date(End_Date, format="%m/%d/%Y")
})
# SAVE DATA FRAME TO DATABASE
sqlSave(dbconnection, df, "myRData", rownames = FALSE, append = FALSE)
# IMPORT JOINED AND DATE FILTERED QUERY
q <- "SELECT ID, Date, Account_Balance
FROM Table t
INNER JOIN myRData r
ON r.ID = t.ID
AND t.Date BETWEEN r.Start_Date AND r.End_Date"
final_df <- sqlQuery(dbconnection, q)

How to force Hive to distribute data equally among different reducers?

Imagine I want to send the Iris dataset, that I have as a Hive table, to different reducers in order to run the same task in parallel on R. I can execute my R script through the transform function and use lateral view explode on hive to do a cartesian product on the iris dataset and an array containing my "partition" variable, like on the query below:
set source_table = iris;
set x_column_names = "sepallenght|sepalwidth|petallength|petalwidth";
set y_column_name = "species";
set output_dir = "/r_output";
set model_name ="paralelism_test";
set param_var = params;
set param_array = array(1,2,3);
set mapreduce.job.reduces=3;
select transform(id, sepallenght, sepalwidth, petallength, petalwidth, species, ${hiveconf:param_var})
using 'controlScript script.R ${hiveconf:x_column_names}${hiveconf:y_column_name}${hiveconf:output_dir}${hiveconf:model_name}${hiveconf:param_var}'
as (script_result string)
from
(select *
from ${hiveconf:source_table}
lateral view explode ( ${hiveconf:param_array} ) temp_table
as ${hiveconf:param_var}
distribute by ${hiveconf:param_var}
) data_query;
I call a memory control script, so please ignore it for the sake of objectivity.
What my script.R returns are a list of the unique parameters it has received (the "params" column populated with the "param_var" array values) and the number of rows the partition it gets has, as follows:
#The aim of this script is to validate the paralel computation of R scripts through Hive.
compute_model <- function(data){
paste("parameter ",unique(data[ncol(data)]), ", " , nrow(data), "lines")
}
main <- function(args){
#Reading the input parameters
#These inputs were passed along the transform's "using" clause, on Hive.
x_column_names <- as.character(unlist(strsplit(gsub(' ','',args[1]),'\\|')))
y_column_name <- as.character(args[2])
target_dir <- as.character(args[3])
model_name <- as.character(args[4])
param_var_name <- as.character(args[5])
#Reading the data table
f <- file("stdin")
open(f)
data <- tryCatch({
as.data.frame (
read.table(f, header=FALSE, sep='\t', stringsAsFactors = T, dec='.')
)},
warning = function(w) cat(w),
error = function(e) stop(e),
finally = close(f)
)
#Computes the model. Here, the model can be any computation.
instance_result <- as.character(compute_model(data))
#writes the result to "stdout" separated by '\t'. This output must be a data frame where
#each column represents a Hive Table column.
write.table(instance_result,
quote = FALSE,
row.names = FALSE,
col.names = FALSE,
sep = "\t",
dec='.'
)
}
#Main code
###############################################################
main(commandArgs(trailingOnly=TRUE))
What I want hive to do is replicate the Iris dataset equally among these reducers. It works fine when i put sequential values on my param_array variable, but for values like array(10, 100, 1000, 10000) and mapreduce.job.reduces=4, or array(-5,-4,-3,-2,-1,0,1,2,3,4,5) and mapreduce.job.reduces=11, some reducers won't receive any data, and others will receive more than one key.
The question is: is there a way to make sure hive distributes each partition to a different reducer?
Did I make myself clear?
It may look silly to do it, but I want to run grid search on Hadoop and have some restrictions on using other technologies that are more suitable to this task.
Thank you!

Resources