How to fetch data in batches using R - r

I have a dataframe in R with the following structure:
ID Date
ID-1 2020-02-10 13:12:04
ID-2 2020-02-12 15:02:24
ID-3 2020-02-14 12:25:32
I am using the following query to fetch the data from MySQL, that where I'm getting a problem because I have a large number if ID (i.e ~90K). When I'm passing 500-1000 ID it is working fine but passing 90K Id it throws an error.
Data_frame<-paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")
The query returns the output in the following manner which I need to rbind with DF using ID.
Query_Output<-
ID Name output
ID-1 Name1 23
ID-1 Name2 20
ID-2 Name1 40
ID-2 Name2 97
ID-3 Name1 34
ID-3 Name2 53
Required Output:
ID Date Name1 Name2
ID-1 2020-02-10 13:12:04 23 20
ID-2 2020-02-12 15:02:24 40 97
ID-3 2020-02-14 12:25:32 34 53
I have tried the below-mentioned code:
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(DF$ID, 2)
# It looks like this:
# [1] "'ID-1','ID-2'" "'ID-3','ID-4'" "'ID-5'"
# now we create a vector of SQL-queries out of that
# queries <- createQueries(IDbatches)
df_final <- data.frame() # initialize a dataframe
conn <- database # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))}

Surprised 90k rows kills your SQL but such is life
Not sure I understand why you are doing what you are doing rather that looping a for
for (batches in 0:90) {
b = batches*1000
SELECT ...
... WHERE ID > b & < b+1000
rbind(myData, result)
}
(That's not the solution just the method)
But if your method is working then is what you want dplyr::pivot_wider()

Related

Appending and updating DB2 table rows in R object using ibmdbR package

I've got a data.table DT that I'd like to write to DB2 and update using ibmdbR package.
I upload the first batch using as.ida.data.frame.
> DT<- data.table(A = c(111,222,333,444), MONTH= c('2018-01', '2018-02', '2018-03', '2018-04'), B= c(11,22,33,44))
> DT
A MONTH B
1: 111 2018-01 11
2: 222 2018-02 22
3: 333 2018-03 33
4: 444 2018-04 44
> db2_test <- as.ida.data.frame(DT, table='myschema.TEST', clear.existing=FALSE, case.sensitive=FALSE,
rownames=NULL, dbname='DB_NAME', asAOT=FALSE)
This creates a DB2 table named TEST in my schema in the database.
Then I try to update TEST based on column MONTH using another data.table DT2 by doing:
> DT2 <- data.table(A = c(999,888), MONTH = c('2018-01', '2019-02'), B = c(99,77))
> DT2
A MONTH B
1: 999 2018-01 99
2: 888 2019-02 77
> idaUpdate(myconnection, updf = 'myschema.TEST', dfrm = DT2, idaIndex = 'MONTH')
Error in sqlUpdate(db2Conn, dfrm, updf, index = idaIndex, fast = ifelse(idaIsOracleMode(), :
[RODBC] Failed exec in Update02000 100 [IBM][CLI Driver][DB2/NT64] SQL0100W No row was found for FETCH, UPDATE or DELETE; or the result of a query is an empty table. SQLSTATE=02000
While i receive this error, when I look at the data in TEST table in DB2 database, the first entry has changed, which is expected.
A MONTH B
999 2018-01 99
222 2018-02 22
333 2018-03 33
444 2018-04 44
So i think the error comes from the second entry in DT2. there are no rows in TEST with MONTH = '2019-02', so it fails.
But I thought the point of updating with an index column is substituting the rows that have a match with the index column and adding the rows that don't?
How can I update TEST properly with DT2, so that if month exists then update the rows? but add the news ones if there are no rows in TEST that match MONTH column in DT2?
basically, how can I append data properly to a DB2 table from an R object?
I never had issues with AWS. DB2 is a nightmare.

How to pass multiple ids in batches and than merge it in R

I have below mentioned dataframe in R.
DF1
ID Sales Cost Value
RTT-123 10 10000 15000
RTT-456 15 12000 17000
RTT-789 14 14000 19000
The dataframe containst almost ~30K unique Ids, while passing these ids to redshift using below mentioned query, I am getting error Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
How to pass these Ids automatically in the batch of 2K ids while querying and then merge the output to one single data frame in R.
Query:
df2<-paste0("SELECT ID,list1,list2, date1 FROM table1 b
WHERE b.ID IN (", paste(shQuote(DF1$ID , type = "sh"),collapse = ','),");")
output<-dbGetQuery(link,df2)
Something like this (not tested), here we are using 1000 IDs at a time, adjust as per your needs:
library(data.table) # rbindlist
output <- rbindlist(
lapply(
# 1000 chunks
split(DF1$ID, ceiling(seq_along(DF1$ID)/1000)),
function(i){
df2 <- paste0("SELECT ID,list1,list2, date1 FROM table1 b
WHERE b.ID IN (",
paste(shQuote(i , type = "sh"), collapse = ','),
");")
dbGetQuery(link, df2)
}))

Writing nested for loop to concatenate rows that share a key in a dataframe in R

So, I have a key dataframe of IDs
IDs <- data.frame(c(123,456,789))
I also have a dataframe of split SQL queries that need to be concatenated (there was an issue of the queries truncating due to their length, so I had to split them into pieces)
splitQueriesdf <- data.frame(ID = c(123,123,123,456,456,456,789,789,789), SplitQUery = c("SELECT", "* FROM", "tablename1","SELECT", "* FROM", "tablename2","SELECT", "* FROM", "tablename3"))
I need to write a loop that concatenates the queries by the IDs that are present in the IDs dataframe into a 3rd dataframe. The nrows(IDs) will vary, so I need that too be dynamic as well
So I need the 3rd dataframe to look like:
ID FullQuery
1 123 SELECT * FROM tablename1
2 456 SELECT * FROM tablename2
3 789 SELECT * FROM tablename3
I have an idea that I need a loop that goes through the length of IDs -- so 3 times, and a nested loop that appends the correct rows together, but I'm fairly new to R, and I'm getting stuck. Here's what I have so far:
dataframe3= NULL
for (index in 1:nrow(IDs)){
for (index2 in 1:nrow(splitQueriesdf)){
dataframe3[index] <- rbind(splitQueriesdf[index2,4])
}
}
Any help is much appreciated!
One option is aggregate from base R to group by 'ID' and then paste the 'SplitQUery' column
splitQueriesdf$SplitQUery <- as.character(splitQueriesdf$SplitQUery)
aggregate(cbind(FullQuery = SplitQUery) ~ ID, splitQueriesdf,
FUN = paste, collapse = ' ')
# ID FullQuery
#1 123 SELECT * FROM tablename1
#2 456 SELECT * FROM tablename2
#3 789 SELECT * FROM tablename3
Using the data table package you can do:
library(data.table)
IDs <- data.frame(ID = c(123,456,789))
splitQueriesdf <- data.frame(ID = c(123,123,123,456,456,456,789,789,789), SplitQUery = c("SELECT", "* FROM", "tablename1","SELECT", "* FROM", "tablename2","SELECT", "* FROM", "tablename3"))
setDT(splitQueriesdf)
splitQueriesdf[ID %in% IDs$ID, paste(SplitQUery, collapse = " "), by = .(ID)]
ID FullQuery
1: 123 SELECT * FROM tablename1
2: 456 SELECT * FROM tablename2
3: 789 SELECT * FROM tablename3
With tidyverse:
splitQueriesdf %>% group_by(ID) %>% summarise(query=paste(SplitQUery,collapse=" "))
## A tibble: 3 x 2
# ID query
# <dbl> <chr>
#1 123 SELECT * FROM tablename1
#2 456 SELECT * FROM tablename2
#3 789 SELECT * FROM tablename3

Own function in RSQL Lite Engine in R

I found this SQL code for SAS and I want to translate it into RSQL Lite.
proc sql;
create table crspcomp as
select a.*, b.ret, b.date
from ccm1 as a left join crsp.msf as b
on a.permno=b.permno
and intck('month',a.datadate,b.date)
between 3 and 14;
quit;
The first Problem which occurred was R does not provide the intck function, which Returns the difference in months between two dates. I found a similar function (at stackoverflow) which looks like this:
mob<-function (begin, end) {
begin<-paste(substr(begin,1,6),"01",sep="")
end<-paste(substr(end,1,6),"01",sep="")
mob1<-as.period(interval(ymd(begin),ymd(end)))
mob<-mob1#year*12+mob1#month
mob
}
I've tested the mob function outside RSQL and it works fine so far. Now I want to put the mob function into the SQL Statement written above.
In the SQL Code I want to merge the data on permno and in addition I want to lag the data for 3 months (thats why I use the mob function).
The Annual_File looks like this:
GVKEY,datadate,fyear,fyr,bkvlps,permno
14489,19980131,1997,1,4.0155,11081
14489,19990131,1998,1,1.8254,11081
14489,20000131,1999,1,2.0614,11081
14489,20010131,2000,1,2.1615,11081
14489,20020131,2001,1,1.804,11081
The CRSP file looks like this
permno,date,ret
11081,20000103,0.1
11081,20000104,0.2
install.packages('DBI')
install.packages('RSQLite')
mob<-function (begin, end) {
begin<-paste(substr(begin,1,6),"01",sep="")
end<-paste(substr(end,1,6),"01",sep="")
mob1<-as.period(interval(ymd(begin),ymd(end)))
mob<-mob1#year*12+mob1#month
mob
}
Annual_File <- "C:/Users/XYZ"
Annual_File <- paste0(Annual_File ,".csv",sep="")
inputFile <- "C:/Users/XYZ"
inputFile <- paste0(inputFile.csv",sep="")
con <- dbConnect(RSQLite::SQLite(), dbname='CCM')
dbWriteTable(con, name="CRSP", value=inputFile, row.names=FALSE, header=TRUE, overwrite=TRUE)
dbWriteTable(con, name="Annual_File", value=Annual_File, row.names=FALSE, header=TRUE, overwrite=TRUE)
DSQL <- "select a.*, b.ret, b.date
from Annual_File as a left join
CRSP as b
on a.permno=b.PERMNO
and mob(a.datadate,b.date)
between 3 and 14"
yourData <- dbGetQuery(con,DJSQL)
Even tough I defined the function - the Error looks as follows.
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: mob
You can only use SQL functions in SQLite (and functions written in C). You can't use R functions.
Also, SQLite is not very good for date handling since it has no date and time types. Workarounds are possible with the functions SQLite provides (see Note at end) but I suggest you use the H2 database instead. It has datediff built in. Note that depending on what you want you may need to reverse the order of the last two arguments to datediff.
library(RH2)
library(sqldf)
# create test data frames
Lines1 <- "GVKEY,datadate,fyear,fyr,bkvlps,permno
14489,19980131,1997,1,4.0155,11081
14489,19990131,1998,1,1.8254,11081
14489,20000131,1999,1,2.0614,11081
14489,20010131,2000,1,2.1615,11081
14489,20020131,2001,1,1.804,11081"
Lines2 <- "permno,date,ret
11081,20000103,0.1
11081,20000104,0.2"
fmt <- "%Y%m%d"
Annual_File <- read.csv(text = Lines1)
Annual_File$datadate <- as.Date(as.character(Annual_File$datadate), format = fmt)
CRSP <- read.csv(text = Lines2)
CRSP$date <- as.Date(as.character(CRSP$date), format = fmt)
# run SQL statement using sqldf
sqldf("select a.*, b.ret, b.date, datediff('month', a.datadate, b.date) diff
from Annual_File as a
left join CRSP as b
on a.permno = b.permno and
datediff('month', a.datadate, b.date) between 3 and 14")
giving:
GVKEY datadate fyear fyr bkvlps permno ret date diff
1 14489 1998-01-31 1997 1 4.0155 11081 NA <NA> NA
2 14489 1999-01-31 1998 1 1.8254 11081 0.1 2000-01-03 12
3 14489 1999-01-31 1998 1 1.8254 11081 0.2 2000-01-04 12
4 14489 2000-01-31 1999 1 2.0614 11081 NA <NA> NA
5 14489 2001-01-31 2000 1 2.1615 11081 NA <NA> NA
6 14489 2002-01-31 2001 1 1.8040 11081 NA <NA> NA
Note: To use SQLite use this where 2440588.5 is used to convert between R's UNIX epoch date origin and the date origin assumed by SQLite's functions.
library(sqldf)
try(detach("package:RH2"), silent = TRUE) # detach RH2 if present
sqldf("select a.*, b.ret, b.date
from Annual_File as a
left join CRSP as b
on a.permno = b.permno and
b.date + 2440588.5 between julianday(a.datadate + 2440588.5, '+3 months') and
julianday(a.datadate + 2440588.5, '+12 months')")

Removing loops in RecordLinkage

I am using the RecordLinkage package in R to deduplicate a dataset. The deduped output from the RecordLinkage package has loops in it.
For example:
Table rlinkage
id name id2 name2
1 Jane Johnson 5 Jane Johnson
5 Jane Johnson 17 Jane Johnson
I am trying to make a table that lists each id associated with all other id numbers in the loop of records.
For example:
id1 id2 id3 Name
1 5 17 Jane Johnson
or
Name Ids
Jane Johnson 1,5,17
Is this possible in R? I tried using the sqldf package to join the dataset onto itself multiple times to try and get all id's on the same line.
For example:
rlinkage2 <-sqldf('select a.id,
a.id2,
b.id as id3
b.id2 as id4
from rlinkage a
left join rlinkage b
on a.id = b.id
or a.id = b.id2
or a.id2 = b.id
or a.id2 = b.id2')
This creates a very messy dataset and will not put all of the id's on the same line unless I join the table rlinkage to itself many times. Is there a better way to do this?
1) sqldf To do this using sqldf union the two sets of columns and then use group_concat
sqldf("select name, group_concat(distinct id) ids from (
select id, name from rlinkage
union
select id2 id, name2 name from rlinkage
) group by name")
giving:
name ids
1 Jane Johnson 1,5,17
2) rbind/aggregate With plain R:
long <- rbind(rlinkage[1:2], setNames(rlinkage[3:4], names(rlinkage)[1:2]))
aggregate(id ~ name, long, function(x) toString(unique(x)))
giving:
name id
1 Jane Johnson 1, 5, 17
Note: We used this as the data:
Lines <- "id,name,id2,name2
1,Jane Johnson,5,Jane Johnson
5,Jane Johnson,17,Jane Johnson"
rlinkage <- read.csv(text = Lines, as.is = TRUE)
The answer to this question is to use a graph to identify all connected components. If the nodes in the graph are the id's listed in the question above we can create an edge list like this:
1 -> 5
5 -> 17
The graph would look like this 1-> 5 -> 17. Finding the connected components within the graph would reveal all of the groups.

Resources