Own function in RSQL Lite Engine in R - r

I found this SQL code for SAS and I want to translate it into RSQL Lite.
proc sql;
create table crspcomp as
select a.*, b.ret, b.date
from ccm1 as a left join crsp.msf as b
on a.permno=b.permno
and intck('month',a.datadate,b.date)
between 3 and 14;
quit;
The first Problem which occurred was R does not provide the intck function, which Returns the difference in months between two dates. I found a similar function (at stackoverflow) which looks like this:
mob<-function (begin, end) {
begin<-paste(substr(begin,1,6),"01",sep="")
end<-paste(substr(end,1,6),"01",sep="")
mob1<-as.period(interval(ymd(begin),ymd(end)))
mob<-mob1#year*12+mob1#month
mob
}
I've tested the mob function outside RSQL and it works fine so far. Now I want to put the mob function into the SQL Statement written above.
In the SQL Code I want to merge the data on permno and in addition I want to lag the data for 3 months (thats why I use the mob function).
The Annual_File looks like this:
GVKEY,datadate,fyear,fyr,bkvlps,permno
14489,19980131,1997,1,4.0155,11081
14489,19990131,1998,1,1.8254,11081
14489,20000131,1999,1,2.0614,11081
14489,20010131,2000,1,2.1615,11081
14489,20020131,2001,1,1.804,11081
The CRSP file looks like this
permno,date,ret
11081,20000103,0.1
11081,20000104,0.2
install.packages('DBI')
install.packages('RSQLite')
mob<-function (begin, end) {
begin<-paste(substr(begin,1,6),"01",sep="")
end<-paste(substr(end,1,6),"01",sep="")
mob1<-as.period(interval(ymd(begin),ymd(end)))
mob<-mob1#year*12+mob1#month
mob
}
Annual_File <- "C:/Users/XYZ"
Annual_File <- paste0(Annual_File ,".csv",sep="")
inputFile <- "C:/Users/XYZ"
inputFile <- paste0(inputFile.csv",sep="")
con <- dbConnect(RSQLite::SQLite(), dbname='CCM')
dbWriteTable(con, name="CRSP", value=inputFile, row.names=FALSE, header=TRUE, overwrite=TRUE)
dbWriteTable(con, name="Annual_File", value=Annual_File, row.names=FALSE, header=TRUE, overwrite=TRUE)
DSQL <- "select a.*, b.ret, b.date
from Annual_File as a left join
CRSP as b
on a.permno=b.PERMNO
and mob(a.datadate,b.date)
between 3 and 14"
yourData <- dbGetQuery(con,DJSQL)
Even tough I defined the function - the Error looks as follows.
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: mob

You can only use SQL functions in SQLite (and functions written in C). You can't use R functions.
Also, SQLite is not very good for date handling since it has no date and time types. Workarounds are possible with the functions SQLite provides (see Note at end) but I suggest you use the H2 database instead. It has datediff built in. Note that depending on what you want you may need to reverse the order of the last two arguments to datediff.
library(RH2)
library(sqldf)
# create test data frames
Lines1 <- "GVKEY,datadate,fyear,fyr,bkvlps,permno
14489,19980131,1997,1,4.0155,11081
14489,19990131,1998,1,1.8254,11081
14489,20000131,1999,1,2.0614,11081
14489,20010131,2000,1,2.1615,11081
14489,20020131,2001,1,1.804,11081"
Lines2 <- "permno,date,ret
11081,20000103,0.1
11081,20000104,0.2"
fmt <- "%Y%m%d"
Annual_File <- read.csv(text = Lines1)
Annual_File$datadate <- as.Date(as.character(Annual_File$datadate), format = fmt)
CRSP <- read.csv(text = Lines2)
CRSP$date <- as.Date(as.character(CRSP$date), format = fmt)
# run SQL statement using sqldf
sqldf("select a.*, b.ret, b.date, datediff('month', a.datadate, b.date) diff
from Annual_File as a
left join CRSP as b
on a.permno = b.permno and
datediff('month', a.datadate, b.date) between 3 and 14")
giving:
GVKEY datadate fyear fyr bkvlps permno ret date diff
1 14489 1998-01-31 1997 1 4.0155 11081 NA <NA> NA
2 14489 1999-01-31 1998 1 1.8254 11081 0.1 2000-01-03 12
3 14489 1999-01-31 1998 1 1.8254 11081 0.2 2000-01-04 12
4 14489 2000-01-31 1999 1 2.0614 11081 NA <NA> NA
5 14489 2001-01-31 2000 1 2.1615 11081 NA <NA> NA
6 14489 2002-01-31 2001 1 1.8040 11081 NA <NA> NA
Note: To use SQLite use this where 2440588.5 is used to convert between R's UNIX epoch date origin and the date origin assumed by SQLite's functions.
library(sqldf)
try(detach("package:RH2"), silent = TRUE) # detach RH2 if present
sqldf("select a.*, b.ret, b.date
from Annual_File as a
left join CRSP as b
on a.permno = b.permno and
b.date + 2440588.5 between julianday(a.datadate + 2440588.5, '+3 months') and
julianday(a.datadate + 2440588.5, '+12 months')")

Related

Override bad/wrong values in a main table with NA or null values listed on another lookup table in R

The main table is large.
Has certain undesired values that I want to override.
I am writing into a lookup table the keys and new_value (NA) to override.
Both have 2 keys (session_id and datetime), not one unique.
Other similar questions goes into replacing an NA with a value, but I want to replace the value with an NA. Clear cells contents.
The 2 keys limits the use of match() which can handle only one key and first occurrences.
left_join or merge operations, would create a new large dataframe with an added column, and will fill them up with NA for every row, and it would also require to perform some 'coalescing' into an NA value, which I guess, doesn't exists.
I don't want to remove the entire row, as there are many other columns with its own values. I just want to delete that value from that cells.
I think, that in short, it is just an assignment operation to a filtered subset based on 2 keys. Something like:
table[ lookup_paired_keys(session_ids, lookup_datetimes) ] <- NA
Follows a sample dataset with undesired "0" to replace by NA. The real dataset may contain other kind of values.
table <- read.table(text = "
session_id datetime CaloriesDaily
1233815059 2016-05-01 5555
8583815123 2016-05-03 4444
8512315059 2016-05-04 2432
8583815059 2016-05-12 0
6290855005 2016-05-10 0
8253242879 2016-04-30 0
1503960366 2016-05-20 0
1583815059 2016-05-19 2343
8586545059 2016-05-20 1111
1290855005 2016-05-11 5425
1253242879 2016-04-25 1234
1111111111 2016-05-09 6542", header = TRUE)
table$datetime = as.POSIXct(table$datetime, tz='UTC')
table
lookup <- read.table(text = "
session_id datetime CaloriesDaily
8583815059 2016-05-12 NA
6290855005 2016-05-10 NA
8253242879 2016-04-30 NA
1503960366 2016-05-12 NA", header = TRUE)
lookup$datetime = as.POSIXct(lookup$datetime, tz='UTC')
lookup$CaloriesDaily = as.numeric(lookup$CaloriesDaily)
lookup
SOLVED
After reading the accepted answer, I want to share the final outcome.
And as I have the main table a data.table and I got some warns regarding nomenclature, be aware that I am no expert, but is working with this example dataset and my own.
lookup_by : Standard Lookup operation
lookup_by <- function(table, lookup, by) {
merge( table, lookup, by=by )
}
### usage ###
keys = c('session_id','datetime')
lookup_by( table, lookup, keys)
Adopted solution: match_by
Like match() but with keys.
It returns a vectors with row numbers when keys match.
So that, assignment like table[ ..matches.. ] <- NA is possible.
match_by <- function(table, lookup, by) {
table <- setDT(table)[,..by]
table$idx1 <- 1:nrow(table)
lookup <- setDT(lookup)[,..by]
lookup$idx2 <- 1:nrow(lookup)
m <- merge( table , lookup, by=by )
return( m[ ,c('idx1','idx2') ] )
}
### usage ###
keys = c('session_id','datetime')
rows = match_by( table, lookup, keys)
overrides <- c(lookup[ rows$idx2, 'CaloriesDaily' ])
table[ rows$idx1, 'CaloriesDaily' ] <- overrides
table
Here’s a solution using dplyr::semi_join() and dplyr::anti_join() to split your dataframe based on whether the id and date keys match your lookup table. I then assign NAs in just the subset with matching keys, then row-bind the subsets back together. Note that this solution doesn’t preserve the original row order.
library(dplyr)
table_ok_vals <- table %>%
anti_join(lookup, by = c("session_id", "datetime"))
table_replaced_vals <- table %>%
semi_join(lookup, by = c("session_id", "datetime")) %>%
mutate(CaloriesDaily = NA_real_)
table <- bind_rows(table_ok_vals, table_replaced_vals)
table
Output:
session_id datetime CaloriesDaily
1 1233815059 2016-05-01 5555
2 8583815123 2016-05-03 4444
3 8512315059 2016-05-04 2432
4 1503960366 2016-05-20 0
5 1583815059 2016-05-19 2343
6 8586545059 2016-05-20 1111
7 1290855005 2016-05-11 5425
8 1253242879 2016-04-25 1234
9 1111111111 2016-05-09 6542
10 8583815059 2016-05-12 NA
11 6290855005 2016-05-10 NA
12 8253242879 2016-04-30 NA

RSQlite - Find values with most occurences in group

I'm using RSQlite to import Datasets from an SQlite-Database. There are multiple millions of observations within the Database. Therefor I'd like to do as much as possible of Data selection and aggregation within the Database.
At some point I need to aggregate a character variable. I want to get the value which occures the most ordered by a group. How can I edit the following dplyr-chain so it works also with RSQlite?
library(tidyverse)
library(RSQLite)
# Path to Database
DATABASE="./xxx.db"
# Connect Database
mydb <- dbConnect(RSQLite::SQLite(), DATABASE)
# Load Database
data = tbl(mydb, "BigData")
# Query Database
Summary <- data %>%
filter(year==2020) %>%
group_by(Grouping_variable) %>%
summarize(count=n(),
Item_variable=names(which.max(table(Item_variable))))
Within R that code would do it's job. Querying the database I get an error code Error: near "(": syntax error
Original pipe contains more filters and steps.
Example Database would basically look like:
data.frame(Grouping_variable=c("A","A","B","C","C","C","D","D","D","D"),
year=c(2019,2020,2019,2020,2020,2020,2020,2020,2020,2021),
Item_variable=c("X","Y","Y","X","X","Y","Y","Y","X","X"))
Grouping_variable year Item_Variable
1 A 2019 X
2 A 2020 Y
3 B 2019 Y
4 C 2020 X
5 C 2020 X
6 C 2020 Y
7 D 2020 Y
8 D 2020 Y
9 D 2020 X
10 D 2021 X
Result should look like:
Grouping_variable count Item_variable
<chr> <int> <chr>
1 A 1 Y
2 C 3 X
3 D 3 Y
Assuming that DF is the data frame defined in the question and using SQL we calculate the count of each item within group in the year 2020 giving tmp and then take the row whose count is maximum giving tmp2 - SQLite guarantees that when using group by and max that the other fields come from the row where the maximum was found. Also take the sum of the counts in tmp2 and finally select just the desired columns.
library(sqldf)
sql <- "with tmp as (
select Grouping_variable, count(*) count, Item_variable from DF
where year = 2020
group by Grouping_variable, Item_variable
),
tmp2 as (
select Grouping_variable, max(count), sum(count) count, Item_variable
from tmp
group by Grouping_variable
)
select Grouping_variable, count, Item_variable
from tmp2
"
sqldf(sql)
giving:
Grouping_variable count Item_variable
1 A 1 Y
2 C 3 X
3 D 3 Y
Added
Suppose that DF were a table in your database. This code creates such a database.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
dbWriteTable(con, 'DF', DF, row.names = FALSE)
dbDisconnect(con)
then this would run the sql command in the sql string defined above on that database and return the result.
library(RSQLite)
m <- dbDriver("SQLite")
con <- dbConnect(m, dbname = "database.sqlite")
result <- dbGetQuery(con, sql)
dbDisconnect(con)

How to fetch data in batches using R

I have a dataframe in R with the following structure:
ID Date
ID-1 2020-02-10 13:12:04
ID-2 2020-02-12 15:02:24
ID-3 2020-02-14 12:25:32
I am using the following query to fetch the data from MySQL, that where I'm getting a problem because I have a large number if ID (i.e ~90K). When I'm passing 500-1000 ID it is working fine but passing 90K Id it throws an error.
Data_frame<-paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")
The query returns the output in the following manner which I need to rbind with DF using ID.
Query_Output<-
ID Name output
ID-1 Name1 23
ID-1 Name2 20
ID-2 Name1 40
ID-2 Name2 97
ID-3 Name1 34
ID-3 Name2 53
Required Output:
ID Date Name1 Name2
ID-1 2020-02-10 13:12:04 23 20
ID-2 2020-02-12 15:02:24 40 97
ID-3 2020-02-14 12:25:32 34 53
I have tried the below-mentioned code:
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(DF$ID, 2)
# It looks like this:
# [1] "'ID-1','ID-2'" "'ID-3','ID-4'" "'ID-5'"
# now we create a vector of SQL-queries out of that
# queries <- createQueries(IDbatches)
df_final <- data.frame() # initialize a dataframe
conn <- database # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))}
Surprised 90k rows kills your SQL but such is life
Not sure I understand why you are doing what you are doing rather that looping a for
for (batches in 0:90) {
b = batches*1000
SELECT ...
... WHERE ID > b & < b+1000
rbind(myData, result)
}
(That's not the solution just the method)
But if your method is working then is what you want dplyr::pivot_wider()

Replace wrong values in df2 with true values in df1 by using 2 common columns in R

I have 2 data frames like this
TEAM <- c("PE","PE","MPI","TDT","HPT")
EmpID <- c (444452,444456,16822,339862,14828)
ManagerID <- c(11499,11599,11899,11339,11559)
CODE <- c("F",NA,"A","H","G")
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE)
TEAM <- c("MPI","TDT","HPT","PE","TDT","PE","MPI","TDT","HPT","PE")
EmpID <- c(444452,444452,444452,339862,339862,16822,339862,16822,14828,14828)
ManagerID <- c(11499,11499,11499,11339,11339,11899,11339,11899,11559,11559)
CODE <- c("A234","H665","G654","F616","H626","F234","H695","G954","G616",NA)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE)
I am trying to update the wrong values of ManagerID & EmpID in df2 with the true values of ManagerID & EmpID in df1 only when the TEAM & the CODE (matching the letter in CODE column in df1 with the first letter of CODE column in df2). If the team matches but the code is not correct, then the wrong values stay and shouldn't be replaced with the values from df1.
My desired output is
TEAM EmpID ManagerID CODE
1 MPI 16822 11899 A234
2 TDT 339862 11339 H665
3 HPT 14828 11559 G654
4 PE 444452 11499 F616
5 TDT 339862 11339 H626
6 PE 444452 11499 F234
7 MPI 339862 11339 H695
8 TDT 16822 11899 G954
9 HPT 14828 11559 G616
10 PE 444452 11599 <NA>
You can see that the row 7 & 8 remain unchanged since the codes don't match.
I tried doing it this way with the help from Gregor for my previous question
df2$ManagerID = df1$ManagerID[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2$EmpID = df1$EmpID [match(substr(df2$CODE, 1, 1), df1$CODE)]
I am not sure if I am headed in the right direction. Kindly help me with inputs on how efficiently to solve this.
Since you are doing text subsetting here, I would initialize as character vectors, not factors:
df1 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
df2 <- data.frame(TEAM,EmpID,ManagerID,CODE, stringsAsFactors = FALSE)
I would then use data table (not necessary, but cleaner):
library(data.table)
# convert data frames to data.table
setDT(df1)
setDT(df2)
You want to create a column with the value to merge here, just taking the first character of df2's CODE
df2[ , C_SHORT := substr(CODE,1,1)]
We then merge the two data frames on the TEAM/CODE combination. This creates NAs where any matches do not exist. Then, test if any of the columns are NA, and insert the initial value if they are
merge(x = df2,y = df1, by.x = c("TEAM","C_SHORT"), by.y = c("TEAM","CODE"), all.x = TRUE)[
,
.(
TEAM,
EmpID = ifelse(is.na(EmpID.y), EmpID.x, EmpID.y),
ManagerID = ifelse(is.na(ManagerID.y), ManagerID.x, ManagerID.y),
CODE
)
]
TEAM EmpID ManagerID CODE
1: HPT 14828 11559 G654
2: HPT 14828 11559 G616
3: MPI 16822 11899 A234
4: MPI 339862 11339 H695
5: PE 444456 11599 NA
6: PE 444452 11499 F616
7: PE 444452 11499 F234
8: TDT 16822 11899 G954
9: TDT 339862 11339 H665
10: TDT 339862 11339 H626
One note: You are using NA here as a lookup here. This works in the merge function (which I did not realize), but IMO this is bad practice (NA in R really refers to missing data, whereas here it encodes something). I would think about changing how this is represented in your data.

Update a column in df2 by matching patterns in columns in df1 & df2 using R

I have 2 data frames like this
TEAM <- c("PE","PE","MPI","TDT","HPT","ATD")
CODE <- c(NA,"F","A","H","G","D")
df1 <- data.frame(TEAM,CODE)
CODE <- c(NA,"F100","A234","D664","H435","G123","A666","D345","G324",NA)
TEAM <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
df2 <- data.frame(CODE,TEAM)
I am trying to update the TEAM in df2 by matching the first letter in code column in df1 with the code column in df2
My desired output for df2
CODE TEAM
1 NA PE
2 F100 PE
3 A234 MPI
4 D664 ATD
5 H435 TDT
6 G123 HPT
7 A666 MPI
8 D345 ATD
9 G324 HPT
10 NA PE
I am trying this way with sqldf but it is not right
library(sqldf)
df2 <- sqldf(c("update df2 set TEAM =
case
when CODE like '%F%' then 'PE'
when CODE like '%A%' then 'MPI'
when CODE like '%D%' then 'ATD'
when CODE like '%G%' then 'HPT'
when CODE like '%H%' then 'TDT'
else 'NA'
end"))
Can someone help me provide some directions on achieving this without sqldf?
Using match and substr (both in base R):
df2$TEAM = df1$TEAM[match(substr(df2$CODE, 1, 1), df1$CODE)]
df2
# CODE TEAM
# 1 <NA> PE
# 2 F100 PE
# 3 A234 MPI
# 4 D664 ATD
# 5 H435 TDT
# 6 G123 HPT
# 7 A666 MPI
# 8 D345 ATD
# 9 G324 HPT
# 10 <NA> PE
This is expedient for a single case - if you're doing things like this frequently I would encourage you to just extract the first letter of code into its own column, CODE_1, and then do a regular merge or join.
Assuming you are looking for an sqldf solution try this:
sqldf("select CODE,
case
when CODE like 'F%' then 'PE'
when CODE like 'A%' then 'MPI'
when CODE like 'D%' then 'ATD'
when CODE like 'G%' then 'HPT'
when CODE like 'H%' then 'TDT'
else 'PE'
end TEAM from df2", method = "raw")
or this:
sqldf("select df2.CODE, coalesce(df1.TEAM, 'PE') TEAM
from df2
left join df1 on substr(df2.CODE, 1, 1) = df1.CODE")

Resources