Bulk-Update an SQLite column in R using RSQLite and bind.data - r

I am using R in combination with SQLite using RSQLite to persistate my data since I did not have sufficient RAM to constantly store all columns and calculate using them. I have added an empty column to the SQLite database using:
dbGetQuery(db, "alter table test_table add column newcol real)
Now I want to fill this column using data I calculated in R and which is stored in my data.table column dtab$newcol. I have tried the following approach:
dbGetQuery(db, "update test_table set newcol = ? where id = ?", bind.data = data.frame(transactions$sum_year, transactions$id))
Unfortunately, R seems like it is doing something but is not using any CPU time or RAM allocation. The database does not change size and even after 24 hours nothing has changed. Therefore, I assume it has crashed - without any output.
Am I using the update statement wrong? Is there an alternative way of doing this?
UPDATE
I have also tried the RSQLite functions dbSendQuery and dbGetPreparedQuery - both with the same result. However, what does work is updating a single row without the use of bind.data. A loop to update the column, therefore, seems possible but I will have to evaluate the performance since the dataset is huge.

As mentioned by #jangorecki the problem had to do with SQLite performance. I disabled synchronous and set journal_mode to off (which has to be done for every session).
dbGetQuery(transDB, "PRAGMA synchronous = OFF")
dbGetQuery(transDB, "PRAGMA journal_mode = OFF")
Also I changed my RSQLite code to use dbBegin(), dbSendPreparedQuery() and dbCommit(). It is takes a while but at least it works not and has an acceptable performance.

Related

RJDBC queries against Oracle NUMBER(22,0) fields get wrong results

I have been submitting queries to a particular Oracle database, from R, over RJDBC, for over three years. All that time the results in R seemed to match the results I would get in a database browser (DBeaver in my case).
Now I'm writing queries against a table that has NUMBER(22,0) primary and foreign keys with values in the billions (>1,000,000,000).
And now the results in R don't match the results in DBeaver. These data are sensitive, and I haven't figured out a way to illustrate my problem without disclosing the data, nor have I gotten to the point of replicating the problem in an Oracle database of my own creation yet. I'm open to any suggestions!
A sample query is
SELECT
PK_PATIENT_IDENTIFIER_ID
FROM
PATIENT_IDENTIFIERS pi1
WHERE
FK_PATIENT_ID = <SECRET>
ORDER BY
PK_PATIENT_IDENTIFIER_ID
In DBeaver, I get 6 rows with values in the 70,000,000 - 75,000,000 range.
When submitted in R (see below), I get 8 rows with values in the 100,000,000 - 300,000,000 range.
Are my values in the where block getting cast to some semi-compatible type? Do I need to assert somewhere that these are long integers?
driver <-
JDBC(driverClass = "oracle.jdbc.OracleDriver",
classPath = config$oracle.jdbc.path)
con.string <- paste0("jdbc:oracle:thin:#//",
config$host,
":",
config$port,
"/",
config$database)
Connection <-
dbConnect(driver,
con.string,
config$user,
config$pw)
query <- "SELECT
PK_PATIENT_IDENTIFIER_ID
FROM
PATIENT_IDENTIFIERS pi1
WHERE
FK_PATIENT_ID = <SECRET>
ORDER BY
PK_PATIENT_IDENTIFIER_ID"
query.res <- dbGetQuery(Connection, query)

How to unittest PostGIS database queries with R

I've been playing around with database queries in R that are executed on a Postgres database with the PostGIS extension. This means I use some of the PostGIS functions that do not have an R equivalent. If it wasn't for that, I could probably just execute the same function on a local test dataframe instead of a database connection, but due to the PostGIS functions, that's not possible.
Is there a simple approach to create test data in a test database and run the query on that and assess the outcome? I have a WKB column which R does not directly support, so I'm not even sure a simple copy_to could work with inserting a character vector into a geometry column, not to speak of resolving potential key constraints.
A local sqlite database does not work because it does not provide these functions.
Has someone found a viable solution to this problem?
It sounds like you can not collect tables from postgresql back into R, hence your comparison has to happen in sql.
I would do the following:
define text strings to generate sql tables
execute the strings to generate the tables
run your code
make your comparison
For doing a comparison in sql that two tables are identical I would follow the method in this question or this one.
This would look something like this:
# Define text strings
create_string = 'CREATE TABLE test1 (code VARCHAR(4), size INTEGER);'
insert_string = 'INSERT INTO test1 (code, size) VALUES ('AAA', 123);'
# Execute strings
db_con = create_connection()
dbExecute(db_con, create_string)
dbExecute(db_con, insert_string)
# optional validate new table with contents now exists in datbase
# run code
test1 = tbl(db_con, "test1")
test2 = my_function_to_test_that_does_nothing(test1)
# comparison
num_records_not_in_both = test1 %>%
full_join(test2, by = colnames(test2), suffix = c("_1","_2") %>%
filter(is.na(id_1) | is.na(id_2)) %>%
ungroup() %>%
summarise(num = n()) %>%
collect()
require(num_records_not_in_both == 0)
# optional delete test functions

RODBC gives proper row count but yields empty query

Using R-3.5.0 and RODBC v. 1.3-15 on Windows.
I am trying to query data from a remote database. I can connect fine and if I do a query to count the rows, the answer comes out correctly. But if I try to remove the count statement select count(*) and actually get the data via select *, I yield an empty query (with some rather strange headers). Only two of the column names come out correctly and the rest are question marks and a number (as shown below). I can using sql developer to query the data no problem.
I include the simplest version of the code below but I get the same results if I try to limit to just a few rows or certain conditions, etc. Sorry I cannot create a reproducible example but as this is a remote db and I have no idea what the problem is, I'm not sure how I could even do that.
I can query other tables from different schemas within the same odbc connection, so I don't think it is that. I have tried with and without the believeNRows and the rows_at_time.
Thank you for any thoughts.
channel <- odbcConnect("mydb", uid="myuser", pwd="mypass", believeNRows=FALSE,rows_at_time = 1)
myquery <- paste("select count(*) from MYSCHEMA.MYTABLE")
sqlQuery(channel, myquery)
COUNT(*)
1 149712361
myquery <- paste("select * from MYSCHEMA.MYTABLE")
sqlQuery(channel, myquery)
[1] ID FMC_IN_ID ? ?.1 ?.2 ?.3 ?.4 ?.5 ?.6 ?.7 ?.8 ?.9 ?.10 ?.11 ?.12 ?.13 ?.14 ?.15
<0 rows> (or 0-length row.names)
I would try the following:
add a simple limit 100 to your query to see if you can get some data back
add the believeNRows option to the sqlQuery call -- in my experience it is needed at that level
In case it helps others, the problem was that the database contained an Oracle spatial field (MDSYS.SDO_GEOMETRY). R did not know what to do with it. I assumed it would just convert it to a character but instead it just got confused. By omitting the spatial field, the query worked fine.

SQLite3 split date while creating index

I'm using a SQLite3 database, and I have a table that looks like this:
The database is quite big and running queries is very slow. I'm trying to speed up the process by indexing some of the columns. One of the columns that I want to index is the QUOTE_DATETIME column.
Problem: I want to index by date (YYYY-MM-DD) only, not by date and time (YYYY-MM-DD HH:MM:SS), which is the data I currently have in QUOTE_DATETIME.
Question: How can I use CREATE INDEX to create an index that uses only dates in the format YYYY-MM-DD? Should I split QUOTE_DATETIME into 2 columns: QUOTE_DATE and QUOTE_TIME? If so, how can I do that? Is there an easier solution?
Thanks for helping! :D
Attempt 1: I tried running CREATE INDEX id ON DATA (date(QUOTE_DATETIME)) but I got the error Error: non-deterministic functions prohibited in index expressions.
Attempt 2: I ran ALTER TABLE data ADD COLUMN QUOTE_DATE TEXT to create a new column to hold the date only. And then INSERT INTO data(QUOTE_DATE) SELECT date(QUOTE_DATETIME) FROM data. The date(QUOTE_DATETIME) should convert the date + time to only date, and the INSERT INTO should add the new values to QUOTE_DATE. However, it doesn't work and I don't know why. The new column ends up not having anything added to it.
Expression indexes must not use functions that might change their return value based on data not mentioned in the function call itself. The date() function is such a function because it might use the current time zone setting.
However, in SQLite 3.20 or later, you can use date() in indexes as long as you are not using any time zone modifiers.
INSERT adds new rows. To modify existing rows, use UPDATE:
UPDATE Data SET Quote_Date = date(Quote_DateTime);

R: Updating SQL table loop using R

I am using the RODBC package on R which allows me to connect to SQL using R.
As an example to my problem, I have a table [Sales] within SQL with 3 Columns (Alpha, Beta, BetaDistribution).
1.50,77,x
2.99,53,x
4.50,122,x
Note that the 3rd column (BetaDistribution) is not populated, and this needs to be populated using a Statistical R Function.
I have assigned my table to the variable SELECT
select <- sqlQuery(dbhandle, 'select * from dbo.sales')
how to I run a loop to update my sql table so that the BetaDistribution column is updated with the calculated Beta Distribution - pbeta(alpha,beta)
Something like this. Basically you make a temp table and then update the existing table. There's a reasonable chance you need to tweak that update statement since I, obviously, can't test it.
select$BetaDistribution<-yourfunc(x,y)
sqlSave(dbhandle, select, tablename="dbo.salestemp", rownames=FALSE,varTypes=list(Alpha="decimal(10,10)", Beta="decimal(10,10)", BetaDistribution="decimal(10,10)"))
sqlQuery(dbhandle, "update dbo.sales
set sales.BetaDistribution=salestemp.BetaDistribution
from dbo.sales
inner join
salestemp
on
sales.Alpha=salestemp.Alpha and
sales.Beta=salestemp.Beta")
sqlQuery(dbhandle, "drop table salestemp")

Resources