I've been playing around with database queries in R that are executed on a Postgres database with the PostGIS extension. This means I use some of the PostGIS functions that do not have an R equivalent. If it wasn't for that, I could probably just execute the same function on a local test dataframe instead of a database connection, but due to the PostGIS functions, that's not possible.
Is there a simple approach to create test data in a test database and run the query on that and assess the outcome? I have a WKB column which R does not directly support, so I'm not even sure a simple copy_to could work with inserting a character vector into a geometry column, not to speak of resolving potential key constraints.
A local sqlite database does not work because it does not provide these functions.
Has someone found a viable solution to this problem?
It sounds like you can not collect tables from postgresql back into R, hence your comparison has to happen in sql.
I would do the following:
define text strings to generate sql tables
execute the strings to generate the tables
run your code
make your comparison
For doing a comparison in sql that two tables are identical I would follow the method in this question or this one.
This would look something like this:
# Define text strings
create_string = 'CREATE TABLE test1 (code VARCHAR(4), size INTEGER);'
insert_string = 'INSERT INTO test1 (code, size) VALUES ('AAA', 123);'
# Execute strings
db_con = create_connection()
dbExecute(db_con, create_string)
dbExecute(db_con, insert_string)
# optional validate new table with contents now exists in datbase
# run code
test1 = tbl(db_con, "test1")
test2 = my_function_to_test_that_does_nothing(test1)
# comparison
num_records_not_in_both = test1 %>%
full_join(test2, by = colnames(test2), suffix = c("_1","_2") %>%
filter(is.na(id_1) | is.na(id_2)) %>%
ungroup() %>%
summarise(num = n()) %>%
collect()
require(num_records_not_in_both == 0)
# optional delete test functions
Related
For example I made an SQL table with column names "Names", "Class", "age" and I have a data-frame which I made using R code:
data_structure1<- as.data.frame("Name")
data_structure1
data_structure2<-as.data.frame("Class")
data_structure2
data_structure3<-as.data.frame("age")
data_structure3
final_df<- cbind(data_structure1,data_structure2,data_structure3)
final_df
#dataframe "Name" contains multiple entries as different names of students.
#dataframe "Class" contains multiple entries as classes in which student are studying like 1,2,3,4 etc.
#dataframe "age" contains multiple entries as ages of students.
I want to insert final_df in my SQL table containing column names as "Names", "Class", "age".
Below is the command which I am using for inserting only one column in one column of SQL table and this is working fine for me.
title<- sqlQuery(conn,paste0("INSERT INTO AMPs(Names) VALUES('",Names, "')"))
title
But this time I want to insert a dataframe of multiple columns in multiple columns of SQL table.
Please help me with this. Thanks in advance.
Using the DBI package and a package for your database (e.g., RPostgres) you can do something like this, assuming the table already exists:
AMPs <- data.frame(name = c("Foo", "Bar"), Class = "Student", age = c(21L, 22L))
conn <- DBI::dbConnect(...) # read doc to set parameters
## Create a parameterized INSERT statement
db_handle <- DBI::dbSendStatement(
conn,
"INSERT INTO AMPs (Name, Class, age) VALUES ($1, $2, $3)"
)
## Passing the data.frame
DBI::dbBind(db_handle, params = unname(as.list(AMPs)))
DBI::dbHasCompleted(db_handle)
## Close statement handle and connection
DBI::dbClearResult(db_handle)
DBI::dbDisconnect(conn)
You may also want to look at DBI::dbAppendTable.
Edit:
The example above works with PostgreSQL, and I see that the OP tagged the question with 'sql-server'. I don't have means for testing it with SQL Server, but the DBI interface aims at being general so I believe it should work. One important thing to note is the syntax, and, in particular, the syntax for the bind parameters. In PostgreSQL the parameters are defined as $1 for the first parameter, $2 for the second and so on. This might be different for other databases. I think SQL Server is using ?, and that that the binding is done based on position - i.e., the first ? is bind to the first supplied parameter, the second ? to the second and so on.
I have a database called "db" with a table called "company" which has a column named "name".
I am trying to look up a company name in db using the following query:
dbGetQuery(db, 'SELECT name,registered_address FROM company WHERE LOWER(name) LIKE LOWER("%APPLE%")')
This give me the following correct result:
name
1 Apple
My problem is that I have a bunch of companies to look up and their names are in the following data frame
df <- as.data.frame(c("apple", "microsoft","facebook"))
I have tried the following method to get the company name from my df and insert it into the query:
sqlcomp <- paste0("'SELECT name, ","registered_address FROM company WHERE LOWER(name) LIKE LOWER(",'"', df[1,1],'"', ")'")
dbGetQuery(db,sqlcomp)
However this gives me the following error:
tinyformat: Too many conversion specifiers in format string
I've tried several other methods but I cannot get it to work.
Any help would be appreciated.
this code should work
df <- as.data.frame(c("apple", "microsoft","facebook"))
comparer <- paste(paste0(" LOWER(name) LIKE LOWER('%",df[,1],"%')"),collapse=" OR ")
sqlcomp <- sprintf("SELECT name, registered_address FROM company WHERE %s",comparer)
dbGetQuery(db,sqlcomp)
Hope this helps you move on.
Please vote my solution if it is helpful.
Using paste to paste in data into a query is generally a bad idea, due to SQL injection (whether truly injection or just accidental spoiling of the query). It's also better to keep the query free of "raw data" because DBMSes tend to optimize a query once and reuse that optimized query every time it sees the same query; if you encode data in it, it's a new query each time, so the optimization is defeated.
It's generally better to use parameterized queries; see https://db.rstudio.com/best-practices/run-queries-safely/#parameterized-queries.
For you, I suggest the following:
df <- data.frame(names = c("apple", "microsoft","facebook"))
qmarks <- paste(rep("?", nrow(df)), collapse = ",")
qmarks
# [1] "?,?,?"
dbGetQuery(con, sprintf("select name, registered_address from company where lower(name) in (%s)", qmarks),
params = tolower(df$names))
This takes advantage of three things:
the SQL IN operator, which takes a list (vector in R) of values and conditions on "set membership";
optimized queries; if you subsequently run this query again (with three arguments), then it will reuse the query. (Granted, if you run with other than three companies, then it will have to reoptimize, so this is limited gain);
no need to deal with quoting/escaping your data values; for instance, if it is feasible that your company names might include single or double quotes (perhaps typos on user-entry), then adding the value to the query itself is either going to cause the query to fail, or you will have to jump through some hoops to ensure that all quotes are escaped properly for the DBMS to see it as the correct strings.
Today, for the first time I discovered sqldf package which I found to be very useful and convenient. Here is what the documentation says about the package:
https://www.rdocumentation.org/packages/sqldf/versions/0.4-11
sqldf is an R package for runing SQL statements on R data frames,
optimized for convenience. The user simply specifies an SQL statement
in R using data frame names in place of table names and a database
with appropriate table layouts/schema is automatically created, the
data frames are automatically loaded into the database, the specified
SQL statement is performed, the result is read back into R and the
database is deleted all automatically behind the scenes making the
database's existence transparent to the user who only specifies the
SQL statement.
So if I understand correctly, some data.frame which contains data stored in the RAM of the computer is mapped into a database on the disk temporarily as a table, then the calculation or whatever the query is supposed to do will be done and finally the result is returned back to R and all that was temporarily created in the database goes away as it never existed.
My question is, does it work other way around? Meaning, that assuming there is already a table let's say named my_table (just an example) in the database (I use PostgreSQL), is there any way to import its data from the database into a data.frame in R via sqldf? Because, currently the only way that I know is RPostgreSQL.
Thanks to G. Grothendieck for the answer. Indeed it is perfectly possible to select data from already existing tables in the database. My mistake was that I was thinking that the name of the dataframe and the corresponding table must always be the same, whereas if I understand correctly, this is only the case when a data.frame data is mapped to a temporary table in the database. As a result when I tried to select data, I had an error message saying that a table with the same name already existed in my database.
Anyway, just as a test to see whether this works, I did the following in PostgreSQL (postgres user and test database which is owned by postgres)
test=# create table person(fname text, lname text, email text);
CREATE TABLE
test=# insert into person(fname, lname, email) values ('fname-01', 'lname-01', 'fname-01.lname-01#gmail.com'), ('fname-02', 'lname-02', 'fname-02.lname-02#gmail.com'), ('fname-03', 'lname-03', 'fname-03.lname-03#gmail.com');
INSERT 0 3
test=# select * from person;
fname | lname | email
----------+----------+-----------------------------
fname-01 | lname-01 | fname-01.lname-01#gmail.com
fname-02 | lname-02 | fname-02.lname-02#gmail.com
fname-03 | lname-03 | fname-03.lname-03#gmail.com
(3 rows)
test=#
Then I wrote the following in R
options(sqldf.RPostgreSQL.user = "postgres",
sqldf.RPostgreSQL.password = "postgres",
sqldf.RPostgreSQL.dbname = "test",
sqldf.RPostgreSQL.host = "localhost",
sqldf.RPostgreSQL.port = 5432)
###
###
library(tidyverse)
library(RPostgreSQL)
library(sqldf)
###
###
result_df <- sqldf("select * from person")
And indeed we can see that result_df contains the data stored in the table person.
> result_df
fname lname email
1 fname-01 lname-01 fname-01.lname-01#gmail.com
2 fname-02 lname-02 fname-02.lname-02#gmail.com
3 fname-03 lname-03 fname-03.lname-03#gmail.com
>
>
I can't figure out how to update an existing DB2 database in R or update a single value in it.
I can't find much information on this topic online other than very general information, but no specific examples.
library(RJDBC)
teachersalaries=data.frame(name=c("bob"), earnings=c(100))
dbSendUpdate(conn, "UPDATE test1 salary",teachersalaries[1,2])
AND
teachersalaries=data.frame(name=c("bob",'sally'), earnings=c(100,200))
dbSendUpdate(conn, "INSERT INTO test1 salary", teachersalaries[which(teachersalaries$earnings>200,] )
Have you tried passing a regular SQL statement like you would in other languages?
dbSendUpdate(conn, "UPDATE test1 set salary=? where id=?", teachersalary, teacherid)
or
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
Basically you specify the regular SQL DML statement using parameter markers (those question marks) and provide a list of values as comma-separated parameters.
Try this, it worked for me well.
dbSendUpdate(conn,"INSERT INTO test1 VALUES (?,?)",teacherid,teachersalary)
You just need to pass a regular SQL piece in the same way you do in any programing langs. Try it out.
To update multiple rows at the same time, I have built the following function.
I have tested it with batches of up to 10,000 rows and it works perfectly.
# Libraries
library(RJDBC)
library(dplyr)
# Function upload data into database
db_write_table <- function(conn,table,df){
# Format data to write
batch <- apply(df,1,FUN = function(x) paste0("'",trimws(x),"'", collapse = ",")) %>%
paste0("(",.,")",collapse = ",\n")
#Build query
query <- paste("INSERT INTO", table ,"VALUES", batch)
# Send update
dbSendUpdate(conn, query)
}
# Push data
db_write_table(conn,"schema.mytable",mydataframe)
Thanks to the other authors.
I am using R in combination with SQLite using RSQLite to persistate my data since I did not have sufficient RAM to constantly store all columns and calculate using them. I have added an empty column to the SQLite database using:
dbGetQuery(db, "alter table test_table add column newcol real)
Now I want to fill this column using data I calculated in R and which is stored in my data.table column dtab$newcol. I have tried the following approach:
dbGetQuery(db, "update test_table set newcol = ? where id = ?", bind.data = data.frame(transactions$sum_year, transactions$id))
Unfortunately, R seems like it is doing something but is not using any CPU time or RAM allocation. The database does not change size and even after 24 hours nothing has changed. Therefore, I assume it has crashed - without any output.
Am I using the update statement wrong? Is there an alternative way of doing this?
UPDATE
I have also tried the RSQLite functions dbSendQuery and dbGetPreparedQuery - both with the same result. However, what does work is updating a single row without the use of bind.data. A loop to update the column, therefore, seems possible but I will have to evaluate the performance since the dataset is huge.
As mentioned by #jangorecki the problem had to do with SQLite performance. I disabled synchronous and set journal_mode to off (which has to be done for every session).
dbGetQuery(transDB, "PRAGMA synchronous = OFF")
dbGetQuery(transDB, "PRAGMA journal_mode = OFF")
Also I changed my RSQLite code to use dbBegin(), dbSendPreparedQuery() and dbCommit(). It is takes a while but at least it works not and has an acceptable performance.