Define mlr3 task using data from a database (different tables)? - r

This is a newbie question.
How do I define a (classification) tsk that uses data from a (sqlite) database? The mlr3db example seems to write data from memory first. In my case, the data is already in the database. What is maybe a bigger problem, the target data and the features are in different tables.
What I tried:
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "my_data.db")
my_features <- dplyr::tbl(con, "my_features")
my_target <- dplyr::tbl(con, "my_targets")
task <- mlr3::TaskClassif$new("my_task", backend=my_features, target="???")
and then I don't know how to specify the target argument.
Maybe a solution would be to create a VIEW in the database that joins features and targets?

Having the data split into (1) multiple tables or (2) multiple data bases is possible. In your case, it looks like the data is just split into multiple tables, but you can use the same DBI connection to access them.
All you need is a key column to join the two tables. In the following example I'm using a simple integer key column and an inner_join() to merge the two tables into a single new table, but this somewhat depends on your database scheme.
library(mlr3)
library(mlr3db)
# base data set
data = iris
data$row_id = 1:nrow(data)
# create data base with two tables, split data into features and target and
# keep key column `row_id` in both tables
path = tempfile()
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
DBI::dbWriteTable(con, "features", subset(data, select = - Species))
DBI::dbWriteTable(con, "target", subset(data, select = c(row_id, Species)))
DBI::dbDisconnect(con)
# re-open table
con = DBI::dbConnect(RSQLite::SQLite(), dbname = path)
# access tables with dplyr
tbl_features = dplyr::tbl(con, "features")
tbl_target = dplyr::tbl(con, "target")
# join tables with an inner_join
tbl_joined = dplyr::inner_join(tbl_features, tbl_target, by = "row_id")
# convert to a backend and create the task
backend = as_data_backend(tbl_joined, primary_key = "row_id")
mlr3::TaskClassif$new("my_task", backend, target = "Species")

Related

How to import data from Azure to R faster

I'm dealing with a huge database (67mi of rows and 55 columns). This database is on Azure and I want to make analysis in R with that. So I'm using the following:
library("odbc")
library("DBI")
library("tidyverse")
library("data.table")
con <- DBI::dbConnect(odbc::odbc(),
UID = rstudioapi::askForPassword("myEmail"),
Driver="ODBC Driver 17 for SQL Server",
Server = server, Database = database,
Authentication = "ActiveDirectoryInteractive")
#selecting columns
myData = data.table::setDT(DBI::dbGetQuery(conn = con, statement =
'SELECT Column1, Column2, Column3
FROM myTable'))
I'm trying to use data.table::setDT to convert the data.frame to data.base, but it is still taking a very long time to load.
Any hint on how can I load this data faster?

Using clinicaltrials.gov database in R

I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.

List tables within a Postgres schema using R

I'm connecting to a PostgreSQL db using R and the RPostgreSQL package. The db has a number of schemas and I would like to know which tables are associated with a specific schema.
So far I have tried:
dbListTables(db, schema="sch2014")
dbGetQuery(db, "dt sch2014.*")
dbGetQuery(db, "\dt sch2014.*")
dbGetQuery(db, "\\dt sch2014.*")
None of which have worked.
This related question also exists: Setting the schema name in postgres using R, which would solve the problem by defining the schema at the connection. However, it's not yet been answered!
Reading this answer https://stackoverflow.com/a/15644435/2773500 helped. I can use the following to get the tables associated with a specific schema:
dbGetQuery(db,
"SELECT table_name FROM information_schema.tables
WHERE table_schema='sch2014'")
The following should work (using DBI_v1.1.1)
DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
While it has all the info you want, it's hard to access and hard to read.
I would recommend something that produces a data frame:
# get a hard to read table given some Postgres connection `conn`
x = DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))
# - extract column "table" comprising a list of Formal class 'Id' objects then
# - extract the 'name' slot for each S4 object element
# could also do `lapply(d$table, function(x) x#name)`
v = lapply(x$table, function(x) slot(x, 'name'))
# create a dataframe with header 'schema', 'table'
d = as.data.frame(do.call(rbind, v))
Or in one line:
d = as.data.frame(do.call(rbind, lapply(DBI::dbListObjects(conn, DBI::Id(schema = 'schema_name'))$table, function(x) slot(x, 'name'))))
Or in a more "tidy" way:
conn %>%
DBI::dbListObjects(DBI::Id(schema = 'schema_name')) %>%
dplyr::pull(table) %>%
purrr::map(~slot(.x, 'name')) %>%
dplyr::bind_rows()
OUTPUT is something like
> d
schema table
1 schema_name mtcars
You can use the table_schema option rather than just schema to see a list of tables within the specific schema. So keeping with your example code snippet above, the below line should work:
dbListTables(db, table_schema="sch2014")

Clearing specific rows using RODBC

I would like to use the RODBC package to partially overwrite a Microsoft Access table with a data frame. Rather than overwriting the entire table, I am looking for a way in which to remove only specific rows from that table -- and then to append my data frame to its end.
My method for appending the frame is pretty straightforward. I would use the following function:
sqlSave(ch, df, tablename = "accessTable", rownames = F, append = T)
The challenge is finding a function that will allow me to clear specific row numbers from the Access table ahead of time. The sqlDrop and sqlClear functions do not seem to get me there, since they will either delete or clear the entire table as a whole.
Any recommendation to achieve this task would be much appreciated!
Indeed, consider using sqlQuery to subset your Access table of the rows you want to keep, then rbind with current dataframe and finally sqlSave, purposely overwriting original Access table with append = FALSE.
# IMPORT QUERY RESULTS INTO DATAFRAME
keeprows <- sqlQuery(ch, "SELECT * FROM [accesstable] WHERE timedata >= somevalue")
# CONCATENATE df to END
finaldata <- rbind(keeprows, df)
# OVERWRITE ORIGINAL ACCESS TABLE
sqlSave(ch, finaldata, tablename = "accessTable", rownames = FALSE, append = FALSE)
Of course you can also do the counter, deleting rows from table per specified logic and then appending (NOT overwriting) with sqlSave:
# ACTION QUERY TO RUN IN DATABASE
sqlQuery(ch, "DELETE FROM [accesstable] WHERE timedata <= somevalue")
# APPEND TO ACCESS TABLE
sqlSave(ch, df, tablename = "accessTable", rownames = FALSE, append = TRUE)
The key is finding the SQL logic that specifies the rows you intend to keep.

Reading huge csv files into R with sqldf works but sqlite file takes twice the space it should and needs "vacuuming"

Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Solution: using RSQLite without going through sqldf:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")

Resources