Can you parse a RODBC query through readr? - r

I am trying to improve my analytics script for a database and want to use a direct output from the query in my Access DB instead of outputting to a linked excel file where I refreshed it and then saved to a csv file. I can get the table into R, but I would like it parsed through readr. Is there a way to do this?
I have tried outputting to a csv straight away then using read_csv to reimport but it pulled errors on the parsing.
dbdata <- sqlQuery(db , "SELECT *
FROM qRoutput", stringsAsFactors = FALSE)
Is how I currently import my query.
I would just like to parse that query output through the readr parsing function - everything being taken as characters, doubles and the class coming back as "[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"

This should already give the output as data.frame.
May be you could try df = sqlGetResults(db) after your sqlQuery.
Alternatively you can go for dbplyr approach. Below is an example. This approach is valid for RODBC connections too. Intro to dbplyr
library(RSQLite)
# Create a db
db = dbConnect(SQLite(), "./Data/TestDB.db")
# write data to the db
dbWriteTable(db, "testtable", mtcars)
# Check whether the table is present
dbListTables(db)
# use the dplyr syntax to query
tbl(db, "testtable") %>% select(mpg, cyl)

Related

Speed up odbc::dbFetch

I'm trying to analyze data stored in an SQL database (MS SQL server) in R, and on a mac. Typical queries might return a few GB of data, and the entire database is a few TB. So far, I've been using the R package odbc, and it seems to work pretty well.
However, dbFetch() seems really slow. For example, a somewhat complex query returns all results in ~6 minutes in SQL server, but if I run it with odbc and then try dbFetch, it takes close to an hour to get the full 4 GB into a data.frame. I've tried fetching in chunks, which helps modestly: https://stackoverflow.com/a/59220710/8400969. I'm wondering if there is another way to more quickly pipe the data to my mac, and I like the line of thinking here: Quickly reading very large tables as dataframes
What are some strategies for speeding up dbFetch when the results of queries are a few GB of data? If the issue is generating a data.frame object from larger tables, are there savings available by "fetching" in a different manner? Are there other packages that might help?
Thanks for your ideas and suggestions!
My answer includes use of a different package. I use RODBC which is found in cran at https://cran.r-project.org/web/packages/RODBC/index.html.
This has saved me SO MUCH frustration and wasted time that came from my previous method of exporting each query result to .csv to load it into my R environment. I found regular ODBC to be much slower than RODBC.
I use the following functions:
sqlQuery() wraps the function that opens the connection to the SQL db with the first argument (in parentheses) and the query itself as the second argument. Put the query itself in quote marks.
odbcConnect() is itself the first argument in sqlquery(). The argument in odbcConnect() is the name of your connection to the SQL db. Put the connection name in quote marks.
odbcCloseAll() is the final function for this task set. Use this after each sqlQuery() to close the connection and save yourself from annoying warning messages.
Here is a simple example.
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498")
odbcCloseAll()
Here is the same example PLUS data manipulation directly from the query result.
library(dplyr)
library(RODBC)
result <- sqlQuery(odbcConnect("ODBCConnectionName"),
"SELECT *
FROM dbo.table
WHERE Collection_ID = 2498") %>%
mutate(matchid = paste0(schoolID, "-", studentID)) %>%
distinct(matchid, .keep_all - TRUE)
odbcCloseAll()
I would suggest using the dbcooper found on github. https://github.com/chriscardillo/dbcooper
I have found huge improvements in speed when querying large datasets.
Firstly, Add your connection to your environment.
conn <- DBI::dbConnect(odbc::odbc(),
Driver = "",
Server = "",
Database = "",
UID="",
PWD="")
devtools::install_github("chriscardillo/dbcooper")
library(dbcooper)
dbcooper::dbc_init(con = conn,
con_id = "test",
tables = c("schema.table"))
This adds the function test_schema_table() to your environment which is used to call the data. To collect into your environment use scheme_table %>% collect()
Here is a microbenchmark I did to compare the results of both DBI and dbcooper.
mbm <- microbenchmark::microbenchmark(
DBI = DBI::dbFetch(DBI::dbSendQuery(conn,qry)),
dbcooper = ava_qry() %>% collect() , times=5
)
Here are the results of a microbenchmark I did to compare DBI with dbcooper.

Reading csv into Rstudio from google cloud storage

I want to read csv file from google cloud storage with a function similar to
read.csv.
I used library googleCloudStorageR and I can't find a function for that. I don't want to download it, I just want to read it in environment like a data frame.
If you download a .csv file, googleCloudStorageR will by default put it into a data.frame for you via write.csv - you can turn off the behaviour by specifying saveToDisk
# will make a data.frame
gcs_get_object("mtcars.csv")
# save to disk as a CSV
gcs_get_object("mtcars.csv", saveToDisk = "mtcars.csv")
You can specify your own parse function by supplying it via parseFunction
## default gives a warning about missing column name.
## custom parse function to suppress warning
f <- function(object){
suppressWarnings(httr::content(object, encoding = "UTF-8"))
}
## get mtcars csv with custom parse function.
gcs_get_object("mtcars.csv", parseFunction = f)
I’ve tried running a sample csv file with the as.data.frame() function.
In order to run this code snippet make sure you install (install.packages("data.table")) and included the library library(“data.table”)
Also be sure that you include the fread() within the as.data.frame() function in order to read the file from it’s location.
Here is the code snippet I ran and managed to display the data frame for my data set:
library(“data.table”)
MyData <- as.data.frame(fread(file="$FILE_PATH",header=TRUE, sep = ','))
print(MyData)
Reading Data with TensorFlow:
There is one other way you can read a csv from your cloud storage with the TensorFlow API. I would assume you are accessing this data from a bucket? Firstly, you would need to install the “readr” and “cloudml” packages for these functionalities to work. Then you would need to use gs_data_dir(“gs://your-bucket-name”) along with specifying the file path file.path(data_dir, “something.csv”). You would then want to read data from the file path with read_csv(file.path(data_dir, “something.csv”)). If you want it formatted as a data frame it should look something like this.
library(“data.table”)
library(cloudml)
library(readr)
data_dir <- gs_data_dir(“gs://your-bucket-name”)
MyData <- as.data.frame(read_csv(file.path(data_dir, “something.csv”)))
print(MyData)
Make sure you have properly authenticated access to your storage
More information in this link

Import data in to R from MongoDB in JSON Format

I want to import data in JSON format from MongoDB in to R. I am using Mongolite package to connect MongoDB to R, but when i use mongo$find('{}') data is getting stored as dataframe. Please check my Rcode,
library(mongolite)
mongo <- mongolite::mongo(collection = "Attributes", db = "Test", url =
"mongodb://IP:PORT",verbose = TRUE)
df1 <- mongo$find('{}')
df1 is getting stored as dataframe, but I want the data in JSON format only. Please give your suggestions for the same.
Edit -
Actual json structure converted in to list
But when i load data from MongoDB to R using mongolite package, data is getting stored as dataframe and then if i convert to list, the structure is getting changed and few extra columns are inserted in to list.
Please let me know on how to solve this issue.
Thanks
SJB

How to import tab delimited data to sqlite using RSQLite?

I'd like to import a bunch of large text files to a SQLite db using RSQLite. If my data were comma delimited, I'd do this:
library(DBI)
library(RSQLite)
db <- dbConnect(SQLite(), dbname = 'my_db.sqlite')
dbWriteTable(conn=db, name='my_table', value='my_file.csv')
But how about with '\t' -delimited data? I know I could read the data into an R data.frame and then create the table from that, but I'd like to go straight into SQLite, since there are lots of large files. When I try the above with my data, I get one single character field.
Is there a sep='\t' sort of option I can use? I tried just adding sep='\t', like this:
dbWriteTable(conn=db, name='my_table', value='my_file.csv', sep='\t')
EDIT: And in fact that works great, but a flaw in the file I was working with was producing an error. Also good to add header=TRUE, if you have headers as I do.
Try the following:
dbWriteTable(conn=db, name='my_table', value='my_file.csv', sep='\t')
Per the following toward the top of page 21 of http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf
When dbWriteTable is used to import data from a file, you may optionally specify header=,
row.names=, col.names=, sep=, eol=, field.types=, skip=, and quote=.
[snip]
sep= specifies the field separator, and its default is ','.

How to import CSV into sqlite using RSqlite?

As question, I found that I can use .import in sqlite shell, but seems it is not working in R environment, any suggestions?
You can use read.csv.sql in the sqldf package. It is only one line of code to do the read. Assuming you want to create a new database, testingdb, and then read a file into it try this:
# create a test file
write.table(iris, "iris.csv", sep = ",", quote = FALSE, row.names = FALSE)
# create an empty database.
# can skip this step if database already exists.
sqldf("attach testingdb as new")
# or: cat(file = "testingdb")
# read into table called iris in the testingdb sqlite database
library(sqldf)
read.csv.sql("iris.csv", sql = "create table main.iris as select * from file",
dbname = "testingdb")
# look at first three lines
sqldf("select * from main.iris limit 3", dbname = "testingdb")
The above uses sqldf which uses RSQLite. You can also use RSQLite directly. See ?dbWriteTable in RSQLite. Note that there can be problems with line endings if you do it directly with dbWriteTable that sqldf will automatically handle (usually).
If your intention was to read the file into R immediately after reading it into the database and you don't really need the database after that then see:
http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
I tend to do that with the sqldf package: Quickly reading very large tables as dataframes in R
Keep in mind that in the above example I read the csv into a temp sqlite db. You'll obviously need to change that bit.

Resources