I connect to Redshift using R remotely from my workstation.
install.packages("RPostgreSQL")
library (RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con1 <- dbConnect(drv, host="url", port="xxxx",
dbname="db_name", user="id", password="password")
dbGetInfo(con1)
then I create a table:
dbSendQuery(con1, "create table schema.table_name as select * from schema.table_name;")
now I want to export this table to a .csv file on my workstation, how to do this ? Again, I don't have PostGres database installed on my workstation, only using R to get to it.
Also, this table is LARGE, 4 columns, 14 million rows.
Thanks!
You'll need to pull down the results of a query into a local object, then dump the object to a CSV. Something along the lines of res <- dbSendQuery(con1, "select * from schema.table_name"); dat <-dbFetch(res); readr::write_csv(dat, "~/output.csv") should get you started.
I figured this out after posting - sharing..
system.time( fwrite(dbReadTable(con1, c("schema","table")), file="file.csv", sep=",", na="", row.names=FALSE, col.names=TRUE ))
I hear feather is even faster ?
this was for 43 million rows with 4 columns, took 15 minutes.
Related
I'm dealing with a huge database (67mi of rows and 55 columns). This database is on Azure and I want to make analysis in R with that. So I'm using the following:
library("odbc")
library("DBI")
library("tidyverse")
library("data.table")
con <- DBI::dbConnect(odbc::odbc(),
UID = rstudioapi::askForPassword("myEmail"),
Driver="ODBC Driver 17 for SQL Server",
Server = server, Database = database,
Authentication = "ActiveDirectoryInteractive")
#selecting columns
myData = data.table::setDT(DBI::dbGetQuery(conn = con, statement =
'SELECT Column1, Column2, Column3
FROM myTable'))
I'm trying to use data.table::setDT to convert the data.frame to data.base, but it is still taking a very long time to load.
Any hint on how can I load this data faster?
Here is my code, where I am trying to write a data from R to SQLite database file.
library(DBI)
library(RSQLite)
library(dplyr)
library(data.table)
con <- dbConnect(RSQLite::SQLite(), "data.sqlite")
### Read the file you want to load to the SQLite database
data <- read_rds("data.rds")
dbSafeNames = function(names) {
names = gsub('[^a-z0-9]+','_',tolower(names))
names = make.names(names, unique=TRUE, allow_=TRUE)
names = gsub('.','_',names, fixed=TRUE)
names
}
colnames(data) = dbSafeNames(colnames(data))
### Load the dataset to the SQLite database
dbWriteTable(conn=con, name="data", value= data, row.names=FALSE, header=TRUE)
While writing the 80GB data, I see the size of the data.sqlite increasing upto 45GB and then it stops and throws the following error.
Error in rsqlite_send_query(conn#ptr, statement) : disk I/O error
Error in rsqlite_send_query(conn#ptr, statement) :
no such savepoint: dbWriteTable
What is the fix and what should I do? If it's only with RSQLite, please suggest the most robust database creation method like RMySQL, RPostgreSQL, etc.
All,
I'm trying to use the packages RJDBC , rJava and DBI in R to extract the data from a big hive table sitting in a mapr hive/hadoop cluster on a remote linux machine.
I don't have any issues in connecting to the hive cluster. The table1 I'm trying to extract data from is of the size 500M ( million) rows x 16 columns.
This is the code:
options(java.parameters = "-Xmx32g" )
library(RJDBC)
library(DBI)
library(rJava)
hive_dir <- "/opt/mapr/hive/hive-0.13/lib"
.jinit()
.jaddClassPath(paste(hive_dir,"hadoop-0.20.2-core.jar", sep="/"))
.jaddClassPath(c(list.files("/opt/mapr/hadoop/hadoop-0.20.2/lib",pattern="jar$",full.names=T),
list.files("/opt/mapr/hive/hive-0.13/lib",pattern="jar$",full.names=T),
list.files("/mapr/hadoop-dir/user/userid/lib",pattern="jar$",full.names=T)))
drv <- JDBC("org.apache.hive.jdbc.HiveDriver","hive-jdbc-0.13.0-mapr-1504.jar",identifier.quote="`")
hive.master <- "xx.xxx.xxx.xxx:10000"
url.dbc <- paste0("jdbc:hive2://", hive.master)
conn = dbConnect(drv, url.dbc, "user1", "xxxxxxxx")
dbSendUpdate(conn, "set hive.resultset.use.unique.column.names=false")
df <- dbGetQuery(conn, "select * from dbname.table1 limit 1000000 ") # basically 1 million rows
The above works perfectly and df data.frame contains exactly what I want. However if I remove the limit from the last code segment, I get an error:
df <- dbGetQuery(conn, "select * from dbname.table1 ")
Error:
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve JDBC result set for select * from dbname.table1 (Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask)
I googled the last part of the error Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask and tried to put these 2 statements before the dbGetQuery, but couldn't get rid of the error.
dbSendUpdate(conn, "SET hive.auto.convert.join=false")
dbSendUpdate(conn, "SET hive.auto.convert.join.noconditionaltask=false")
Does anybody have any idea why I'm getting the error when I remove the limit from my select statement ? It even works for 120 million rows with the limit, but takes a long time. At this point, the time taken is not a concern for me.
I am using SparkR on R Studio server. After creating an sqlContext, I processed few tables in sparkR and am left with the final table of 2.2million records which I wanted to convert to a R data.frame in order to develop regression models using R functions. However, the code of "as.data.frame(finaltable)" never gets executed even after 2 hours due to memory issues.
library(SparkR)
sc <- sparkR.init(master="yarn-client", sparkEnvir = list(spark.yarn.keytab="/home/teja_kolli/teja_kolli.keytab" , spark.yarn.principal="teja_kolli#HADOOP.QA.AWS.CHOTEL.COM",spark.driver.memory="4g" ))
sqlContext <- sparkRSQL.init(sc)
customer_activity_bookings <- parquetFile(sqlContext, "s3a:/parquet/customer_activity_bookings/.parquet")
registerTempTable(customer_activity_bookings, "customer_activity_bookings")
I use similarly some 4 tables and do further processing to arrive at that below table T3 which has some 2.2 million records
t3 <- sql(sqlContext,
"select a.visitor_id,a.timestamp,a.sort_number,a.property_id,a.brand_name,a.distance_value,a.guest_recommends,a.guest_reviews,a.min_avg_nightly_before_tax,a.rating_value,
a.relevance,a.relevance_distance_index,a.relevance_rate_index,a.relevance_rating_index,a.hotel_selection_type,a.pid,c.p_key,c.sum_key
from t1 a left outer join t2 c on a.visitor_id = c.visitor_id and a.timestamp = c.timestamp where c.p_key=1 and sum_key=1")
**modeldata1<-as.data.frame(t3)**
The above as.data.frame takes so long to run ( throws "out of memory Java Heap space error". In the sparkR.init connection I went up to (memory = "4g") and can't go beyond due to memory restrictions.)
Is there any work around to bring this final table of 2.2 million records into R so that I can use R functions,libraries and commands?
Converting a sparkR data frame to a local R data frame is not a good idea because you are moving all the distributed dato to one point causing a lot of networking traffic and missing the advantage of having the data distributed, maybe you should research more about the sparkR package probably there is a command for what you want to calculate
https://spark.apache.org/docs/latest/api/R/
Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql from package sqldf. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db").
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db"). This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Solution: using RSQLite without going through sqldf:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")