Create database in Azure Databricks using R dataframe - r

I have my final output in R dataframe. I need to write this output to a database in Azure Databricks. Can someone help me with the syntax? I used this code:
require(SparkR)
data1 <- createDataFrame(output)
write.df(data1, path="dbfs:/datainput/sample_dataset.parquet",
source="parquet", mode="overwrite")
This code runs without error, but i dont see the database in the datainput folder (mentioned in the path). Is there some other way to do it?

I believe you are looking for saveAsTable function. write.df is particularly to save the data in a file system only, not to tag the data as table.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table. If you mention sample_table instead of default.sample_table then it will be saved in the default database.

Related

How to save an empty data to adlsgen2 using R?

I've explored but didn't find any suggestions.
I want to know how to save an empty file to adlsgen2 using R and then read it back to the same code.
Thanks for the help.
Since you want to write an empty file to ADLS gen2 using R and read it from that location, first create a dataframe.
The dataframe can either be completely empty, or it can be a dataframe with column names but no rows. You can use the following code to create one.
df <- data.frame() #OR
df <- data.frame(<col_name>=<type>)
Once you have created a dataframe, you must establish connection to ADLS gen2 account. You can use the method specified in the following link to do that.
https://saketbi.wordpress.com/2019/05/11/how-to-connect-to-adls-gen2-using-sparkr-from-databricks-rstudio-while-integrating-securely-with-azure-key-vault/
After making the connection you can use read and write functions in R language using the path to ADLS gen2 storage. The following link provides numerous functions that can be used according to your requirement.
https://www.geeksforgeeks.org/reading-files-in-r-programming/
Sharing the solution to my question:
I've used SparkR to create an empty data frame and save the same in the adls after setting up the spark context.
Solution is below:
Step 1: Create a schema
fcstSchema <- structType(structField("ABC", "string",TRUE))
new_df<- data.frame(ABC=NULL,stringsAsFactors=FALSE)
n<-createDataFrame(new_df,fcstSchema)
SparkR::saveDF(n,path = "abfss://<account_name>#<datalake>.dfs.core.windows.net/<path>/",source = "delta",mode = "overwrite")

Cannot save tables with get() in R

I scan through SQL database and load evey table by ODBC then would like to store it in a file of file name same as table name. I compose filename by paste(path,variablename,Sys.Date(),sep="_"). I also like to distinguish data in R by storing value of tables in a variable of same name as corresponding SQL table. I achieve this by loading data to a temporary variable then assigning its content to a variable which name is stored in variablename with assign(variablename,temporarytable) function.
I would like to save an R variable with save() function, but need to refer to its name stored in variablename variable. When using get(variablename) I got its content. When using save(get(variablename),file=paste(...,variablename,...)) I got an error that the object ‘get(variablename)’ cannot be found.
What is the problem with get() in save()? How can I get a variable content saved in this situation?
ps
I scan through SQL database tables with for loop. variablename variable stores SQL table name in particular iteration. assign(variablename,temporarytable) helped to load data to an object of required name.
It can be solved like this:
save(list = variablename, file = paste(...,variablename,...))
But I would still itch to know why save(get(variablename), ...) does not work.
Maybe, you can save the data to an object.
temp <- get(variablename)
save(temp,file=file.path(...,variablename,...))

Speed up INSERT of 1 million+ rows into Postgres via R using COPY?

I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.
RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df

How to use Rsqlite to save and load data in sqlite

I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

Resources