I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.
Related
I have created and filled a sqlite database within R using packages DBI and RSQLite. E.g. like that:
con_sqlite <- DBI::dbConnect(RSQLite::SQLite(), "../mydatabase.sqlite")
DBI::dbWriteTable(con_sqlite, "mytable", d_table, overwrite = TRUE)
...
Now the sqlite file got too big and I reduced the tables. However, the size does not decrease and I found out that I have to use command vaccum. Is there a possibility to use this command within R?
I think this should do the trick:
DBI::dbExecute(con, "VACUUM;")
I have a 20GB dataset in csv format and I am trying to trim it down with a read.csv.sql command.
I am successfully able to load the first 10,000 observations with the following command:
testframe = read.csv(file.choose(),nrows = 10000)
The column names can be seen in the following picture:
I then tried to build my trimmed down dataset with the following command, and get an error:
reduced = read.csv.sql(file.choose(),
sql = 'select * from file where "country" = "Poland" OR
country = "Germany" OR country = "France" OR country = "Spain"',
header = TRUE,
eol = "\n")
The error is:Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: C:\Users\feded\Desktop\AWS\biodiversity-data\occurence.csv line 262 expected 37 columns of data but found 38
Why is it that I can load the first 10,000 observations with ease and problems arise with the second command? I hope you have all the information needed to be able to provide some help on this issue.
Note that with the latest version of all packages read.csv.sql is working again.
RSQLite made breaking changes in their interface to SQLite which mean read.csv.sql and any other software that reads files into SQLite from R that used their old interface no longer work. (Other aspects of sqldf still work.)
findstr/grep
If the only reason you are doing this is to cut down the file to the 4 countries indicated perhaps you could just preprocess the csv file like this on Windows assuming that abc.csv is your csv file and that it is in the current directory. Also we have assumed that XYZ is a string in the header.
DF <- read.csv(pipe('findstr "XYZ France Germany Poland Spain" abc.csv'))
On other platforms use grep:
DF <- read.csv(pipe('grep "XYZ|France|Germany|Poland|Spain" abc.csv'))
The above could possibly retrieve some extra rows if those words can also appear in fields other than the intended one but if that is a concern then using subset or filter in R once you have the data in R could be used to narrow it down to just the desired rows.
Other utilities
There are also numerous command line utilities that can be used as an alternative to findstr and grep such as sed, awk/gawk (mentioned in the comments) and utilities specifically geared to csv files such as csvfix (C++), miller (go), csvkit (python), csvtk (go) and xsv (rust).
xsv
Taking xsv as an example, binaries can be downloaded here and then we can write the following assuming xsv is in current directory or on path. This instructs xsv to extract the rows for which the indicated regular expression matches the country column.
cmd <- 'xsv search -s country "France|Germany|Poland|Spain" abc.csv'
DF <- read.csv(pipe(cmd))
SQLite command line tool
You can use the SQLite command line program to read the file into an SQLite database which it will create for you. Google for download sqlite, download the sqlite command line tools for your platform and unpack it. Then from the command line (not from R) run something like this to create the abc.db SQLite database from abc.csv.
sqlite3 --csv abc.db ".import abc.csv abc"
Then assuming that the database is in current directory run this in R:
library(sqldf)
sqldf("select count(*) from abc", dbname = "abc.db")
I am not sure that sqlite it a good choice for such a large file but you can try it
H2
Another possibility if you have sufficient memory to hold the database (possibly after using findstr/grep/xsv or other utility on the command line rather than R) is to then use the H2 database backend to sqldf from R.
If sqldf sees that the RH2 package containing the H2 driver is loaded it will use that instead of SQLite. (It would also be possible to use MySQL or PostgreSQL backends but these are more involved to install so we won't cover them although these are much more likely to be able to handle the large size you have.)
Note that the RH2 driver requires that rJava R package be installed and it requires java itself although java is very easy to install. The H2 database itself is included in the RH2 R driver package so it does not have to be separately installed. Also the first time in a session that you access java code with rJava it will have to load java itself which will take some time but thereafter it will be faster in that session.
library(RH2)
library(sqldf)
abc3 <- sqldf("select * from csvread('abc.csv') limit 3") |>
type.convert(as.is = TRUE)
I have two very large csv files that contain the same variables. I want to combine them into one table inside a sqlite database - if possible using R.
I successfully managed to put both csv files in separate tables inside one database using inborutils::csv_to_sqlite that one imports small chunks of data at a time.
Is there a way to create a third tables where both tables are simply appended using R (keeping in mind the limited RAM)? And if not - how else can I perform this task? Maybe via the terminal?
We assume that when the question refers to the "same variables" that it means that the two tables have the same column names. Below we create two such test tables, BOD and BOD2, and then in the create statement we combine them creating table both. This does the combining entirely on the SQLite side. Finally we use look at both.
library(RSQLite)
con <- dbConnect(SQLite()) # modify to refer to existing SQLite database
dbWriteTable(con, "BOD", BOD)
dbWriteTable(con, "BOD2", 10 * BOD)
dbExecute(con, "create table both as select * from BOD union select * from BOD2")
dbReadTable(con, "both")
dbDisconnect(con)
I have my final output in R dataframe. I need to write this output to a database in Azure Databricks. Can someone help me with the syntax? I used this code:
require(SparkR)
data1 <- createDataFrame(output)
write.df(data1, path="dbfs:/datainput/sample_dataset.parquet",
source="parquet", mode="overwrite")
This code runs without error, but i dont see the database in the datainput folder (mentioned in the path). Is there some other way to do it?
I believe you are looking for saveAsTable function. write.df is particularly to save the data in a file system only, not to tag the data as table.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table. If you mention sample_table instead of default.sample_table then it will be saved in the default database.
I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.