Facing issue while loading HDFS file into Hive - r

I am trying to write data into hive using R Studio, for which first I am storing data into HDFS and from there I want to insert data into Hive.
Data stored in HDFS as :
["TER0626974_achieved","TER0630327_achieved","TER0630520_achieved","TER0537124_achieved","TER0404705_achieved"]
Issue: Now the issue is reading this data from Hive.
CREATE EXTERNAL TABLE dbname.table_name (
id string
) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION "/hdfs/path/to/file";
We are able to insert this result into hive. But when try to read, getting below error.
archive_data <- dbGetQuery(hivecon, "SELECT * from Table") Error in .jcall(rp, "I", "fetch", stride, block) :
org.apache.hive.service.cli.HiveSQLException: java.io.IOException:
org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException:
Start token not found where expected
can this be the issue? JSON should start with { and not with array ([)?

Related

Does read.csv.sql puts the result dataframe in memory or in the database?

I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.
I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :
Reads the indicated file into an sql database creating the database if
it does not already exist. Then it applies the sql statement returning
the result as a data frame. If the database did not exist prior to
this statement it is removed.
However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)
Curious to see how this works, I wrote the following program
df_data <- suppressWarnings(read.csv.sql(
file = "X:/logs/data.csv",
sql = "
select
nullif(timestamp, '') as timestamp_value,
nullif(user_account, '') as user_account,
nullif(country_code, '') as country_code,
nullif(prefix_value, '') as prefix_value,
nullif(user_query, '') as user_query,
nullif(returned_code, '') as returned_code,
nullif(execution_time, '') as execution_time,
nullif(output_format, '') as output_format
from
file
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
timestamp_value = c("TEXT"),
user_account = c("TEXT"),
country_code = c("TEXT"),
prefix_value = c("TEXT"),
user_query = c("TEXT"),
returned_code = c("TEXT"),
execution_time = c("REAL"),
output_format = c("TEXT")
),
dbname = "X:/logs/sqlite_tmp.db",
drv = "SQLite"
))
I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).
What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?
The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.
Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
library(sqldf)
sqldf("attach 'mydb' as new")
read.csv.sql("myfile.csv", sql =
"create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or
## dbSendStatement() instead of dbGetQuery() or dbSendQuery().
sqldf("select * from sqlite_master", dbname = "mydb")
## type name tbl_name rootpage
## .. info on table that was created in mydb ...
sqldf("select count(*) from mytab", dbname = "mydb")

How to save to pre-existing Snowflake table from R using pool

I am using pool to handle connections to my Snowflake warehouse. I have created a connection to my database and can read data in a pre-existing table with no issues e.g:
my_pool <- dbPool(odbc::odbc(),
Driver = "Snowflake",
Server = Sys.getenv('WH_URL'),
UID = Sys.getenv('WH_USER'),
PWD = Sys.getenv('WH_PW'),
Warehouse = Sys.getenv('WH_WH'),
Database = "MY_DB")
my_data<-tbl(my_pool, in_schema(sql("schema_name"), sql("table_name"))) %>%
collect()
I would like to save back to a table (table_name) and I believe the best way to do this is with pool::dbWriteTable:
# Create some data to save to db
data<-data.frame("user_email" = "tim#apple.com",
"query_run" = "arrivals_departures",
"data_downloaded" = FALSE,
"created_at" = as.character(Sys.time()))
# Define where to save the data
table_id <- Id(database="MY_DB", schema="MY_SCHEMA", table="TABLE_NAME")
# Write to database
pool::dbWriteTable(my_pool, table_id, data, append=TRUE)
However this returns the error:
Error in new_result(connection#ptr, statement, immediate) :
nanodbc/nanodbc.cpp:1594: 00000: SQL compilation error:
Object 'MY_DB.MY_SCHEMA.TABLE_NAME' already exists.
I have read/write/update permissions for this database for the user specified in my_pool.
I have explored the accepted answers here and here to create the above attempt and can't figure out what I'm doing wrong. It's probably something simple that I've forgotten to do - any thoughts?
EDIT: Wondering if my issue is anything to do with: https://github.com/r-dbi/odbc/issues/480

How to use Rsqlite to save and load data in sqlite

I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.

R Shiny: Unable to retrieve JDBC result set for vertica DB

Getting below error while using vertica copy table from local.
Please suggest
Error:Unable to retrieve JDBC result set for COPY
Monetisation_Base_table FROM LOCAL 'E://testCSV.csv' delimiter ','
([Vertica]JDBC A ResultSet was expected but not generated
from query "COPY Monetisation_Base_table FROM LOCAL 'E://testCSV.csv'
delimiter ','". Query not executed. )
Code Used:
library(RJDBC)
vDriver <- JDBC(driverClass="com.vertica.jdbc.Driver", classPath="full\path\to\driver\vertica_jdbc_VERSION.jar")
vertica <- dbConnect(vDriver, "jdbc:vertica://30.0.9.163:5433/db", "sk14930IU", "Snapdeal_40")
myframe = dbGetQuery(vertica, "COPY Monetisation_Base_table FROM LOCAL 'E://testCSV.csv' delimiter ','"")
dbSendUpdate should do the work in this case .

integrating hadoop, revo-scaleR and hive

I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.

Resources