This question already has answers here:
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 4 years ago.
I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:
R> my_table %>%
spark_write_table(.,
path="mytable",
mode="append",
partition_by=c("col1", "col2")
)
However, now I want to update the table by altering just one partition, instead of writing the whole table again.
In Hadoop-SQL I would do something like:
INSERT INTO TABLE mytable
PARTITION (col1 = 'my_partition')
VALUES (myvalues..)
Is there an equivalent option to do this in sparklyr correctly? I cannot find it in the documentation.
Re - duplication note: this question is specifically about the way to do this in R with the sparklyr function, while the other question is about general Hive syntax
Thanks all for the comments.
It seems there is no way to do this with sparklyr directly, but this is what I am going to do.
In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.
> dbGetQuery(con,
"ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")
> spark_write_table(new_partition, "tmp_partition_table")
> dbGetQuery(con,
"INSERT VALUES INTO TABLE mytable
PARTITION (mycol='partition1')
SELECT *
FROM tmp_partition_table "
)
Related
Today, for the first time I discovered sqldf package which I found to be very useful and convenient. Here is what the documentation says about the package:
https://www.rdocumentation.org/packages/sqldf/versions/0.4-11
sqldf is an R package for runing SQL statements on R data frames,
optimized for convenience. The user simply specifies an SQL statement
in R using data frame names in place of table names and a database
with appropriate table layouts/schema is automatically created, the
data frames are automatically loaded into the database, the specified
SQL statement is performed, the result is read back into R and the
database is deleted all automatically behind the scenes making the
database's existence transparent to the user who only specifies the
SQL statement.
So if I understand correctly, some data.frame which contains data stored in the RAM of the computer is mapped into a database on the disk temporarily as a table, then the calculation or whatever the query is supposed to do will be done and finally the result is returned back to R and all that was temporarily created in the database goes away as it never existed.
My question is, does it work other way around? Meaning, that assuming there is already a table let's say named my_table (just an example) in the database (I use PostgreSQL), is there any way to import its data from the database into a data.frame in R via sqldf? Because, currently the only way that I know is RPostgreSQL.
Thanks to G. Grothendieck for the answer. Indeed it is perfectly possible to select data from already existing tables in the database. My mistake was that I was thinking that the name of the dataframe and the corresponding table must always be the same, whereas if I understand correctly, this is only the case when a data.frame data is mapped to a temporary table in the database. As a result when I tried to select data, I had an error message saying that a table with the same name already existed in my database.
Anyway, just as a test to see whether this works, I did the following in PostgreSQL (postgres user and test database which is owned by postgres)
test=# create table person(fname text, lname text, email text);
CREATE TABLE
test=# insert into person(fname, lname, email) values ('fname-01', 'lname-01', 'fname-01.lname-01#gmail.com'), ('fname-02', 'lname-02', 'fname-02.lname-02#gmail.com'), ('fname-03', 'lname-03', 'fname-03.lname-03#gmail.com');
INSERT 0 3
test=# select * from person;
fname | lname | email
----------+----------+-----------------------------
fname-01 | lname-01 | fname-01.lname-01#gmail.com
fname-02 | lname-02 | fname-02.lname-02#gmail.com
fname-03 | lname-03 | fname-03.lname-03#gmail.com
(3 rows)
test=#
Then I wrote the following in R
options(sqldf.RPostgreSQL.user = "postgres",
sqldf.RPostgreSQL.password = "postgres",
sqldf.RPostgreSQL.dbname = "test",
sqldf.RPostgreSQL.host = "localhost",
sqldf.RPostgreSQL.port = 5432)
###
###
library(tidyverse)
library(RPostgreSQL)
library(sqldf)
###
###
result_df <- sqldf("select * from person")
And indeed we can see that result_df contains the data stored in the table person.
> result_df
fname lname email
1 fname-01 lname-01 fname-01.lname-01#gmail.com
2 fname-02 lname-02 fname-02.lname-02#gmail.com
3 fname-03 lname-03 fname-03.lname-03#gmail.com
>
>
I am new to RSQLite.
I have an input document in text format in which values are seperately by '|'
I created a table with the required variables (dummy code as follows)
db<-dbconnect(SQLite(),dbname="test.sqlite")
dbSendQuery(conn=db,
"CREATE TABLE TABLE1(
MARKS INTEGER,
ROLLNUM INTEGER
NAME CHAR(25)
DATED DATE)"
)
However I am struck at how to import values into the created table.
I cannot use INSERT INTO Values command as there are thousands of rows and more than 20+ columns in the original data file and it is impossible to manually type in each data point.
Can someone suggest an alternative efficient way to do so?
You are using a scripting language. The deal of this is literally to avoid manually typing each data point. Sorry.
You have two routes:
1: You have corrected loaded a database connection and created an empty table in your SQLite database. Nice!
To load data into the table, load your text file into R using e.g. df <-
read.table('textfile.txt', sep='|') (modify arguments to fit your text file).
To have a 'dynamic' INSERT statement, you can use placeholders. RSQLite allows for both named or positioned placeholder. To insert a single row, you can do:
dbSendQuery(db, 'INSERT INTO table1 (MARKS, ROLLNUM, NAME) VALUES (?, ?, ?);', list(1, 16, 'Big fellow'))
You see? The first ? got value 1, the second ? got value 16, and the last ? got the string Big fellow. Also note that you do not enclose placeholders for text in quotation marks (' or ")!
Now, you have thousands of rows. Or just more than one. Either way, you can send in your data frame. dbSendQuery has some requirements. 1) That each vector has the same number of entries (not an issue when providing a data.frame). And 2) You may only submit the same number of vectors as you have placeholders.
I assume your data frame, df contains columns mark, roll, and name, corrsponding to the columns. Then you may run:
dbSendQuery(db, 'INSERT INTO table1 (MARKS, ROLLNUM, NAME) VALUES (:mark, :roll, :name);', df)
This will execute an INSERT statement for each row in df!
TIP! Because an INSERT statement is execute for each row, inserting thousands of rows can take a long time, because after each insert, data is written to file and indices are updated. Insert, enclose it in an transaction:
dbBegin(db)
res <- dbSendQuery(db, 'INSERT ...;', df)
dbClearResult(res)
dbCommit(db)
and SQLite will save the data to a journal file, and only save the result when you execute the dbCommit(db). Try both methods and compare the speed!
2: Ah, yes. The second way. This can be done in SQLite entirely.
With the SQLite command utility (sqlite3 from your command line, not R), you can attach a text file as a table and simply do a INSERT INTO ... SELECT ... ; command. Alternately, read the text file in sqlite3 into a temporary table and run a INSERT INTO ... SELECT ... ;.
Useful site to remember: http://www.sqlite.com/lang.html
A little late to the party, but DBI provides dbAppendTable() which will write the contents of a dataframe to an SQL table. Column names in the dataframe must match the field names in the database. For your example, the following code would insert the contents of my random dataframe into your newly created table.
library(DBI)
db<-dbConnect(RSQLite::SQLite(),dbname=":memory")
dbExecute(db,
"CREATE TABLE TABLE1(
MARKS INTEGER,
ROLLNUM INTEGER,
NAME TEXT
)"
)
df <- data.frame(MARKS = sample(1:100, 10),
ROLLNUM = sample(1:100, 10),
NAME = stringi::stri_rand_strings(10, 10))
dbAppendTable(db, "TABLE1", df)
I don't think there is a nice way to do a large number of inserts directly from R. SQLite does have a bulk insert functionality, but the RSQLite package does not appear to expose it.
From the command line you may try the following:
.separator |
.import your_file.csv your_table
where your_file.csv is the CSV (or pipe delimited) file containing your data and your_table is the destination table.
See the documentation under CSV Import for more information.
I am using the RODBC package on R which allows me to connect to SQL using R.
As an example to my problem, I have a table [Sales] within SQL with 3 Columns (Alpha, Beta, BetaDistribution).
1.50,77,x
2.99,53,x
4.50,122,x
Note that the 3rd column (BetaDistribution) is not populated, and this needs to be populated using a Statistical R Function.
I have assigned my table to the variable SELECT
select <- sqlQuery(dbhandle, 'select * from dbo.sales')
how to I run a loop to update my sql table so that the BetaDistribution column is updated with the calculated Beta Distribution - pbeta(alpha,beta)
Something like this. Basically you make a temp table and then update the existing table. There's a reasonable chance you need to tweak that update statement since I, obviously, can't test it.
select$BetaDistribution<-yourfunc(x,y)
sqlSave(dbhandle, select, tablename="dbo.salestemp", rownames=FALSE,varTypes=list(Alpha="decimal(10,10)", Beta="decimal(10,10)", BetaDistribution="decimal(10,10)"))
sqlQuery(dbhandle, "update dbo.sales
set sales.BetaDistribution=salestemp.BetaDistribution
from dbo.sales
inner join
salestemp
on
sales.Alpha=salestemp.Alpha and
sales.Beta=salestemp.Beta")
sqlQuery(dbhandle, "drop table salestemp")
I am using R in combination with SQLite using RSQLite to persistate my data since I did not have sufficient RAM to constantly store all columns and calculate using them. I have added an empty column to the SQLite database using:
dbGetQuery(db, "alter table test_table add column newcol real)
Now I want to fill this column using data I calculated in R and which is stored in my data.table column dtab$newcol. I have tried the following approach:
dbGetQuery(db, "update test_table set newcol = ? where id = ?", bind.data = data.frame(transactions$sum_year, transactions$id))
Unfortunately, R seems like it is doing something but is not using any CPU time or RAM allocation. The database does not change size and even after 24 hours nothing has changed. Therefore, I assume it has crashed - without any output.
Am I using the update statement wrong? Is there an alternative way of doing this?
UPDATE
I have also tried the RSQLite functions dbSendQuery and dbGetPreparedQuery - both with the same result. However, what does work is updating a single row without the use of bind.data. A loop to update the column, therefore, seems possible but I will have to evaluate the performance since the dataset is huge.
As mentioned by #jangorecki the problem had to do with SQLite performance. I disabled synchronous and set journal_mode to off (which has to be done for every session).
dbGetQuery(transDB, "PRAGMA synchronous = OFF")
dbGetQuery(transDB, "PRAGMA journal_mode = OFF")
Also I changed my RSQLite code to use dbBegin(), dbSendPreparedQuery() and dbCommit(). It is takes a while but at least it works not and has an acceptable performance.
I have a data frame data_frm which has following columns:
emp_id | emp_sal | emp_bonus | emp_desig_level
| | |
| | |
And I want to insert all the records present in this data frame into the database table tab1.I executed this query but I got an error:
for(record in data_frm)
{
write_sql <- paste("Insert into tab1 (emp_id,emp_sal,emp_bonus,emp_desig_level) values (",data_frm[,"emp_id"],",",data_frm[,"emp_sal"],",",data_frm[,"emp_bonus"],",",data_frm[,"emp_desig_level"],")",sep="")
r <- dbSendQuery(r,write_sql)
}
I get error as:
Error in data_frm[, "emp_id"] : incorrect number of dimensions
How do I insert all the records from the data frame into database?
NOTE: I want to insert all the records of the data frame using insert statement.
dbWriteTable(conn, "RESULTS", results2000, append = T) # to protect current values
dbWriteTable(conn, "RESULTS", results2000, append = F) # to overwrite values
From the RDBI homepage at sourceforge. Hope that helps...
In your for loop, you need to put:
data_frm[record,"column_name"]
Other wise your loop is trying to insert the entire column instead of just the particular record.
for(record in data_frm)
{
write_sql <- paste("Insert into tab1 (emp_id,emp_sal,emp_bonus,emp_desig_level) values (",data_frm[record,"emp_id"],",",data_frm[record,"emp_sal"],",",data_frm[record,"emp_bonus"],",",data_frm[record,"emp_desig_level"],")",sep="")
r <- dbSendQuery(r,write_sql)
}
Answered here
Copied one more time:
Recently I had similar issue.
Problem description: MS Server data base with scheme. The task is to save an R data.frame object to a predefined data base table without dropping it.
Problems I faced:
Some packages functions does not support schemes or require github development version installation
You can save data.frame only after drop (delete table) operation (I needed just "clear table" operation)
How I solved the issue
Using simple RODBC::sqlQuery, writing a data.frame row by row.
The solution (couple of functions) is available here or here