I used the following code in iPython in order to get some information from a database's table in the form of a pandas dataframe.
import sqlite3
con = sqlite3.connect('-----.db')
a = pd.read_sql('SELECT * FROM table1, con)
c= con.cursor
I have table 1 as a dataframe named a. However, I need to carry out a number of inner joins between different tables from the database. My question would be how to use SQL commands within iPython using these dataframes? I tried c.execute(''' sql command for inner join''') but the error says that the dataframes mentioned are not tables.
Any help?
You just write the full sql command directly using read_sql.
sql = """
select col1 from
tablea inner join tableb
on tablea.col2 = tableb.col2
where tablea.col3 < 10
limit 10
"""
a = pd.read_sql(sql, con)
Related
Background
I am using R Studio to connect R to Microsoft SQL Sever Management Studio. I am reading tables into R as follows:
library(sqldf)
library(DBI)
library(odbc)
library(data.table)
TableX <- dbGetQuery(con, statement = "SELECT * FROM [dim1].[dimA].[TableX]")
Which for some tables works fine. However for most tables which have a binary ID variable
the following happens:
TableA <- dbGetQuery(con, statement = "SELECT * FROM [dim1].[dimA].[TableA]")
Error in result_fetch(res#ptr, n) :
nanodbc/nanodbc.cpp:xxx: xxxxx: [Microsoft][ODBC SQL Server Driver]Invalid Descriptor Index
Warning message:
In dbClearResult(rs) : Result already cleared
I figured out that the problem is caused by the first column, which I can select like this:
TableA <- dbGetQuery(con, statement = "SELECT ID FROM [dim1].[dimA].[TableA]")
and looks as follows:
AlwaysLearning mentioned in the comments that this is a recurring problem (1, 2, 3). The query only works when ID is selected last:
TableA <- dbGetQuery(con, statement = "SELECT AEE, ID FROM [dim1].[dimA].[TableA]")
Updated Question
The question is essentially how I can read in the table with the ID variable last, without specifying all table variables each time (because this would be unworkable).
Possible Workaround
I thought a work around could be to select ID as an integer:
TableA <- dbGetQuery(con, statement = "SELECT CAST(ID AS int), COL2 FROM [dim1].[dimA].[TableA]")
However how do I select the whole table in this case?
I am an SQL beginner, but I thought I could solve it by using something like this (from this link):
TableA <- dbGetQuery(con, statement = "SELECT * EXCEPT(ID), SELECT CAST(ID AS int) FROM [[dim1].[dimA].[TableA]")
Where I select everything but the ID column, and then the ID column last. However the solution I suggest is not accepted syntax.
Other links
A similar problem for java can be found here.
I believe I have found a workaround that meets your requirements using a table alias.
By assigning the alias T to the table I want to query, it allows me to select both a specific column ([ID]) as well as all columns in the aliased table without the need to explicitly specify them all by name.
This returns all columns of the table (including the ID column) as well as a copy of the ID column at the end of the table.
I then remove the ID column from the resulting table.
This leaves you with the desired result: all columns of a table in the order that they appear with the exception of the ID column that is placed at the end.
PS: For the sake of completeness, I have provided a template of my own DBIConnection object. You can substitute this with the specifics of your own DBIConnection object.
library(sqldf)
library(DBI)
library(odbc)
library(data.table)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={YourDriver};
server=YourServer;
database=YourDatabase;
Trusted_Connection=yes'
)
dataframe <- dbGetQuery(con, statement= 'SELECT T.*, T.[ID] FROM [SCHEMA_NAME].[TABLE_NAME] AS T')
dataframe_scoped <- dataframe[,-1]
Today, for the first time I discovered sqldf package which I found to be very useful and convenient. Here is what the documentation says about the package:
https://www.rdocumentation.org/packages/sqldf/versions/0.4-11
sqldf is an R package for runing SQL statements on R data frames,
optimized for convenience. The user simply specifies an SQL statement
in R using data frame names in place of table names and a database
with appropriate table layouts/schema is automatically created, the
data frames are automatically loaded into the database, the specified
SQL statement is performed, the result is read back into R and the
database is deleted all automatically behind the scenes making the
database's existence transparent to the user who only specifies the
SQL statement.
So if I understand correctly, some data.frame which contains data stored in the RAM of the computer is mapped into a database on the disk temporarily as a table, then the calculation or whatever the query is supposed to do will be done and finally the result is returned back to R and all that was temporarily created in the database goes away as it never existed.
My question is, does it work other way around? Meaning, that assuming there is already a table let's say named my_table (just an example) in the database (I use PostgreSQL), is there any way to import its data from the database into a data.frame in R via sqldf? Because, currently the only way that I know is RPostgreSQL.
Thanks to G. Grothendieck for the answer. Indeed it is perfectly possible to select data from already existing tables in the database. My mistake was that I was thinking that the name of the dataframe and the corresponding table must always be the same, whereas if I understand correctly, this is only the case when a data.frame data is mapped to a temporary table in the database. As a result when I tried to select data, I had an error message saying that a table with the same name already existed in my database.
Anyway, just as a test to see whether this works, I did the following in PostgreSQL (postgres user and test database which is owned by postgres)
test=# create table person(fname text, lname text, email text);
CREATE TABLE
test=# insert into person(fname, lname, email) values ('fname-01', 'lname-01', 'fname-01.lname-01#gmail.com'), ('fname-02', 'lname-02', 'fname-02.lname-02#gmail.com'), ('fname-03', 'lname-03', 'fname-03.lname-03#gmail.com');
INSERT 0 3
test=# select * from person;
fname | lname | email
----------+----------+-----------------------------
fname-01 | lname-01 | fname-01.lname-01#gmail.com
fname-02 | lname-02 | fname-02.lname-02#gmail.com
fname-03 | lname-03 | fname-03.lname-03#gmail.com
(3 rows)
test=#
Then I wrote the following in R
options(sqldf.RPostgreSQL.user = "postgres",
sqldf.RPostgreSQL.password = "postgres",
sqldf.RPostgreSQL.dbname = "test",
sqldf.RPostgreSQL.host = "localhost",
sqldf.RPostgreSQL.port = 5432)
###
###
library(tidyverse)
library(RPostgreSQL)
library(sqldf)
###
###
result_df <- sqldf("select * from person")
And indeed we can see that result_df contains the data stored in the table person.
> result_df
fname lname email
1 fname-01 lname-01 fname-01.lname-01#gmail.com
2 fname-02 lname-02 fname-02.lname-02#gmail.com
3 fname-03 lname-03 fname-03.lname-03#gmail.com
>
>
This question already has answers here:
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 4 years ago.
I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them:
R> my_table %>%
spark_write_table(.,
path="mytable",
mode="append",
partition_by=c("col1", "col2")
)
However, now I want to update the table by altering just one partition, instead of writing the whole table again.
In Hadoop-SQL I would do something like:
INSERT INTO TABLE mytable
PARTITION (col1 = 'my_partition')
VALUES (myvalues..)
Is there an equivalent option to do this in sparklyr correctly? I cannot find it in the documentation.
Re - duplication note: this question is specifically about the way to do this in R with the sparklyr function, while the other question is about general Hive syntax
Thanks all for the comments.
It seems there is no way to do this with sparklyr directly, but this is what I am going to do.
In short, I'll save the new partition file in a temporary table, use Hadoop SQL commands to drop the partition, then another SQL command to insert into the temporary table into it.
> dbGetQuery(con,
"ALTER TABLE mytable DROP IF EXISTS PARTITION (mycol='partition1');")
> spark_write_table(new_partition, "tmp_partition_table")
> dbGetQuery(con,
"INSERT VALUES INTO TABLE mytable
PARTITION (mycol='partition1')
SELECT *
FROM tmp_partition_table "
)
I am using the RODBC package on R which allows me to connect to SQL using R.
As an example to my problem, I have a table [Sales] within SQL with 3 Columns (Alpha, Beta, BetaDistribution).
1.50,77,x
2.99,53,x
4.50,122,x
Note that the 3rd column (BetaDistribution) is not populated, and this needs to be populated using a Statistical R Function.
I have assigned my table to the variable SELECT
select <- sqlQuery(dbhandle, 'select * from dbo.sales')
how to I run a loop to update my sql table so that the BetaDistribution column is updated with the calculated Beta Distribution - pbeta(alpha,beta)
Something like this. Basically you make a temp table and then update the existing table. There's a reasonable chance you need to tweak that update statement since I, obviously, can't test it.
select$BetaDistribution<-yourfunc(x,y)
sqlSave(dbhandle, select, tablename="dbo.salestemp", rownames=FALSE,varTypes=list(Alpha="decimal(10,10)", Beta="decimal(10,10)", BetaDistribution="decimal(10,10)"))
sqlQuery(dbhandle, "update dbo.sales
set sales.BetaDistribution=salestemp.BetaDistribution
from dbo.sales
inner join
salestemp
on
sales.Alpha=salestemp.Alpha and
sales.Beta=salestemp.Beta")
sqlQuery(dbhandle, "drop table salestemp")
create table tmp as select code,avg(how-low)/low as vol from quote group by code;
select avg(vol) from tmp
I create a new table with the first statement ,then select ave(vol) from tmp table.How can i combine the two sqlite statements into one statement?
If you do not need the temporary table later, use a common table expression:
WITH tmp AS (SELECT avg(how-low)/low AS vol
FROM quote
GROUP BY code)
SELECT avg(vol)
FROM tmp
If you have an outdated SQLite version (older than 3.8.3), you could use a subquery instead:
SELECT avg(vol)
FROM (SELECT avg(how-low)/low AS vol
FROM quote
GROUP BY code)