I have the following data being sent to redshift with a replace table command- is there a command to instead add new rows to the table rather than replacing the entire thing?
PipelineSimulation<-matrix(,42,7)
PipelineSimulation<-as.data.frame(PipelineSimulation)
PipelineSimulation[1,1]<-"APAC"
PipelineSimulation[1,2]<-"Enterprise"
and so on through
PipelineSimulation[42,3]<-"Commit"
PipelineSimulation[42,4]<-"Upsell"
PipelineSimulation[42,5]<-NAMEFURate
PipelineSimulation[42,6]<-mean(NFUEntTotals)
PipelineSimulation[,7]<-Sys.time()
then to get it into redshift I use
library(RPostgres)
library(redshiftTools)
library(RPostgreSQL)
library("aws.s3")
library("DBI")
drv<-dbDriver('PostgreSQL')
con <- dbConnect(RPostgres::Postgres(), host='bi-prod-dw-
instance.cceimtxgnc4w.us-west-2.redshift.amazonaws.com', port='5439',
dbname= '***', user="***", password="***", sslmode='require')
query="select * from everyonesdb.jet_pipelinesimulation_historic;"
result<-dbGetQuery(con,query)
print (nrow(result))
Sys.setenv("AWS_ACCESS_KEY_ID" = "***",
"AWS_SECRET_ACCESS_KEY" = "***",
"AWS_DEFAULT_REGION" = "us-west-2")
b=get_bucket(bucket = 'bjnbi-bjnrd/jetPipelineSimulation')
rs_replace_table(PipelineSimulation, con,
tableName='everyonesdb.jet_pipelinesimulation_historic', bucket='bjnbi-
bjnrd/jetPipelineSimulation',split_files =2)
So instead of rs_replace_table, I want to preserve the old data and simply add new rows onto the existing table if that's possible
From How to bulk upload your data from R into Redshift:
rs_replace_table truncates the target table and then loads it entirely from the data frame, only do this if you don't care about the current data it holds.
On the other hand, rs_upsert_table replaces rows which have coinciding keys, and inserts those that do not exist in the table.
Does using rs_upsert_table instead of rs_replace_table solve your issue?
Related
The pandas equivalent code for connecting to Teradata, I have used is:
database = config.get('Teradata connection', 'database')
host = config.get('Teradata connection', 'host')
user = config.get('Teradata connection', 'user')
pwd = config.get('Teradata connection', 'pwd')
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{}".format(database, tables)
df = pd.read_sql_query(query1, connect)
Now, I need to use the Dask library for loading big data as an alternative to pandas.
Please suggest a method to connect the same with Teradata.
Teradata appears to have a sqlalchemy engine, so you should be able to install that, set your connection string appropriately and use Dask's existing from_sql function.
Alternatively, you could do this by hand: you need to decide on a set of conditions which will partition the data for you, each partition being small enough for your workers to handle. Then you can make a set of partitions and combine into a dataframe as follows
def get_part(condition):
with teradatasql.connect(host=host, user=user, password=pwd) as connect:
query1 = "SELECT * FROM {}.{} WHERE {}".format(database, tables, condition)
return pd.read_sql_query(query1, connect)
parts = [dask.delayed(get_part)(cond) for cond in conditions)
df = dd.from_delayed(parts)
(ideally, you can derive the meta= parameter for from_delayed beforehand, perhaps by getting the first 10 rows of the original query).
I have the following dataframe:
library(rpostgis)
library(RPostgreSQL)
library(glue)
df<-data.frame(elevation=c(450,900),
id=c(1,2))
Now I try to upload this to a table in my PostgreSQL/Postgis database. My connection (dbConnect) is working for "SELECT"-Statements properly. However, I tried two ways of updating a database table with this dataframe and both failed.
First:
pgInsert(postgis,name="fields",data.obj=df,overwrite = FALSE, partial.match = TRUE,
row.names = FALSE,upsert.using = TRUE,df.geom=NULL)
2 out of 2 columns of the data frame match database table columns and will be formatted for database insert.
Error: x must be character or SQL
I do not know what the error is trying to tell me as both the values in the dataframe and table are set to integer.
Second:
sql<-glue_sql("UPDATE fields SET elevation ={df$elevation} WHERE
+ id = {df$id};", .con = postgis)
> sql
<SQL> UPDATE fields SET elevation =450 WHERE
id = 1;
<SQL> UPDATE fields SET elevation =900 WHERE
id = 2;
dbSendStatement(postgis,sql)
<PostgreSQLResult>
In both cases no data is transferred to the database and I do not see any Error logs within the database.
Any hint on how to solve this problem?
It is a mistake from my site, I got glue_sql wrong. To correctly update the database with every query created by glue_sql you have to loop through the created object like the following example:
for(i in 1:max(NROW(sql))){
dbSendStatement(postgis,sql[i])
}
I have the following function written in R that (I think) is doing a poor job of updating my mongo databases collections.
library(mongolite)
con <- mongolite::mongo(collection = "mongo_collection_1", db = 'mydb', url = 'myurl')
myRdataframe1 <- con$find(query = '{}', fields = '{}')
rm(con)
con <- mongolite::mongo(collection = "mongo_collection_2", db = 'mydb', url = 'myurl')
myRdataframe2 <- con$find(query = '{}', fields = '{}')
rm(con)
... code to update my dataframes (rbind additional rows onto each of them) ...
# write dataframes to database
write.dfs.to.mongodb.collections <- function() {
collections <- c("mongo_collection_1", "mongo_collection_2")
my.dataframes <- c("myRdataframe1", "myRdataframe2")
# loop dataframes, write colllections
for(i in 1:length(collections)) {
# connect and add data to this table
con <- mongo(collection = collections[i], db = 'mydb', url = 'myurl')
con$remove('{}')
con$insert(get(my.dataframes[i]))
con$count()
rm(con)
}
}
write.dfs.to.mongodb.collections()
My dataframes myRdataframe1 and myRdataframe2 are very large dataframes, currently ~100K rows and ~50 columns. Each time my script runs, it:
uses con$find('{}') to pull the mongodb collection into R, saved as a dataframe myRdataframe1
scrapes new data from a data provider that gets appended as new rows to myRdataframe1
uses con$remove() and con$insert to fully remove the data in the mongodb collection, and then re-insert the entire myRdataframe1
This last bullet point is iffy, because I run this R script daily in a cronjob and I don't like that each time I am entirely wiping the mongo db collection and re-inserting the R dataframe to the collection.
If I remove the con$remove() line, I receive an error that states I have duplicate _id keys. It appears I cannot simply append using con$insert().
Any thoughts on this are greatly appreciated!
When you attempt to insert documents into MongoDB that already exist in the database as per their primary key you will get the duplicate key exception. In order to work around that you can simply unset the _id column using something like this before the con$insert:
my.dataframes[i]$_id <- NULL
This way, the newly inserted document will automatically get a new _id assigned.
you can use upsert ( which matches document with the first condition if found it will update it, if not it will insert a new one,
first you need to separate id from each doc
_id= my.dataframes[i]$_id
updateData = my.dataframes[i]
updateData$_id <- NULL
then use upsert ( there might be some easier way to concatenate strings in R)
con$update(paste('{"_id":"', _id, '"}' ,sep="" ) , paste('{"$set":', updateData,'}', sep=""), upsert = TRUE)
I have created a table in a sqlite3 database from R using the following code:-
con <- DBI::dbConnect(drv = RSQLite::SQLite(),
dbname="data/compfleet.db")
s<- sprintf("create table %s(%s, primary key(%s))", "PositionList",
paste(names(FinalTable), collapse = ", "),
names(FinalTable)[2])
dbGetQuery(con, s)
dbDisconnect(con)
The second column of the table is UID which is the primary key. I then run a script to update the data in the table. The updated data could contain the same UID which already exists in the table. I don't want these existing records to be updated and just want the new records(with new UID values) to be appended to this database. The code I am using is:-
DBI::dbWriteTable(con, "PositionList", FinalTable, append=TRUE, row.names=FALSE, overwite=FALSE)
Which returns an error:
Error in result_bind(res#ptr, params) :
UNIQUE constraint failed: PositionList.UID
How can I achieve the task of appending only the new UID values without changing the existing UID values even if they appear when I run my updation script?
You can query the existing UIDs (as a one-column data frame) and remove corresponding rows from the table you want to insert.
uid_df <- dbGetQuery(con, "SELECT UID FROM PositionList")
dbWriteTable(con, "PositionList", FinalTable[!(FinalTable$UID %in% uid_df[[1]]), ], ...)
When you are going to insert data,first get the data from database by using UID.If data is exist nothing to do else insert new data with new UID.Duplicate Primary Key (UID) recard is not exist ,so it show the error.
I am using the in_schema() function from dbplyr package to create a table in a named schema of a postgresql database from R.
It is not a new piece of code and it used to work as expected = creating a table called 'my_table' in schema 'my_schema'.
con <- dbConnect(odbc::odbc(),
driver = "PostgreSQL Unicode",
server = "server",
port = 5432,
uid = "user name",
password = "password",
database = "dbase")
dbWriteTable(con,
in_schema('my_schema', 'my_table'),
value = whatever) # assume that 'whatever' is a data frame...
This piece of code has now developed issues and unexpectedly started to create a table called 'my_scheme.my_table' in the default public scheme of my database, instead of the expected my_schema.my_table.
Has anybody else noticed such behaviour, and is there a solution (except using the default postgresql scheme, which is not practical in my case)?
for that, I would recommend using copy_to() instead of dbWriteTable(): copy_to(con, iris, in_schema("production", "iris"))