I have two very large csv files that contain the same variables. I want to combine them into one table inside a sqlite database - if possible using R.
I successfully managed to put both csv files in separate tables inside one database using inborutils::csv_to_sqlite that one imports small chunks of data at a time.
Is there a way to create a third tables where both tables are simply appended using R (keeping in mind the limited RAM)? And if not - how else can I perform this task? Maybe via the terminal?
We assume that when the question refers to the "same variables" that it means that the two tables have the same column names. Below we create two such test tables, BOD and BOD2, and then in the create statement we combine them creating table both. This does the combining entirely on the SQLite side. Finally we use look at both.
library(RSQLite)
con <- dbConnect(SQLite()) # modify to refer to existing SQLite database
dbWriteTable(con, "BOD", BOD)
dbWriteTable(con, "BOD2", 10 * BOD)
dbExecute(con, "create table both as select * from BOD union select * from BOD2")
dbReadTable(con, "both")
dbDisconnect(con)
Related
I have multiple large similar data files stored in .csv format. These are data files released annually. Most of these have the same variables but in some years they have added variables or changed the names of variables.
I am looping through my directory of files (~30 .csv files), converting them to data frames, and importing them to a Google Cloud SQL PostgreSQL 12 database via:
DBI::dbAppendTable(con, tablename, df)
where con is my connection to the database, tablename is the table name, and df is the data frame produced from a .csv.
The problem is each of these .csv files will have a different number of columns and some won't have columns others have.
Is there an easy way to pre-define a structure to the PostgreSQL 12 database that specifies "any of these .csv columns will all go into this one database column" and also "any columns not included in the .csv should be filled with NA in the database". I think I could come up with something in R to make all the dataframes look similar prior to uploading to the database but it seems cumbersome. I am imaging a document like a JSON that the SQL database compares against kind of like below:
SQL | Data frame
----------------------------------
age = "age","Age","AGE"
sex = "Sex","sex","Gender","gender"
...
fnstatus = "funcstatus","FNstatus"
This would specify to the database all the possible columns it might see and how to parse those. And for columns it doesn't see in a given .csv, it would fill all records with NA.
While I cannot say if such a feature is available as Postgres has many novel methods and extended data types, I would be hesitant to utilize such features as maintainability can be a challenge.
Enterprise, server, relational databases like PostgreSQL should be planned infrastructure systems. As r2evans comments, tables [including schemas, columns, users, etc.] should be defined up front. Designers need to think out entire uses and needs before any data migration or interaction. Dynamically adjusting database tables and columns by one-off application needs are usually not recommended. So clients like R should dynamically align data to meet the planned, relational database specifications.
One approach can be to use a temporary table as staging of all raw CSV data, possibly set with all VARCHAR. Then populate this table with all raw data to be finally migrated in a single append query using COALESCE and :: for type casting to final destination.
# BUILD LIST OF DFs FROM ALL CSVs
df_list <- lapply(list_of_csvs, read.csv)
# NORMALIZE ALL COLUMN NAMES TO LOWER CASE
df_list <- lapply(df_list, function(df), setNames(df, tolower(names(df))))
# RETURN LIST OF UNIQUE NAMES
all_names <- unique(lapply(df_list, names))
# CREATE TABLE QUERY
dbSendQuery(con, "DROP TABLE IF EXISTS myTempTable")
sql <- paste("CREATE TABLE myTempTable (",
paste(all_names, collapse = " VARCHAR(100), "),
"VARCHAR(100)",
")")
dbSendQuery(con, sql)
# APPEND DATAFRAMES TO TEMP TABLE
lapply(df_list, function(df) DBI::dbAppendTable(con, "myTempTable", df))
# RUN FINAL CLEANED APPEND QUERY
sql <- "INSERT INTO myFinalTable (age, sex, fnstatus, ...)
SELECT COALESCE(age)::int
, COALESCE(sex, gender)::varchar(5)
, COALESCE(funcstatus, fnstatus)::varchar(10)
...
FROM myTempTable"
dbSendQuery(con, sql)
A thing I ran into using the RPostgreSQL package is that dbWriteTable(… overwrite=TRUE) seems to destroy existing table structure (datatypes and constraints), and dbRemoveTable() is equivalent to DROP table.
So I’ve used:
ltvTable <- "the_table_to_use"
dfLTV <- dataframe(x,y,z)
sql_truncate <- paste("TRUNCATE ", ltvTable) ##unconditional DELETE FROM…
res <- dbSendQuery(conn=con, statement=sql_truncate)
dbWriteTable(conn=con, name=ltvTable, value=dfLTV, row.names=FALSE, append=TRUE)
Is the TRUNCATE step necessary, or is there a dbWriteTable method that overwrites just the content not the structure?
I experience different behaviour from the answer offered by Manura Omal to How we can write data to a postgres DB table using R?, as overwrite=TRUE does not appear to truncate first.
I'm using: RPostgreSQL 0.4-1; PostgreSQL 9.4
best wishes - JS
As far as I know overwrite=T does 3 things:
DROPs the table
CREATES the table
fills table with new data
So if you want to preserve the structure, then you need the Truncate step.
Different behaviour might be caused by existence or non-existence of foreign keys preventing the DROP table step.
I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.
RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df
I loaded some spatial data from my PostgreSQL DB into R with the help of the RPostgreSQL-package and ST_AsText:
dbGetQuery(con, "SELECT id, ST_AsText(geom) FROM table;")
After having done some analyses I want to go back that way. My geom column is stil formated as character / WKT. Unfortunately dbWriteTable doesn't accept similar arguments as dbGetQuery.
What is the best way to import spatial data from R to PostgreSQL?
Up to now the only way I found, is importing the data into the DB and using ST_GeomFromText in an additional step to get my geometry data type.
I created a table on my DB by using dbWriteTable() and the postGIStools-package to INSERT the data (it must be a SpatialPolygonsDataFrame).
## Create an empty table on DB
dbWriteTable(con, name=c("public", "<table>"), value=(dataframe[-c(1:nrow(dataframe)), ]))
require(postGIStools)
## INSERT INTO
postgis_insert(con, df=SpatialPolygonsDataFrame, tbl="table", geom_name="st_astext")
dbSendQuery(con, "ALTER TABLE <table> RENAME st_astext TO geom;")
dbSendQuery(con, "ALTER TABLE <table> ALTER COLUMN geom TYPE geometry;")
I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.