I've been doing processing data with R that results in a data frame of typically 4860 observations. I write that to a Results table in a SQLite database like this:
db = dbConnect(RSQLite::SQLite(), dbname=DATAFILE)
dbWriteTable(db, "Results", my_dataframe, append = TRUE)
dbDisconnect(db)
Then I process some more data and later write it to the same table using this same code.
The problem is, every now and again, what's written to my SQLite file is some multiple of the 4860 records I expect. Just now it was 19448 (exactly 4X the 4860 records that I can see in RStudio are in my data frame).
This seems such a random problem. As I know the data frame contents is correct, I feel as though the problem must be in my use of dbWriteTable(). Any guidance would be appreciated. Thank you.
Related
I have a .csv file that contains 105M rows and 30 columns that I would like to query for plotting in an R shiny app.
it contains alpha-numeric data that looks like:
#Example data
df<-as.data.frame(percent=as.numeric(rep(c("50","80"),each=5e2)),
maskProportion=as.numeric(rep(c("50","80")),each=5e2),
dose=runif(1e3),
origin=as.factor(rep(c("ABC","DEF"),each=5e2)),
destination=as.factor(rep(c("XYZ","GHI"),each=5e2))
)
write.csv(df,"PassengerData.csv")
In the terminal, I have ingested it into an SQLite database as follows:
$ sqlite3 -csv PassengerData.sqlite3 '.import PassengerData.csv df'
which is from:
Creating an SQLite DB in R from an CSV file: why is the DB file 0KB and contains no tables?
So far so good.
The problem I have is speed in querying in R so I tried indexing the DB back in the terminal.
In sqlite3, I tried creating indexes on percent, maskProportion, origin and destination following this link https://data.library.virginia.edu/creating-a-sqlite-database-for-use-with-r/ :
$ sqlite3 create index "percent" on PassengerData("percent");
$ sqlite3 create index "origin" on PassengerData("origin");
$ sqlite3 create index "destination" on PassengerData("destination");
$ sqlite3 create index "maskProp" on PassengerData("maskProp");
I run out of disk space because my DB seems to grow in size every time I make an index. E.g. after running the first command the size is 20GB. How can I avoid this?
I assume the concern is that running collect() to transfer data from SQL to R is too slow for your app. It is not clear how / whether you are processing the data in SQL before passing to R.
Several things to consider:
Indexes are not copied from SQL to R. SQL works with data off disk, so knowing where to look up specific parts of your data result in time savings. R works on data in memory so indexes are not required.
collect transfers data from a remote table (in this case SQLite) into R memory. If your goal is to transfer data into R, you could read a csv direct into R instead of writing it to SQL and then reading from SQL into R.
SQL is a better choice for doing data crunching / preparation of large datasets, and R is a better choice for detailed analysis and visualisation. But if both R and SQL are running on the same machine then both are limited by the cpu speed. Not a concern is SQL and R are running on separate hardware.
Some things you can do to improve performance:
(1) Only read the data you need from SQL into R. Prepare the data in SQL first. For example, contrast the following:
# collect last
local_r_df = remote_sql_df %>%
group_by(origin) %>%
summarise(number = n()) %>%
collect()
# collect first
local_r_df = remote_sql_df %>%
collect() %>%
group_by(origin) %>%
summarise(number = n())
Both of these will produce the same output. However, in the first example, the summary takes place in SQL and only the final result is copied to R; while in the second example, the entire table is copied to R where it is then summarized. Collect last will likely have better performance than collect first because it transfers only a small amount of data between SQL and R.
(2) Preprocess the data for your app. If your app will only examine the data from a limited number of directions, then the data could be preprocessed / pre-summarized.
For example, suppose users can pick at most two dimensions and receive a cross-tab, then you could calculate all the two-way cross-tabs and save them. This is likely to be much smaller than the entire database. Then at runtime, your app loads the prepared summaries and shows the user any summary they request. This will likely be much faster.
I have multiple large similar data files stored in .csv format. These are data files released annually. Most of these have the same variables but in some years they have added variables or changed the names of variables.
I am looping through my directory of files (~30 .csv files), converting them to data frames, and importing them to a Google Cloud SQL PostgreSQL 12 database via:
DBI::dbAppendTable(con, tablename, df)
where con is my connection to the database, tablename is the table name, and df is the data frame produced from a .csv.
The problem is each of these .csv files will have a different number of columns and some won't have columns others have.
Is there an easy way to pre-define a structure to the PostgreSQL 12 database that specifies "any of these .csv columns will all go into this one database column" and also "any columns not included in the .csv should be filled with NA in the database". I think I could come up with something in R to make all the dataframes look similar prior to uploading to the database but it seems cumbersome. I am imaging a document like a JSON that the SQL database compares against kind of like below:
SQL | Data frame
----------------------------------
age = "age","Age","AGE"
sex = "Sex","sex","Gender","gender"
...
fnstatus = "funcstatus","FNstatus"
This would specify to the database all the possible columns it might see and how to parse those. And for columns it doesn't see in a given .csv, it would fill all records with NA.
While I cannot say if such a feature is available as Postgres has many novel methods and extended data types, I would be hesitant to utilize such features as maintainability can be a challenge.
Enterprise, server, relational databases like PostgreSQL should be planned infrastructure systems. As r2evans comments, tables [including schemas, columns, users, etc.] should be defined up front. Designers need to think out entire uses and needs before any data migration or interaction. Dynamically adjusting database tables and columns by one-off application needs are usually not recommended. So clients like R should dynamically align data to meet the planned, relational database specifications.
One approach can be to use a temporary table as staging of all raw CSV data, possibly set with all VARCHAR. Then populate this table with all raw data to be finally migrated in a single append query using COALESCE and :: for type casting to final destination.
# BUILD LIST OF DFs FROM ALL CSVs
df_list <- lapply(list_of_csvs, read.csv)
# NORMALIZE ALL COLUMN NAMES TO LOWER CASE
df_list <- lapply(df_list, function(df), setNames(df, tolower(names(df))))
# RETURN LIST OF UNIQUE NAMES
all_names <- unique(lapply(df_list, names))
# CREATE TABLE QUERY
dbSendQuery(con, "DROP TABLE IF EXISTS myTempTable")
sql <- paste("CREATE TABLE myTempTable (",
paste(all_names, collapse = " VARCHAR(100), "),
"VARCHAR(100)",
")")
dbSendQuery(con, sql)
# APPEND DATAFRAMES TO TEMP TABLE
lapply(df_list, function(df) DBI::dbAppendTable(con, "myTempTable", df))
# RUN FINAL CLEANED APPEND QUERY
sql <- "INSERT INTO myFinalTable (age, sex, fnstatus, ...)
SELECT COALESCE(age)::int
, COALESCE(sex, gender)::varchar(5)
, COALESCE(funcstatus, fnstatus)::varchar(10)
...
FROM myTempTable"
dbSendQuery(con, sql)
I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.
query <- sprintf("BEGIN;
CREATE TEMPORARY TABLE
md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;
INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;
UPDATE %s.meta_data_unlocalized
SET meta_data = md_updates.meta_data
FROM md_updates
WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
COMMIT;", md_values, schema, schema, schema, schema)
DBI::dbGetQuery(con,query)
The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.
I did some research and found quite some information:
bulk inserts
bulk inserts II
what causes large inserts to slow down
answers from #Erwin Brandstetter and #Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.
However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.
The following works for me:
sql <- sprintf('CREATE TABLE
md_updates(ts_key varchar, meta_data hstore);
COPY md_updates FROM STDIN;')
dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")
But I can't get it done without reading from a extra .csv file. So my questions are:
Is COPY really a promising approach here (over the multi-row INSERT I got?
Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?
I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.
RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:
install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)
Where table foo has the same columns as dataframe df
I am trying to import data from an excel sheet to an Oracle table. I am able to extract the correct data, but when I run the following code:
$bulkCopy = new-object ("Oracle.DataAccess.Client.OracleBulkCopy") $oraClientConnString
$bulkCopy.DestinationTableName = $entity
$bulkCopy.BatchSize = 5000
$bulkCopy.BulkCopyTimeout = 10000
$bulkCopy.WriteToServer($dt)
$bulkcopy.close()
$bulkcopy.Dispose()
The data inserted in the table is some garbage values, consisting of 0's and 10's.
Value received from excel is stored in a data table ($dt).
Any help will be highly appreciated.
Please check the data type of the values in your data table. I have experienced this issue in .Net with the data type Double. When I change my data type to Decimal, everything was fine.
I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.