I am learning SQLite and I have a big data frame in csv format and I imported into the SQLite.
db <- dbConnect(SQLite(), dbname="myDB.sqlite")
dbWriteTable(conn = db, name = "myDB", dataframe, overwrite=T,
row.names=FALSE)
after that, I saw there is a myDB.sqlite in my directory but with zero byte. How can I save the dataframe in the sqlite so that I don't need to write table everytime. Thanks.
It should be written in your db. Like i said before, your code works for me it's just that the R-Studio File Viewer doesn't automatically Refresh when some of the files have been written to.
Just to be sure that data was written to your db try running this dbGetQuery(conn=db, "SELECT * FROM myDB). That should return your data frame.
Related
I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.
I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :
Reads the indicated file into an sql database creating the database if
it does not already exist. Then it applies the sql statement returning
the result as a data frame. If the database did not exist prior to
this statement it is removed.
However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)
Curious to see how this works, I wrote the following program
df_data <- suppressWarnings(read.csv.sql(
file = "X:/logs/data.csv",
sql = "
select
nullif(timestamp, '') as timestamp_value,
nullif(user_account, '') as user_account,
nullif(country_code, '') as country_code,
nullif(prefix_value, '') as prefix_value,
nullif(user_query, '') as user_query,
nullif(returned_code, '') as returned_code,
nullif(execution_time, '') as execution_time,
nullif(output_format, '') as output_format
from
file
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
timestamp_value = c("TEXT"),
user_account = c("TEXT"),
country_code = c("TEXT"),
prefix_value = c("TEXT"),
user_query = c("TEXT"),
returned_code = c("TEXT"),
execution_time = c("REAL"),
output_format = c("TEXT")
),
dbname = "X:/logs/sqlite_tmp.db",
drv = "SQLite"
))
I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).
What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?
The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.
Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
library(sqldf)
sqldf("attach 'mydb' as new")
read.csv.sql("myfile.csv", sql =
"create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or
## dbSendStatement() instead of dbGetQuery() or dbSendQuery().
sqldf("select * from sqlite_master", dbname = "mydb")
## type name tbl_name rootpage
## .. info on table that was created in mydb ...
sqldf("select count(*) from mytab", dbname = "mydb")
I have my final output in R dataframe. I need to write this output to a database in Azure Databricks. Can someone help me with the syntax? I used this code:
require(SparkR)
data1 <- createDataFrame(output)
write.df(data1, path="dbfs:/datainput/sample_dataset.parquet",
source="parquet", mode="overwrite")
This code runs without error, but i dont see the database in the datainput folder (mentioned in the path). Is there some other way to do it?
I believe you are looking for saveAsTable function. write.df is particularly to save the data in a file system only, not to tag the data as table.
require(SparkR)
data1 <- createDataFrame(output)
saveAsTable(data1, tableName = "default.sample_table", source="parquet", mode="overwrite")
In the above code, default is some existing database name, under which a new table will get created having name as sample_table. If you mention sample_table instead of default.sample_table then it will be saved in the default database.
I have a *.csv file containing columnar numbers and strings (13GB on disk ) which I imported into a new duckdb (or sqlite) database and saved it so I can access it later in R. But reconnecting duplicates it and is very slow, is this wrong?
From within R, I am doing the following:
library(duckdb)
library(dplyr)
library(DBI)
#Create the DB
con <- dbConnect(duckdb::duckdb(), "FINAL_data.duckdb")
#Read in the csv
duckdb_read_csv(con, "data", "FINAL_data_new.csv")
I then close R and restart it to see if it has worked:
#This is super slow (about 10minutes), because it looks like it's writing the DB again somewhere. But why?
con <- dbConnect(duckdb::duckdb(), "FINAL_data.duckdb")
NB. I've added sqlite as a tag because I don't think this is particular to duckdb
The slowdown you experienced is due to the database checkpointing on startup. This has been fixed on the master branch already.
I've been doing processing data with R that results in a data frame of typically 4860 observations. I write that to a Results table in a SQLite database like this:
db = dbConnect(RSQLite::SQLite(), dbname=DATAFILE)
dbWriteTable(db, "Results", my_dataframe, append = TRUE)
dbDisconnect(db)
Then I process some more data and later write it to the same table using this same code.
The problem is, every now and again, what's written to my SQLite file is some multiple of the 4860 records I expect. Just now it was 19448 (exactly 4X the 4860 records that I can see in RStudio are in my data frame).
This seems such a random problem. As I know the data frame contents is correct, I feel as though the problem must be in my use of dbWriteTable(). Any guidance would be appreciated. Thank you.
I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.