RPostgreSQL loading multiple CSV files into an Postgresql table - r

I'm new at using Postgresql, and I'm having trouble populating a table I created with multiple *.csv files. I was working first in pgAdmin4, then I decide to work on RPostgreSQL as R is my main language.
Anyway, I am dealing (for now) with 30 csv files located in one folder. All have the same headers and general structure, for instance:
Y:/Clickstream/test1/video-2016-04-01_PARSED.csv
Y:/Clickstream/test1/video-2016-04-02_PARSED.csv
Y:/Clickstream/test1/video-2016-04-03_PARSED.csv
... and so on.
I tried to load all csv files by using a following the RPostgresql specific answer from Parfait. Sadly, it didn't work. My code is specified below:
library(RPostgreSQL)
dir = list.dirs(path = "Y:/Clickstream/test1")
num = (length(dir))
psql.connection <- dbConnect(PostgreSQL(),
dbname="coursera",
host="127.0.0.1",
user = "postgres",
password="xxxx")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY citl.courses FROM '", csvfile , "' DELIMITER ',' CSV ;")
dbSendQuery(psql.connection, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(psql.connection)
I'm not understanding the error I got:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: could not open file
" Y:/Clickstream/test1/video-2016-04-01_PARSED.csv " for reading: Invalid
argument
)
If I'm understanding correctly, there is an invalid argument in the name of my first file. I'm not very sure about it, but again I am recently using PostgreSQL and this RPostgreSQL in R. Any help will be much appreciated.
Thanks in advance!
Edit: I found the problem, but cannot solve it for some reason. When I copy the path while in the for loop:
# IMPORT USING COPY COMMAND
sql <- paste("COPY citl.courses FROM '",csvfile,"' DELIMITER ',' CSV ;")
I have the following result:
sql
[1] "COPY citl.courses FROM ' Y:/Clickstream/test1/video-2016-04-01_PARSED.csv ' DELIMITER ',' CSV ;"
This means that the invalid argument is the blank space between the file path. I've tried to change this unsuccessfully. Any help will be deeply appreciated!

Try something like this
Files <- list.files("Y:/Clickstream/test1", pattern = "*.csv", full.names = TRUE)
CSVs <- lapply(Files, read.csv)
psql.connection <- dbConnect(PostgreSQL(),
dbname="coursera",
host="127.0.0.1",
user = "postgres",
password="xxxx")
for(i in 1:length(Files)){
dbWriteTable(psql.connection
# schema and table
, c("citl", "courses")
, CSVs[i]
, append = TRUE # add row to bottom
, row.names = FALSE
)
}

Related

Import multiple csv files into postgresql database using r (memory error)

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.
I successfully connected to the database, created a loop to import the csv files and tried to import.
R then returns an error, because my pc runs out of memory.
My question is:
Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards?
That way I would not run out of memory.
Code which returns the memory error:
`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db,
port=db_port, user=db_user, password=db_password)
#check if connection was succesfull
dbExistsTable(con, "prices")
#create function to load multiple csv files
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
}
#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`
Thanks in advance for the feedback!
If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,function(x){
prices <- read.csv(x)
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
})
}
I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.
When calling ?read.csv I saw the following output:
nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.
skip : the number of lines of the data file to skip before beginning to read data.
Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.
For example, for each file do the following:
number_of_lines = 5000 # Number of lines to read at a time
row_skip = 0 # number of lines to skip initially
keep_reading = TRUE # We will change this value to stop the while
while (keep_reading) {
my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB
row_skip = 1 + row_skip + number_of_lines # The "1 +" is there due to inclusivity avoiding duplicates
# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
keep_reading = FALSE
} # end-if
} # end-while
By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

How to ignore delimiters inside quoted strings when importing a csv file with RSQLite?

I want to import a csv file that has a similar structure with the example below:
var1;var2;var3
"a";1;"Some text"
"b";0;"More text"
"c;0;"Delimiter in ; middle of the text"
Traditional parsers such as the one used by data.table::fread deal with that by default. I want to import this data to a SQLite database with RSQLite::dbWriteTable.
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "mydb.sqlite")
dbWriteTable(conn = con, name = "my_table", value = "data_file.csv")
There is no option in dbWriteTable to provide quotes and thus the function throws an error when the problematic line is found. How could I import this data? The only constraint I have is that I don't have enough memory to parse the data with R before importing into SQLite.
Install the csvfix utility which is available on Windows and Linux platforms and then try this test code. It worked for me on Windows. You may need to adjust it slightly for other platforms, particularly the shell line and the eol= argument which you may not need or you may need a different value. We use csvfix to remove the quotes and replace the semicolons that are not in fields with #. Then we use the # separator when reading it in.
First we create the test data.
# if (file.exists("mydb")) file.remove("mydb")
# if (file.exists("data_file2.csv")) file.remove("data_file2.csv")
# write out test file
cat('var1;var2;var3\n"a";1;"Some text"\n"b";0;"More text"\n"c";0;"Delimiter in ; middle of the text"', file = "data_file.csv")
# create database (can omit if it exists)
cat(file = "mydb")
csvfix
Now process data file with csvfix
library(RSQLite)
# preprocess file using csvfix - modify next line as needed depending on platform
shell("csvfix write_dsv -sep ; -s # data_file.csv > data_file2.csv")
file.show("data_file2.csv") # omit this line for real data
# write file to database
con <- dbConnect(SQLite(), "mydb")
dbWriteTable(con, "myFile", "data_file2.csv", sep = "#", eol = "\r\n")
dbGetQuery(con, "select * from myFile") # omit this line for real data
dbDisconnect(con)
xsv
Alternately install the xsv (releases) rust utility. This worked for me on Windows.
library(RSQLite)
shell("xsv fmt -d ; -t # data_file.csv > data_file2.csv")
file.show("data_file2.csv") # omit this line for real data
# write file to database
con <- dbConnect(SQLite(), "mydb")
dbWriteTable(con, "myFile", "data_file2.csv", sep = "#")
dbGetQuery(con, "select * from myFile") # omit this line for real data
dbDisconnect(con)

Looping through files using dynamic name variable in R

I have a large number of files to import which are all saved as zip files.
From reading other posts it seems I need to pass the zip file name and then the name of the file I want to open. Since I have a lot of them I thought I could loop through all the files and import them one by one.
Is there a way to pass the name dynamically or is there an easier way to do this?
Here is what I have so far:
Temp_Data <- NULL
Master_Data <- NULL
file.names <- c("f1.zip", "f2.zip", "f3.zip", "f4.zip", "f5.zip")
for (i in 1:length(file.names)) {
zipFile <- file.names[i]
dataFile <- sub(".zip", ".csv", zipFile)
Temp_Data <- read.table(unz(zipFile,
dataFile), sep = ",")
Master_Data <- rbind(Master_Data, Temp_Data)
}
I get the following error:
In open.connection(file, "rt") :
I can import them manually using:
dt <- read.table(unz("D:/f1.zip", "f1.csv"), sep = ",")
I can create the sting dynamically but it feels long winded - and doesn't work when I wrap it with read.table(unz(...)). It seems it can't find the file name and so throws an error
cat(paste(toString(shQuote(paste("D:/",zipFile, sep = ""))),",",
toString(shQuote(dataFile)), sep = ""), "\n")
But if I then print this to the console I get:
"D:/f1.zip","f1.csv"
I can then paste this into `read.table(unz(....)) and it works so I feel like I am close
I've tagged in data.table since this is what I almost always use so if it can be done with 'fread' that would be great.
Any help is appreciated
you can use the list.files command here:
first set your working directory, where all your files are stored there:
setwd("C:/Users/...")
then
file.names = list.files(pattern = "*.zip", recursive = F)
then your for loop will be:
for (i in 1:length(file.names)) {
#open the files
zipFile <- file.names[i]
dataFile <- sub(".zip", ".csv", zipFile)
Temp_Data <- read.table(unz(zipFile,
dataFile), sep = ",")
# your function for the opened file
Master_Data <- rbind(Master_Data, Temp_Data)
#write the file finaly
write_delim(x = Master_Data, path = paste(file.names[[i]]), delim = "\t",
col_names = T )}

ODBC Connection error for merging files in R

I am trying to read excel files using odbcConnectExcel2007 function in R from RODBC package. While reading individual file, it's working. But when I am trying to run using for loop function, it's throwing following error
3 stop(sQuote(tablename), ": table not found on channel")
2 odbcTableExists(channel, sqtable)
1 sqlFetch(conn1, sqlTables(conn1)$TABLE_NAME[1])
Below is the code:-
file_list <- list.files("./Raw Data")
file_list
for (i in 1:length(file_list)){
conn1 = odbcConnectExcel2007(paste0("./Raw Data/",file_list[i])) # open a connection to the Excel file
sqlTables(conn1)$TABLE_NAME
data=sqlFetch(conn1, sqlTables(conn1)$TABLE_NAME[1])
close(conn1)
data <- data[,c("Branch","Custome","Category","Sub Category","SKU"
"Weight","Order Type","Invoice Date")]
if(i==1) alldata=data else{
alldata = rbind(alldata,data)
}
}
I would appreciate any kind of help.
Thanks in advance.
I think it's getting messed up with the table name having quotes returned from the sqlTables(conn1)$TABLE_NAME object. Try manipulating the table name by removing the quotes. Something like this:
table <- sqlTables(conn1)$TABLE_NAME
table <- noquote(table)
table <- gsub("\'", "", table)
And then just do:
data=sqlFetch(conn1, table)

How to import CSV into Sqlite in R where one of the variables has a comma (,) within quotes?

This is driving me mad.
I have a csv file "hello.csv"
a,b
"drivingme,mad",1
I just want to convert this into a sqlite database from within R (I need to do this because the actual file is actually 10G and it won't fit into a data.frame, so I will use Sqlite as an intermediate datastore)
dbWriteTable(conn= dbConnect(SQLite(),
dbname="c:/temp/data.sqlite3",
name="data",
value="c:/temp/hello.csv",row.names=FALSE, header=TRUE)
The above code failed with error
Error in try({ :
RS-DBI driver: (RS_sqlite_import: c:/temp/hello.csv line 2 expected 2 columns of data but found 3)
In addition: Warning message:
In read.table(fn, sep = sep, header = header, skip = skip, nrows = nrows, :
incomplete final line found by readTableHeader on 'c:/temp/hello.csv'
How do I tell it to treat comma (,) within a quote "" is to be treat as string and not a separator!
I tried adding in the argument
quote="\""
But it didn't work. Help!! read.csv work just file it will fail when reading large files.
Update
A much better now is to use readr's chunked functions e.g.
#setting up sqlite
con_data = dbConnect(SQLite(), dbname="yoursqlitefile")
readr::read_delim_chunked(file, function(chunk) {
dbWriteTable(con_data, chunk, name="data", append=TRUE )) #write to sqlite
})
Original more cumbuersome way
One way to do this is to read from the file since read.csv works but it just cannot load the whole data into memory.
n = 100000 # experiment with this number
f = file(csv)
con = open(f) # open a connection to the file
data <-read.csv(f,nrows=n,header=TRUE)
var.names = names(data)
#setting up sqlite
con_data = dbConnect(SQLite(), dbname="yoursqlitefile")
while(nrow(data) == n) { # if not reached the end of line
dbWriteTable(con_data, data, name="data",append=TRUE )) #write to sqlite
data <-read.csv(f,nrows=n,header=FALSE))
names(data) <- var.names
}
close(f)
if (nrow(data) != 0 ) {
dbWriteTable(con_data, data, name="data",append=TRUE ))
Improving the proposed answer:
data_full_path <- paste0(data_folder, data_file)
con_data <- dbConnect(SQLite(),
dbname=":memory:") # you can also store in a .sqlite file if you prefer
readr::read_delim_chunked(file = data_full_path,
callback =function(chunk,
dummyVar # https://stackoverflow.com/a/42826461/9071968
) {
dbWriteTable(con_data, chunk, name="data", append=TRUE ) #write to sqlite
},
delim = ";",
quote = "\""
)
(The other, current answer with readr does not work: parentheses are not balanced, the chunk function requires two parameters, see https://stackoverflow.com/a/42826461/9071968)
You make a parser to parse it.
string = yourline[i];
if (string.equals(",")) string = "%40";
yourline[i] = string;
or something of that nature. You could also use:
string.split(",");
and rebuild your string that way. That's how I would do it.
Keep in mind that you'll have to "de-parse" it when you want to get the values back. Commas in SQL mean column, so it can really screw things up, not to mention JSONArrays or JSONObjects.
Also keep in mind that this might be very costly for 10GB of data, so you might want to start by parsing the input before it even gets to the CSV if possible..

Resources