How to store .pdf file in Sqlite using R programming - r

I am creating an application on R-Shiny where i am taking inputs from users which i am storing into SQLite database in backed.But my concern is on my form i have one upload file input which basically accepts files like (.PDF,.jpeg,.png)(screenshot below).
If users upload any file using that i want that file to be stored in my SQLite database table for further use.But i am not aware of how to achieve this using r programming.
Any help would be appreciated.

You can store objects (any R output which is not tabular, model's output for example) as BLOBs in SQLite, in R use serialize/unserialize for this, but before you need to read the raw PDF using readBin, here an example :
path <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
# see the PDF
browseURL(path)
# Read raw PDF
pdf <- readBin(con = path, what = raw(), n = file.info(path)$size)
library(RSQLite)
# Connect to DB
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# Serialize raw pdf
pdf_serialized <- serialize(pdf, NULL)
# Create a data.frame
df <- data.frame(pdfs = I(list(pdf_serialized)))
# Write the table in database
dbWriteTable(conn = con, name = "mytable", value = df, overwrite = TRUE)
# Read your table
out <- dbReadTable(conn = con, name = "mytable")
# unserialize
pdf_out <- unserialize(out$pdfs[[1]])
# Write the PDF in a temporary file
tmp <- tempfile(fileext = ".pdf")
writeBin(object = pdf_out, con = tmp)
# open it
browseURL(tmp)

Similar to Victorp's answer, you can also base64 encode the data and store it as text in a database:
file <- "mypdf.pdf"
con <- dbConnect(...)
table <- "mypdf_table"
bin <- readBin(file, raw(), n = file.size(file))
enc_data <- base64enc::base64encode(bin)
dd <- data.frame(
file = file,
data = enc_data
)
DBI::dbWriteTable(con, table, dd, append = TRUE)

Related

Copy Multiple data frames to SQLite db in R

I have ~ 250 csv files I want to load into SQLite db. I've loaded all the csv into my global environment as data frames. I'm using the following function to copy all of them to db but get Error: df must be local dataframe or a remote tbl_sql
library(DBI)
library(odbc)
library(rstudioapi)
library(tidyverse)
library(dbplyr)
library(RSQLite)
library(dm)
# Create DB Instance ---------------------------------------------
my_db <- dbConnect(RSQLite::SQLite(), "test_db.sqlite", create = TRUE)
# Load all csv files ---------------------------------------------
filenames <- list.files(pattern = ".*csv")
names <- substr(filenames, 1, nchar(filenames)-4)
for (i in names) {
filepath <- file.path(paste(i, ".csv", sep = ""))
assign(i, read.csv(filepath, sep = ","))
}
# Get list of data.frames ----------------------------------------
tables <- as.data.frame(sapply(mget(ls(), .GlobalEnv), is.data.frame))
colnames(tables) <- "is_data_frame"
tables <- tables %>%
filter(is_data_frame == "TRUE")
table_list <- row.names(tables)
# Copy dataframes to db ------------------------------------------
for (j in table_list) {
copy_to(my_db, j)
}
I have had mixed success using copy_to. I recommend the dbWriteTable command from the DBI package. Example code below:
DBI::dbWriteTable(
db_connection,
DBI::Id(
catalog = db_name,
schema = schema_name,
table = table_name
),
r_table_name
)
This would replace your copy_to command. You will need to provide a string to name the table, but the database and schema names are likely optional and can probably be omitted.

Import multiple csv files into postgresql database using r (memory error)

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.
I successfully connected to the database, created a loop to import the csv files and tried to import.
R then returns an error, because my pc runs out of memory.
My question is:
Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards?
That way I would not run out of memory.
Code which returns the memory error:
`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db,
port=db_port, user=db_user, password=db_password)
#check if connection was succesfull
dbExistsTable(con, "prices")
#create function to load multiple csv files
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
}
#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`
Thanks in advance for the feedback!
If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,function(x){
prices <- read.csv(x)
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
})
}
I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.
When calling ?read.csv I saw the following output:
nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.
skip : the number of lines of the data file to skip before beginning to read data.
Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.
For example, for each file do the following:
number_of_lines = 5000 # Number of lines to read at a time
row_skip = 0 # number of lines to skip initially
keep_reading = TRUE # We will change this value to stop the while
while (keep_reading) {
my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB
row_skip = 1 + row_skip + number_of_lines # The "1 +" is there due to inclusivity avoiding duplicates
# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
keep_reading = FALSE
} # end-if
} # end-while
By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

How do I get the file path of a file saved using write.xlsx or another function in R?

I am creating two dataframes and one graph on Rstudio. I wrote code to transfer them to an Excel file on different sheets, but each time I have to choose the file path using file.choose(). Is it possible to assign the file path to the variable when saving the file for the first time? If such a method exists, how can it be done?
I would also like to receive comments on how to more easily export my dataframes to an excel file. I shared my codes.
Thank you to everyone.
dataframe1 <- data.frame("A"=1, "B"=2)
dataframe2 <- data.frame("C"=3,"D"=4)
list_of_datasets <- list("Name of DataSheet1" = dataframe1, "Name of Datasheet2" = dataframe2, )
write.xlsx(list_of_datasets, file = "writeXLSX2.xlsx")
dflist <- list("Sonuçlar"=yazılacakdosya0, "Frame"=dtf, "Grafik"="")
edc <- write.xlsx(dflist, file.choose(new = T), colNames = TRUE,
borders = "surrounding",
firstRow = T,
headerStyle = hs)
require(ggplot2)
q1 <- qplot(hist(yazılacakdosya0$Puan))
print(q1)
insertPlot(wb=edc, sheet = "Grafik")
saveWorkbook(edc, file = file.choose(), overwrite = T)
Just save the file path before you call saveWorkbook
file = file.choose()
saveWorkbook(edc, file = file, overwrite = T)

How can I use do.call() to read specific rows?

I'm using the do.call() command to read a list of csv-files to have all data points in one csv file. I have been using the following:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
library(data.table)
DT = do.call(rbind, lapply(files, fread))
Instead of reading all the rows in each file, I want to read specific rows. Especially the ones that are within this range:
Data <- filter(DT, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
Is there a way that I can do that using do.call()? Looking forward for a soon reply. Thank you!
There are several strategies on how to tackle this. You can import all the data into a list using lapply and then from each list element filter out based on your filter. You would use data.table::rbindlist to make the final data.table. Another one would be to do this in one step, e.g. (not tested, obviously)
library(data.table)
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
xy <- lapply(files, FUN = function(x) {
out <- fread(x)
out <- filter(out, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
out
})
xy <- rbindlist(xy)
Assuming you use Windows PC and have at least Microsoft Office 2007+ installed, consider directly querying the CSV with the JET/ACE SQL Engine (.dll files) which is the very engine of MS Access.
Below includes two connection strings using Access or Excel. Either version works and the files do need to exist but are never used except for connecting to ACE. Once connected, CSV files are then queried from same or different path.
library(odbc)
# VERIFY AVAILABLE DSNs AND DRIVERS
odbcListDataSources()
# DSN VERSIONS
conn <- dbConnect(odbc::odbc(), DSN ="MS Access Database;DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(), DSN ="Excel Files;DBQ=C:\\Path\\To\\Excel.xlsx;");
# DRIVER VERSIONS
conn <- dbConnect(odbc::odbc(),
.connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(),
.connection_string ="Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=C:\\Path\\To\\Excel.xlsx;");
# CSV QUERY
df <- dbGetQuery(conn, "SELECT t.*
FROM [text;database=C:\\Path\\To\\CSV_Folder].Name_of_File.csv AS t
WHERE t.longitude BETWEEN -86.97 AND -86.78
AND t.latitude BETWEEN 40.35 AND 40.49;")
head(df)
dbDisconnect(conn)
And in a loop:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
df_list <- lapply(files, function(f)
df <- dbGetQuery(conn,
paste0("SELECT t.* ",
" FROM [text;database=G:\\SafeGraph201708MidWest].", f, " AS t ",
" WHERE t.longitude BETWEEN -86.97 AND -86.78",
" AND t.latitude BETWEEN 40.35 AND 40.49;")
)
)
final_dt <- rbindlist(df_list)
You can use the ability from data.table::fread() to execute a command and 'read' he results.
I Assume you are using windows, so you have got accessto the findstr-function in your command prompt.
So, if you can build a regex that 'hits' on the lines you want to extract, you can filter wanted lines before reading the entire file into R. This is (potentially) a huuuge memory-saver on larger files, and may speed up your workflow considerably.
sample data
lat's say coords.csv looks line this:
id,latitude,longitude
1,10,11
2,11,12
3,12,13
4,13,14
5,14,15
In this example, you want to extract lines with latitudes bewteen 12 and 14 AND longitudes beteeen 11 and 13
code
#build a list of files (I created only one)
#make sure you use the full path (not relative)
x <- list.files( path = "C:/folder", pattern = "coord.csv", full.names = TRUE )
#build reges that only hits on rows with:
# latitude 12-14
# longitude 11-13
pattern = "^[0-9],1[2-4],1[1-3]$"
#read the file(s), extract the lines with match the regex-pattern
#and bind the resuklt to a data.table
rbindlist( lapply( x, function(x) {
fread( cmd = paste0( "findstr /R ", pattern, " ", x ), header = FALSE )
} ) )
output
V1 V2 V3
1: 3 12 13

How to read a table from filename received as argument in R?

I was trying to read a table in R using an argument received from a php program like,
echo exec("Rscript /var/www/html/genome/coex.R $db");
In R,
args = commandArgs(trailingOnly=FALSE)
db <- args[1]
db <-paste(db, "txt", sep=".")
dat <- as.matrix(read.table("/var/www/html/genome/coex_db/db", header = TRUE, fill = TRUE))
I am not sure how can the filename for the db can be used to get the table content in dat.

Resources