I am trying to load a large-ish csv file into a SQL lite database using the RSQLite package (I have also tried the sqldf package). The file contains all UK postcodes and a variety of lookup values for them.
I wanted to avoid loading it into R and just directly load it into the database. Whilst this is not strictly necessary for this task, I want to do so in order to have the technique ready for larger files which won't fit in memory should I have to handle them in the future.
Unfortunately the csv is provided with the values in double quotes and the dbWriteTable function doesn't seem able to strip them or ignore them in any form. Here is the download location of the file: http://ons.maps.arcgis.com/home/item.html?id=3548d835cff740de83b527429fe23ee0
Here is my code:
# Load library
library("RSQLite")
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
file <- "data\\ONSPD_MAY_2017_UK.zip"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")
# Create a SQL Lite database connection
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
# Now load the data into our SQL lite database
dbWriteTable(conn = db_connection,
name = "ONS_PD",
value = ONSPD_path,
row.names = FALSE,
header = TRUE,
overwrite = TRUE
)
# Check the data upload
dbListTables(db_connection)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 20")
Having hit this issue, I found a reference tutorial (https://www.r-bloggers.com/r-and-sqlite-part-1/) which recommended using the sqldf package but unfortunately when I try to use the relevant function in sqldf (read.csv.sql) then I get the same issue with double quotes.
This feels like a fairly common issue when importing csv files into a sql system, most import tools are able to handle double quotes so I'm surprised to be hitting an issue with this (unless I've missed an obvious help file on the issue somewhere along the way).
EDIT 1
Here is some example data from my csv file in the form of a dput output of the SQL table:
structure(list(pcd = c("\"AB1 0AA\"", "\"AB1 0AB\"", "\"AB1 0AD\"",
"\"AB1 0AE\"", "\"AB1 0AF\""), pcd2 = c("\"AB1 0AA\"", "\"AB1 0AB\"",
"\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\""), pcds = c("\"AB1 0AA\"",
"\"AB1 0AB\"", "\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\"")), .Names = c("pcd",
"pcd2", "pcds"), class = "data.frame", row.names = c(NA, -5L))
EDIT 2
Here is my attempt using the filter argument in sqldf's read.csv.sql function (note that Windows users will need rtools installed for this). Unfortunately this still doesn't seem to remove the quotes from my data, although it does mysteriously remove all the spaces.
library("sqldf")
sqldf("attach 'ons_lkp_db' as new")
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
read.csv.sql(ONSPD_path,
sql = "CREATE TABLE ONS_PD AS SELECT * FROM file",
dbname = "ons_lkp_db",
filter = 'tr.exe -d ^"'
)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 5")
Also, thanks for the close vote from whoever felt this wasn't a programming question in the scope of Stack Overflow(?!).
The CSV importer in the RSQLite package is derived from the sqlite3 shell, which itself doesn't seem to offer support for quoted values when importing CSV files (How to import load a .sql or .csv file into SQLite?, doc). You could use readr::read_delim_chunked():
callback <- function(data) {
name <- "ONS_PD"
exists <- dbExistsTable(con, name)
dbWriteTable(con, name, data, append = exists)
}
readr::read_delim_chunked(ONSPD_path, callback, ...)
Substitute ... with any extra arguments you need for your CSV file.
Use read.csv.sql from the sqldf package with the filter argument and provide any utility which strips out double quotes or which translates them to spaces.
The question does not provide a fully reproducible minimal example but I have provided one below. If you are using read.csv.sql in order to pick out a subset of rows or columns then just add the appropriate sql argument to do so.
First set up the test input data and then try any of the one-line solutions shown below. Assuming Windows, ensure that the tr utility (found in R's Rtools distribution) or the third party csvfix utility (found here and for Linux also see this) or the trquote2space.vbs vbscript utility (see Note at end) is on your path:
library(sqldf)
cat('a,b\n"1","2"\n', file = "tmp.csv")
# 1 - corrected from FAQ
read.csv.sql("tmp.csv", filter = "tr.exe -d '^\"'")
# 2 - similar but does not require Windows cmd quoting
read.csv.sql("tmp.csv", filter = "tr -d \\42")
# 3 - using csvfix utility (which must be installed first)
read.csv.sql("tmp.csv", filter = "csvfix echo -smq")
# 4 - using trquote2space.vbs utility as per Note at end
read.csv.sql("tmp.csv", filter = "cscript /nologo trquote2space.vbs")
any of which give:
a b
1 1 2
You could also use any other language or utility that is appropriate. For example, your Powershell suggestion could be used although I suspect that dedicated utilities such as tr and csvfix would run faster.
The first solution above is corrected from the FAQ. (It did work at the time the FAQ was written many years back but testing it now in Windows 10 it seems to require the indicated change or possibly the markdown did not survive intact from the move from Google Code, where it was originally located, to github which uses a slightly different markdown flavor.)
For Linux, tr is available natively although quoting differs from Windows and can even depend on the shell. csvfix is available on Linux too but would have to be installed. The csvfix example shown above would work identically on Windows and Linux. vbscript is obviously specific to Windows.
Note: sqldf comes with a mini-tr utility written in vbscript. If you change the relevant lines to:
Dim sSearch : sSearch = chr(34)
Dim sReplace : sReplace = " "
and change the name to trquote2space.vbs then you will have a Windows specific utility to change double quotes to spaces.
Honestly I could not find anything to solve this problem.
sqldf documentation tells
"so, one limitation with .csv files is that quotes
are not regarded as special within files so a comma within a data field such as
"Smith, James"
would be regarded as a field delimiter and the quotes would be entered as part of the data which
probably is not what is intended"
So, It looks like there is no solution as far as I know.
One possible suboptimal approach (other then obvious find and replace in text editor)
is to use SQL commands like this
dbSendQuery(db_connection,"UPDATE ONS_PD SET pcd = REPLACE(pcd, '\"', '')")
Related
I'm using sqldf package to import CSV files into R and then produce statistics based on the data inserted into the created dataframes. We have a shared lab environment with a lot of users which means that we all share the available RAM on the same server. Although there is a high capacity of RAM available, given the number of users who are often simultaneously connected, the administrator of the environment recommends using some database (PostgreSQL, SQlite, etc.) to import our files into it, instead of importing everything in memory.
I was checking the documentation of sqldf package and the read.csv.sql.function had my attention. Here is what we can read in the documentation :
Reads the indicated file into an sql database creating the database if
it does not already exist. Then it applies the sql statement returning
the result as a data frame. If the database did not exist prior to
this statement it is removed.
However, what I don't understand is, whether the returned result as a dataframe, will be in memory (therefore in the RAM of the server) or like the imported CSV file, it will be in the specified database. Because if it is in memory it doesn't meet my requirement which is reducing the use of the available shared RAM as much as possible given the huge size of my CSV files (tens of gigabytes, often more than 100 000 000 lines in each file)
Curious to see how this works, I wrote the following program
df_data <- suppressWarnings(read.csv.sql(
file = "X:/logs/data.csv",
sql = "
select
nullif(timestamp, '') as timestamp_value,
nullif(user_account, '') as user_account,
nullif(country_code, '') as country_code,
nullif(prefix_value, '') as prefix_value,
nullif(user_query, '') as user_query,
nullif(returned_code, '') as returned_code,
nullif(execution_time, '') as execution_time,
nullif(output_format, '') as output_format
from
file
",
header = FALSE,
sep = "|",
eol = "\n",
`field.types` = list(
timestamp_value = c("TEXT"),
user_account = c("TEXT"),
country_code = c("TEXT"),
prefix_value = c("TEXT"),
user_query = c("TEXT"),
returned_code = c("TEXT"),
execution_time = c("REAL"),
output_format = c("TEXT")
),
dbname = "X:/logs/sqlite_tmp.db",
drv = "SQLite"
))
I run the above program to import a big CSV file (almost 150 000 000 rows). It took around 30 minutes. During the execution time, as specified via the dbname parameter in the program source code, I saw that a SQLite database file was created in X:/logs/sqlite_tmp.db. As the rows in the file were being imported, this file became bigger and bigger which indicated that all rows were indeed being inserted into the database file on the disk and not in memory (into the server's RAM). Finally, the database file at the end of the import, had reached 30 GB. As stated in the documentation, at the end of the import process, this database was removed automatically. Yet after removing automatically the created SQLite database, I was able to work with the result dataframe (that is, df_data in the above code).
What I understand is that the returned dataframe was in the RAM of the server otherwise I wouldn't have been able to refer to it after the created SQlite database had been removed. Please correct me if I'm wrong, but if that is the case, I think I misunderstood the purpose of this R package. My aim was to put everything, even the result dataframe in a database, and use the RAM only for calculations. Is there anyway to put everything in the database until the end of the program?
The purpose of sqldf is to process data frames using SQL. If you want to create a database and read a file into it you can use dbWriteTable from RSQLite directly; however, if you want to use sqldf anyways then first create an empty database, mydb, then read the file into it and finally check that the table is there. Ignore the read.csv.sql warning. If you add the verbose=TRUE argument to read.csv.sql it will show the RSQLite statements it using.
Also you may wish to read https://avi.im/blag/2021/fast-sqlite-inserts/ and https://www.pdq.com/blog/improving-bulk-insert-speed-in-sqlite-a-comparison-of-transactions/
library(sqldf)
sqldf("attach 'mydb' as new")
read.csv.sql("myfile.csv", sql =
"create table mytab as select * from file", dbname = "mydb")
## data frame with 0 columns and 0 rows
## Warning message:
## In result_fetch(res#ptr, n = n) :
## SQL statements must be issued with dbExecute() or
## dbSendStatement() instead of dbGetQuery() or dbSendQuery().
sqldf("select * from sqlite_master", dbname = "mydb")
## type name tbl_name rootpage
## .. info on table that was created in mydb ...
sqldf("select count(*) from mytab", dbname = "mydb")
I have a 20GB dataset in csv format and I am trying to trim it down with a read.csv.sql command.
I am successfully able to load the first 10,000 observations with the following command:
testframe = read.csv(file.choose(),nrows = 10000)
The column names can be seen in the following picture:
I then tried to build my trimmed down dataset with the following command, and get an error:
reduced = read.csv.sql(file.choose(),
sql = 'select * from file where "country" = "Poland" OR
country = "Germany" OR country = "France" OR country = "Spain"',
header = TRUE,
eol = "\n")
The error is:Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: C:\Users\feded\Desktop\AWS\biodiversity-data\occurence.csv line 262 expected 37 columns of data but found 38
Why is it that I can load the first 10,000 observations with ease and problems arise with the second command? I hope you have all the information needed to be able to provide some help on this issue.
Note that with the latest version of all packages read.csv.sql is working again.
RSQLite made breaking changes in their interface to SQLite which mean read.csv.sql and any other software that reads files into SQLite from R that used their old interface no longer work. (Other aspects of sqldf still work.)
findstr/grep
If the only reason you are doing this is to cut down the file to the 4 countries indicated perhaps you could just preprocess the csv file like this on Windows assuming that abc.csv is your csv file and that it is in the current directory. Also we have assumed that XYZ is a string in the header.
DF <- read.csv(pipe('findstr "XYZ France Germany Poland Spain" abc.csv'))
On other platforms use grep:
DF <- read.csv(pipe('grep "XYZ|France|Germany|Poland|Spain" abc.csv'))
The above could possibly retrieve some extra rows if those words can also appear in fields other than the intended one but if that is a concern then using subset or filter in R once you have the data in R could be used to narrow it down to just the desired rows.
Other utilities
There are also numerous command line utilities that can be used as an alternative to findstr and grep such as sed, awk/gawk (mentioned in the comments) and utilities specifically geared to csv files such as csvfix (C++), miller (go), csvkit (python), csvtk (go) and xsv (rust).
xsv
Taking xsv as an example, binaries can be downloaded here and then we can write the following assuming xsv is in current directory or on path. This instructs xsv to extract the rows for which the indicated regular expression matches the country column.
cmd <- 'xsv search -s country "France|Germany|Poland|Spain" abc.csv'
DF <- read.csv(pipe(cmd))
SQLite command line tool
You can use the SQLite command line program to read the file into an SQLite database which it will create for you. Google for download sqlite, download the sqlite command line tools for your platform and unpack it. Then from the command line (not from R) run something like this to create the abc.db SQLite database from abc.csv.
sqlite3 --csv abc.db ".import abc.csv abc"
Then assuming that the database is in current directory run this in R:
library(sqldf)
sqldf("select count(*) from abc", dbname = "abc.db")
I am not sure that sqlite it a good choice for such a large file but you can try it
H2
Another possibility if you have sufficient memory to hold the database (possibly after using findstr/grep/xsv or other utility on the command line rather than R) is to then use the H2 database backend to sqldf from R.
If sqldf sees that the RH2 package containing the H2 driver is loaded it will use that instead of SQLite. (It would also be possible to use MySQL or PostgreSQL backends but these are more involved to install so we won't cover them although these are much more likely to be able to handle the large size you have.)
Note that the RH2 driver requires that rJava R package be installed and it requires java itself although java is very easy to install. The H2 database itself is included in the RH2 R driver package so it does not have to be separately installed. Also the first time in a session that you access java code with rJava it will have to load java itself which will take some time but thereafter it will be faster in that session.
library(RH2)
library(sqldf)
abc3 <- sqldf("select * from csvread('abc.csv') limit 3") |>
type.convert(as.is = TRUE)
I have a number of large data files (.csv) on my local drive that I need to read in R, filter rows/columns, and then combine. Each file has about 33,000 rows and 575 columns.
I read this post: Quickly reading very large tables as dataframes and decided to use "sqldf".
This is the short version of my code:
Housing <- file("file location on my disk")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing', file.format = list(eol="/n")) *I am using Windows
I see "Housing_filtered" data.frame is created with Var1, but zero observations. This is my very first experience with sqldf. I am not sure why zero observations are returned.
I also used "read.csv.sql" and still I see zero observations.
Housing_filtered <- read.csv.sql(file = "file location on my disk",
sql = "select Var01 from file",
eol = "/n",
header = TRUE, sep = ",")
You never really imported the file as a data.frame like you think.
You've opened a connection to a file. You mentioned that it is a CSV. Your code should look something like this if it is a normal CSV file:
Housing <- read.csv("my_file.csv")
Housing_filtered <- sqldf('SELECT Var1 FROM Housing')
If there's something non-standard about this CSV file please mention what it is and how it was created.
Also, to another point that was made in the comments, if you do for some reason need to manually input the line breaks use \n where you were using /n. Any error is not being caused by that change, but rather you're getting passed 1 problem and on to another, probably due to improperly handling missing data, space, commas in text fields that aren't being handled, etc.
If there are still data errors can you please use R code to create a small file that is reflective of the relevant characteristics of your data and which produces the same error when you import it? This may help.
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}
I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).
So, I use the following code to import:
test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)
but this returns the following error message :
Error in read.table.ffdf(FUN = "read.csv2", ...) :
only ffdf objects can be used for appending (and skipping the first.row chunk)
What am I doing wrong?
Here are the first 5 rows of the text file:
CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1 1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2 1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3 1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4 1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5 1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z
I encountered a similar problem related to reading csv into ff objects. On using
read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")
instead of implicit call
read.csv2.ffdf("FD_INDCVIZC_2010.txt")
I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.
You could try the following code:
read.csv2.ffdf("FD_INDCVIZC_2010.txt",
sep = "\t",
VERBOSE = TRUE,
first.rows = 100000,
next.rows = 200000,
header=T)
I am assuming that since its a txt file, its a tab-delimited file.
Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.
If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:
textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)
You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.
This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.
perhaps you could try the following code:
read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )