I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.
Here is a small data set so my problem can be reproduced:
write.csv(iris, "iris.csv", row.names = F)
I am finding that the notation you have to use is extremely awkward using read.csv.sql the following is the first way I tried reading in the file and it works but it is messy:
library(sqldf)
csvFile <- "iris.csv"
spec <- 'setosa'
sql <- paste0("select * from file where Species = '\"", spec,"\"'")
d1 <- read.csv.sql(file = csvFile, sql = sql)
I then found another way of writing the same notation in a slightly cleaner manner is:
sql <- paste0("select * from file where Species = '", spec,"'")
d2 <- read.csv.sql(file = csvFile, sql = sql,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
Next, I wanted to read in a case where I select multiple values from the same column, so I tried this and it works:
d3 <- read.csv.sql(file = csvFile,
sql = "select * from file where Species in
('\"setosa\"', '\"versicolor\"') ")
However, I want to avoid hard coding the values like that so I tried:
spec2 <- c('setosa', 'versicolor')
sql2 <- paste0("select * from file where Species in '", spec2,"'")
d4 <- read.csv.sql(file = csvFile, sql = sql2,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
But this does not work (it seems to only read the first value from the vector and tries to match it as a table). I'm sure this is another notation issue again and would like help to clear up this chunk of code.
Also, if you have any tips/tricks on using read.csv.sql and dealing with the notation issues, I would like to hear them!
The problem is that sqldf provides text preprocessing faciliities but the code shown in the question does not use them making it overly complex.
1) Regarding text substitution, use fn$ (from gsubfn which sqldf automatically loads) as discussed on the github page for sqldf. Assuming that we used quote = FALSE in the write.csv since sqlite does not handle quotes natively:
spec <- 'setosa'
out <- fn$read.csv.sql("iris.csv", "select * from file where Species = '$spec' ")
spec <- c("setosa", "versicolor")
string <- toString(sprintf("'%s'", spec)) # add quotes and make comma-separated
out <- fn$read.csv.sql("iris.csv", "select * from file where Species in ($string) ")
2) Regarding deleting double quotes, a simpler way would be to use the following filter= argument:
read.csv.sql("iris.csv", filter = "tr -d \\042") # Windows
or
read.csv.sql("iris.csv", filter = "tr -d \\\\042") # Linux / bash
depending on your shell. The first one worked for me on Windows (with Rtools installed and on the PATH) and the second worked for me on Linux with bash. It is possible that other variations could be needed for other shells.
2a) Another possibility for removing quotes is to install the free csvfix utility (available on Windows, Linux and Mac) on your system and then use the following filter= argument which should work in all shells since it does not involve any characters that are typically interpreted specially by either R or most shells. Thus the following should work on all platforms.
read.csv.sql("iris.csv", filter = "csvfix echo -smq")
2b) Another cross platform utility that could be used is xsv. The eol= argument is only needed on Windows since xsv produces UNIX style line endings but won't hurt on other platforms so the following line should work on all platforms.
read.csv.sql("iris.csv", eol = "\n", filter = "xsv fmt")
2c) sqldf also includes an awk program (csv.awk) that can be used. It outputs UNIX style newlines so specify eol = "\n" on Windows. On other platforms it won't hurt if you specify it but you can omit it if you wish since that is the default on those platforms.
csv.awk <- system.file("csv.awk", package = "sqldf")
rm_quotes_cmd <- sprintf('gawk -f "%s"', csv.awk)
read.csv.sql("iris.csv", eol = "\n", filter = rm_quotes_cmd)
3) Regarding general tips, note that the verbose=TRUE argument to read.csv.sql can be useful to see what it is going on.
read.csv.sql("iris.csv", verbose = TRUE)
Related
I am having problems using fread with " " as delimiter and interspersed blank values. For example, this:
dt <- data.table(1:5,1:5,1:5) #make a simple table
dt[3,"V2" := NA] #add a blank in the middle to illustrate the problem
fwrite(dt, file = "dt.csv", sep = " ") #save to file
dt <- fread("dt.csv", sep = " ") #try to retrieve
The fread fails with: "Stopped early on line 4. Expected 3 fields but found 2." The problem seems to be that with the NA value in the middle column, fwrite gives value|space|space|value, then fread doesn't recognize the implied blank value in the middle.
I understand it would be simple to use another delimiter in the first place. However, is it possible to get fread to reproduce the original dt here?
EDIT WITH A READ-SIDE SOLUTION:
I found the same question here. It's a bit confusing because it gives a solution, but then the solution later stopped working. On pursuing some other leads the closest I've now found to a read-side solution with fread() is with a Unix command like this:
dt <- fread(cmd="wsl sed -r 's/ /,/g' dt.csv") #converts spaces to commas on the way in
On Windows 10 I had to do some trial and error to get my system to run Unix commands. The "wsl" part seems to depend on the system. This video was helpful, and I used the first method he describes there. This and this question provides a bit more on sed with fread. The latter says sed comes with rTools, although I did not try that.
Maybe export NA as something other than "" by default
Here I use #
library(data.table)
dt <- data.table(1:5,1:5,1:5) #make a simple table
dt[3,"V2" := NA] #add a blank in the middle to illustrate the problem
fwrite(dt, file = "dt.csv", sep = " ", na="#") #save to file
dt <- fread("dt.csv", sep = " ",na.strings = "#") #try to retrieve
I have a folder with thousands of comma delimited CSV files, totaling dozens of GB. Each file contains many records, which I'd like to separate and process separately based on the value in the first field (for example, aa, bb, cc, etc.).
Currently, I'm importing all the files into a dataframe and then subsetting in R into smaller, individual dataframes. The problem is that this is very memory intensive - I'd like to filter the first column during the import process, not once all the data is in memory.
This is my current code:
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, fread, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
DF <- rbindlist(temp)
DFaa <- subset(DF, V1 =="aa")
If possible, I'd like to move the "subset" process into lapply.
Thanks
1) read.csv.sql This will read a file directly into a temporarily set up SQLite database (which it does for you) and then only read the aa records into R. The rest of the file will not be read into R at any time. The table will then be deleted from the database.
File is a character string that contains the file name (or pathname if not in the current directory). Other arguments may be needed depending on the format of the data.
library(sqldf)
read.csv.sql(File, "select * from file where V1 == 'aa'", dbname = tempfile())
2) grep/findstr Another possibility is to use grep (Linux) or findstr (Windows) to extract the lines with aa. That should get you the desired lines plus possibly a few others and at that point you have a much smaller input so it could be subset it in R without memory problems. For example,
fread("findstr aa File")[V1 == 'aa'] # Windows
fread("grep aa File")[V1 == 'aa'] # Linux
sed or gawk could also be used and are included with Linux and in Rtools on Windows.
3) csvfix The free csvfix utility is available on all platforms that R supports and can be used to select field values -- there also exist numerous other similar utilities such as csvkit, csvtk, miller and xsv.
The line below says to return only lines for which the first comma separated field equals aa. This line may need to be modified slightly depending on the cmd line or shell processor used.
fread("csvfix find -if $1==aa File") # Windows
fread("csvfix find -if '$1'==aa File") # Linux bash
setwd("E:/Data/")
files <- list.files(path = "E:/Data/",pattern = "*.csv")
temp <- lapply(files, function(x) subset(fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE), V1=="aa"))
DF <- rbindlist(temp)
Untested, but this will probably work - replace your function call with an anonymous function.
This could help but you have to expand the function:
#Function
myload <- function(x)
{
y <- fread(x, sep=",", fill=TRUE, integer64="numeric",header=FALSE)
y <- subset(y, V1 =="aa")
return(y)
}
#Apply
temp <- lapply(files, myload)
If you don't want to muck with SQL, consider using the skip argument in a loop. Slower, but that way you read in a block of lines, filter them, then read in the next block of lines (to the same temp variable so as not to take extra memory), etc.
Inside your lapply call, either a second lapply or equivalently
for (jj in 0: N) {
foo <- fread(filename, skip = (jj*1000+1):((jj+1)*1000), sep=",", fill=TRUE, integer64="numeric",header=FALSE)
mydata[[jj]] <- do_something_to_filter(foo)
}
When exporting data from R using haven::write_sas(), the resulting sas7bdat file is not recognized (i.e. cannot be loaded) by SAS EG/9.4. Although there are several other packages such as foreign that provide alternative approaches, I was hoping to find a relatively automated way to push a dataset from my R session directly into SAS.
When using haven, the file is made but cannot be opened by SAS EG nor 9.4:
# Load package
library(haven)
# Save data
write_sas(mtcars, "mtcars.sas7bdat")
Using foreign as alternative to haven:
library(foreign)
write.foreign(df = mtcars,
datafile = 'mtcars.txt',
codefile = 'mtcars.sas',
dataname = 'libraryname.tablename', # Destination in SAS to save the data
package = 'SAS')
Running the SAS code output from foreign is successful.
* Written by R;
* write.foreign(df = mtcars, datafile = "mtcars.txt", codefile = "mtcars.sas", ;
DATA libraryname.tablename ;
INFILE "mtcars.txt"
DSD
LRECL= 43 ;
INPUT
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
;
RUN;
However, neither of these methods help with automatically pushing the data directly from R into a SAS library, which would be preferable.
There is a lengthy discussion on GitHub describing some of the challenges when exporting data from R for use in SAS via haven. In addition to providing a solution on how to automate data transfer from R to SAS, I hope this can serve as an answer to some related questions.
If one wants to use tools designed by SAS for interoperability with R, RSWAT on GitHub is likely a more robust option. However, this will assume that you have access to SAS Cloud Analytics Services configured for this purpose.
If you are working with a SAS 9.4 on your machine and perhaps also connect to SAS servers (i.e. using rsubmit; commands), it should be relatively straightforward to pass a data-set directly from R into a SAS library. There are three steps:
Format dataset for SAS; although foreign will do a lot of the formatting changes, I prefer converting factors back to characters and having NA replaced with "". This I find ensures that no special formatting is needed by colleagues to open the final table in SAS.
# Example data
data <- data.frame(ID = c(123, NA, 125),
disease = factor(c('syphilis', 'gonorrhea', NA)),
AdmitDate = as.Date(c("2014-04-05", NA, "2016-02-03")),
DOB = as.Date(c("1990-01-01", NA, NA)))
# Function defined for converting factors and blanks
convert_format_r2sas <- function(data){
data <- data %>%
dplyr::mutate_if(is.factor, as.character) %>%
dplyr::mutate_if(is.character, tidyr::replace_na, replace = "")
return(data)
}
# Convert some formatting
data <- convert_format_r2sas(data)
Use foreign to export the data and associated code
library(foreign)
# Ensure the data and code files are saved in an easily accessible location (ideally in or downstream of your R project directory)
write.foreign(df = data ,
datafile = 'data.txt',
codefile = 'data.sas',
dataname = 'libraryname.tablename', # Destination in SAS to save the data
package = 'SAS')
Pass code to local SAS installation using custom function. You may need to adjust the location of the SAS.exe as well as the configuration file. This will work both passing a list of SAS files, or SAS code written directly in R as a character vector.
# Define function for passing the code to SAS and upload data (may require tweaking the local SAS installation location and configuration file)
pass_code_to_sas <- function(sas_file_list = NULL, inputstring = NULL,
sas_path = "C:/LocationTo/SASHome/SASFoundation/9.4/sas.exe",
configFile = "C:/LocationTo/SASHome/SASFoundation/9.4/SASV9.CFG") {
# If provided list of scripts, check they are all valid
if(!is.null(sas_file_list)){
if(any(purrr::map_lgl(sas_file_list, file.exists)) == FALSE | is.list(sas_file_list) == F){
stop("You entered an invalid file location or did not provide the locations as a list of characters")
}
}
sink(file.path(R.home(), "temp_codePass.sas"))
if(!is.null(sas_file_list)){
for(i in 1:length(sas_file_list)){
cat(readLines(sas_file_list[[i]]), sep = "\n")
}
}
cat(inputstring)
sink()
# Output message to view what code was sent...
message(paste0("The above info was passed to SAS: ",
if(!is.null(sas_file_list)){for(i in 1:length(sas_file_list)){cat(readLines(sas_file_list[[i]]), sep = "\n")}},
print(inputstring)))
# Run SAS
system2(sas_path,
args = paste0(
"\"", file.path(R.home(), "temp_codePass.sas"), "\"",
if(!is.null(configFile)) { paste0(" -config \"", configFile, "\"")}
)
)
# Delete the SAS file
file.remove(file.path(R.home(), "temp_codePass.sas"))
}
# Pass data to SAS
pass_code_to_sas(sas_file_list = 'path2codefile/data.sas')
I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."
These attempts to get the number of rows have failed (they take as long as simply reading the data in:
length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select
count(*) from file")
Is there any way via R or some other program to generate a random sample from a large underlying dataset?
Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.
Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.
First lets create a dummy file to test on:
library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)
Now we can use the shell command wc to find how many lines there are in the file:
nl = read.table(pipe("wc -l test.csv"))[[1]]
Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.
N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header
cat(paste0(sample.lines, collapse = '\n'), file = "lines.txt")
Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.
dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")
The following works, but I'm missing a functional programming technique, indexing, or a better way of structuring my data. After a month, it will take a bit to remember exactly how this works instead of being easy to maintain. It seems like a workaround when it shouldn't be. I want to use regex to decide which function to use for expected groups of files. When a new file format comes along, I can write the read function, then add the function along with the regex to the data.frame to run it alongside all the rest.
I have different formats of Excel and csv files that need to be read in and standardized. I want to maintain a list or data.frame of the file name regex and appropriate function to use. Sometimes there will be new file formats that won't be matched, and old formats without new files. But then it gets complicated which is something I would prefer to avoid.
# files to read in based on filename
fileexamples <- data.frame(
filename = c('notanyregex.xlsx','regex1today.xlsx','regex2today.xlsx','nomatch.xlsx','regex1yesterday.xlsx','regex2yesterday.xlsx','regex3yesterday.xlsx'),
readfunctionname = NA
)
# regex and corresponding read function
filesourcelist <- read.table(header = T,stringsAsFactors = F,text = "
greptext readfunction
'.*regex1.*' 'readsheettype1'
'.*nonematchthis.*' 'readsheetwrench'
'.*regex2.*' 'readsheettype2'
'.*regex3.*' 'readsheettype3'
")
# list of grepped files
fileindex <- lapply(filesourcelist$greptext,function(greptext,files){
grepmatches <- grep(pattern = greptext,x = data.frame(files)[,1],ignore.case = T)
},files = fileexamples$filename)
# run function on files based on fileindex from grep
for(i in 1:length(fileindex)){
fileexamples[fileindex[[i]],'readfunctionname'] <- filesourcelist$readfunction[i]
}