R: Simple Random Sample of Massive Dataframe - r

I have a massive (8GB) dataset, which I am simply unable to read into R using my existing setup. Attempting to use fread on the dataset crashes the R session immediately, and attempting to read in random lines from the underlying file was insufficient because: (1) I don't have a good way of knowing that total number of rows in the dataset; (2) my method was not a true "random sampling."
These attempts to get the number of rows have failed (they take as long as simply reading the data in:
length(count.fields("file.dat", sep = "|"))
read.csv.sql("file.dat", header = FALSE, sep = "|", sql = "select
count(*) from file")
Is there any way via R or some other program to generate a random sample from a large underlying dataset?
Potential idea: Is it possible, given a "sample" of the first several rows to get a sense of the average amount of information contained on a per-row basis. And then back-out how many rows there must be given the size of the dataset (8 GB)? This wouldn't be accurate, but it might give a ball-park figure that I could just under-cut.

Here's one option, using the ability of fread to accept a shell command that preprocesses the file as its input. Using this option we can run a gawk script to extract the required lines. Note you may need to install gawk if it is not already on your system. If you have awk instead on your system, you can use that instead.
First lets create a dummy file to test on:
library(data.table)
dt = data.table(1:1e6, sample(letters, 1e6, replace = TRUE))
write.csv(dt, 'test.csv', row.names = FALSE)
Now we can use the shell command wc to find how many lines there are in the file:
nl = read.table(pipe("wc -l test.csv"))[[1]]
Take a sample of line numbers and write them (in ascending order) to a temp file which makes them accessible easily to gawk.
N = 20 # number of lines to sample
sample.lines = sort(sample(2:nl, N)) #start sample at line 2 to exclude header
cat(paste0(sample.lines, collapse = '\n'), file = "lines.txt")
Now we are ready to read in the sample using fread and gawk (based on this answer). You can also try some of the other gawk scripts in this linked question which could possibly be be more efficient on very large data.
dt.sample = fread("gawk 'NR == FNR {nums[$1]; next} FNR in nums' lines.txt test.csv")

Related

Saving .dta: data frame with very long strings using R

I have a df with multiple variables, some are very long strings with up to 4500 characters. I would like to export this database as a .dta file.
I try to save it using haven's write_dta() function, but I get the following error message: Error in write_dta_(data, normalizePath(path, mustWork = FALSE), version = stata_file_format(version), : Writing failure: A provided string value was longer than the available storage size of the specified column.
Here is an example of the issue:
library(haven)
longFun <- function(n) {
do.call(paste0, replicate(5000, sample(LETTERS, n, TRUE), FALSE))
}
longString <- data.frame(VeryveryveryveryveryveryveryveryveryveryVeryveryveryveryveryveryveryveryveryverylongname = longFun(1), stringsAsFactors = F)
write_dta(longString,"tst.dta")
I am aware that write_dta has issues handling long strings (https://github.com/tidyverse/haven/issues/437) and that one possibility is to trim the strings (Error in write_dta : A provided string value was longer than the available storage size of the specified column). But it is essential for me to keep the full strings.
Is there any way to save variables with long strings as .dta files using R?
Edit:
I have tried the readstata13::save.dta13 option suggested by #jay.sf but this has two issues: 1) Is not able to manage - i.e. it truncates - long variable names above 32-UTF characters, that write_dta() manages well. 2) It is significantly slower than write_dta(). Given that I have to save a very large dataset this is a relevant concern.
In sum is there any other approach that would allow me to
a) save as .dta a df with very long strings
b) retain original variable names (longer than 32-UTF characters)
c) do this in a relatively fast manner.
Use save.dta13 from the readstata13 package.
R:
readstata13::save.dta13(longString, "tst.dta")
Stata:
. use "V:\tst.dta"
. list
+------------------------------------------------------------------------------------------------------+
| V1 |
|------------------------------------------------------------------------------------------------------|
1. | GZSPZGLLKOQHETKURLPKQDTZWTNHLDJDUSAFAXHFMPKUDIZURKIFLWQSXSFBLTPBGBLJKTDYJCHZOPZCFYKIMLGTQGDKRNBGUI.. |
+------------------------------------------------------------------------------------------------------+

Read a 20GB file in chunks without exceeding my RAM - R

I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

read large csv files in R inside for loop

To speedup I'm setting colClasses, my readfile looks like following:
readfile=function(name,save=0, rand=1)
{
data=data.frame()
tab5rows <- read.table(name, header = TRUE, nrows = 5,sep=",")
classes <- sapply(tab5rows, class)
data <- read.table(pipe(paste("cat",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",")
if(save==1)
{
out=paste(file,"Rdata",sep=".")
save(data,file=out)
}
else
{
data
}
}
contents of myscipt.sh:
#!/bin/sh
awk -v prob="$1" 'BEGIN {srand()} {if(NR==1)print $0; else if(rand() < prob) print $0;}'
In an extension to this, I needed to read file incrementaly. Say, if file had 10 lines at 10:am and 100 lines at 11:am, I needed those newly added 90 lines + the header (without which I would not be able to implement futher R processing) I made a change to readfile funtion using the comand:
data <- read.table(pipe(paste("(head -n1 && tail -n",skip,")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") here skip gives me the number of lined to be tailed (calculated by some other script, lets, say I have these already). I call this function readfileIncrementally.
abcd are csv files each with 18 columns. Now I run this inside for loop say for i in a b c d
a,b,c,d are 4 files which have different values of skip. Lets say skip=10,000 for a , 20,000 for b. If I run these individually (not in for loop), it runs fine. But in case of loop it gives me error in scan line "n" does not have 18 columns. Usually this happens when skip value is greater than 3000 (approx).
However I cross checked no. of columns using command awk -F "," 'NF != 18' ./a.csv it surely has 18 columns.
It looks like a timing issue to me, is there any way to give R the required amount of time before going to next file. Or is there anything I'm missing. On running individually it runs fine (takes few seconds though).
data <- read.table(pipe(paste("(head -n1 && tail -n",skip," | head " as.integer(skip)-1")<",name,"| ./myscript.sh",rand)), header = TRUE, colClasses = classes,sep=",") worked for me. Basically the last line was not getting written completely by the time R was reading the file. And hence displayed the error that line number n didn't have 18 columns. Making it read 1 line less works fine for me.
Apart from this I didn't find any R feature to overcome such scenarios.

Load a small random sample from a large csv file into R data frame

The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?
You can also just do it in the terminal with perl.
perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt
This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.
Try this based on examples 6e and 6f on the sqldf github home page:
library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")
See ?read.csv.sql using other arguments as needed based on the particulars of your file.
This should work:
RowsInCSV = 10000000 #Or however many rows there are
List <- lapply(1:20000, function(x) read.csv("YourFile.csv", nrows=1, skip = sample(1, RowsInCSV), header=F)
DF = do.call(rbind, List)
The following can be used in case you have an ID or something similar in your data.
Take a sample of IDs, then take the subset of the data using the sampled ids.
sampleids <- sample(data$id,1000)
newdata <- subset(data, data$id %in% sampleids)

Resources