Quickest way to read a subset of rows of a CSV - r

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
The solution deserving of a bounty will give system.time() estimates of different alternatives.
Other:
I am using Linux
I don't need exactly 10% of the rows. Just approximately 10%.

I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)

Here's a file with 100000 lines in it like this:
"","a","b","c"
"1",0.825049088569358,0.556148858508095,0.591679535107687
"2",0.161556158447638,0.250450366642326,0.575034103123471
"3",0.676798462402076,0.0854280597995967,0.842135070590302
"4",0.650981109589338,0.204736212035641,0.456373531138524
"5",0.51552157686092,0.420454133534804,0.12279288447462
$ wc -l d.csv
100001 d.csv
So that's 100000 lines plus a header. We want to keep the header and sample each line if a random number from 0 to 1 is greater than 0.9.
$ awk 'NR==1 {print} ; rand()>.9 {print}' < d.csv >sample.csv
check:
$ head sample.csv
"","a","b","c"
"12",0.732729186303914,0.744814146542922,0.199768838472664
"35",0.00979996216483414,0.633388962829486,0.364802648313344
"36",0.927218825090677,0.730419414117932,0.522808947600424
"42",0.383301998255774,0.349473554175347,0.311060158303007
and it has 10027 lines:
$ wc -l sample.csv
10027 sample.csv
This took 0.033s of real time on my 4-yo box, probably the HD speed is the limiting factor here. It should scale linearly since the file is being dealt with strictly line-by-line.
You then read in sample.csv using read.csv or fread as desired:
> s = fread("sample.csv")

You could use sqldf::read.csv.sql and an SQL command to pull the data in:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) # write a csv file to test with
read.csv.sql("iris.csv","SELECT * FROM file ORDER BY RANDOM() LIMIT 10")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 6.3 2.8 5.1 1.5 virginica
2 4.6 3.1 1.5 0.2 setosa
3 5.4 3.9 1.7 0.4 setosa
4 4.9 3.0 1.4 0.2 setosa
5 5.9 3.0 4.2 1.5 versicolor
6 6.6 2.9 4.6 1.3 versicolor
7 4.3 3.0 1.1 0.1 setosa
8 4.8 3.4 1.9 0.2 setosa
9 6.7 3.3 5.7 2.5 virginica
10 5.9 3.2 4.8 1.8 versicolor
It doesn't calculate the 10% for you, but you can choose the absolute limit of rows to return.

Related

Is it possible to combine parameters to a subset function that is generated programmatically in R?

Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa

R package: read in data via system.file() and read.table() from R data.table

I'm creating an R package with several files in /data. The way one loads data in the R package is to use the system.file(),
system.file(..., package = "base", lib.loc = NULL, mustWork = FALSE)
The file in /data I would like to load into an R data.table has the extension *.txt.gz, my_file.txt.gz. How do I load this into a data.table via read.table() or fread()?
Within the R script, I tried :
#' #import data.table
#' #export
my_function = function(){
my_table = read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This leads to an error via devtools document():
Error in read.table(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header = TRUE) (from script1.R#7) :
no lines available in input
In addition: Warning message:
In file(file, "rt") :
file("") only supports open = "w+" and open = "w+b": using the former
I appear to get the same issue via fread()
#' #import data.table
#' #export
my_function() = function(){
my_table = fread(system.file("data", "my_file.txt.gz", package = "FusionVizR"), header=TRUE)
}
This outputs the error:
Input is either empty or fully whitespace after the skip or autostart. Run again with verbose=TRUE.
So, it appears that system.file() doesn't give an object to the file which I could load into an R data.table. How do I do this?
Do yourself a HUGE favour and study fread() closely: it is one of the very best features in data.table. I have examples (at work) of reading from a pipe of other commands, of reading compresse data and more.
Here is a simple mock example:
R> write.csv(iris, file="/tmp/demo.csv")
R> system("gzip /tmp/demo.csv") # to be very plain
R> fread("zcat /tmp/demo.csv.gz")
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2 setosa
2: 2 4.9 3.0 1.4 0.2 setosa
3: 3 4.7 3.2 1.3 0.2 setosa
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
---
146: 146 6.7 3.0 5.2 2.3 virginica
147: 147 6.3 2.5 5.0 1.9 virginica
148: 148 6.5 3.0 5.2 2.0 virginica
149: 149 6.2 3.4 5.4 2.3 virginica
150: 150 5.9 3.0 5.1 1.8 virginica
R>
Seems in the hast I wrote one column too many (rownames) but you get the idea.
Now, you don't even need fread (but it still more powerful than the alternatives):
R> head(read.csv(file="/tmp/demo.csv.gz"))
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
R>
R figured out by itself it needed to compress the file.
Edit: I was editing this question earlier when it was deleted under me, which is about as de-motivating as it gets. In a nutshell:
system.file() works, e.g. file <- system.file("rawdata", "population.csv", package="gunsales") does contain the complete path as the file exists: "/usr/local/lib/R/site-library/gunsales/rawdata/population.csv". But this is easy to mess up. (Needless to say I do have the package and the file.)
look into the data/ directory and what Writing R Extensions says. It is a good mechanism.

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

Using data.table's fread with varying lengths for blank missing values

I have a dataset with many missing values. Some of the missing values are NAs, some are Nulls, and others have varying lengths of blank spaces. I would like to utilize the fread function in R to be able to read all these values as missing.
Here is an example:
#Find fake data
iris <- data.table(iris)[1:5]
#Add missing values non-uniformly
iris[1,Species:=' ']
iris[2,Species:=' ']
iris[3,Species:='NULL']
#Write to csv and read back in using fread
write.csv(iris,file="iris.csv")
fread("iris.csv",na.strings=c("NULL"," "))
V1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 1 5.1 3.5 1.4 0.2
2: 2 4.9 3.0 1.4 0.2 NA
3: 3 4.7 3.2 1.3 0.2 NA
4: 4 4.6 3.1 1.5 0.2 setosa
5: 5 5.0 3.6 1.4 0.2 setosa
From the above example, we see that I am unable to account for the first missing value since there are many blank spaces. Any one know of a way to account for this?
Thanks so much for the wonderful answer from #eddi.
fread("sed 's/ *//g' iris.csv",na.strings=c("",NA,"NULL"))

R programming transfering data from excel with missing values to R

So I have an excel spreadsheet with NA values....What is the best way to copy the data and put it in R...I usually use data=read.delim("clipboard").... But because of those missing values....I keep getting this error
Error in if (del == 0 && to == 0) return(to) :
missing value where TRUE/FALSE needed
What are the possible ways I can get rid of this error?...I tried putting zeros instead of NA values but that kinda screws up the what the code is doing
Heres the link of the code that I'm using R programming fixing error was really helpful for my data problems.
I was gonna post the whole set but theres a limit of 30000 characters
You need to set option fill to TRUE , This will let you in case the rows have unequal length, to add NA fields.
read.table(fileName,header=TRUE,fill=TRUE)
fileName here is your excel file path. for example filename ='c:\temp\myfile.csv'.
This should work also with read.delim which is a wrapper of read.table. You can give read.table a string , but you set the text argument not the file one. For example:
read.table(text = ' Time Speed Time Speed
0.8 2.9 0.3 2.7
1.3 2.8 0.9 2.7
1.7 2.3 2.5 3.1
2.0 0.6
2.3 1.7 13.6 3.3
3.0 1.4 15.1 3.5
3.5 1.3 17.5 3.3',head=T,fill=T)
Time Speed Time.1 Speed.1
1 0.8 2.9 0.3 2.7
2 1.3 2.8 0.9 2.7
3 1.7 2.3 2.5 3.1
4 2.0 0.6 NA NA
5 2.3 1.7 13.6 3.3
6 3.0 1.4 15.1 3.5
7 3.5 1.3 17.5 3.3

Resources