I am currently attempting to use read_table() function from the readr package on a few large data files. I only want the second column so I set all the other columns NULL with this argument in the function:
col_types = c(paste("_", "c", paste(rep("_", 20000), sep = "", collapse = ""), sep = "", collapse = ""))
EDIT: There should be an underdash between the 1st and 3rd pair of closed quotes in the code above.
However, read_table seems to insist on reading in the entire data file (And using up excessive memory and causing a crash) instead of just reading in column 2.
With read.table(), I have tried a similar argument: colClasses = c("NULL", "character", rep("NULL", 20000) which works perfectly without taking up excess memory but I would like to use read_table since it is supposedly faster. Any ideas on why read_table is taking up so much memory even though I am including an argument to only keep one column?
If you only want to read the second column of a large data file, you can also use the fread function from the data.table package. The fread function was also developed for (very) fast file reading.
fread has a select argument with which you can determine which columns to load. In your case it would be something like:
dt <- fread("name_of_file.csv", select=2)
This selects only the second column. You can also give it a vector of columns:
dt <- fread("name_of_file.csv", select=c(2,5,10))
or a vector of column names:
dt <- fread("name_of_file.csv", select=c("id","time"))
Related
I'm currently trying to read a 20GB file. I only need 3 columns of that file.
My problem is, that I'm limited to 16 GB of ram. I tried using readr and processing the data in chunks with the function read_csv_chunked and read_csv with the skip parameter, but those both exceeded my RAM limits.
Even the read_csv(file, ..., skip = 10000000, nrow = 1) call that reads one line uses up all my RAM.
My question now is, how can I read this file? Is there a way to read chunks of the file without using that much ram?
The LaF package can read in ASCII data in chunks. It can be used directly or if you are using dplyr the chunked package uses it providing an interface for use with dplyr.
The readr package has readr_csv_chunked and related functions.
The section of this web page entitled The Loop as well as subsequent sections of that page describes how to do chunked reads with base R.
It may be that if you remove all but the first three columns that it will be small enough to just read it in and process in one go.
vroom in the vroom package can read in files very quickly and also has the ability to read in just the columns named in the select= argument which may make it small enough to read it in in one go.
fread in the data.table package is a fast reading function that also supports a select= argument which can select only specified columns.
read.csv.sql in the sqldf (also see github page) package can read a file larger than R can handle into a temporary external SQLite database which it creates for you and removes afterwards and reads the result of the SQL statement given into R. If the first three columns are named col1, col2 and col3 then try the code below. See ?read.csv.sql and ?sqldf for the remaining arguments which will depend on your file.
library(sqldf)
DF <- read.csv.sql("myfile", "select col1, col2, col3 from file",
dbname = tempfile(), ...)
read.table and read.csv in the base of R have a colClasses=argument which takes a vector of column classes. If the file has nc columns then use colClasses = rep(c(NA, "NULL"), c(3, nc-3)) to only read the first 3 columns.
Another approach is to pre-process the file using cut, sed or awk (available natively in UNIX and in the Rtools bin directory on Windows) or any of a number of free command line utilities such as csvfix outside of R to remove all but the first three columns and then see if that makes it small enough to read in one go.
Also check out the High Performance Computing task view.
We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))
I have many years of data to read from .txt (tab delimited) to data.frame or data.table formats to work in R. For each year, quarterly files need to be appended. My searching has resulted in some nice code to find all quarterly files and, using FREAD and BIND_ROWS, create 1 annual file. #Maiasaura
One oddity I've found - using FREAD instead of READ.TABLE leads to different classes for some vectors. The pat_age is to be alphanumeric, "00", "01", "02". READ.TABLE seems to handle this as expected - FREAD creates an integer. Thus I've added colClasses to control PAT_AGE class.
Unfortunately - column names across the quarterly files are sometimes Upper Case - others are Lower Case (PAT_AGE pat_age). Any way to control that as I read in the .txt files? ColClasses with tolower didn't work for me.
tabtest <- list.files( pattern= ".*PUDF.*base.*tab.*" , full.names = TRUE)
%>% lapply( fread, header=TRUE, colClasses=c(pat_age="character")) %>%
dplyr::bind_rows()
I expect messy data - and may need to adjust other column names and classes as I move from year to year.
NOTE: Am I correct that if I can't change case within the lapply statement - I'd need to do it to the .txt files? The colClasses function requires "pat_age" to be lower cased across all files.
NOTE: Came across this question:
fread (data.table) select columns, throw error if column not found
Could it be modified to read and modify the header - and then read the entire .txt file with corrected headers?
Latest attempt - think it might work okay. Lots of effort/syntax just to change the case of column names!
read_cols <- function(x) {
titles <- fread(x , nrows = 0, header = TRUE, stringsAsFactors = FALSE )
var.names<-tolower(colnames(titles))
rest <- fread(x , skip =1 )
names(rest) <- var.names
return(rest)
}
tabtest2 <- list.files( pattern=".*PUDF.*base.*tab.*", full.names = TRUE)
%>% lapply( read_cols )
%>% dplyr::bind_rows()
Thank you.
My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.
I have a data.table in R that I'm trying to write out to a .txt file, and then input back into R.
It's sizeable table of 6.5M observations and 20 variables, so I want to use fread().
When I use
write.table(data, file = "data.txt")
a table of about 2.2GB is written in data.txt. In manually inspecting it, I can see that there are column names, that it's separated by " ", and that there are quotes on character variables. So everything should be fine.
However,
data <- fread("data.txt")
returns a data.table of 6.5M observations and 1 variable. OK, maybe for some reason fread() isn't automatically understanding the separator string:
data <- fread("data.txt", sep = " ")
All the data is in the proper variables now, but
R has added an unnecessary row-number column
in one (only one) of my columns all NAs have been replaced by 9218868437227407266
All variable names are missing
Maybe fread() isn't recognizing the header, somehow.
data <- fread("data.txt", sep = " ", header = T)
Now my first set of observations is my column names. Not very useful.
I'm completely baffled. Does anyone understand what's happening here?
EDIT:
row.names = F solved the names problem, thanks Ananda Mahto.
Ran
datasub <- data[runif(1000,1,6497651), ]
write.table(datasub, file = "datasub.txt", row.names = F)
fread("datasub.txt")
fread() seems to work fine for the smaller dataset.
EDIT:
Here is the subset of data I created above:
https://github.com/cbcoursera1/ExploratoryDataAnalysisProject2/blob/master/datasub.txt
This data comes from the National Emissions Inventory (NEI) and is made available by the EPA. More information is available here:
http://www.epa.gov/ttn/chief/eiinformation.html
EDIT:
I can no longer reproduce this issue. It may be that row.names = F solved the issue, or possibly restarting R/clearing my environment/something random fixed the problem.
I have a .csv files with many rows. However, I want to read only the first row in a vector format. I know that this works:
names(read.csv("file.csv",nrows=1L))
However it creates a data.frame first before reading the names which seems very inefficient. Weirdly, this doesn't seem to work:
names(read.csv("file.csv",nrows=0L))
I also tried using strsplit(readLines()), but the row contains quotes which are read as backslash and so this method doesn't work.
I have also tried using fread, but it is as slow as read.csv.
Does anyone have a solution to this problem? For reference, here's what the first row looks like:
"Timestamp","Parameter_1","Parameter_2","Parameter_3"
con <- file("somefile.csv")
st <- scan(con, what = "", nlines = 1, sep=",", quote = "\"",)
class(st): returns a character vector