read.csv to import more than 2105 columns? - r

Leaving question for archival purposes only.
(Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame)
Related to this question:
Maximum number of columns that can be read using read.csv
I would like to import a csv file into R that contains about 3200 columns (100 rows). I am used to working with data.frames and read.csv, but my usual approach failed because
data <- read.csv("data.csv", header=TRUE)
only imported the first 2105 columns. It did not display an error message.
How can I readin a csv file with more than 2105 columns?
without specifying column classes
into a data frame
the file contains different data types (dates, strings, numbers, ..)
speed is not my biggest concern
I did not manage to apply the solutions in Quickly reading very large tables as dataframes in R to my situation. Tried this, but it does not seem to work without information on column classes:
df <- as.data.frame(scan("data.csv",sep=','))
There are already several questions about reading in large datafiles with millions of rows/columns and how to speed up the process, but my files are much smaller, so I am hoping that there is an easier solution that I overlooked.

Try using data.table.
library(data.table)
data <- fread("data.csv")

(Posted answer on behalf of the OP).
Read.csv did read in all columns, I just did not see them in the preview when opening the data.frame

Related

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Is it possible to import a subset of big .rds or .feather files into R?

I've found good tips about fast ways to import files into R, but I'm wondering if it is possible to import only a subset of a given file into a variable.
In my case, I have a file with 16 million rows saved as .rds (and also as .feather, as I was playing with the speed of both formats) and I'd like to import a subset of it (say, a few rows or a few columns) for initial analysis.
Is it possible? The readRDS() does not seem to accept any subsetting, while read_feather() does not seem to allow row selection (although you can specify the columns). Should I consider another data format?
The short answer is 'no'. A nice alternative is the fst file format, which does allow the retrieval of a selection of columns and rows from a large dataset. More info here.
Using readr::read_csv you could use n_max parameter and read as many rows as you like.
With readRDS, I suppose you could read the file dplyr::sample_n and then just erase it from memory with rm(object).
If you can not read the whole file into memory, you could use either sqlite, or another database, which is the prefered way, or you could try something along the line of readr::read_delim_chunked, which alows you to read a file in chunks, do something with the read chunk (like sample_n), delete the read chukc from memory and keep just the callback's result and go on like that until the file is over.

Reading subset of large data

I have a LARGE dataset with over 100 Million rows. I only want to read part of the data corresponds to one particular level of a factor, say column1 == A. How do I accomplish this in R using read.csv?
Thank you
You can't filter rows using read.csv. You might try sqldf::read.csv.sql as outlined in answers to this question.
But I think most people would process the file using another tool first. For example, csvkit allows filtering by rows.

Load large datasets into data frame [duplicate]

This question already has answers here:
Quickly reading very large tables as dataframes
(12 answers)
Closed 8 years ago.
I have a dataset stored in text file, it is of 997 columns, 45000 rows. All values are double values except row names and column names. I use R studio with read.table command to read the data file, but it seems taking hours to do it. Then I aborted it.
Even I use Excel to open it, it takes me 2 minutes.
R Studio seems lacking of efficiency in this task, any suggestions given how to make it faster ? I dont want to read the data file all the time ?
I plan to load it once and store it in Rdata object, which can make the loading data faster in the future. But the first load seems not working.
I am not a computer graduate, any kind help will be well appreciated.
I recommend data.table although you will end up with a data table after this. If you choose not to use the data table, you can simply convert back to a normal data frame.
require(data.table)
data=fread('yourpathhere/yourfile')
As documented in the ?read.table help file there are three arguments that can dramatically speed up and/or reduce the memory required to import data. First, by telling read.table what kind of data each column contains you can avoid the overhead associated with read.table trying to guess the type of data in each column. Secondly, by telling read.table how many rows the data file has you can avoid allocating more memory than is actually required. Finally, if the file does not contain comments, you can reduce the resources required to import the data by telling R not to look for comments. Using all of these techniques I was able to read a .csv file with 997 columns and 45000 rows in under two minutes on a laptop with relatively modest hardware:
tmp <- data.frame(matrix(rnorm(997*45000), ncol = 997))
write.csv(tmp, "tmp.csv", row.names = FALSE)
system.time(x <- read.csv("tmp.csv", colClasses="numeric", comment.char = ""))
# user system elapsed
#115.253 2.574 118.471
I tried reading the file using the default read.csv arguments, but gave up after 30 minutes or so.

How do I stack data in R?

I have 20 different .csv files and I need to some how stack the data in R so that I can get an overall picture of the data.
Presently I am copying and pasting the columns in excel to make one big data set.
However, I am sure there is a quicker and more efficient way of doing this in R as this would ultimately take a while.
Also, to make things worse some of the variable names are not the same in each data set.
eg VARIABLE1 is written as variable1 in some datasets. How would i rectify this in R as I understand that R is case sensitive?
Any help would be greatly appreciated. Thanks!
The easiest and the fastest way to do this, if you're (or wish you to be) familiar with data.table package is this way (not tested):
require(data.table)
in_pth <- "path_to_csv_files" # directory where CSV files are located, not the files.
files <- list.files(in_pth, full.names=TRUE, recursive=FALSE, pattern="\\.csv$")
out <- rbindlist(lapply(files, fread))
list.files parameters:
full.names = TRUE will return the full path to your file. Suppose your in_pth <- "c:\\my_csv_folder" and inside this you've two files: 01.csv and 02.csv. Then, full.names=TRUE will return c:\\my_csv_folder\\01.csv and c:\\my_csv_folder\\02.csv (full path).
recursive = FALSE will not search inside directories within your in_pth folder. Assume you've two more csv files in c:\\my_csv_folder\\another_folder. Now, if you want to load these files inside this one, then you can set recursive=TRUE, which'll scan for files until you reach all directories searching down.
pattern=\\.csv$: This is a regular expression to tell which sort of files to load. If your folder, in addition to csv files also has text files (.txt), then by specifying this pattern, you'll load only the csv files. If your folder has only CSV files, then this is not necessary.
data.table functions:
rbindlist avoids conflict in column names by retaining the name of the previous data.table. That is, if you've two data.tables dt1, dt2 with column names x,y and a,b respectively, then doing rbindlist(dt1,dt2) will take care of changing a,b to x,y and rbindlist(dt2, dt1) will take care of changing x,y to a,b.
fread takes care of columns, headers separators etc most often automatically.. and is extremely fast (although still experimental, so you may want to check your output to be sure it's all fine (even if stable)).
# Denis:It is also worth looking into the plyr package for the same. rbind.fill(...) allows you to combine data.frames by row.
install.packages("plyr")
library(plyr)
help (rbind.fill) for details gives you following:
rbinds a list of data frames filling missing columns with NA.
Usage
rbind.fill(...)
Arguments
...
input data frames to row bind together. The first argument can be a list of data frames, in which case all other arguments are ignored.
Details
This is an enhancement to rbind that adds in columns that are not present in all inputs, accepts a list of data frames, and operates substantially faster.
Column names and types in the output will appear in the order in which they were encountered. No checking is performed to ensure that each column is of consistent type in the inputs.
To my knowledge,there is no cbind.fill; however, there is the user function cbind.fill that allows you to combine data.frames by column. Details here.
There are two solutions: one depending on rbind.fill in the plyr package and another is independent of rbind.fill.
Another way, without using external packages, is to use the cbind() command: it makes the binding per column.. So if you have to different tables you can just pass them as arguments to cbind() and they will be appended

Resources