Large csv file fails to fully read in to R data.frame - r

I am trying to load a fairly large csv file into R. It has about 50 columns and 2million row.
My code is pretty basic, and I have used it to open files before but none this large.
mydata <- read.csv('file.csv', header = FALSE, sep=",", stringsAsFactors = FALSE)
The result is that it reads in the data but stops after 1080000 rows or so. This is roughly where excel stops as well. Is their way to get R to read the whole file in? Why is it stopping around half way.
Update: (11/30/14)
After speaking with the provider of the data it was discovered that they may have been some corruption issue with the file. A new file was provided which also is smaller and loads into R easily.

As, "read.csv()" read up to 1080000 rows, "fread" from library(data.table) should read it with ease. If not, there exists two other options, either try with library(h20) or with "fread" you can use select option to read required columns (or read in two halves, do some cleaning and can merge them back).

You can try using read.table and include the parameter colClasses to specify the type of the individual columns.
With your current code, R will read all data first as strings and then check for each column if it is convertible e. g. to a numeric type, which needs more memory than reading right away as numeric. colClasses will also allow you to ignore columns you might not need.

Related

Reading large numeric TSV file into memory in R

I am trying to read a file representing a numeric matrix with 4.5e5 rows and 2e3 columns. First line is the header with ncol+1 words, while each row begins with a row name. In txt format it is around 17G in size.
I tried using:
read.table(fname, header=TRUE)
but the operation ate all 64G of RAM available. I assume it loaded it in a wrong structure.
Usually people discuss speed, is there a way to import it so it fits properly? Performance is not a primary issue.
EDIT: I managed to read it with read.table:
colclasses = c("character",rep("numeric",2000))
betas = read.table(beta_fname, header=TRUE, colClasses=colclasses, row.names=1)
But documentation still recommends "scan" for memory usage. What would be the "scan" alternative?
There are several things you might try. Google about reading large files and they might point you to using 'fread' in data.table. You can also try 'read_delim_chunked' that might help. Also break the file into smaller pieces, read each one in, write out an RDS file. When complete you might be able to read in the RDS files and combine using less space.

When I upload my excel file to R, the column titles are in the rows and the data seems all jumbled. How do I fix this?

hi literally day one new coder
On the excel sheet, my data looks organized, but when I upload my file to R, it's not able to read the excel properly and the column headers are in the rows and the data seems randomized.
So far I have tried:
library(readxl)
dataset <-read_excel("pathname")
View(dataset)
Also tried:
dataset <-read_excel("pathname", sheet=1, colNames=TRUE)
Also tried to use the package openxlsx
but nothing is giving me the correct, organized data set.
I tried formatting my Excel to a CSV file, and the CSV file looks exactly like the data that shows up on R (both are messed up).
How should I approach this problem?
I deal with importing .xlsx into R frequently. It can be challenging due to the flexibility of the excel platform. I generally use readxl::read_xlsx() to fetch data from .xlsx files. My suggestions:
First, specify exactly the data you want to import with the range argument.
A cell range to read from, as described in cell-specification. Includes typical Excel
ranges like "B3:D87", possibly including the sheet name like "Budget!B2:G14"
Second, if there are there merged cells or other formatting challenges in column headers, I resort to setting col_names = FALSE. And supplying clean names after import with names(df) <- c("first_col", "second_col")
Third, if there are merged cells elsewhere in the spreadsheet I generally I resort to "fixing" them in excel (not ideal but easier for my use case), however, others may have suggestions on a programmatic fix.
It may be helpful to provide a screenshot of your spreadsheet.

How can I load a large (3.96 gb) .tsv file in R studio

I want to load a 3.96 gigabyte tab separated value file to R and I have 8 ram in my system. How can I load this file to R to do some manipulation on it.
I tried library(data.table) to load my data
but I´ve got this error message (Error: cannot allocate vector of size 965.7 Mb)
I also tried fread with this code but it was not working either: it took a lot of time and at last it showed an error.
as.data.frame(fread(file name))
If I were you, I probably would
1) try your fread code once more without the typo (closing parenthesis was initially missing):
as.data.frame(fread(file name))
2) try to read the file in parts by specifying number of rows to read. This can be done in read.csv and fread with nrow arguments. By reading a small number of rows one could check and confirm that the file is actually readable before doing anything else. Sometimes files are malformed, there could be some special characters, wrong end-of-line characters, escaping or something else which needs to be addressed first.
3) have a look at bigmemory package which have read.big.matrix function. Also ff package has the desired functionalities.
Alternatively, I probably would also try to think "outside the box": do I need all of the data in the file? If not, I could preprocess the file for example with cut or awk to remove unnecessary columns. Do I absolutely need to read it as one file and have all data simultaneously in memory? If not, I could split the file or maybe use readLines..
ps. This topic is covered quite nicely in this post.
pps. Thanks to #Yuriy Barvinchenko for comment on fread
You are reading the data (which puts it in memory) and then storing it as a data.frame (which makes another copy). Instead, read it directly into a data.frame with
fread(file name, data.table=FALSE)
Also, it wouldn't hurt to run garbage collection.
gc()
From my experience and in addition to #Oka answer:
fread() have nrows= argument, so you can read first 10 lines.
If you found out that you don't need all lines and/or all columns, so you can set condition and list of fields just after fread()[]
You can use data.table as dataframe in many cases, so you can try to read without as.data.frame()
This way I worked with 5GB csv file.

How do I get EXCEL to interpret character variable without scientific notation in R using fwrite?

I have a relatively simple issue when writing out in R with fwrite from the data.table package I am getting a character vector interpreted as scientific notation by Excel. You can run the following code to create the data issue:
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))
fwrite(samp,"test.csv",row.names = F)
When you read this back into R you get values back no problem if you have scinote disable. My less code capable colleagues work with the csv directly in excel and they see this:
They can attempt to change the variable to text but excel then interprets all the zeros. I want them to see the original "7E39" from the data table created. Any ideas how to avoid this issue?
PS: I'm working with millions of rows so write.csv is not really an option
EDIT:
One workaround I've found is to just create a mock variable with quotes:
samp = data.table(id = c("7E39", "7G32","5D99999"))[,id2:=shQuote(id)]
I prefer a tidyr solution (pun intended), as I hate unnecessary columns
EDIT2:
Following R2Evan's solution I adapted it to data table with the following (factoring another numerical column, to see if any changes occured):
#create example
samp = data.table(id = c("7E39", "7G32","5D99999"))[,second_var:=c(1,2,3)]
fwrite(samp[,id:=sprintf("=%s", shQuote(id))],
"foo.csv", row.names=FALSE)
It's a kludge, and dang-it for Excel to force this (I've dealt with it before).
write.csv(data.frame(id=sprintf("=%s", shQuote(c("7E39", "7G32","5D99999")))),
"foo.csv", row.names=FALSE)
This is forcing Excel to consider that column a formula, and interpret it as such. You'll see that in Excel, it is a literal formula that assigns a static string.
This is obviously not portable and prone to all sorts of problems, but that is Excel's way in this regard.
(BTW: I used write.csv here, but frankly it doesn't matter which function you use, as long as it passes the string through.)
Another option, but one that your consumers will need to do, not you.
If you export the file "as is", meaning the cell content is just "7E39", then an auto-import within Excel will always try to be smart about that cell's content. However, you can manually import the data.
Using Excel 2016 (32bit, on win10_64bit, if it matters):
Open Excel (first), have an (optionally empty) worksheet already open
On the ribbon: Data > Get External Data > From Text
Navigate to the appropriate file (CSV)
Select "Delimited" (file type), click Next, select "Comma" (and optionally deselect any others that may default to selected), Next
Click on the specific column(s) and set the "Default data format" to "Text" (this will need to be done for any/all columns where this is a problem). Multiple columns can be Shift-selected (for a range of columns), but not Ctrl-selected. Finish.
Choose the top-left cell to import/paste the data (or a new worksheet)
Select Properties..., and deselect "Save query definition". Without this step, the data is considered a query into an external data source, which may not be a problem but makes some things a little annoying. (For example, try to highlight all data and delete it ... Excel really wants to make sure you know what you're doing there.)
This method provides a portable solution. It "punishes" the Excel users, but anybody/anything else will still be able to consume the files directly without change. The biggest disadvantage with this method is that you won't know if somebody loads it incorrectly unless/until they get odd results when the try to use the data and some fields are silently converted.

Split big data in R

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.
Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.
Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.
To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.
My question is: how can I get the number of lines from the big data file without fully loading it into R?
Suppose I can only use R. So cannot use any other programming languages.
Thank you.
Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part).
The gist is to use ireadLines to open an iterator over the file
For Windows, something like this should work
fname <- "blah.R" # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"

Resources