Error: Invalid: File is too small to be a well-formed file - error when using feather in R - r

I'm trying to use feather (v. 0.0.1) in R to read a fairly large (3.5 GB) csv file with 21178665 rows and 16 columns.
I use the following lines to load the file:
library(feather)
path <- "pp-complete.csv"
df <- read_feather(path)
But I get the following error:
Error: Invalid: File is too small to be a well-formed file
There's no explanation in the documentation of read_feather so I'm not sure what's the problem. I guess this function expects a different file form but I'm not sure what that would be.
Btw, I can read the file with read_csv in readr library but it takes a while.

The feather file format is distinct from a CSV file format. They are not interchangeable. The read_feather function cannot read simple CSV files.
If you want to read CSV files quickly, your best bets are probably readr::read_csv or data.table::fread. For large files, it will still usually take a while just to read it from disc.
After you've loaded the data into R, you can create a file in the feather format with write_feather so you can read it with read_feather the next time.

Related

How to read csv file with unknown formatting and unknown encoding in R Program? (example file provided)

I have tried my best to read a CSV file in r but failed. I have provided a sample of the file in the following Gdrive link.
Data
I found that it is a tab-delimited file by opening in a text editor. The file is read in Excel without issues. But when I try to read it in R using "readr" package or the base r packages, it fails. Not sure why. I have tried different encoding like UTF-8. UTF-16, UTF16LE. Could you please help me to write the correct script to read this file. Currently, I am converting this file to excel as a comma-delimited to read in R. But I am sure there must be something that I am doing wrong. Any help would be appreciated.
Thanks
Amal
PS: What I don't understand is how excel is reading the file without any parameters provided? Can we build the same logic in R to read any file?
This is a Windows-related encoding problem.
When I open your file in Notepad++ it tells me it is encoded as UCS-2 LE BOM. There is a trick to reading in files with unusual encodings into R. In your case this seems to do the trick:
read.delim(con <- file("temp.csv", encoding = "UCS-2LE"))
(adapted from R: can't read unicode text files even when specifying the encoding).
BTW "CSV" stands for "comma separated values". This file has tab-separated values, so you should give it either a .tsv or .txt suffix, not .csv, to avoid confusion.
In terms of your second question, could we build the same logic in R to guess encoding, delimiters and read in many types of file without us explicitly saying what the encoding and delimiter is - yes, this would certainly be possible. Whether it is desirable I'm not sure.

How to write data into a macro-enabled Excel file (write.xlslx corrupts my document)?

I'm trying to write a table into a macro-enabled Excel file (.xlsm) through the R. The write.xlsx (openxlsx) and writeWorksheetToFile (XLconnect) functions don't work.
When I used the openxlsx package, as seen below, the resulting .xlsm files ended up getting corrupted.
Code:
library(XLConnect)
library(openxlsx)
for (i in 1:3){
write.xlsx(Input_Files[[i]], Inputs[i], sheetName="Input_Sheet")
}
#Input_Files[[i]] are the R data.frames which need to be inserted into the .xslm file
#Inputs[i] are the excel files upon which the tables should be written into
Corrupted .xlsm file error message after write.xlsx:
Excel cannot open the file 'xxxxx.xslm' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file
After researching this problem extensively, I found that the XLConnect connect package offers the writeWorksheetToFile function which works with .xlsm, albeit after running it a few times it yields an error message that there is no more free space. It also runs for 20+ minutes for tables with approximately 10,000 lines. I tried adding xlcFreeMemory at the beginning of the for loop, but it doesn't solve the issue.
Code:
library(XLConnect)
library(openxlsx)
for (i in 1:3){
xlcFreeMemory()
writeWorksheetToFile(Inputs[i], Input_Files[[i]], "Input_Sheet")
}
#Input_Files[[i]] are the R data.frames which need to be inserted into the .xslm file
#Inputs[i] are the excel files upon which the tables should be written into
Could anyone recommend a way to easily and quickly transfer an R table into an xlsm file without corrupting it?

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Writing and loading expression sets to and from csv files

In R, one can write a bioconductor ExpressionSet into a csv file using csv.write. For example, using the standard bladderbatch data available as a bioconductor package the following code writes a csv file to the current working drectory:
library("bladderbatch")
data("bladderdata")
write.csv(bladderEset, "bladderEset.csv")
Is there a tool which can read the produced csv file back into R as an ExpressionSet?
If not, is there an ExpressionSet ↔ csv serialiser/deserialiser, which can both output ExpressionSets as csv files and read csv files as ExpressionSets?
The reason I'm asking is because I need to interact with ExpressionsSets with python and java code, and I can easily work with "csv" files, but not with ".rda", ".CEL" or other binary files.
If you just wanted to interact with the data using R and python, consider saving the ExpressionSet as a feather object.
https://github.com/wesm/feather
The comment from #Nathan Werth is what I think you are looking for. By calling readExpressionSet you can easily read in a CSV file as an ExpressionSet.
First, write out the CSV file as your initial code:
library("bladderbatch")
data("bladderdata")
write.csv(bladderEset, "bladderEset.csv")
Then read it back in:
temp <- Biobase::readExpressionSet("bladderEset.csv")
> class(temp)
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"

How to read excel file in Chinese character [R]?

I always convert excel file into CSV file to import to R as following.
myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)
But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.
So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB.
It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').
FYI, I can see full chinese character in the original excel file.
Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.
Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.
If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.
Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:
Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.
The first one is simpler, the second one can handle really large file.
Hope it helps.

Resources