alternatives to read.csv(textConnection()) - r

I am downloading a 120mb csv file from webserver using read.csv(textConnection(binarydata1)) and this is painfully slow. I tried pipe(), like this read.csv(pipe(binarydata1)) I am getting an error Error in pipe(binarydata1) : invalid 'description' argument. Any help regarding this issue is much appricated.
#jeremycg, #hrbrmstr
Suggestion
fread from the data.table package.
local storage via download.file or functions in curl or httr and use data.table::fread like #jeremycg suggested or readr::read_csv
Response
The csv file i am dealing with is in binary format, so I am converting this to standard format using these functions
t1 = getURLContent(url,userpwd,httpauth = 1L, binary=TRUE)
t2 = readBin(t1, what='character', n=length(t1)/4)
when I try fread(t2) after converting binary to standard format i get an error
Error in fread(t61) :
'input' must be a single character string containing a
file name, a command, full path to a file, a URL starting
'http://' or 'file://', or the input data itself
If i try fread directly without converting binary to standard format then no problem it works, if I try converting binary to standard format it does not work

Even though the question is 4 years old it helped me with my current problem, where I also have a 300MB connection where read.csv took ages.
I found the vroom function from package vroom helpful here. It stored my data like a charme. It took one minute for my data where I don't even know if the read.csv(textConnection...) would get me a result (I usually terminated R after 30min. with no result).

Related

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

Is there a way to set character encoding when reading sas files to spark or when pulling the data to the r session?

So, I have sas7bdat files that are huge and I would like to read them to spark, process them and then collect the results to a r session. I'm reading them to spark using the package spark.sas7bdat and function spark_read_sas. So far so good. The problem is that the character encoding of the sas7bdat files is iso-8859-1 but to show the content correctly in R it would need to be UTF-8. When I pull the results to R, my data looks like this. (Let's first create an example that has the same raw bytes that my results have)
mydf <- data.frame(myvar = rawToChar(as.raw(c(0xef, 0xbf, 0xbd, 0x62, 0x63))))
head(mydf$myvar,1) # should get äbc if it was originally read correctly
> �bc
Changing the encoding afterwards doesn't work for some reason.
iconv(head(mydf$myvar,1), from = 'iso-8859-1', to = 'UTF-8')
> �bc
If I use haven package and read_sas('myfile.sas7bdat', encoding = 'iso-8859-1') to read the file directly to my r-session, everything works as expected.
head(mydf$myvar,1)
> äbc
I would be very grateful for a solution that enables me to do the processing in spark and then collect only the results to R-session because the files are so big. I guess this could potentially either be solved when a) reading the file to spark (but I did not find and option that would work) or b) correcting the encoding in R (could not get it to work but do not understand why, maybe it could have something to do with special character encoding in the sas7bdat file).

Error while parsing a very large (10 GB) XML file in R, using the XML package

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !
This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

how to have fread perform like read.delim

I've got a large tab-delimited data table that I am trying to read into R using the data.table package fread function. However, fread encounters an error. If I use read.delim, the table is read in properly, but I can't figure out how to configure fread such that it handles the data properly.
In an attempt to find a solution, I've installed the development version of data.table, so I am currently running data.table v1.9.7, under R v3.2.2, running on Ubuntu 15.10.
I've isolated the problem to a few lines from my large table, and you can download it here.
When I used fread:
> fread('problemRows.txt')
Error in fread("problemRows.txt") :
Expecting 8 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
I tried using the parameters used by read.delim:
fread('problemRows.txt', sep="\t", quote="\"")
but I get the same error.
Any thoughts on how to get this to read in properly? I'm not sure what exactly the problem is.
Thanks!
With this recent commit c1b7cda, fread's quote logic got a bit cleverer in handling such tricky cases. With this:
require(data.table) # v1.9.7+
fread("my_file.txt")
should just work. The error message is now more informative as well if it is unable to handle. See #1462.
As explained in the comments, specifying the quotes argument did the trick.
fread("my_file.txt", quote="")

Reading excel with R

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels,
ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?
Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.
All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.
If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

Resources