I have .cdb files in binary format in sizes of ~630M.
I use read.cdb(file, type='cdb') from library cdb to read them in but it takes for ever to load. (20 min+)
Is this normal for a large file like this?
Related
Is there a way to extract the information about .wav file length/duration without having to read in the file in R? I have thousands of those files and it would take a long time if I had to read in every single one to find its duration. Windows File Explorer gives you and option to turn on the Length field and you can see the file duration, but is there a way to extract that information to be able to use in in R?
This is what I tried and would like to avoid doing since reading in tens of thousands of audio files in R will take a long time:
library(tuneR)
audio<-readWave("AudioFile.wav")
round(length(audio#left) / audio#samp.rate, 2)
You can use the readWave function from the tuneR package with header=TRUE. This will only head the metadata of the file and not the entire file.
library(tuneR)
audio<-readWave("AudioFile.wav", header=TRUE)
round(audio$samples / audio$sample.rate, 2)
I am writing SPSS .sav files from R using the package haven, which works very well for me in general. However I have noticed that the .sav file size written on disk using write_sav() seems to be much bigger than nescessary. Whenever I open and save a .sav file written by write_sav() in SPSS, the file size is reduced by a factor of up to ~10!
This matters to me as I am writing rather big data to SPSS for others and sometimes SPSS refuses to open a very big file. Maybe this would problem would not arise if write_sav() would store more efficiently in a "real" native SPSS way?
Does anyone know this issue and maybe has a helpful comment on it?
SPSS installation is needed to replicate this issue
It's not clear from the Haven write_sav() documentation, but it sounds like it is saving them as uncompressed .sav files. The default for (most) SPSS installations would be to save as compressed files. SPSS has an extra compression option of 'zCompressed' which will produce even smaller files but these generally can't be opened outside of SPSS.
You can experiment with this like so;
Save outfile = 'Uncompressed file.sav'
/UnCompressed.
Save outfile = 'Compressed file.sav'
/Compressed.
Save outfile = 'ZCompressed file.zsav'
/ZCompressed.
Note the .zsav file extension isn't necessary (could be .sav) but it's considered best practice to use this to make it clear where compatibility might be an issue.
See https://www.ibm.com/support/knowledgecenter/en/SSLVMB_21.0.0/com.ibm.spss.statistics.help/syn_save_compressed_uncompressed.htm for more info.
What form does your actual data take? Is is Codepage or Unicode; and what is Haven doing? Since SPSS 16.0 and the introduction of the UNICODE setting, there has been a tripling of string field widths when converting from Codepage to Unicode. This is a pain best suffered only once. Get your data to unicode and then stay there.
See https://www.ibm.com/support/knowledgecenter/SSLVMB_26.0.0/statistics_reference_project_ddita/spss/base/syn_set_unicode.html for more information.
If the output size is a problem, you could have a look at my package readspss. Using compression and zsav you should be able to get the best available compression. Compression in sav files depends on how the file is written. SPSS has different compression methods to store numeric information. Numerics can be stored only as doubles (no compression) or in a mix of doubles and int8_t (compression 1). Zsav used zlib to compress whatever the initial input was (compression 2). Eight integers take the size of a double hence the difference in the file size.
There are three variants of the SPSS (.sav) file format:
Uncompressed (.sav). This is haven's default output, but is rarely used in my experience.
Compressed (.sav). This is what most people use, and it has been the default save format for SPSS for many, many years.
Zcompressed (.zsav, but sometimes .sav). Added a few years ago to SPSS, but doesn't seem used much. You can get this from haven by adding compress=TRUE to write_spss()
I have submitted a pull request to make the compressed (2) format the default.
I'm trying to use feather (v. 0.0.1) in R to read a fairly large (3.5 GB) csv file with 21178665 rows and 16 columns.
I use the following lines to load the file:
library(feather)
path <- "pp-complete.csv"
df <- read_feather(path)
But I get the following error:
Error: Invalid: File is too small to be a well-formed file
There's no explanation in the documentation of read_feather so I'm not sure what's the problem. I guess this function expects a different file form but I'm not sure what that would be.
Btw, I can read the file with read_csv in readr library but it takes a while.
The feather file format is distinct from a CSV file format. They are not interchangeable. The read_feather function cannot read simple CSV files.
If you want to read CSV files quickly, your best bets are probably readr::read_csv or data.table::fread. For large files, it will still usually take a while just to read it from disc.
After you've loaded the data into R, you can create a file in the feather format with write_feather so you can read it with read_feather the next time.
I always convert excel file into CSV file to import to R as following.
myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)
But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.
So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB.
It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').
FYI, I can see full chinese character in the original excel file.
Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.
Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.
If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.
Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:
Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.
The first one is simpler, the second one can handle really large file.
Hope it helps.
I am a newbie to R, but I am aware that it chokes on "big" files. I am trying to read a 200MB data file. I have tried it in csv format and also converting it to tab delimited txt but in both cases I use up my 4GB of RAM before the file loads.
Is it normal that R would use 4GB or memory to load a 200MB file, or could there be something wrong with the file and it is causing R to keep reading a bunch of nothingness in addition to the data?
From ?read.table
Less memory will be used if colClasses is specified as one of the six atomic vector classes.
...
Using nrows, even as a mild over-estimate, will help memory usage.
Use both of these arguments.
Ensure that you properly specify numeric for your numeric data. See here: Specifying colClasses in the read.csv
And do not under-estimate nrows.
If you're running 64-bit R, you might try the 32-bit version. It will use less memory to hold the same data.
See here also: Extend memory size limit in R