I have downloaded a huge file from the UCI Machine learning Dataset library. (~300mb).
Is there a way to predict the memory required to load the dataset, before loading it into R memory?
Googled a lot, but everywhere all I could find is how to calculate memory with R-profiler and several other packages, but after loading the objects into R.
based on "R programming" coursera course, U can calculate the proximate memory usage using number of rows and columns within the data" U can get that info from the codebox/meta file"
memory required = no. of column * no. of rows * 8 bytes/numeric
so for example if you have 1,500,00 rows and 120 column you will need more than 1.34 GB of spare memory required
U also can apply the same approach on other types of data with attention to number of bytes used to store different data types.
If your data's stored in a csv file, you could first read in a subset of the file and calculate the memory usage in bytes with the object.size function. Then, you could compute the total number of lines in the file with the wc command-line utility and use the line count to scale the memory usage of your subset to get an estimate of the total usage:
top.size <- object.size(read.csv("simulations.csv", nrow=1000))
lines <- as.numeric(gsub("[^0-9]", "", system("wc -l simulations.csv", intern=T)))
size.estimate <- lines / 1000 * top.size
Presumably there's some object overhead, so I would expect size.estimate to be an overestimate of the total memory usage when you load the whole csv file; this effect will be diminished if you use more lines to compute top.size. Of course, this approach could be inaccurate if the first 1000 lines of your file are not representative of the overall file contents.
R has the function object.size(), that provides an estimate of the memory that is being used to store an R object.
You can use like this:
predict_data_size <- function(numeric_size, number_type = "numeric") {
if(number_type == "integer") {
byte_per_number = 4
} else if(number_type == "numeric") {
byte_per_number = 8 #[ 8 bytes por numero]
} else {
stop(sprintf("Unknown number_type: %s", number_type))
estimate_size_in_bytes = (numeric_size * byte_per_number)
class(estimate_size_in_bytes) = "object_size"
print(estimate_size_in_bytes, units = "auto")
# Example
# Matrix (rows=2000000, cols=100)
predict_data_size(2000000*100, "numeric") # 1.5 Gb
I've got a CSV file I import into R and then split into several subsets which consitute my list "importedData":
filePath <- "Test.csv"
rowsPerBatch <- 58
numRows <- length(count.fields(file = filePath, sep = ","))
readSegment <- function(x) fread(file = filePath, sep = ",", header = TRUE, skip = rowsPerBatch*(x-1), nrows = rowsPerBatch-1)
importedData <- lapply(1:(numRows/rowsPerBatch), readSegment)
The raw CSV file is just 4MB. However, the list object within R is 17.8 MB large. Why is that the case? Is there a way to do the above more memory-efficiently?
I am planning on scaling-up the algoirithm above to handle several dozen CSV files each >200MB. If each of their corresponding list objects in R is 3x their original size, I'm afraid the memory usage will get out of control very quickly.
As noted in the Advanced R book's section on memory usage, numeric vectors occupy 8 bytes per element, integer vectors occupy 4 bytes per element, and complex vectors occupy 16 bytes per element.
Therefore, depending on the number of rows and columns in the input CSV file, the resulting R object can be significantly larger than the input CSV file.
Based on the amount of RAM available on the machine being used to process the data, R users rely on the following strategies to deal with limited memory, including:
sampling: analyze a random sample of the input data,
subsetting: process the data in subsets, then combine results, and
aggregating: aggregate data to higher unit of analysis, then analyze it.
Since R loads all objects into memory in order to process them, one must not only have enough RAM to load an object, but also enough RAM to process the object, including writing additional output objects.
Please note that storage formats such as data.table and tibble are more efficient than the Base R data.frame, and can save as much as 50% in RAM usage as I illustrate in American Community Survey Example.
I'm trying to complete the very simple task of reading in an unphased fasta file and phasing it using ape, and then calculating Tajima's D using pegas, but #my data doesn't seem to be reading in correctly. Input and output is as #follows:
DNAbin8c18 <- read.dna(file="fasta8c18.fa", format="f")
I shouldn't need to attach any data since I've just generated the file, but since the data() command was in the manual, I executeed
and got
Warning message: In data(DNAbin8c18) : data set ‘DNAbin8c18’ not found
I know that data() only works in certain contexts, so maybe this isn't a big deal. I looked at what had been loaded
817452 DNA sequences in binary format stored in a matrix.
All sequences of same length: 96
CLocus_12706_Sample_1_Locus_34105_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_2_Locus_31118_Allele_0 [BayOfIslands_s08...
CLocus_12706_Sample_3_Locus_30313_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_5_Locus_33345_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_7_Locus_37388_Allele_0 [BayOfIslands_s09...
CLocus_12706_Sample_8_Locus_29451_Allele_0 [BayOfIslands_s09... ...
More than 10 million nucleotides: not printing base composition
so it looks like the data should be fine. Because of this, I tried what I want to do
and got
Error: cannot allocate vector of size 2489.3 Gb
Many people have completed this same test using as many or more SNPs that I have, and also using FASTA files, but is it possible that mine is too big, or can you see another issue?
The data file can be downloaded at the following link
I have also sent and earlier version of this question, with the data, to the r-sig-genetics mailing list, but I have not heard back.
Any thoughts would be much appreciated.
Thank you for the comment. Indeed, you are correct. The developer just emailed me with the following very helpful comments.
The problem is that your data are too big (too many sequences) and tajima.test() needs to compute the matrix of all pairwise distances. You could this check by trying:
dist.dna(DNAbin8c18, "N")
One possibility for you is to sample randomly some observations, and repeat this many times, eg:
tajima.test(DNAbin8c18[sample(n, size = 1000), ])
This could be:
N <- 1000 # number of repeats
RES <- matrix(N, 3)
for (i in 1:N)
RES[, i] <- unlist(tajima.test(DNAbin8c18[sample(n, size = 10000), ]))
You may adjust N and 'size =' to have something not too long to run. Then you may look at the distribution of the columns of RES.
I am using the bigmemory package to load a heavy dataset, but when I check the size of the object (with function object.size), it always returns 664 bytes. As far I as understand, the weight should be almost the same as a classic R matrix, but depending of the class (double or integer). Then, why do I obtain 664 bytes as an answer?. Below, reproducible code. The first chunck is really slow, so feel free to reduce the number of simulated values. With (10^6 * 20) will be enough.
# CREATE BIG DATABASE -----------------------------------------------------
data <- as.data.frame(matrix(rnorm(6 * 10^6 * 20), ncol = 20))
write.table(data, file = "big-data.csv", sep = ",", row.names = FALSE)
format(object.size(data), units = "auto")
rm(list = ls())
# BIGMEMORY READ ----------------------------------------------------------
ini <- Sys.time()
data <- read.big.matrix(file = "big-data.csv", header = TRUE, type = "double")
print(Sys.time() - ini)
print(object.size(data), units = "auto")
To determine the size of the bigmemory matrix use:
> GetMatrixSize(data)
[1] 9.6e+08
Data stored in big.matrix objects can be of type double (8 bytes, the default), integer (4 bytes), short (2 bytes), or char (1 byte).
The reason for the size disparity is that data stores a pointer to a memory-mapped file. You should be able to find the new file in the temporary directory of your machine. - [Paragraph quoted from R High Performance Programming]
Essentially, bigmatrix maintains a binary data file on the disk called a backing file that holds all of the values in a data set. When values from a bigmatrix object are needed by R, a check is performed to see if they are already in RAM (cached). If they are, then the cached values are returned. If they are not cached, then they are retrieved from the backing file. These caching operations reduce the amount of time needed to access and manipulate the data across separate calls, and they are transparent to the statistician.
See page 8 of the documentation for a description
R High Performance Programming By: Aloysius Lim; William Tjhi
Data Science in R By: Duncan Temple Lang; Deborah Nolan
By using R ill try to open my NetCDF data that contain 5 dimensional space with 15 variables. (variable for calculation is in matrix 1000X920 )
This problem actually look like the same with the other question before.
I got explanation from here and the others
At first I used RNetCDF package, but after some trial i found unconsistensy when the package read my data. And then finally better after used ncdf package.
there is no problem for opening data in a single file, but after ill try for looping in more than hundred data inside folder for a spesific variable (for example: var no 15) the program was failed.
> days = formatC(001:004, width=3, flag="0")
> ncfiles = lapply (days,
> function(d){ filename = paste("data",d,".nc",sep="")
> open.ncdf(filename) })
also when i try the command like this for a spesific variable
> sapply(ncfiles,function(file,{get.var.ncdf(file,"var15")})
so my question is, any solution to read all netcdf file with special variable then make calculation in one frame. From the solution before i was failed for generating the variable no 15 on whole netcdf data.
thanks for any solution to this problem.
this is the last what i have done
when i write
for(i in seq_along(files)) {
nc <- lapply(files[i],open.ncdf)
lw = get.var.ncdf(nc,"var15")
i can get all netcdf data by > nc
so i how i can get variable data with new name automatically like lw1,lw2...etc
i cant apply
var1 <- lapply(files, FUN = get.var.ncdf, variable = "var15")
then i can do calculation with all data.
the other technique i try used RNetCDF package n doing a looping
# Declare data frame
#Open all files
files= list.files("allnc/",pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'DBZH')
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'var15')
# Add the values from each file to a single data.frame
i can take a variable data but i just got one data from my all file nc.
note: sampe of my data name ( data20150102001.nc,data20150102002.nc.....etc)
This solution uses NCO, not R. You may use it to check your R solution:
ncra -v var15 data20150102*.nc out.nc
That is all.
Full documentation in NCO User Guide.
You can use the ensemble statistics capabilities of CDO, but note that on some systems the number of files is limited to 256:
cdo ensmean data20150102*.nc ensmean.nc
you can replace "mean" with the statistic of your choice, max, std, var, min etc...
Good evening,
I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?
Thanks in advance.
P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.
If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that
your R objects are file backed so that you aren't limited by RAM
you can parallelize the degree computation using foreach
Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
colnames(edgelist) <- c("sender","receiver")
I next concatenate this file 10 times to make the example a bit bigger.
for i in $(seq 1 10)
cat edgelist-small.csv >> edgelist.csv
Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.
x <- read.big.matrix("edgelist.csv", header = FALSE,
type = "integer",sep = ",",
backingfile = "edgelist.bin",
descriptor = "edgelist.desc")
nrow(x) # 1e7 as expected
We can compute the outdegrees by using bigtable() on the first column.
outdegree <- bigtable(x,1)
Quick sanity check to make sure table is working as expected:
# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1])) # get name of first node
all.equal(as.numeric(outdegree[1]), # outdegree's answer
sum(x[,1]==j)) # manual outdegree count
To get indegree, just do bigtable(x,2).