Removing lines in data.table and spiking memory usage - r

I have a data.table of a decent size 89M rows, 3.7Gb. Keys are in place so everything is set-up properly. However I am experiencing a problem when I remove rows based on a column's value. The memory usage just goes through the roof!
Just for the record I have read the other posts here about this, but they don't really help much. Also, I am using RStudio which I am pretty sure is not ideal but it helps while experimenting, however I notice the same behaviour in the R console. I am using Windows.
Let me post an example (taken from a similar question regarding removal of rows) of creating a very big data.table approx 1e6x100
rm(list=ls(all=TRUE)) #Clean stuff
gc(reset=TRUE) #Call gc (not really helping but whatever..)
dimension=1e6 #let's say a million
DT = data.table(col1 = 1:dimension)
cols = paste0('col', 2:100) #let these be conditions as columns
for (col in cols){ DT[, col := 1:dimension, with = F] }
DT.m<-melt(DT,id=c('col1','col2','col3'))
Ok so now we have a data.table with 97M rows, approx 1.8Gb. This is our starting point.
Let's remove all rows where the value column (after the melt) is e.g. 4
DT.m<-DT.m[value!=4]
The last line takes a huge amount of memory! Prior to executing this line, in my PC the memory usage is approx 4.3Gb, and just after the line is executed, it goes to 6.9Gb!
This is the correct way to remove the lines, right? (just checking). Has anyone come across this behaviour before?
I thought of looping for all parameters and keeping the rows I am interested in, in another data.table but somehow I doubt that this is a proper way of working.
I am looking forward to your help.
Thanks
Nikos

Update: With this commit, the logical vector is replaced by row indices to save memory (Read the post below for more info). Fixed in 1.9.5.
Doing sum(DT.m$value == 4L) gives me 97. That is, you're removing a total of 97 rows from 97 million. This in turn implies that the subset operation would return ~1.8GB data set as well.
Your memory usage was 4.3GB to begin with
The condition you provide value == 4 takes the space of a logical vector of size 97 million =~360MB.
data.table computes a which(that_value) to fetch indices = almost all the rows = another 360MB
The data that's being subset has to be allocated elsewhere first, and that's ~1.8GB.
Total comes to 4.3+1.8+0.72 =~ 6.8GB
And garbage collection hasn't happened yet. If you now do gc(), the memory corresponding to old DT.m should be released.
The only place where I can see we can save space is by replacing the logical vector with the integer vector (rather than storing the integer indices in another vector) to save the extra 360MB of space.
Usually which results in a much smaller (negligible) value - and therefore subset is faster - that being the reason for using which(). But in this case, you remove 97 rows.
But good to know that we can save a bit of memory. Could you please file an issue here?
Removing rows by reference, #635, when implemented, should both be fast and memory efficient.

Related

Read huge csv file using `read.csv` by divide-and-conquer strategy?

I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread(). But I want to know is it possible to read it by using the basic read.csv()?
I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?
dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
for (i in (1:9)){
tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
dt1 <- rbind(dt1, tmp)
print(paste(i + 1, 'th chunk completed'))
}
)
Also I want to know how fread() works that could read all the data at once and very efficiently no matter in terms of memeory or time?
Your issue is not fread(), it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:
Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.
If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:
read in and process chunks (of rows) (use skip, nrows arguments, and search SO for questions on fread in chunks)
filter out all unneeded rows (e.g. you may be able to do some crude processing to form a row-index of the subset rows you care about, and import that much smaller set later)
drop all unneeded columns (use fread select/drop arguments (specify vectors of column names to keep or drop).
Make sure option stringsAsFactors=FALSE, it's a notoriously bad default in R which causes no end of memory grief.
Date/datetime fields are currently read as character (which is bad news for memory usage, millions of unique strings). Either totally drop date columns for beginning, or read the data in chunks and convert them with the fasttime package or standard base functions.
Look at the args for NA treatment. You might want to drop columns with lots of NAs, or messy unprocessed string fields, for now.
Please see ?fread and the data.table doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)), your code and the error.

column slots in data.table

I have a dataset x with 350m rows and 4 columns. When joining two columns from a dataset i of 13m rows and 19 columns, I encounter the following error:
Internal logical error. DT passed to assign has not been allocated enough column slots. l=4, tl=4, adding 1
I have checked Not Enough Columns Slots but there the problem appears to be in the number of columns. Since I have only a few, I would be surprised if this was the issue.
Also, I found https://github.com/Rdatatable/data.table/issues/1830, where the error is related to "column slots", but I do not understand what they are. When checking truelength, I obtain
> truelength(x)
[1] 0
> truelength(i)
[1] 0
My understanding is that setting, for example, alloc.col(x,32) or alloc.col(i,32), or both could solve the issue. However, I don`t understand what this does and and what the issue is. Can anyone offer an explanation?
Part of what makes data.table so efficient is it tries to be smart about memory usage (whereas base data.frames tend to end up getting copied left and right in regular usage, e.g., setting names(DF) = col_names can actually copy all of DF despite only manipulating an attribute of the object).
Part of this, in turn, is that a data.table is always allocated a certain size in memory to allow for adding/subtracting column pointers more fluidly (from a memory perspective).
So, while actual columns take memory greedily (when they're created, sufficient memory is claimed to store the nrow(DT)-size vector), the column pointers, which store addresses where to find the actual data (you can think of this ~like~ column names, if you don't know the grittier details of pointers), have a fixed memory slot upon creation.
alloc.col forces the column pointer address reserve process; this is most commonly used in two cases:
Your data needs a lot of columns (by default, room is allocated for 1024 pointers more than there are columns at definition)
You've loaded your data from RDS (since readRDS/load don't know to allocate this memory for a data.table upon loading, we have to trigger this ourselves)
I assume Frank is right and that you're experiencing the latter. See ?alloc.col for some more details, but in most cases, you should just run alloc.col(x) and alloc.col(i) -- except for highly constrained machines, allocating 1024 column pointers requires relatively little memory, so you shouldn't spend to much effort skimping and trying to figure out the right quantity.

r: managing memory allocation in loops

First, this question is NOT about
Error: cannot allocate vector of size n
I accept this error as a given and I am trying to avoid the error in code
I have a dataset of 3000+ variables and 120000 cases
All columns are numeric
I need to reset NA with zero
If I reassign values to 0 for the entire dataset, I get the memory
allocation error.
So I am reassigning the values to zero one column at a time:`
resetNA <- function(results)
{
for (i in 1:ncol(results))
{
if(i>10)
{
results[,i][is.na(results[,i])] <- 0
}
}
print(head(results))
}
After about 1000 columns, I still get the memory allocation error.
Now, this seems strange to me. Somehow memory allocation is incrementing after each loop. However, I don't see why this would be the case.
Also, I tried calling garbage collection function after each loop, I still got the memory allocation error.
Can someone explain to me how I can manage the variables to avoid the incremental increase in memory allocation (after all, the data frame size has not changed).
As noted in the comments above, the answer is here:
Fastest way to replace NAs in a large data.table
I tried it and it works very well
I have learned an important general principle about r memory usage.
See this discussion.
Whereever possible avoid looping through a dataframe. Use lapply. This converts a dataframe to a list and then runs the relevant function on the list. It then returns a list. Convert the list back to a dataframe.
The following example recodes numeric frequencies to a categorical variable. It is fast and does not increase memory usage.
list1<-lapply(mybigdataframe,function(x) ifelse( x>0,"Yes","No"))
newdf1<-as.data.frame(list1)

How to efficiently merge these data.tables

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.
Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.
Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.

Why does changing a column name take an extremely long time with a large data.frame?

I have a data.frame in R with 19 million rows and 90 columns. I have plenty of spare RAM and CPU cycles. It seems that changing a single column name in this data frame is a very intense operation for R.
system.time(colnames(my.df)[1] <- "foo")
user system elapsed
356.88 16.54 373.39
Why is this so? Does every row store the column name somehow? Is this creating an entirely new data frame? It seems this operation should complete in negligible time. I don't see anything obvious in the R manual entry.
I'm running build 7600 of R (64bit) on Windows 7, and in my current workspace, setting colnames on a small data.frame takes '0' time according to system.time().
Edit: I'm aware of the possibility of using data.table, and, honestly, I can wait 5 minutes for the rename to complete whilst I go get some tea. What I'm interested in is what is happening and why?
As several commenters have mentioned, renaming data frame columns is slow, because (depending on how you do it) it makes between 1 and 4 copies of the entire data.frame. Here, from data.table's ?setkey help page, is the nicest way of demonstrating this behavior that I've seen:
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies
try(tracemem(DF)) # try() for non-Windows where R is
# faster without memory profiling
colnames(DF)[1] <- "A" # 4 copies of entire object
names(DF)[1] <- "A" # 3 copies of entire object
names(DF) <- c("A", "b") # 1 copy of entire object
`names<-`(DF,c("A","b")) # 1 copy of entire object
x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method)
# What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
To (start) understanding why things are done this way, you'll probably need to delve into some of the related discussions on R-devel. Here are a couple: R-devel: speeding up perception and R-devel: Confused about NAMES
My impressionistic reading of those threads is that:
At least one copy is made so that modifications to it can be 'tried out' before overwriting the original. Thus, if something is wrong with the value-to-be-reassigned, [<-.data.frame or names<- can 'back out' and deliver an error message without having done any damage to the original object.
Several members of R-core aren't completely satisfied with how things are working right now. Several folks explain that in some cases "R loses track"; Luke Tierney indicates that he's tried some modifications relating to this copying in the past "in a few cases and always had to back off"; and Simon Urbanek hints that "there may be some things coming up, too"
(As I said, though, that's just impressionistic: I'm simply not able to follow a full conversation about the details of R's internals!)
Also relevant, in case you haven't seen it, here's how something like names(z)[3] <- "c2" "really" works:
# From ?names<-
z <- "names<-"(z, "[<-"(names(z), 3, "c2"))
Note: Much of this answer comes from Matthew Dowle's answer to this other question. (I thought it was worth placing it here, and giving it some more exposure, since it's so relevant to your own question).

Resources