Running out of memory with merge - r

I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1990 1991 1992 1990 1991 1992 1993 1994 1992 1993
Now I would like to give out for every row (PErsno) the years of workexperience at the begining of the year. I use ddply
hilf3<-ddply(data, .(Persn0), summarize, Bgwork = 1:(max(year) - min(year)))
To produce output looking like this:
Workexperience: 1 2 3 1 2 3 4 5 1 2
Now I want to merge the ddply results to my original panel data:
data<-(merge(data,hilf3,by.x="Persno",by.y= "Persno"))
The panel data set is very large. The code stops because of a memory size error.
Errormessage:
1: In make.unique(as.character(rows)) :
Reached total allocation of 4000Mb: see help(memory.size)
What should I do?

Re-reading your question, I think you don't actually want to use merge here at all. Just sort your original data frame and rbind Bgwork from hilf3. And also, your ddply-call could perhaps result in a 1:0 sequence, which is most likely not what you want. Try
data = data[order(data$Persno, data$year),]
hilf3 = ddply(data, .(Persno), summarize, Bgwork=(year - min(year) + 1))
stopifnot(nrow(data) == nrow(hilf3))
stopifnot(all(data$Persno == hilf3$Persno))
data$Bgwork = hilf3$Bgwork

Well, perhaps the surest way of fixing this is to get more memory. However, this isn't always an option. What you can do is somewhat dependent on your platform. On Windows, check the results of memory.size()and compare this to your available RAM. If memory size is lower than RAM then you can increase it. This is not an option on linux, as by default it will show all of your memory.
Another issue that can complicate matters is whether or not you are running a 32bit or 64bit system, as 32bit windows can only address up to a certain amount of RAM (2-4GB) depending on settings. This is not an issue if you are using 64bit Windows 7, which can address far more memory.
A more practical solution is to eliminate all unnecessary objects from your workspace before performing merge. You should run gc() to see how much memory you have and are using, and also to remove any objects which have no more references. Personally, I would probably run your ddply() from a script, then save the resulting dataframe as a CSV file, close your workspace and reopen it and then perform the merge again.
Finally the worst possible option (but which does require a whole lot less memory) is to create a new dataframe, and use the subsetting commands in R to copy the columns you want over, one by one. I really don't recommend this as it is tiresome and error prone, but I have had to do it once when there was no way to complete my analysis otherwise (i ended up investing in a new computer with more RAM shortly afterwards).
Hope this helps.

If you need to merge large data frames in R, one good option is to do it in pieces of, say 10000 rows. If you're merging data frames x and y, loop over 10000-row pieces of x, merge (or rather use plyr::join) with y and immediately append these results to a sigle csv-file. After all pieces have been merged and written to file, read that csv-file. This is very memory-efficient with proper use of logical index vectors and well placed rm and gc calls. It's not fast though.

Since this question was posted, the data.table package has provided a re-implementation of data frames and a merge function that I have found to be much more memory-efficient than R's default. Converting the default data frames to data tables with as.data.table may avoid memory issues.

Related

Memory management in R ComplexUpset Package

I'm trying to plot an stacked barplot inside an upset-plot using the ComplexUpset package. The plot I'd like to get looks something like this (where mpaa would be component in my example):
I have a dataframe of size 57244 by 21, where one column is ID and the other is type of recording, and other 19 columns are components from 1 to 19:
ID component1 component2 ... component19 type
1 1 0 1 a
2 0 0 1 b
3 1 1 0 b
Ones and zeros indicate affiliation with a certain component. As shown in the example in the docs, I first convert these ones and zeros to logical, and then try to plot the basic upset plot. Here's the code:
df <- df %>% mutate(across(where(is.numeric), as.logical))
components <- colnames(df)[2:20]
upset(df, components, name='protein', width_ratio = 0.1)
But unfortunately after thinking for a while when processing the last line it spits out an error message like this:
Error: cannot allocate vector of size 176.2 Mb
Though I know I'm using the 32Gb RAM architecture, I'm sure I couldn't have flooded the memory so much that 167 Mb can't be allocated, so my guess is I am managing memory in R somehow wrong. Could you please explein what's faulty in my code, if possible.
I also know that UpsetR package plots the same data, but as far as i know it provides no way for the stacked barplotting.
Somehow, it works if you:
Tweak the min_size parameter so that the plot is not overloaded and makes a better impression
Making the first argument of ComplexUpset a sample with some data also helps, even if your sample is the whole dataset.

How to reduce the size of the data in R?

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.
I pulled the data use the package readr with following code
VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")
When I checked the object size using following code, it's about 3 gb.
> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb
In the next step, using the below code, I want to create training and test set for LASSO regression.
set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))
train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]
yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))
When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.
rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)
Appreciate if someone can guide me a way to reduce the size of the data.
Thanks
R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.
You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.

Criteria for deciding which character columns should be converted to factors

I have been working through the book "Analyzing Baseball Data with R" by Marchi and Albert and am wondering about an issue which they don't address.
Many of the datasets I need to import are fairly large (though not really "Big" in the sense of "Big Data"). For example, the Retrosheet Game Logs have 1 csv file per year dating back to 1871 where each file has a row for each game played that year, and 161 columns. When I read it into a dataframe using read.csv() using the default setting on stringsAsFactors fully 75 of the 161 columns become factors. Some of these columns conceptually are factors (such as one containing "D" or "N" for day or night games) but others are probably better left as strings (many of the columns contain names of starting pitchers, closers, etc.) I know how to convert columns from factors to strings or vice versa, but I don't want to have to scan through 161 columns, making an explicit decision for 75 of them.
The reason I think it important is that I've noticed that conceptually small dataframes obtained by subsetting these game logs are surprisingly large given the need to retain the full factor information. For example, given the dataframe GL2016 obtained from downloading, unzipping and the reading in the file, object.size(GL2016) is about 2.8 MB, and when I use:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
to extract the home day games played by the Cleveland Indians in 2016, I get a df with 26 rows. 26/2428 (where 2428 is the number of rows in the whole dataframe) is slightly more than 1%, but object.size(df) is around 1.3 MB, which is far more than 1% of the size of GL2016.
I came up with an ad-hoc solution. I first defined a function:
big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k}
And then used mutate_if from dplyr like thus:
GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016
30 is the number of teams in the MLB and I somewhat arbitrarily decided that any factor with more than 30 levels should probably be treated as a string.
After this code has been run, the number of factor variables has been reduced from 75 to 12. It works in the sense that even though now GL2016 is around 3.2 MB (slightly larger than before), if I now subset the dataframe to pull out the Cleveland day games, the resulting dataframe is just 0.1 MB.
Questions:
1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large data set?
2) I am aware of the cost in terms of memory footprint of converting all character data to factors, but am I incurring any hidden costs (say in processing time) when I convert most of these factors back into strings?
Essentially, I think what you need to do is:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
df <- droplevels(df)
droplevelsfunction will remove all the unused factor levels, and thus reduce the size of df immensely.

Removing duplicates requires a transpose, but my dataframe is too large

I had asked a question here. I had a simple dataframe, for which I was attempting to remove duplicates. Very basic question.
Akrun gave a great answer, which was to use this line:
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
I went ahead and did this, which worked great on the dummy problem. But I have 3.5 million records that I'm trying to filter.
In an attempt to see where the bottleneck is, I broke the code into steps.
step1 <- apply(df1[1:2], 1, sort)
step2 <- t(step1)
step3 <- data.frame(step2, df1$location)
step4 <- !duplicated(step3)
final <- df1[step4, ,]
step 1 look quite a long time, but it wasn't the worst offender.
step 2, however, is clearly the culprit.
So I'm in the unfortunate situation where I'm looking for a way to transpose 3.5 million rows in R. (Or maybe not in R. Hopefully there is some way to do it somewhere).
Looking around, I saw a few ideas
install the WGCNA library, which has a transposeBigData function. Unfortunately this package is not longer being maintained, and I can't install all the dependencies.
write the data to a csv, then read it in line by line, and transpose each line one at a time. For me, even writing the file run overnight with no completion.
This is really strange. I just want to remove duplicates. For some reason, I have to transpose a dataframe in this process. But I can't transpose a dataframe this large.
So I need a better strategy for either removing duplicates, or for transposing. Does anyone have any ideas on this?
By the way, I'm using Ubuntu 14.04, with 15.6 GiB RAM, for which cat /proc/cpuinfo returns
Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
model name : Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
cpu MHz : 1200.000
cache size : 6144 KB
Thanks.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))
A faster option would be using pmin/pmax with data.table
library(data.table)
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2)))]
# id1 id2 location comment
#1: 1 2 Alaska cold
#2: 3 4 California nice
#3: 4 5 Kansas boring
#4: 9 10 Alaska cold
If 'location' also needs to be included to find the unique
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2), location))]
So after struggling with this for most of the weekend (grateful for plenty of selfless help from the illustrious #akrun), I realized that I would need to go about this in a completely different manner.
Since the dataframe was simply too large to process in memory, I ended up using a strategy where I pasted together a (string) key and column-bound it onto the dataframe. Next, I collapsed the key and sorted the characters. Here I could use which to get the index of the rows that contained non-duplicate keys. With that I could filter the my dataframe.
df_with_key <- within(df, key <- paste(boxer1, boxer2, date, location, sep=""))
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
df_with_key$key <- strSort(df_with_key$key)
idx <- which(!duplicated(df_with_key$key))
final_df <- df[idx,]

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Resources