Merging 2 large data sets in R - r

I am trying to merge two large data sets as i need to create a final trainset for my models to run
head(TrainWithAppevents_rel4)
event_id |device_id |gender |age |group| phone_brand |device_model| numbrand nummodel | app_id
6 6 1476664663289716480 M 19 M22- åŽä¸º Mate 7 29 919 4348659952760821248
and
head(app_labels)
app_id |label_id
1 7324884708820028416 251
The first dataset has unique rows now as i have worked on it to remove all duplicates
i want my final set to be having the below columns
event_id device_id gender age group phone_brand device_model numbrand nummodel app_id label_id
However when i try to merge using the below in R (R studio session)
TrainWithLabels=merge(x=TrainWithAppevents_rel4,y=app_labels,by="app_id",all.x = TRUE)
i get following error
**Error: cannot allocate vector of size 512.0 Mb**
Error varies if i run again but only in terms of size of vector
The sizes of my datasets are as below :
> dim(TrainWithAppevents_rel4)
[1] 4787796 10
> dim(app_labels)
[1] 459943 2
More information about the machine/R i use :
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
i use intel 2.6GHz/16GB RAM /64 Bit OS/Windows10/x64 -based processesor
i have tried the following :
-Reducing the dataset by removing duplicates and unwanted columns ,
all rows in the first dataset are unique now
-closing all other application on my laptop and then running the merge-Still fails
-executing gc() and then running merge
I have gone through similar questions on SO for R, however none of them offered a solution to move forward and not specific to merges failing on a 64 bit machine
Can anyone please help in either suggesting a solution or a workaround to move forward.
Please assume that this is the only machine where i can execute the code and running this R script on AWS via zepplin is not possible at the moment.

Related

Mclust freezes with small dataset

I am trying to use the Mclust() function from the R-package mclust on a dataset with 500 observations and 2 variables, and I want to identify 2 clusters.
> head(data)
x y
1 0.9929185 -1.9662945
2 8.2259360 -0.7240049
3 3.3866952 -1.8054764
4 -0.5161490 -2.3096992
5 1.8931073 -1.8928091
6 4.0833228 -1.9045669
> Mclust(data, G = 2)
fitting ...
|=============================================================== | 67%
This should produce an output relatively quickly, but freezes at 67%.
I ran this function multiple times over different datasets, and had no problems whatsoever. It even works if I only include observations up to row 498, but fails as soon as row 499+ is included.
498 -1.710175250 -1.612248596
499 -5.666497204 5.565422240
500 -3.649579976 1.552779499
I have uploaded the whole dataset in my GitHub repository: https://github.com/fstermann/bthesis/tree/main/MclustFreeze
I would greatly appreciate if anyone has an idea why this is happing with this specific dataset.
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_5.4.7
loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5

How to reduce the size of the data in R?

I've a CSV file which has 600,000 rows and 1339 columns making 1.6 GB. 1337 columns are binaries taking either 1 or 0 values and other 2 columns are numeric and character variables.
I pulled the data use the package readr with following code
VLU_All_Before_Wide <- read_csv("C:/Users/petas/Desktop/VLU_All_Before_Wide_Sample.csv")
When I checked the object size using following code, it's about 3 gb.
> print(object.size(VLU_All_Before_Wide),units="Gb")
3.2 Gb
In the next step, using the below code, I want to create training and test set for LASSO regression.
set.seed(1234)
train_rows <- sample(1:nrow(VLU_All_Before_Wide), .7*nrow(VLU_All_Before_Wide))
train_set <- VLU_All_Before_Wide[train_rows,]
test_set <- VLU_All_Before_Wide[-train_rows,]
yall_tra <- data.matrix(subset(train_set, select=VLU_Incidence))
xall_tra <- data.matrix(subset(train_set, select=-c(VLU_Incidence,Replicate)))
yall_tes <- data.matrix(subset(test_set, select=VLU_Incidence))
xall_tes <- data.matrix(subset(test_set, select=-c(VLU_Incidence,Replicate)))
When I started my R session the RAM was at ~3 gb and by the time I exicuted all the above code it's now at 14 gb, leaving me an error saying can't allocate vector of size 4 gb. There was no other application running other than 3 chrome windows. I removed the original dataset, training and test dataset but it only reduced .7 to 1 gb RAM.
rm(VLU_All_Before_Wide)
rm(test_set)
rm(train_set)
Appreciate if someone can guide me a way to reduce the size of the data.
Thanks
R struggles when it comes to huge datasets because it tries to load and keep all the data into the RAM. You can use other packages available in R which are made to handle big datasets, like 'bigmemory and ff. Check my answer here which addresses a similar issue.
You can also choose to do some data processing & manipulation outside R and remove unnecessary columns and rows. But still, to handle big datasets, it's better to use the capable packages.

Storing plot objects in R [duplicate]

This question already has answers here:
Save a plot in an object
(4 answers)
Closed 7 years ago.
Two methods of storing plot objects in list or a name string are mentioned on this page Generating names iteratively in R for storing plots . But both do not seem to work on my system.
> plist = list()
> plist[[1]] = plot(1:30)
>
> plist
list()
>
> plist[[1]]
Error in plist[[1]] : subscript out of bounds
Second method:
> assign('pp', plot(1:25))
>
> pp
NULL
I am using:
> R.version
_
platform i486-pc-linux-gnu
arch i486
os linux-gnu
system i486, linux-gnu
status
major 3
minor 2.0
year 2015
month 04
day 16
svn rev 68180
language R
version.string R version 3.2.0 (2015-04-16)
nickname Full of Ingredients
Where is the problem?
Use recordPlot and replayPlot:
plot(BOD)
plt <- recordPlot()
plot(0)
replayPlot(plt)

nodesize parameter ignored in randomForest package

Does the randomForest package ignore the nodesize parameter? When I predict the terminal nodes for a dataset and check the counts, I see values that are less than the nodesize. I would submit a fix for this myself but the underlying code was written in Fortran. If someone can confirm this behavior I will reach out to the package maintainer and hopefully start a fix.
> library(randomForest)
> set.seed(1)
> rf <- randomForest(mtcars[,-1], mtcars[,1], nodesize = 5)
> nodes <- attr(predict(rf, mtcars[,-1], nodes = TRUE), 'nodes')
# node counts of first tree
> table(nodes[,1])
# first row is the terminal node ID#, second row is the count
2 6 9 10 11 14 15 16 18 19
5 3 3 6 4 2 3 1 3 2
Adding system info:
Session info----------------------------------------------------------------
setting value
version R version 3.1.1 (2014-07-10)
system x86_64, mingw32
ui RStudio (0.98.1049)
language (EN)
collate English_United States.1252
tz America/Chicago
Packages--------------------------------------------------------------------
package * version date source
randomForest * 4.6.10 2014-07-17 CRAN (R 3.1.1)
Response from package maintainer:
That parameter behaves as the way that Leo Breiman intended. The bug
is in how the parameter was described. It’s the same as minsplit in
the rpart:::rpart.control() function:
the minimum number of observations that must exist in a node in order
for a split to be attempted.
I will change the description in the help file in the next version to
resolve this confusion.
Best, Andy

convert a character to a table in R

I am trying to get a .csv file from an S3 bucket.
I am using
x = getFile()
from the RAmazonS3 and converting it to a character using
y=rawToChar(x).
I wish to convert y to a data table with the same structure as the csv file (2 columns) the lines sep is \n and the inner line sep is a ",".
Code:
table = rawToChar(getFile("bucket/file.csv"))
Output:
"\"id\",\"name\"\n\"12\",\"Member 12\"\n\"123\",\"Member 123\"\n\"1234\",\"Member 1234\"\n\"12345\....
And i wish it to be a 2 columns data.table/frame of the form:
id name
12 Member 12
123 Member 123
1234 Member 1234
12345 Member 12345
Any suggestions on an efficient way to perform this conversion?
Any other (more useful) suggestions on how to retrieve csv files from an amazon S3 bucket to a table in R are more then welcome of course.
*I am using:
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
Any help in this issue would be appreciated,
Thank you

Resources