nodesize parameter ignored in randomForest package - r

Does the randomForest package ignore the nodesize parameter? When I predict the terminal nodes for a dataset and check the counts, I see values that are less than the nodesize. I would submit a fix for this myself but the underlying code was written in Fortran. If someone can confirm this behavior I will reach out to the package maintainer and hopefully start a fix.
> library(randomForest)
> set.seed(1)
> rf <- randomForest(mtcars[,-1], mtcars[,1], nodesize = 5)
> nodes <- attr(predict(rf, mtcars[,-1], nodes = TRUE), 'nodes')
# node counts of first tree
> table(nodes[,1])
# first row is the terminal node ID#, second row is the count
2 6 9 10 11 14 15 16 18 19
5 3 3 6 4 2 3 1 3 2
Adding system info:
Session info----------------------------------------------------------------
setting value
version R version 3.1.1 (2014-07-10)
system x86_64, mingw32
ui RStudio (0.98.1049)
language (EN)
collate English_United States.1252
tz America/Chicago
Packages--------------------------------------------------------------------
package * version date source
randomForest * 4.6.10 2014-07-17 CRAN (R 3.1.1)

Response from package maintainer:
That parameter behaves as the way that Leo Breiman intended. The bug
is in how the parameter was described. It’s the same as minsplit in
the rpart:::rpart.control() function:
the minimum number of observations that must exist in a node in order
for a split to be attempted.
I will change the description in the help file in the next version to
resolve this confusion.
Best, Andy

Related

different behaviors of the pseudo RNG depending on the version of R

I observed different behaviors of the pseudo RNG depending on the version of R (see code below).
Can someone explain this difference in behavior to me?
Best Regards,
> version; set.seed(1); rnorm(3); sample(1:100, 10);
_
platform x86_64-conda_cos6-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 5.1
year 2018
month 07
day 02
svn rev 74947
language R
version.string R version 3.5.1 (2018-07-02)
nickname Feather Spray
[1] -0.6264538 0.1836433 -0.8356286
[1] 95 66 62 6 20 17 65 36 71 46
> version; set.seed(1); rnorm(3); sample(1:100, 10);
_
platform x86_64-apple-darwin13.4.0
arch x86_64
os darwin13.4.0
system x86_64, darwin13.4.0
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree
[1] -0.6264538 0.1836433 -0.8356286
[1] 87 43 14 82 59 51 85 21 54 74
sample was changed in version 3.6.0. From https://stat.ethz.ch/pipermail/r-announce/2019/000641.html
The default method for generating from a discrete uniform
distribution (used in sample(), for instance) has been changed.
This addresses the fact, pointed out by Ottoboni and Stark, that
the previous method made sample() noticeably non-uniform on large
populations. See PR#17494 for a discussion. The previous method
can be requested using RNGkind() or RNGversion() if necessary for
reproduction of old results. Thanks to Duncan Murdoch for
contributing the patch and Gabe Becker for further assistance.
The output of RNGkind() has been changed to also return the
'kind' used by sample().
PR#17494 is shown here https://bugs.r-project.org/show_bug.cgi?id=17494.
3.6 introduced some changes. You can read about them here:
https://stat.ethz.ch/pipermail/r-announce/2019/000641.html
If necessary you can get the same behavior by calling some functions to get results to match previous versions. Look up ?RNGkind or ?RNGversion for more details.
The relevant section from the news file:
The previous method
can be requested using RNGkind() or RNGversion() if necessary for
reproduction of old results.

Problem Printing Regression Tree, "Error in cat(x, ..., sep = sep) : argument 1 (type 'list') cannot be handled by 'cat'"

I am attempting to construct a tree on some prostate cancer data.
> head(prostate)
# A tibble: 6 x 6
lcavol age lbph lcp gleason lpsa
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.580 50 -1.39 -1.39 6 -0.431
2 -0.994 58 -1.39 -1.39 6 -0.163
3 -0.511 74 -1.39 -1.39 7 -0.163
4 -1.20 58 -1.39 -1.39 6 -0.163
5 0.751 62 -1.39 -1.39 6 0.372
6 -1.05 50 -1.39 -1.39 6 0.765
I declared all necessary packages and began to construct my tree.
> library(tree)
> pstree <- tree(lcavol ~., data=prostate, mindev=0.1, mincut=1)
> pstree <- tree(lcavol ~., data=prostate, mincut=1)
The commands run with no issue. However, when I try and print my tree, I encounter an error.
> pstree
Error in cat(x, ..., sep = sep) :
argument 1 (type 'list') cannot be handled by 'cat'
When I examine the structure of 'prostate', it shows it to be a data frame.
> str(prostate)
tibble [97 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
The weirder part may be that when I plot the tree, I receive my plots as if the previous command worked
plot(pstree, col=8)
text(pstree, digits=2)
When I was loading the 'tree' package I had to update my R as I was running an older version. Could this perhaps be why I am encountering an error? Here is the version of R I am running
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
My understanding is cat is part of the base package, so I'm not sure why this would be causing problems. Is there perhaps some other package that got removed in the update that caused this problem? I installed and loaded the package 'Cat' in case, but that did not solve it.
Also, note that this code is for a Data Mining class I am currently taking. The code and accompanying text are from Johannes Ledolter's book Data Mining and Business Analytics with R. You can see the full code for chapter 13 Here and the data Here. Thank you very much for your time, thought and help!
Cheers,
Chris
I faced the same problem, and did not find any other support for this online. I realised that I am getting this error when tidyverse and tree packages are both loaded simultaneously. Unloading tidyverse solved the problem for me.
Note: Here are the versions I am running. I made sure I am running the latest versions.
R Studio: 1.3.1093
R: 4.0.3 (2020-10-10)
tree package: 1.0.40
tidyverse package: 1.0.40
I have the same error as you when trying to call the print function of the tree class.
>tree_OJ
Error in cat(x, ..., sep = sep) :
argument 1 (type 'list') cannot be handled by 'cat'
The error is caused by loading package "cli", which is loaded implicitly when loading "tidyverse". The "cli" package mess up the namespace and overwrite the print function for the tree class.
Registered S3 method overwritten by 'cli':
method from
print.tree tree
If you have to use the "tidyverse" package,
you can use the following code, which explicitly call the print function used for the tree class. Remember to use ":::" instead of the commonly used "::" because print.tree function is not an exposed function in the tree package.
tree:::print.tree(tree_OJ)

R chart.Timeseries in loop not working

I recently updated to the most recent versions of R and R studio and suddenly chart.TimeSeries from the PerformanceAnalytics package is not working inside a loop.
For example if I highlight the code below in Rstudio and run it , it executes without errors (which you can confirm by checking the value of i = 3 after running) but no plots are produced
library(PerformanceAnalytics)
library(xts)
ts1 <- xts(1:12, order.by = as.Date("2018-05-01") + (-11:0))
i <- 0
for (i in 1:3) chart.TimeSeries(ts1)
However if I replace
for (i in 1:3) chart.TimeSeries(ts1)
with
chart.TimeSeries(ts1)
chart.TimeSeries(ts1)
chart.TimeSeries(ts1)
then 3 plots are produced as expected. Has anyone seen or noted this before or have an explanation for it ?
Update : The same happens if I use plot.xts (which is what chart.TimeSeries uses under the hood) in place of chart.TimeSeries.
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.0
year 2018
month 04
day 23
svn rev 74626
language R
version.string R version 3.5.0 (2018-04-23)
nickname Joy in Playing
R-Studio verison 1.1.423. PerformanceAnalytics version 1.5.2, xts version 0.10-2
I just ran your example and indeed, my result is the same as yours.
I changed
for (i in 1:3) chart.TimeSeries(ts1)
to
for (i in 1:3) print(PerformanceAnalytics::chart.TimeSeries(ts1))
and now all 3 charts are showing properly in my plots panel inside rstudio (I also use up-to-date versions)
Hope this answers your issue.

Merging 2 large data sets in R

I am trying to merge two large data sets as i need to create a final trainset for my models to run
head(TrainWithAppevents_rel4)
event_id |device_id |gender |age |group| phone_brand |device_model| numbrand nummodel | app_id
6 6 1476664663289716480 M 19 M22- åŽä¸º Mate 7 29 919 4348659952760821248
and
head(app_labels)
app_id |label_id
1 7324884708820028416 251
The first dataset has unique rows now as i have worked on it to remove all duplicates
i want my final set to be having the below columns
event_id device_id gender age group phone_brand device_model numbrand nummodel app_id label_id
However when i try to merge using the below in R (R studio session)
TrainWithLabels=merge(x=TrainWithAppevents_rel4,y=app_labels,by="app_id",all.x = TRUE)
i get following error
**Error: cannot allocate vector of size 512.0 Mb**
Error varies if i run again but only in terms of size of vector
The sizes of my datasets are as below :
> dim(TrainWithAppevents_rel4)
[1] 4787796 10
> dim(app_labels)
[1] 459943 2
More information about the machine/R i use :
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
i use intel 2.6GHz/16GB RAM /64 Bit OS/Windows10/x64 -based processesor
i have tried the following :
-Reducing the dataset by removing duplicates and unwanted columns ,
all rows in the first dataset are unique now
-closing all other application on my laptop and then running the merge-Still fails
-executing gc() and then running merge
I have gone through similar questions on SO for R, however none of them offered a solution to move forward and not specific to merges failing on a 64 bit machine
Can anyone please help in either suggesting a solution or a workaround to move forward.
Please assume that this is the only machine where i can execute the code and running this R script on AWS via zepplin is not possible at the moment.

Storing plot objects in R [duplicate]

This question already has answers here:
Save a plot in an object
(4 answers)
Closed 7 years ago.
Two methods of storing plot objects in list or a name string are mentioned on this page Generating names iteratively in R for storing plots . But both do not seem to work on my system.
> plist = list()
> plist[[1]] = plot(1:30)
>
> plist
list()
>
> plist[[1]]
Error in plist[[1]] : subscript out of bounds
Second method:
> assign('pp', plot(1:25))
>
> pp
NULL
I am using:
> R.version
_
platform i486-pc-linux-gnu
arch i486
os linux-gnu
system i486, linux-gnu
status
major 3
minor 2.0
year 2015
month 04
day 16
svn rev 68180
language R
version.string R version 3.2.0 (2015-04-16)
nickname Full of Ingredients
Where is the problem?
Use recordPlot and replayPlot:
plot(BOD)
plt <- recordPlot()
plot(0)
replayPlot(plt)

Resources