Comparing Multiple lm() Results within ggplot2 [duplicate] - r

Is there a way to extract the values of the fitted line returned from stat_smooth?
The code I am using looks like this:
p <- ggplot(df1, aes(x=Days, y= Qty,group=Category,color=Category))
p <- p + stat_smooth(method=glm, fullrange=TRUE)+ geom_point())
This new r user would greatly appreciate any guidance.

Riffing off of #James example
p <- qplot(hp,wt,data=mtcars) + stat_smooth()
You can use the intermediate stages of the ggplot building process to pull out the plotted data. The results of ggplot_build is a list, one component of which is data which is a list of dataframes which contain the computed values to be plotted. In this case, the list is two dataframes since the original qplot creates one for points and the stat_smooth creates a smoothed one.
> ggplot_build(p)$data[[2]]
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
x y ymin ymax se PANEL group
1 52.00000 1.993594 1.149150 2.838038 0.4111133 1 1
2 55.58228 2.039986 1.303264 2.776709 0.3586695 1 1
3 59.16456 2.087067 1.443076 2.731058 0.3135236 1 1
4 62.74684 2.134889 1.567662 2.702115 0.2761514 1 1
5 66.32911 2.183533 1.677017 2.690049 0.2465948 1 1
6 69.91139 2.232867 1.771739 2.693995 0.2244980 1 1
7 73.49367 2.282897 1.853241 2.712552 0.2091756 1 1
8 77.07595 2.333626 1.923599 2.743652 0.1996193 1 1
9 80.65823 2.385059 1.985378 2.784740 0.1945828 1 1
10 84.24051 2.437200 2.041282 2.833117 0.1927505 1 1
11 87.82278 2.490053 2.093808 2.886297 0.1929096 1 1
12 91.40506 2.543622 2.145018 2.942225 0.1940582 1 1
13 94.98734 2.597911 2.196466 2.999355 0.1954412 1 1
14 98.56962 2.652852 2.249260 3.056444 0.1964867 1 1
15 102.15190 2.708104 2.303465 3.112744 0.1969967 1 1
16 105.73418 2.764156 2.357927 3.170385 0.1977705 1 1
17 109.31646 2.821771 2.414230 3.229311 0.1984091 1 1
18 112.89873 2.888224 2.478136 3.298312 0.1996493 1 1
19 116.48101 2.968745 2.531045 3.406444 0.2130917 1 1
20 120.06329 3.049545 2.552102 3.546987 0.2421773 1 1
21 123.64557 3.115893 2.573577 3.658208 0.2640235 1 1
22 127.22785 3.156368 2.601664 3.711072 0.2700548 1 1
23 130.81013 3.175495 2.625951 3.725039 0.2675429 1 1
24 134.39241 3.181411 2.645191 3.717631 0.2610560 1 1
25 137.97468 3.182252 2.658993 3.705511 0.2547460 1 1
26 141.55696 3.186155 2.670350 3.701961 0.2511175 1 1
27 145.13924 3.201258 2.687208 3.715308 0.2502626 1 1
28 148.72152 3.235698 2.721744 3.749652 0.2502159 1 1
29 152.30380 3.291766 2.782767 3.800765 0.2478037 1 1
30 155.88608 3.353259 2.857911 3.848607 0.2411575 1 1
31 159.46835 3.418409 2.938257 3.898561 0.2337596 1 1
32 163.05063 3.487074 3.017321 3.956828 0.2286972 1 1
33 166.63291 3.559111 3.092367 4.025855 0.2272319 1 1
34 170.21519 3.634377 3.165426 4.103328 0.2283065 1 1
35 173.79747 3.712729 3.242093 4.183364 0.2291263 1 1
36 177.37975 3.813399 3.347232 4.279565 0.2269509 1 1
37 180.96203 3.910849 3.447572 4.374127 0.2255441 1 1
38 184.54430 3.977051 3.517784 4.436318 0.2235917 1 1
39 188.12658 4.037302 3.583959 4.490645 0.2207076 1 1
40 191.70886 4.091635 3.645111 4.538160 0.2173882 1 1
41 195.29114 4.140082 3.700184 4.579981 0.2141624 1 1
42 198.87342 4.182676 3.748159 4.617192 0.2115424 1 1
43 202.45570 4.219447 3.788162 4.650732 0.2099688 1 1
44 206.03797 4.250429 3.819579 4.681280 0.2097573 1 1
45 209.62025 4.275654 3.842137 4.709171 0.2110556 1 1
46 213.20253 4.295154 3.855951 4.734357 0.2138238 1 1
47 216.78481 4.308961 3.861497 4.756425 0.2178456 1 1
48 220.36709 4.317108 3.859541 4.774675 0.2227644 1 1
49 223.94937 4.319626 3.851025 4.788227 0.2281358 1 1
50 227.53165 4.316548 3.836964 4.796132 0.2334829 1 1
51 231.11392 4.308435 3.818728 4.798143 0.2384117 1 1
52 234.69620 4.302276 3.802201 4.802351 0.2434590 1 1
53 238.27848 4.297902 3.787395 4.808409 0.2485379 1 1
54 241.86076 4.292303 3.772103 4.812503 0.2532567 1 1
55 245.44304 4.282505 3.754087 4.810923 0.2572576 1 1
56 249.02532 4.269040 3.733184 4.804896 0.2608786 1 1
57 252.60759 4.253361 3.710042 4.796680 0.2645121 1 1
58 256.18987 4.235474 3.684476 4.786473 0.2682509 1 1
59 259.77215 4.215385 3.656265 4.774504 0.2722044 1 1
60 263.35443 4.193098 3.625161 4.761036 0.2764974 1 1
61 266.93671 4.168621 3.590884 4.746357 0.2812681 1 1
62 270.51899 4.141957 3.553134 4.730781 0.2866658 1 1
63 274.10127 4.113114 3.511593 4.714635 0.2928472 1 1
64 277.68354 4.082096 3.465939 4.698253 0.2999729 1 1
65 281.26582 4.048910 3.415849 4.681971 0.3082025 1 1
66 284.84810 4.013560 3.361010 4.666109 0.3176905 1 1
67 288.43038 3.976052 3.301132 4.650972 0.3285813 1 1
68 292.01266 3.936392 3.235952 4.636833 0.3410058 1 1
69 295.59494 3.894586 3.165240 4.623932 0.3550782 1 1
70 299.17722 3.850639 3.088806 4.612473 0.3708948 1 1
71 302.75949 3.804557 3.006494 4.602619 0.3885326 1 1
72 306.34177 3.756345 2.918191 4.594499 0.4080510 1 1
73 309.92405 3.706009 2.823813 4.588205 0.4294926 1 1
74 313.50633 3.653554 2.723308 4.583801 0.4528856 1 1
75 317.08861 3.598987 2.616650 4.581325 0.4782460 1 1
76 320.67089 3.542313 2.503829 4.580796 0.5055805 1 1
77 324.25316 3.483536 2.384853 4.582220 0.5348886 1 1
78 327.83544 3.422664 2.259739 4.585589 0.5661643 1 1
79 331.41772 3.359701 2.128512 4.590891 0.5993985 1 1
80 335.00000 3.294654 1.991200 4.598107 0.6345798 1 1
Knowing a priori where the one you want is in the list isn't easy, but if nothing else you can look at the column names.
It is still better to do the smoothing outside the ggplot call, though.
EDIT:
It turns out replicating what ggplot2 does to make the loess is not as straightforward as I thought, but this will work. I copied it out of some internal functions in ggplot2.
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth(aes_auto(loess.DF), data=loess.DF, stat="identity")
That gives a plot that looks identical to
ggplot(mtcars, aes(x=hp, y=wt)) +
geom_point() +
geom_smooth()
(which is the expanded form of the original p).

stat_smooth does produce output that you can use elsewhere, and with a slightly hacky way, you can put it into a variable in the global environment.
You enclose the output variable in .. on either side to use it. So if you add an aes in the stat_smooth call and use the global assign, <<-, to assign the output to a varible in the global environment you can get the the fitted values, or others - see below.
qplot(hp,wt,data=mtcars) + stat_smooth(aes(outfit=fit<<-..y..))
fit
[1] 1.993594 2.039986 2.087067 2.134889 2.183533 2.232867 2.282897 2.333626
[9] 2.385059 2.437200 2.490053 2.543622 2.597911 2.652852 2.708104 2.764156
[17] 2.821771 2.888224 2.968745 3.049545 3.115893 3.156368 3.175495 3.181411
[25] 3.182252 3.186155 3.201258 3.235698 3.291766 3.353259 3.418409 3.487074
[33] 3.559111 3.634377 3.712729 3.813399 3.910849 3.977051 4.037302 4.091635
[41] 4.140082 4.182676 4.219447 4.250429 4.275654 4.295154 4.308961 4.317108
[49] 4.319626 4.316548 4.308435 4.302276 4.297902 4.292303 4.282505 4.269040
[57] 4.253361 4.235474 4.215385 4.193098 4.168621 4.141957 4.113114 4.082096
[65] 4.048910 4.013560 3.976052 3.936392 3.894586 3.850639 3.804557 3.756345
[73] 3.706009 3.653554 3.598987 3.542313 3.483536 3.422664 3.359701 3.294654
The outputs you can obtain are:
y, predicted value
ymin, lower pointwise confidence interval around
the mean
ymax, upper pointwise confidence interval around the mean
se, standard error
Note that by default it predicts on 80 data points, which may not be aligned with your original data.

A more general approach could be to simply use the predict() function to predict any range of values that are interesting.
# define the model
model <- loess(wt ~ hp, data = mtcars)
# predict fitted values for each observation in the original dataset
modelFit <- data.frame(predict(model, se = TRUE))
# define data frame for ggplot
df <- data.frame(cbind(hp = mtcars$hp
, wt = mtcars$wt
, fit = modelFit$fit
, upperBound = modelFit$fit + 2 * modelFit$se.fit
, lowerBound = modelFit$fit - 2 * modelFit$se.fit
))
# build the plot using the fitted values from the predict() function
# geom_linerange() and the second geom_point() in the code are built using the values from the predict() function
# for comparison ggplot's geom_smooth() is also shown
g <- ggplot(df, aes(hp, wt))
g <- g + geom_point()
g <- g + geom_linerange(aes(ymin = lowerBound, ymax = upperBound))
g <- g + geom_point(aes(hp, fit, size = 1))
g <- g + geom_smooth(method = "loess")
g
# Predict any range of values and include the standard error in the output
predict(model, newdata = 100:300, se = TRUE)

If you want to bring in the power of the tidyverse, you can use the "broom" library to add the predicted values from the loess function to your original dataset. This is building on #phillyooo's solution.
library(tidyverse)
library(broom)
# original graph with smoother
ggplot(data=mtcars, aes(hp,wt)) +
stat_smooth(method = "loess", span = 0.75)
# Create model that will do the same thing as under the hood in ggplot2
model <- loess(wt ~ hp, data = mtcars, span = 0.75)
# Add predicted values from model to original dataset using broom library
mtcars2 <- augment(model, mtcars)
# Plot both lines
ggplot(data=mtcars2, aes(hp,wt)) +
geom_line(aes(hp, .fitted), color = "red") +
stat_smooth(method = "loess", span = 0.75)

Save the graph object and use ggplot_build() or layer_data() to obtain the elements/estimates for the layers. e.g.
pp<-ggplot(mtcars, aes(x=hp, y=wt)) + geom_point() + geom_smooth();
ggplot_build(pp)

Related

"variable lengths differ" error while running regressions in a loop

I am trying to run a regression loop based on code that I have found in a previous answer (How to Loop/Repeat a Linear Regression in R) but I keep getting an error. My outcomes (dependent) are 940 variables (metabolites) and my exposure (independent) are "bmi","Age", "sex","lpa2c", and "smoking". where BMI and Age are continuous. BMI is the mean exposure, and for others, I am controlling for them.
So I'm testing the effect of BMI on 940 metabolites.
Also, I would like to know how I can extract coefficient, p-value, standard error, and confidence interval for BMI only and when it is significant.
This is the code I have used:
y<- c(1653:2592) # response
x1<- c("bmi","Age", "sex","lpa2c", "smoking") # predictor
for (i in x1){
model <- lm(paste("y ~", i[[1]]), data= QBB_clean)
print(summary(model))
}
And this is the error:
Error in model.frame.default(formula = paste("y ~", i[[1]]), data = QBB_clean, :
variable lengths differ (found for 'bmi').
y1 y2 y3 y4 bmi age sex lpa2c smoking
1 0.2875775201 0.59998896 0.238726027 0.784575267 24 18 1 0.470681834 1
2 0.7883051354 0.33282354 0.962358936 0.009429905 12 20 0 0.365845473 1
3 0.4089769218 0.48861303 0.601365726 0.779065883 18 15 0 0.121272054 0
4 0.8830174040 0.95447383 0.515029727 0.729390652 16 21 0 0.046993681 0
5 0.9404672843 0.48290240 0.402573342 0.630131853 18 28 1 0.262796304 1
6 0.0455564994 0.89035022 0.880246541 0.480910830 13 13 0 0.968641168 1
7 0.5281054880 0.91443819 0.364091865 0.156636851 11 12 0 0.488495482 1
8 0.8924190444 0.60873498 0.288239281 0.008215520 21 23 0 0.477822030 0
9 0.5514350145 0.41068978 0.170645235 0.452458394 18 17 1 0.748792881 0
10 0.4566147353 0.14709469 0.172171746 0.492293329 20 15 1 0.667640231 1
If you want to loop over responses you will want something like this:
respvars <- names(QBB_clean[1653:2592])
predvars <- c("bmi","Age", "sex","lpa2c", "smoking")
results <- list()
for (v in respvars) {
form <- reformulate(predvars, response = v)
results[[v]] <- lm(form, data = QBB_clean)
}
You can then print the results with something like lapply(results, summary), extract coefficients, etc.. (I have a little trouble seeing how it's going to be useful to just print the results of 940 regressions ... are you really going to inspect them all?
If you want coefficients etc. for BMI, I think this should work (not tested):
t(sapply(results, function(m) coef(summary(m))["bmi",]))
Or for coefficients:
t(sapply(results, function(m) confint(m)["bmi",]))

Implementing XGBoost on Methyl450k data set in R

I'm attempting to implement a the XGBoost on a Methyl450k data set. The data has approximately 480000+ specific CpG sites with subsequent beta values between 0 and 1. Here is a look at the data (sample 10 columns with response):
cg13869341 cg14008030 cg12045430 cg20826792 cg00381604 cg20253340 cg21870274 cg03130891 cg24335620 cg16162899 response
1 0.8612869 0.6958909 0.07918330 0.10816711 0.03484078 0.4875475 0.7475878 0.11051578 0.7120003 0.8453396 0
2 0.8337106 0.6276754 0.09811698 0.08934333 0.03348864 0.6300766 0.7753453 0.08652890 0.6465146 0.8137132 0
3 0.8516102 0.6575332 0.13310207 0.07990076 0.04195286 0.4325115 0.7257208 0.14334007 0.7384455 0.8054013 0
4 0.8970384 0.6955810 0.08134887 0.08950676 0.03578006 0.4711689 0.7214661 0.08299838 0.7718571 0.8151683 0
5 0.8562323 0.7204416 0.08078766 0.14902533 0.04274820 0.4769631 0.8034706 0.16473891 0.7143823 0.8475410 0
6 0.8613325 0.6527599 0.10158672 0.15459204 0.04839691 0.4805285 0.8004808 0.12598627 0.8218743 0.8222552 0
7 0.9168869 0.5963966 0.11457045 0.13245761 0.03720798 0.5067649 0.6806004 0.13601034 0.7063457 0.8509160 0
8 0.9002366 0.6898320 0.07029171 0.07158694 0.03875135 0.7065322 0.8167016 0.15394095 0.7226098 0.8310477 0
9 0.8876504 0.6172154 0.13511072 0.15276686 0.06149520 0.5642073 0.7177438 0.14752285 0.6846876 0.8360360 0
10 0.8992898 0.6361644 0.15423780 0.19111275 0.05700406 0.4941239 0.7819968 0.10109936 0.6680640 0.8504023 0
11 0.8997905 0.5906462 0.10411472 0.15006796 0.04157008 0.4931531 0.7857664 0.13430963 0.6946644 0.8326747 0
12 0.9009607 0.6721858 0.09081460 0.11057752 0.05824153 0.4683763 0.7655608 0.01755990 0.7113345 0.8346149 0
13 0.9036750 0.6313643 0.07477824 0.12089404 0.04738597 0.5502747 0.7520128 0.16332395 0.7036665 0.8564414 0
14 0.8420276 0.6265071 0.15351674 0.13647090 0.04901864 0.5037902 0.7446693 0.10534171 0.7727812 0.8317943 0
15 0.8995276 0.6515500 0.09214429 0.08973162 0.04231420 0.5071999 0.7484940 0.21822470 0.6859165 0.7775508 0
16 0.9071643 0.7945852 0.15809474 0.11264440 0.04793316 0.5256078 0.8425513 0.17150603 0.7581367 0.8271037 0
17 0.8691358 0.6206902 0.11868549 0.15944891 0.03523320 0.4581166 0.8058461 0.11557264 0.6960848 0.8579109 1
18 0.8330247 0.7030860 0.12832663 0.12936172 0.03534059 0.4687507 0.7630222 0.12176819 0.7179690 0.8775521 1
19 0.9015574 0.6592869 0.12693119 0.14671845 0.03819418 0.4395692 0.7420882 0.10293369 0.7047038 0.8435531 1
20 0.8568249 0.6762936 0.18220218 0.10123198 0.04963466 0.5781550 0.6324743 0.06676272 0.6805745 0.8291353 1
21 0.8799152 0.6736554 0.15056617 0.16070673 0.04944037 0.4015415 0.4587438 0.10392791 0.7467060 0.7396137 1
22 0.8730770 0.6663321 0.10802390 0.14481460 0.04448009 0.5177664 0.6682854 0.16747621 0.7161234 0.8309462 1
23 0.9359656 0.7401368 0.16730300 0.11842173 0.03388908 0.4906018 0.5730439 0.15970761 0.7904663 0.8136450 1
24 0.9320397 0.6978085 0.10474803 0.10607080 0.03268366 0.5362214 0.7832729 0.15564091 0.7171350 0.8511477 1
25 0.8444256 0.7516799 0.16767449 0.12025258 0.04426417 0.5040725 0.6950104 0.16010829 0.7026808 0.8800469 1
26 0.8692707 0.7016945 0.10123979 0.09430876 0.04037325 0.4877716 0.7053603 0.09539885 0.8316933 0.8165352 1
27 0.8738410 0.6230674 0.12793232 0.14837137 0.04878595 0.4335648 0.6547601 0.13714725 0.6944921 0.8788708 1
28 0.9041870 0.6201079 0.12490195 0.16227251 0.04812720 0.4845896 0.6619842 0.13093443 0.7415606 0.8479339 1
29 0.8618622 0.7060291 0.09453812 0.14068246 0.04799782 0.5474036 0.6088231 0.23338428 0.6772588 0.7795908 1
30 0.8776350 0.7132561 0.12100425 0.17367148 0.04399987 0.5661632 0.6905305 0.12971867 0.6788903 0.8198201 1
31 0.9134456 0.7249370 0.07144695 0.08759897 0.04864476 0.6682650 0.7445900 0.16374150 0.7322691 0.8071598 1
32 0.8706637 0.6743936 0.15291891 0.11422262 0.04284591 0.5268217 0.7207478 0.14296945 0.7574967 0.8609048 1
33 0.8821504 0.6845216 0.12004074 0.14009196 0.05527732 0.5677475 0.6379840 0.14122421 0.7090634 0.8386022 1
34 0.9061180 0.5989445 0.09160787 0.14325261 0.05142950 0.5399465 0.6718870 0.08454002 0.6709083 0.8264233 1
35 0.8453511 0.6759766 0.13345672 0.16310764 0.05107034 0.4666146 0.7343603 0.12733287 0.7062292 0.8471812 1
36 0.9004188 0.6114532 0.11837118 0.14667433 0.05050403 0.4975502 0.7258132 0.14894363 0.7195090 0.8382364 1
37 0.9051729 0.6652954 0.15153241 0.14571184 0.05026702 0.4855397 0.7226850 0.12179138 0.7430388 0.8342340 1
38 0.9112012 0.6314450 0.12681305 0.16328649 0.04076789 0.5382251 0.7404122 0.13971506 0.6607798 0.8657917 1
39 0.8407927 0.7148585 0.12792107 0.15447060 0.05287096 0.6798039 0.7182050 0.06549068 0.7433669 0.7948445 1
40 0.8554747 0.7356683 0.22698080 0.21692162 0.05365043 0.4496654 0.7353112 0.13341649 0.8032266 0.7883068 1
41 0.8535359 0.5729331 0.14392737 0.16612463 0.04651752 0.5228045 0.7397588 0.09967424 0.7906682 0.8384434 1
42 0.8059968 0.7148594 0.16774123 0.19006840 0.04990847 0.5929818 0.7011064 0.17921090 0.8121909 0.8481069 1
43 0.8856906 0.6987405 0.19262137 0.18327412 0.04816967 0.4340002 0.6569263 0.13724290 0.7600389 0.7788117 1
44 0.8888717 0.6760166 0.17025712 0.21906969 0.04812641 0.4173613 0.7927178 0.17458413 0.6806101 0.8297604 1
45 0.8691575 0.6682723 0.11932277 0.13669098 0.04014911 0.4680455 0.6186511 0.10002737 0.8012731 0.7177891 1
46 0.9148742 0.7797494 0.13313955 0.15166151 0.03934042 0.4818276 0.7484973 0.16354624 0.6979735 0.8164431 1
47 0.9226736 0.7211714 0.08036409 0.10395457 0.04063595 0.4014187 0.8026643 0.17762644 0.7194800 0.8156545 1
I've attempted to implement the algorithm in R but I'm continuing to get errors.
Attempt:
> train <- beta_values1_updated[training1, ]
> test <- beta_values1_updated[-training1, ]
> labels <- train$response
> ts_label <- test$response
> new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
Error in `[.data.frame`(train, , -c("response"), with = F) :
unused argument (with = F)
> new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
Error in `[.data.frame`(test, , -c("response"), with = F) :
unused argument (with = F)
I am attempting to follow the tutorial here:
https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Any insight as to how I could correctly implement the XGBoost algorithm would be greatly appreciated.
Edit:
I'm adding additional code to show the point in the tutorial where I get stuck:
train<-data.table(train)
test<-data.table(test)
new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
#convert factor to numeric
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 20 rounds.
[11] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[21] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Stopping. Best iteration:
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
The author of the tutorial is using the data.table package. As you can read here, using with = F is sometimes used to get a single column. Make sure you have loaded and installed data.table and other packages to follow the tutorial. Also, make sure your data set is a data.table object.

Overlaying ggplot data layers

I am trying to overlay two different length datasets within ggplot.
Dataset 1: dataframe r where m is the date and V2 is the value with a range between -1 to +1:
> r
m V2
19991221 1
19910703 -0.396825397
19850326 0.916666667
19890328 -0.473053892
19610912 -0.75
20021106 -0.991525424
19940324 -1
19840522 -0.502145923
19780718 1
19811222 -0.447154472
19781017 0
19761108 -0.971014493
19791006 1
19891219 0.818181818
19851217 0.970149254
19980818 0.808219178
19940816 -0.985185185
19790814 -0.966666667
19990203 -0.882352941
19831220 1
19830114 -1
19980204 -0.991489362
19941115 -0.966101695
19860520 -0.986206897
19761019 -0.666666667
19900207 -0.983870968
19731010 0
19821221 -0.833333333
19770517 1
19800205 0.662337662
19760329 -0.545454545
19810224 -0.957446809
20000628 -0.989473684
19911105 -0.988571429
19960924 -0.483870968
19880816 1
19860923 1
20030506 -1
20031209 -1
19950201 -0.974025974
19790206 1
19811117 -0.989304813
19950822 -1
19860212 0.808219178
19730821 -0.463203463
19991221 1
19910703 -0.396825397
19850326 0.916666667
19890328 -0.473053892
19610912 -0.75
20021106 -0.991525424
19940324 -1
19840522 -0.502145923
19780718 1
19811222 -0.447154472
19781017 0
19761108 -0.971014493
19791006 1
19891219 0.818181818
19851217 0.970149254
19980818 0.808219178
19940816 -0.985185185
19790814 -0.966666667
19990203 -0.882352941
19831220 1
19830114 -1
19980204 -0.991489362
19941115 -0.966101695
19860520 -0.986206897
19761019 -0.666666667
19900207 -0.983870968
19731010 0
19821221 -0.833333333
19770517 1
19800205 0.662337662
19760329 -0.545454545
19810224 -0.957446809
20000628 -0.989473684
19911105 -0.988571429
19960924 -0.483870968
19880816 1
19860923 1
20030506 -1
20031209 -1
19950201 -0.974025974
19790206 1
19811117 -0.989304813
19950822 -1
19860212 0.808219178
19730821 -0.463203463
19991221 1
19910703 -0.396825397
19850326 0.916666667
19890328 -0.473053892
19610912 -0.75
20021106 -0.991525424
19940324 -1
19840522 -0.502145923
19780718 1
19811222 -0.447154472
19781017 0
19761108 -0.971014493
19791006 1
19891219 0.818181818
19851217 0.970149254
19980818 0.808219178
19940816 -0.985185185
19790814 -0.966666667
19990203 -0.882352941
19831220 1
19830114 -1
19980204 -0.991489362
19941115 -0.966101695
19860520 -0.986206897
19761019 -0.666666667
19900207 -0.983870968
19731010 0
19821221 -0.833333333
19770517 1
19800205 0.662337662
19760329 -0.545454545
19810224 -0.957446809
20000628 -0.989473684
19911105 -0.988571429
19960924 -0.483870968
19880816 1
19860923 1
20030506 -1
20031209 -1
19950201 -0.974025974
19790206 1
19811117 -0.989304813
19950822 -1
19860212 0.808219178
19730821 -0.463203463
use these lines to generate r
m<-gsub("-", "/", as.Date(as.character(fileloc$V1), "%Y%m%d"))
r<-cbind(m, fileloc[2])
colnames(r)
r
Dataset 2: The following data sets which defines the recession period in US:
library(quantmod)
getSymbols("USREC",src="FRED")
getSymbols("UNRATE", src="FRED")
unrate.df <- data.frame(date= index(UNRATE),UNRATE$UNRATE)
start <- index(USREC[which(diff(USREC$USREC)==1)])
end <- index(USREC[which(diff(USREC$USREC)==-1)-1])
reccesion.df <- data.frame(start=start, end=end[-1])
recession.df <- subset(reccesion.df, start >= min(unrate.df$date))
The resulting recession.df
> recession.df
start end
1 1948-12-01 1949-10-01
2 1953-08-01 1954-05-01
3 1957-09-01 1958-04-01
.....
11 2008-01-01 2009-06-01
Plotting:
I can generate separate scatter plots with the following:
ggplot(r, aes(V2, r$m, colour=V2))+
geom_point()+xlab(label='Tone Score')+ylab(label='Dates')
and timeseries with shaded region for recession with:
ggplot()+
geom_line(data=unrate.df, aes(x=date, y=UNRATE)) +
geom_rect(data=recession.df,
aes(xmin=start,xmax=end, ymin=0,ymax=max(unrate.df$UNRATE)),
fill="red", alpha=0.2)
How do I merge these plots to see overlay those scatter plot over the time series?
Without you providing the full dataset for the question, I have generated some random data for the dates between the dates 1973/08/21 and 1999/12/21:
set.seed(123)
r <- data.frame(m = seq.Date(as.Date("2017/12/21"), as.Date("1950/08/21"),
length.out = 135),
V2 = rnorm(n = 135, mean = 0, sd = 0.5))
You can overlay multiple layers within a ggplot by adding different a different data and aes arguments for each of the geom_ items you are calling.
ggplot() +
geom_point(data = r, aes(x = m, y = V2, colour=V2))+
geom_line(data=unrate.df, aes(x=date, y=UNRATE)) +
geom_rect(data=recession.df,
aes(xmin=start, xmax=end, ymin=0, ymax=max(unrate.df$UNRATE)),
fill="red", alpha=0.2) +
xlab(label='Tone Score')+ylab(label='Dates')

How to interpret H2O's confusion matrix?

I am using h2o version 3.10.4.8.
library(magrittr)
library(h2o)
h2o.init(nthreads = -1, max_mem_size = "6g")
data.url <- "https://raw.githubusercontent.com/DarrenCook/h2o/bk/datasets/"
iris.hex <- paste0(data.url, "iris_wheader.csv") %>%
h2o.importFile(destination_frame = "iris.hex")
y <- "class"
x <- setdiff(names(iris.hex), y)
model.glm <- h2o.glm(x, y, iris.hex, family = "multinomial")
preds <- h2o.predict(model.glm, iris.hex)
h2o.confusionMatrix(model.glm)
h2o.table(preds["predict"])
This is the output of h2o.confusionMatrix(model.glm):
Confusion Matrix: vertical: actual; across: predicted
Iris-setosa Iris-versicolor Iris-virginica Error Rate
Iris-setosa 50 0 0 0.0000 = 0 / 50
Iris-versicolor 0 48 2 0.0400 = 2 / 50
Iris-virginica 0 1 49 0.0200 = 1 / 50
Totals 50 49 51 0.0200 = 3 / 150
Since it says across:predicted, I interpret this to mean that the model made 50 (0 + 48 + 2) predictions that are Iris-versicolor.
This is the output of h2o.table(preds["predict"]):
predict Count
1 Iris-setosa 50
2 Iris-versicolor 49
3 Iris-virginica 51
This tells me that the model made 49 predictions that are Iris-versicolor.
Is the confusion matrix incorrectly labelled or did I make a mistake in interpreting the results?
Row names (vertical) are the actual labels.
Column names (across) are the predicted labels.
You did not make a mistake; the labels are confusing (and causing people to think that the rows and columns were switched). This was fixed recently and will be included in the next release of H2O.

fitting more than one regression line to a scatterplot in R

I'm trying to fit regression lines to this relation angc~ext. Variable pch divides the data into two sets to each of which I want to fit a regression line
with its confidence intervals. Here's my data frame (C):
"ext" "angc" "pch"
25 3.76288002820208 0
29 4.44255895177431 0
21 2.45214044383301 0
35 4.01334352881766 0
35 9.86225452423762 0
28 19.9304126868056 1
32 25.6984064030981 1
20 5.10582966112880 0
36 5.75603291081328 0
11 4.62311785943305 0
33 4.94401591414043 0
27 8.10039123328465 0
29 16.3882499757369 1
30 29.3492784626796 1
29 3.85960848290140 0
32 5.35857680326963 0
26 4.86451443776053 0
16 8.22008387344697 0
30 10.2212259432413 0
32 17.2519440101067 1
29 27.5011256290209 1
My code:
c0 <- C[C$pch == 0, ]
c1 <- C[C$pch == 1, ]
prd0 <- as.data.frame( predict( lm(c0$angc ~ c0$ext), interval = c("confidence") ) )
prd1 <- as.data.frame( predict( lm(c1$angc ~ c1$ext), interval = c("confidence") ) )
dev.new()
plot( C$angc ~ C$ext, type = 'n' )
points( c0$angc ~ c0$ext, pch = 17 ) # triangles
abline(lm(c0$angc ~ c0$ext)) # regression line
lines(prd0$lwr) # lower CI
lines(prd0$upr) # upper CI
points( c1$angc ~ c1$ext, pch = 1 ) # circles
abline(lm(c1$angc ~ c1$ext))
lines(prd1$lwr, type = 'l', lty = 3 )
lines(prd1$upr, type = 'l', lty = 3 )
I have two problems:
How can I get the desired regression line for the circles? It should be an almost vertical line (check c1)
I don't get correct confidence intervals
Thank you for your help,
Santi
In ggplot2 you can do this rather efficiently:
ggplot(C, aes(x = ext, y = angc, shape = pch)) + geom_point() +
geom_smooth(method = "lm")
This will create a scatterplot (geom_point()) of angc vs ext, where the shape of the points is based on pch. In addition, a regression line is drawn in the plot for each unique element in pch. The name geom_smooth() comes from the fact that it draws a smoothed version of the data, in this case a linear regression.

Resources