Why MARS (earth package) generates so many predictors? - r

I am working on a MARS model using earth package in R. My dataset (CE.Rda) consists of one dependent variable (D9_RTO_avg) and 10 potential predictors (NDVI_l1, NDVI_f0, NDVI_f1, NDVI_f2, NDVI_f3, LST_l1, LST_f0, LST_f1, NDVI_f2,NDVI_f3). Next, I show you the head of my dataset
D9_RTO_avg NDVI_l1 NDVI_f0 NDVI_f1 NDVI_f2 NDVI_f3 LST_l1 LST_f0 LST_f1 LST_f2 LST_f3
2 1.866667 0.3082 0.3290 0.4785 0.4330 0.5844 38.25 30.87 31 21.23 17.92
3 2.000000 0.2164 0.2119 0.2334 0.2539 0.4686 35.7 29.7 28.35 21.67 17.71
4 1.200000 0.2324 0.2503 0.2640 0.2697 0.4726 40.13 33.3 28.95 22.81 16.29
5 1.600000 0.1865 0.2070 0.2104 0.2164 0.3911 43.26 35.79 30.22 23.07 17.88
6 1.800000 0.2757 0.3123 0.3462 0.3778 0.5482 43.99 36.06 30.26 21.36 17.93
7 2.700000 0.2265 0.2654 0.3174 0.2741 0.3590 41.61 35.4 27.51 23.55 18.88_
After creating my earth model as follows
mymodel.mod <- earth(D9_RTO_avg ~ ., data=CE, nk=10)
I print the summary of the resulting model by typing
print(summary(mymodel.mod, digits=2, style="pmax"))
and I obtain the following output
D9_RTO_avg =
4.1
+ 38 * LST_f128.68
+ 6.3 * LST_f216.41
- 2.9 * pmax(0, 0.66 - NDVI_l1)
- 2.3 * pmax(0, NDVI_f3 - 0.23)
Selected 5 of 7 terms, and 4 of 13169 predictors
Termination condition: Reached nk 10
Importance: LST_f128.68, NDVI_l1, NDVI_f3, LST_f216.41, NDVI_f0-unused, NDVI_f1-unused, NDVI_f2-unused, ...
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 2 RSS 4046 GRSq 0.29 RSq 0.29
My question is why earth is identifying 13169 predictors when they are actually 10!? It seems that MARS is considering single observations of candidate predictors as predictors themselves. How can I avoid MARS from doing so?
Thanks for your help

Related

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

How to evaluate a string variable as factor in the emmeans() command in R?

I would like to assign a variable with a custom factor from an ANOVA model to the emmeans() statement. Here I use the oranges dataset from R to make the code reproducible. This is my model and how I would usually calculate the emmmeans of the factor store:
library(emmeans)
oranges$store<-as.factor(oranges$store)
model <- lm (sales1 ~ 1 + price1 + store ,data=oranges)
means<-emmeans(model, pairwise ~ store, adjust="tukey")
Now I would like to assign a variable (lsmeanfact) defining the factor for which the lsmeans are calculated.
lsmeanfact<-"store"
However, when I want to evaluate this variable in the emmeans() function it returns an error, it basically does not find the variable lsmeanfact, so it does not evaluate this variable.
means<-emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust="tukey")
Error in emmeans(model, pairwise ~ eval(parse(lsmeanfact)), adjust = "tukey") :
No variable named lsmeanfact in the reference grid
How should I change my code to be able to evaluate the variable lsmeanfact so that the lsmeans for "plantcode" are correctly calculated?
You can make use of reformulate function.
library(emmeans)
lsmeanfact<-"store"
means <- emmeans(model, reformulate(lsmeanfact, 'pairwise'), adjust="tukey")
Or construct a formula with formula/as.formula.
means <- emmeans(model, formula(paste('pairwise', lsmeanfact, sep = '~')), adjust="tukey")
Here both reformulate(lsmeanfact, 'pairwise') and formula(paste('pairwise', lsmeanfact, sep = '~')) return pairwise ~ store.
You do not need to do anything special at all. The specs argument to emmeans() can be a character value. You can get the pairwise comparisons in a separate call, which is actually a better way to go anyway.
library(emmeans)
model <- lm(sales1 ~ price1 + store, data = oranges)
lsmeanfact <- "store"
( EMM <- emmeans(model, lsmeanfact) )
## store emmean SE df lower.CL upper.CL
## 1 8.01 2.61 29 2.67 13.3
## 2 9.60 2.30 29 4.89 14.3
## 3 7.84 2.30 29 3.13 12.6
## 4 10.44 2.35 29 5.63 15.2
## 5 10.19 2.28 29 5.53 14.9
## 6 15.22 2.28 29 10.56 19.9
##
## Confidence level used: 0.95
pairs(EMM)
## contrast estimate SE df t.ratio p.value
## 1 - 2 -1.595 3.60 29 -0.443 0.9976
## 1 - 3 0.165 3.60 29 0.046 1.0000
## 1 - 4 -2.428 3.72 29 -0.653 0.9856
## 1 - 5 -2.185 3.50 29 -0.625 0.9882
## 1 - 6 -7.209 3.45 29 -2.089 0.3206
## 2 - 3 1.761 3.22 29 0.546 0.9936
## 2 - 4 -0.833 3.23 29 -0.258 0.9998
## 2 - 5 -0.590 3.23 29 -0.182 1.0000
## 2 - 6 -5.614 3.24 29 -1.730 0.5239
## 3 - 4 -2.593 3.23 29 -0.802 0.9648
## 3 - 5 -2.350 3.23 29 -0.727 0.9769
## 3 - 6 -7.375 3.24 29 -2.273 0.2373
## 4 - 5 0.243 3.26 29 0.075 1.0000
## 4 - 6 -4.781 3.28 29 -1.457 0.6930
## 5 - 6 -5.024 3.23 29 -1.558 0.6314
##
## P value adjustment: tukey method for comparing a family of 6 estimates
Created on 2021-06-29 by the reprex package (v2.0.0)
Moreover, in any case what is needed in specs are the name(s) of the factors involved, not the factors themselves. Note also that it was unnecessary to convert store to a factor before fitting the model

Computing regression intercepts and saving them in a separate column

Our dataframe has the following structure:
RecordID datecode Name Betha RF MKTRF Ri
60 1 2014-12-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -0.06 1.61
61 1 2015-01-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -3.11 -3.53
62 1 2015-02-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 6.13 5.49
63 1 2015-03-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 -1.12 0.29
64 1 2015-04-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 0.59 0.67
65 1 2015-05-01 1290 GAMCO Small/Mid Cap Value A 0.7256891 0.0000 1.36 0.57
392035 3267 2019-07-01 Wasatch Core Growth Institutional 0.6421722 0.0019 1.19 6.75
392036 3267 2019-08-01 Wasatch Core Growth Institutional 0.6421722 0.0016 -2.58 0.09
392037 3267 2019-09-01 Wasatch Core Growth Institutional 0.6421722 0.0018 1.44 4.99
392038 3267 2019-10-01 Wasatch Core Growth Institutional 0.6421722 0.0015 2.06 -3.68
392039 3267 2019-11-01 Wasatch Core Growth Institutional 0.6421722 0.0012 3.87 5.35
392040 3267 2019-12-01 Wasatch Core Growth Institutional 0.6421722 0.0014 2.77 1.12
We need to compute yearly Jensen's Alpha and Fama & French 3-factor alphas and store them in separate columns in order to run regressions on them. The formulas for both regressions are illustrated below:
Jensen Alpha: Ri ~ a + B1*MKTRF + e
3-factor Alpha: Ri ~ a + B1*MKTRF + B2*SMB + B3*HML + B4*UMD + e
We have tried saving the data as a data table and in a panel data format and run this regression from a similar post to compute the Jensen Alpha:
dt[, alpha:= roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], width = 12L)$coefs[, 1], by = RecordID]
The post: Rollregres with multiple regression and panel data
However, it did not work and kept giving the error message:
Error in roll_regres.fit(x = cbind(1, .SD[["MKTRF"]]), y = .SD[["Ri"]], :
subscript out of bounds
We are trying to use the "rollRegress" package and no additional packages have been used.
What are we doing wrong and is there anybody that can help us compute the yearly Jensen Alpha which we can store in a separate column? :)

R - two types of prediction in cross validation

When i using cross validation technique with my data it gives me two types of prediction. CVpredict and Predict. What is difference between two of that? I guess cvpredict is cross validation predict but what is the other?
Here is some of my code:
crossvalpredict <- cv.lm(data = total,form.lm = formula(verim~X4+X4.1),m=5)
And this is the result:
fold 1
Observations in test set: 5
3 11 15 22 23
Predicted 28.02 32.21 26.53 25.1 21.28
cvpred 20.23 40.69 26.57 34.1 26.06
verim 30.00 31.00 28.00 24.0 20.00
CV residual 9.77 -9.69 1.43 -10.1 -6.06
Sum of squares = 330 Mean square = 66 n = 5
fold 2
Observations in test set: 5
2 7 21 24 25
Predicted 28.4 32.0 26.2 19.95 25.9
cvpred 52.0 81.8 36.3 14.28 90.1
verim 30.0 33.0 24.0 21.00 24.0
CV residual -22.0 -48.8 -12.3 6.72 -66.1
Sum of squares = 7428 Mean square = 1486 n = 5
fold 3
Observations in test set: 5
6 14 18 19 20
Predicted 34.48 36.93 19.0 27.79 25.13
cvpred 37.66 44.54 16.7 21.15 7.91
verim 33.00 35.00 18.0 31.00 26.00
CV residual -4.66 -9.54 1.3 9.85 18.09
Sum of squares = 539 Mean square = 108 n = 5
fold 4
Observations in test set: 5
1 4 5 9 13
Predicted 31.91 29.07 32.5 32.7685 28.9
cvpred 30.05 28.44 54.9 32.0465 11.4
verim 32.00 27.00 31.0 32.0000 30.0
CV residual 1.95 -1.44 -23.9 -0.0465 18.6
Sum of squares = 924 Mean square = 185 n = 5
fold 5
Observations in test set: 5
8 10 12 16 17
Predicted 27.8 30.28 26.0 27.856 35.14
cvpred 50.3 33.92 45.8 31.347 29.43
verim 28.0 30.00 24.0 31.000 38.00
CV residual -22.3 -3.92 -21.8 -0.347 8.57
Sum of squares = 1065 Mean square = 213 n = 5
Overall (Sum over all 5 folds)
ms
411
You can check that by reading the help of the function you are using cv.lm. There you will find this paragraph:
The input data frame is returned, with additional columns
‘Predicted’ (Predicted values using all observations) and ‘cvpred’
(cross-validation predictions). The cross-validation residual sum
of squares (‘ss’) and degrees of freedom (‘df’) are returned as
attributes of the data frame.
Which says that Predicted is a vector of predicted values made using all the observations. In other words it seems like a predictions made on your "training" data or made "in sample".
To check wether this is so you can fit the same model using lm:
fit <- lm(verim~X4+X4.1, data=total)
And see if the predicted values from this model:
predict(fit)
are the same as those returned by cv.lm
When I tried it on the iris dataset in R - cv.lm() predicted returned the same values as predict(lm). So in that case - they are in-sample predictions where the model is fitted and used using the same observations.
lm() does not give "better results." I am not sure how predict() and lm.cv() can be the same. Predict() returns the expected values of Y for each sample, estimated from the fitted model (covariates (X) and their corresponding estimated Beta values). Those Beta values, and the model error (E), were estimated from that original data. By using predict(), you get an overly optimistic estimate of model performance. That is why it seems better. You get a better (more realistic) estimate of model performance using an iterated sample holdout technique, like cross validation (CV). The least biased estimate comes from leave-one-out CV and the estimate with the least uncertainty (prediction error) comes from 2-fold (K=2) CV.

Extracting logLik from multiple glmer models into a data frame

I'm trying to make an AIC table as described by Anderson et al. (Suggestions for Presenting the Results of Data Analyses. J.Wildl Manage. 65(3): 2001). To do so, I have been using ICtab in the bbmle package.
aic2 <- ICtab(du1500o,du1500a,du1500b,du1500c,du1500d,du1500e,du1500f,
du1500g,du1500h,du1500i,du1500j,du1500k,du1500l,
type="AICc", weights=T,delta=T,sort=T,base=T,nobs=87)
This gives me a nice little ranked table.
AICc df dAICc weight
du1500a 402.4 6 0.0 0.4201
du1500k 403.4 6 1.0 0.2580
du1500j 404.4 6 2.0 0.1520
du1500f 405.6 6 3.2 0.0842
du1500g 406.8 6 4.4 0.0459
du1500e 408.7 6 6.3 0.0176
du1500d 410.2 6 7.8 0.0084
du1500l 410.7 6 8.3 0.0065
du1500i 412.3 6 9.9 0.0030
du1500o 412.4 4 10.0 0.0029
du1500c 415.2 6 12.8 <0.001
du1500h 416.6 6 14.2 <0.001
du1500b 416.6 6 14.2 <0.001
I want to add the log likelihood of each model to this table. I can extract one log likelihood at a time with the logLik function. However, I have been unable to find a more efficient away to extract the LL from all of the models.
head(logLik(du1500a))
[1] -194.6726
Should I turn aic2 into a data frame? How?
If so, how can I then efficiently append the LL of each model to that data frame?
Much appreciated,
Nava

Resources