How to apply fastai tabular model to new data? - fast-ai

I trained a model with fastai.tabular. Now, I have a fitted learner. Ultimately, models are there to be applied to new data and not just to be fitted on training set and evaluated on test set etc. I tried different things all resulting in errors or some weirdness. Is there a way to apply a model trained with fastai to previously unavailable data? Or do I have to train the model again and again and feed new test data in? That does not seem likely.
df_test = pd.read_parquet('generated_test.parquet').head(100)
test_data = TabularList.from_df(df_test, cat_names=cat_names, cont_names=cont_names)
prediction = learn.predict(test_data)
KeyError: 'atomic_distance'
atomic_distance is the name of a column present in both the training and test data and also contained in cont_names.
prediction = learn.get_preds(kaggle_test_data)
This does something, but it returns something weird:
[tensor([[136.0840],
[ -2.0286],
[ -2.0944],
...,
[135.6165],
[ 2.7626],
[ 8.0316]]),
tensor([ 84.8076, -11.2570, -11.2548, ..., 81.0491, 0.8874, 4.1235])]
The documentation says:
Docstring: Return predictions and targets on ds_type dataset.
This is new, unlabeled data. I don't know why the returning object should have labels. Where are they coming from? Also the size does not make sense. I am expecting something with 100 values.
I found a way by passing in the dataframe row by row:
prediction = [float(learn.predict(df_test.loc[i])[0].data) for i in df_test.index]
There is also the method predict_batch available, but it does seem to accept datafames. Are there better ways to do this?

I use:
data_test = (TabularList.from_df(DF_TEST, path=path, cat_names=cat_names,cont_names=cont_vars, procs=procs)
                           .split_none()
                           .label_from_df(cols=dep_var))
data_test.valid = data_test.train
data_test=data_test.databunch()
learn.data.valid_dl = data_test.valid_dl
pred = learn.get_preds(ds_type=DatasetType.Valid)[0]
Where DF_TEST is the test dataframe, dep_var is the depended variable, and learn is your model.
To be honest, it works most of the times, other times it give weird error and then I have to iterate each row to get prediction.

Related

What is this syntax, and how do I use it?

I came across this syntax from a previous question on Stack Overflow, and I am unfamiliar with it.
However, it seems to work pretty good and I've been able to work out how to use it, but that doesn't mean I understand it.
Is this base R, or a library?
cor.test( ~ hp + qsec, mtcars)
I am referring to the usage of ~, and the subsequent use of + in the call, and how that allows the specification of columns in a dataframe.
The help page for cor.test lists one form of the function as
        cor.test(formula, data, subset, na.action, ...)
and in the description of the arguments it says:
        formula: a formula of the form ~ u + v
~ hp + qsec is a formula, so you can get a lot of information by looking at the help page help(formula). However, that page emphasizes formulas of the form a ~ b which can be interpreted as something like "a as a function of b". This formula (~ a+ b) has no dependent variable. It can be interpreted as something like "using the variables a and b".

R predict factor variables

I try to do some prediction in R . I loaded & cleaned the data, fit a model and did a prediction which looks pretty good. My problem now is that my prediction gives me a percentage of probability of the occurence of e certain factor instead of the factor itself:
I have a dataset on how well people perform some exercise. This performance is messured in A-D ( which is a factor-variable in my dataset). When I do the prediction I get this output:
but I want to have it like that:
[ B A E A A C D A A A C ]
How would I do that? This is my code:
modFitA1 <- rpart(classe ~ ., data=PML_Train_red, method="class")
Predictn<-predict(modFitA1, newdata= PML_Test_red)
Predictn
Even though you put method="class" in your model statement, you need to add type="class" to your predict statement.
Predictn<-predict(modFitA1, newdata= PML_Test_red, type="class")

mboost/glmboost doesn't return proper result and na.action doesn't handle my NA issue

I have a problem with glmboost from the mboost package. After fitting the model (99 obs, 311 variables) it only returns an intercept with value 0 and no other variables:
glmboost(myformula, data = train, family = Gaussian(), na.action = na.pass, control = boost_control(nu = 0.1, mstop = 2000, trace = TRUE))
Generalized Linear Models Fitted via Gradient Boosting
Call:
glmboost.formula(formula = myformula, data = train, na.action = na.pass, control = boost_control(nu = 0.1, mstop = 2000, trace = TRUE), family = Gaussian())
Squared Error (Regression)
Loss function: (y - f)^2
Number of boosting iterations: mstop = 2000
Step size: 0.1
Offset: 6.536275
Coefficients:
(Intercept)
0
attr(,"offset")
[1] 6.536275
I choose trace = TRUE to print the risk:
[ 1] ................................................................................................................................................... -- risk: 0
[ 150] ................................................................................................................................................... -- risk: 0
[ 299] ................................................................................................................................................... -- risk: 0
and so on for all 2000 steps.
The risk already starts with 0. I do not have an intuition for that. Do you?
I played with the formula and recognized that with some variables it is working properly, with some it is'nt.
One "problem variable" I found contains only NAs. I thought that na.action = na.pass could handle that problem. Am I wrong or could there be another issue causing that problem?
I also played with nu,mstop, and family but it didn't help me to solve the problem. I read a lot of mboost paper but couldn't find an issue like mine.
In another attempt I removed all variables with more than 52% NAs in the data before fitting the glmboost. Then everything worked fine.
To sum up:
What could be a reason that I don't get coefficients/start with risk 0?
Why cause variables with a lot of NAs problem and why doesn't na.action handle that?
Please remove your variable with NAs before fitting the model. This should solve your problem. In general any regression model will fail if you have a variable that contains only missing values. That is in general a user problem which needs to be solved in advance of fitting the model.
We are currently investigating the behaviour of mboost when NAs are present (issue 12) and will take your problem into consideration as well.
For the future please try to post such issues directly on the github development page. This will allow us (i.e., the authors/maintainers of mboost) to directly chime in.

R's randomForest() function error - any way I can get more info?

I'm getting the error message that "Type of predictors in new data do not match that of the training data."
This confuses me, since I am able to get the same dat sets working under rpart and ctree. These functions conveniently enough report which factors are causing the bug, so it's easy to debug. Right now I'm not sure which factors in my many dimensions are causing problems.
Is there a simple way to know which columns/variables are throwing randomForest off?
For what it's worth:
> write.csv(predict(object=train_comp.rp, newdata = test_w_age, type = c("prob")), file="test_predict_rp_w_age.csv")
> write.csv(predict(object=train_comp.rf, newdata = test_w_age, type = c("prob")), file="test_predict_rf_w_age.csv")
Error in predict.randomForest(object = train_comp.rf, newdata = test_w_age, : Type of predictors in new data do not match that of the training data.

Piece-wise linear and non-linear regression in R

I have a question which is perhaps more a statistical query than one related to r directly, however it may be that I am just invoking an r package incorrectly so I will post the question here. I have the following dataset:
x<-c(1e-08, 1.1e-08, 1.2e-08, 1.3e-08, 1.4e-08, 1.6e-08, 1.7e-08,
1.9e-08, 2.1e-08, 2.3e-08, 2.6e-08, 2.8e-08, 3.1e-08, 3.5e-08,
4.2e-08, 4.7e-08, 5.2e-08, 5.8e-08, 6.4e-08, 7.1e-08, 7.9e-08,
8.8e-08, 9.8e-08, 1.1e-07, 1.23e-07, 1.38e-07, 1.55e-07, 1.76e-07,
1.98e-07, 2.26e-07, 2.58e-07, 2.95e-07, 3.25e-07, 3.75e-07, 4.25e-07,
4.75e-07, 5.4e-07, 6.15e-07, 6.75e-07, 7.5e-07, 9e-07, 1.15e-06,
1.45e-06, 1.8e-06, 2.25e-06, 2.75e-06, 3.25e-06, 3.75e-06, 4.5e-06,
5.75e-06, 7e-06, 8e-06, 9.25e-06, 1.125e-05, 1.375e-05, 1.625e-05,
1.875e-05, 2.25e-05, 2.75e-05, 3.1e-05)
y2<-c(-0.169718017273307, 7.28508517630734, 71.6802510299446, 164.637259265704,
322.02901173786, 522.719633360006, 631.977073772459, 792.321270345847,
971.810607095548, 1132.27551798986, 1321.01923840546, 1445.33152600664,
1568.14204073109, 1724.30089942149, 1866.79717333592, 1960.12465709003,
2028.46548012508, 2103.16027631327, 2184.10965255236, 2297.53360080873,
2406.98288043262, 2502.95194879366, 2565.31085776325, 2542.7485752473,
2499.42610084412, 2257.31567571328, 2150.92120390084, 1998.13356362596,
1990.25434682546, 2101.21333152526, 2211.08405955931, 1335.27559108724,
381.326449703455, 430.9020598199, 291.370887491989, 219.580548355043,
238.708972427248, 175.583544448326, 106.057481792519, 59.8876372379487,
26.965143266819, 10.2965349811467, 5.07812046132922, 3.19125838983254,
0.788251933518549, 1.67980552001939, 1.97695007279929, 0.770663673279958,
0.209216903989619, 0.0117903221723813, 0.000974437796492681,
0.000668823762763647, 0.000545308757270207, 0.000490042305650751,
0.000468780182460397, 0.000322977916070751, 0.000195423690538495,
0.000175847622407421, 0.000135771259866332, 9.15607623591363e-05)
which when plot looks like this:
I have then attempted to use the segmentation package to generate three linear regressions (solid black line) in three regions (10^⁻8--10^⁻7,10^⁻7--10^⁻6 and >10^-6) since I have a theoretical basis for finding different relationships in these different regions. Clearly however my attempt using the following code was unsuccessful:
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
Thus my first question- are there further parameters of the segmentation I can tweak other than the breakpoints? So far as I understand I have iterations set to maximum as default here.
My second question is: could I perhaps attempt a segmentation using the nls package? It looks as though the first two regions on the plot (10^⁻8--10^⁻7 and 10^-7--10^-6) are further from linear then the final section so perhaps a polynomial function would be better here?
As an example of a result I find acceptable I have annoted the original plot by hand:
.
Edit: The reason for using linear fits is the simplicity they provide, to my untrained eye it would require a fairly complex nonlinear function to regress the dataset as a single unit. One thought that had crossed my mind was to fit a lognormal model to the data as this may work given the skew along a log x-axis. I do not have enough competence in R to do this however as my knowledge only extends to fitdistr which so far as I understand would not work here.
Any help or guidance in a relevant direction would be most appreciated.
If you are not satisfied with the segmented package, you can try the earth package with the mars algorithm. But here, I find that the result of the segmented model is very acceptable. see the R-Squared below.
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
summary(segmented.mod)
Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.163e+02 1.143e+02 -1.893 0.0637 .
x 4.743e+10 3.799e+09 12.485 <2e-16 ***
U1.x -5.360e+10 3.824e+09 -14.017 NA
U2.x 6.175e+09 4.414e+08 13.990 NA
Residual standard error: 232.9 on 54 degrees of freedom
Multiple R-Squared: 0.9468, Adjusted R-squared: 0.9419
Convergence attained in 5 iterations with relative change 3.593324e-14
You can check the result by plotting the model :
plot(segmented.mod)
To get the coefficient of the plots , you can do this:
 intercept(segmented.mod)
$x
              Est.
intercept1 -216.30
intercept2 3061.00
intercept3   46.93
> slope(segmented.mod)
$x
             Est.   St.Err.  t value  CI(95%).l  CI(95%).u
slope1  4.743e+10 3.799e+09  12.4800  3.981e+10  5.504e+10
slope2 -6.177e+09 4.414e+08 -14.0000 -7.062e+09 -5.293e+09
slope3 -2.534e+06 5.396e+06  -0.4695 -1.335e+07  8.285e+06

Resources