So I'm currently using the Naive Bayes classifier from the e1071 package to classify data, and I was wondering if there was any way to interact with, and edit the data.
For example, using the iris dataset, and the methods described here to extract a classifier from it, I want to be able to select the individual tables in the classifier.
I would like to be able to select a specific data table (such as the Sepal.Length) table, and compare the values against each other in order to get more information.
Does anyone have any methods for doing this?
Just figured it out
Essentially, the classifier is a set of 4 values, the apriori probabilities, the mean and standard deviations of each of the probabilities, the different classes, and the original call.
Each of those values is a nested list with one item, and if you keep on delving into the individual lists you can get at the individual items, including the individual probability matrices, and work from there. The first value of each is the mean, and the second is the standard deviation. From there you can pull whatever data you want, and edit to your heart's extent.
Related
I have a dataset, df, with one column where weights are present as below as a .csv file:
Outcome,Heat,Mobility,Time,weights
Good,125,0.2,9,2
Neutral,250,0.5,10,2
Bad,12,1.6,1,3
Good,162,0.1,9,1
Good,150,0.3,9,1
Bad,8,5.2,2,4
Neutral,330,0.2,12,3
Neutral,200,0.6,8,1
Bad,50,12,4,3
Good,130,0.9,10,4
I usually begin PCA analysis by using prcomp(df[,2:4]). But there doesn't seem to be any option to add the weights.
I tried doing prcomp(df[,2:4],scale. =as.numeric(unlist(df[5]))) option, but that gave errors stating that the number of columns provided was not suitable. Is there a way to add the associated to each row here, somehow?
Also, how I go about cross validating the model I generate here using the "leave-one-out" approach?
I am using a randomforest model. I want to compare the outcomes with OLS, and mainly see whether and how the contributions of individual variables differ between these. So, I used partial dependence plots to see the effect of variables. However, this still does not give me a clear coefficient.
Is there another way to extract a coefficients, or is it possible to extract the coefficient from the pdp.
I tried different ways to find the underlying data of the pdp. I tried creating simulated data with only one variable different to see how the predictions changed.
I wanted to try xgboost global model from: https://business-science.github.io/modeltime/articles/modeling-panel-data.html
On smaller scale it works fine( Like wmt data-7 departments,7ids), but what if I would like to run it on 200 000 time series (ids)? It means step dummy creates another 200k columns & pc can't handle it.(pc can't handle even 14k ids)
I tried to remove step_dummy, but then I end up with xgboost forecasting same values for all ids.
My question is: How can I forecast 200k time series with global xgboost model and be able to forecast proper values for each one of the 200k ids.
Or is it necessary to put there step_ dummy in oder to create proper FC for all ids?
Ps:code should be the same as one in the link. Only in my dataset there are 50 monthly observations for each id.
For this model, the data must be given to xgboost in the format of a sparse matrix. That means that there should not be any non-numeric columns in the data prior to the conversion (with tidymodels does under the hood at the last minute).
The traditional method for converting a qualitative predictor into a quantitative one is to use dummy variables. There are a lot of other choices though. You can use an effect encoding, feature hashing, or others too.
I think that there is no proper answer to the question "how it would be possible to forecast 200k ts" properly. Global Models are the way to go here, but you need to experiment to find out, which models do not belong inside the global forecast model.
There will be a threshold, determined mostly by the length of the series, that you put inside the global model.
Keep in mind to use several global models, with different feature recipes.
If you want to avoid step_dummy function, use lightgbm from the bonsai package, which is considerably faster and more accurate.
In a paper under review, I have a very large dataset with a relatively small number of imputations. The reviewer asked me to report how many nodes were in the tree I generated using the CART method within MICE. I don't know why this is important, but after hunting around for a while, my own interest is piqued.
Below is a simple example using this method to impute a single value. How many nodes are in the tree that the missing value is being chosen from? And how many members are in each node?
data(whiteside, package ="MASS")
data <- whiteside
data[1,2] <- NA
library(mice)
impute <- mice(data,m=100,method="cart")
impute2 <- complete(impute,"long")
I guess, whiteside is only used as an example here. So your actual data looks different.
I can't easily get the number of nodes for the tree generated in mice. The first problem is, that it isn't just one tree ... as the package names says mice - Multivariate Imputation by Chained Equations. Which means you are sequentially creating multiple CART trees. Also each incomplete variable is imputed by a separate model.
From the mice documentation:
The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
If you really want to get numbers of nodes for each used model, you probably would have to adjust the mice package itself and add some logging there.
Here is how you might approach this:
Calling impute <- mice(data,m=100,method="cart") you get a S3 object of class mids that contains information about the imputation. (but not the number of nodes for each tree)
But you can call impute$formulas, impute$method, impute$nmis to get some more information, which formulas were used and which variables actually had missing values.
From the mice.impute.cart documentation you can see, that mice uses rpart internally for creating the classification and regression trees.
Since the mids object does not contain information about the fitted trees I'd suggest, you use rpart manually with the formula from impute$formulas.
Like this:
library("rpart")
rpart(Temp ~ 0 + Insul + Gas, data = data)
Which will print / give you the nodes/tree. This wouldn't really be the tree used in mice. As I said, mice means multiple chained equations / multiple models after each other - meaning multiple possibly different trees after each other. (take a look at the algortihm description here https://stefvanbuuren.name/fimd/sec-cart.html for the univariate missingness case with CART). But this could at least be an indicator, if applying rpart on your specific data provides a useful model and thus leads to good imputation results.
I have a weighting variable that I'd like to apply to my dataset so that I have weighted totals. In SPSS, this is straightforward enough. However, in R, I've been multiplying the variable by the weight variable to create a new variable as shown in the following example:
https://stats.stackexchange.com/questions/210697/weighting-variable-based-on-another-variable
Is there a more sophisticated way of applying weights in R?
Thanks.
If you need to work with a weighted dataset and define a complex survey sample, you can use the survey package : https://cran.r-project.org/web/packages/survey/survey.pdf.
You can therefore use all sorts of summary statistics once you have defined the weights to be applied.
However, I would advise this for complex weighted analysis.
Otherwise, there are several other packages dealing with weights such as questionr for instance.
It all depends on if you have to do a simple weighted sum or go on to do other types of analysis that require using more sophisticated methods.