Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to calculate the probability distribution function of a random variable that is composed of (sum, division, product, exponentiation, etc...) some other simple random variables. It is pretty complex so I am morte then happy to get a numerical solution
While thought this was a very standard thing to do , I was unable to find a framework to do that. I'd preferably use R, but any major language will do.
What I would like therefore is a library that allowed me to:
i) create numerical random variables from classic distributions
ii) compose them by simple operations (+,-,*,/, exp,min, max,...)
Of course I could work with vectors and use convolutions and the like, but I wanted something more polished.
I am also aware that is possible to use simulation to create the variables, then compose them with the operations and finally getting the PDF using a histogram, but again, I would prefer a non - simulating approach.
Try the rv package. Note that if X is an exponential random variable with mean 1, then -log(X) has a standard Gumbel distribution.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have set of customers with different attributes continuous, categorical, binary and ordinal.
How can I cluster them knowing that we cannot apply the same distance metrics on the these different types of attributes?
Thank you in advance
As mentioned already daisy package is an option which does an automatic selection of best distance metric based on data type.But I would suggest the following approach and request expert to please chime in.
Rather than automatic selection identify and remove some correlated variables like(some examples)
Pearson Correlation: for continuous variable
Chi Square Test: for categorical variables
Categorical vs Numerical: One way Anova test etc.
Taking the subset of useful variables consider doing One-Hot Encoding of categorical variables and maybe convert ordinal to continuous (or categorical and one-hot encode). Test using different distance metric like Euclidean, Manhattan etc to evaluate the result. You will get a better clarity of the overall clustering process in this way.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
SAS MODEL Statement introduction
INCLUDE=n
forces the first n independent variables listed in the MODEL statement to be included in all models. The selection methods are performed on the other variables in the MODEL statement. The INCLUDE= option is not available with SELECTION=NONE.
I think you will find that R users are mostly averse (on solid theoretic grounds) to mimicking SAS's stepwise regression functions. However, you will find that step argument scope has an 'upper' and a 'lower' option and you probably should first read the ?step-help page and then create a value for 'lower'.
scope
defines the range of models examined in the stepwise search. This should be either a single formula, or a list containing components upper and lower, both formulae. See the details for how to specify the formulae and how they are used.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My current linear model is: fit<-lm(ES~Area+Anear+Dist+DistSC+Elevation)
I have been asked to further this by:
Fit a linear model for ES using the five explanatory variables and
include up to quadratic terms and first order interactions (i.e. allow
Area^2 and Area*Elevation, but don't allow Area^3 or
Area*Elevation*Dist).
From my research I can do +I(Area^2) and +(Area*Elevation) but this would make a huge list.
Assuming I am understanding the question correctly I would be adding 5 squared terms and 10 * terms giving 20 total. Or do I not need all of these?
Is that really the most efficient way of going about it?
EDIT:
Note that I am planning on carrying out a stepwise regression for the null model and the full model after. I am seemingly having trouble with this when using poly.
Look at ?formula to further your education:
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 )
Those will not be squared terms but rather part of what you were asked to provide... all the 2-way interactions (and main effects). Formula "mathematics" is different than regular use of powers. To add the squared terms in a manner that allows proper statistical interpretation use poly
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 +
poly(Area,2) +poly(Anear,2)+ poly(Dist,2)+ poly(DistSC,2)+ poly(Elevation,2) )
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman LuĊĦtriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).