How to identify the distribution of the given data using r [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
The community reviewed whether to reopen this question 10 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have the data as below and i need to identify the distribution of the data. pls help.
x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40)

A neat approach would involve using fitdistrplus package that provides tools for distribution fitting. On example of your data.
library(fitdistrplus)
descdist(x, discrete = FALSE)
Now you can attempt to fit different distributions. For example:
normal_dist <- fitdist(x, "norm")
abs subsequently inspect the fit:
plot(normal_dist)
As a generic point I would suggest that you have a look at this discussion at Cross Validated, where the subject is discussed at lengths. You may be also willing to have a look at a paper by Delignette-Muller and Dutang - fitdistrplus: An R Package for Fitting Distributions, available here if you are interested in a more detailed explanation on how to use the Cullen and Frey graph.

First, thing you can do is to plot the histogram and overlay the density
hist(x, freq = FALSE)
lines(density(x))
Then, you see that the distribution is bi-modal and it could be mixture of two distribution or any other.
Once you identified a candidate distribution a 'qqplot' can help you to visually compare the quantiles.

Related

Calculating AWE from mclust package [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Is it possible to calculate the Approximate Weight of Evidence (AWE) from information obtained via the mclust R package?
According to R documentation, you should have access to function awe(tree, data) since version R1.1.7.
From the example on the linked page (in case of broken link),
data(iris)
iris.m _ iris[,1:4]
awe.val <- awe(mhtree(iris.m), iris.m)
plot(awe.val)
Following the formula from Banfield, J. and Raftery, A. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821. -2*model$loglik + model$d*(log(model$n)+1.5) Where model represents the model with number of cluster solutions selected. Keeping this question in the hope that it may help someone in the future.

Sampling weights for subpopulations in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm working with a large, national survey that was collected using complex survey methods. As such, I'm needing to account for sample weights and other survey design features (e.g., sampling strata). I'm new to this methodology, so apologies if the answers here are obvious.
I've had success running path analysis models using the 'lavaan' package paired with the 'lavaan.survey' package. However, some of my models involve only a subset of the data (e.g., only female participants).
How can I adjust the sample weights to reflect the fact that I am only analyzing a subsample (e.g., females)?
The subset() function in the survey package handles subpopulations correctly, and since lavaan.survey uses the survey package to get the basic standard errors for the population covariance matrix, it should all flow through properly.

R Stats Coding questions (fitting log onto normal model) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Hey guys I'm having trouble with R and which codes to use (im a newbie):
So I have some data, I've performed a simple linear regression (SLR), however I found that this doesn't give me the relationship I want.
So I performed another SLR analysis using the log (base 'e') of both the variables and it seems to be a better fit.
My question is that what codes do I use to include plot by log SLR onto my original data (into a smooth curve) i.e. how do I fit the new data, with a log scale, onto my original data?
From what I understand I use the following:
1) plot(x,y) <-- Original data
2) lines(fitted(exp(____))) <-- Not sure about what to do here
If you guys could help me that would be great!
Welcome to StackOverflow!
To get better help around here I recommend you to read and follow How to make a great R reproducible example? the next time you have a question. This way you can increase the quality of your question significantly. And while creating a reproducible example one often solves the problem on its own.
Anyways, I created a sample model that I assume fits your situation quite well:
sampledata <- data.frame(x=1:50, y=1.1^(1:50+rnorm(50)))
model <- lm(log(y) ~ x, sampledata)
summary(model)
Plotting the model using R's default tools is now easy:
plot(sampledata$x, sampledata$y)
lines(sampledata$x, exp(predict(model)))

gblinear xgboost in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say that a data has both numeric & catagoricial feature, and I've created a xgboost model by using gblinear. I've analyzed the xgboost model with xgb.importance, then how can I express categorical variable weights?
While XGBoost is considered to be a black box model, you can understand the feature importance (for both categorical and numeric) by averaging the gain of each feature for all split and all trees.
This is represented in the graph below.
# Get the feature real names
names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = bst)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
In the feature importance above, we can see the first 10 most important features.
This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance.
Alternately, this could be represented in a tree diagram (see the link above).

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman LuĊĦtriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources