Correlations, Scatter Plots and P-Value [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a set of data, after questioning customers.(it's about a shoe company) Two of the columns include GENDER and INCOME. I am supposed to test if there are any significant differences in income between genders, and give the corresponding P-value.
I'm still a n00b when it comes to R, I'm still learning and I've been struggling for 3 days now to find the functions to do so. Does anyone have any lead, or could help me with it? would be awesome.

I am editing this because I realized my other answer was not correct.
What you want is a linear model.
say
GENDER <- factor(c(0,1,1,0,1)
INCOME <- c(20000,30000,40000,50000,550000)
then you want
model <-lm(INCOME~GENDER)
and
summary(model)
anova(model)
will give you the information you are after.
Good luck,
Bryan

Related

Looking for resources for help modeling logistic regression with an event/trial syntax in R? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 days ago.
Improve this question
I am modeling bird nesting success and my advisor wants me to use event/trials syntax in R to model the amount of eggs that hatched vs the total amount of eggs per nest (i.e. the event/trials) against a variety of predictor variables - using essentially the logistic regression format.
This is totally new to me, so any online resources or code help would be incredibly useful! Thank you!
Haven't tried much yet, can't find the resources! I can only find info for SAS.
When specifying a logistic regression with glm(), the response can be specified multiple different ways. One is as a two-column matrix, first column is number of successes, second is number of failures.
So if you have two variables, total_eggs and hatched, try
mod <- glm( cbind(hatched, total_eggs-hatched) ~ x + ...,
data=your_data_frame, family=binomial)

Binning categorial variables in SPSS [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have two different variables which are both categorical. One of them is the seriousness of the sickness in terms of degrees. The other one is the type of surgery.And my dependent variable is their recovery day after the surgery. Because the days of surgery is not normally distributed I have to use a non-parametric method.
So I need to combine the independent variables into a new variable: 1:sickness1Surgery1 2:sickness1Surgery2, 3:sickness2Surgery1, sickness2Surgery1. By this way I will be able to test it.
I have checked the Youtube but, they are all about how to bin scales into categories.
If you provide more details about the structure of your data (preferably with some sample data) we could provide better suited code. Still, the basic idea should be the same:
if sickness=1 and surgery=1 combinedVar=1.
if sickness=2 and surgery=1 combinedVar=2.
if sickness=1 and surgery=2 combinedVar=3.
if sickness=2 and surgery=2 combinedVar=4.
value labels combinedVar
1 "sickness=1, surgery=1"
2 "sickness=2, surgery=1"
3 "sickness=1, surgery=2"
4 "sickness=2, surgery=2".

Interpretation of ACF plot [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Need help on interpreting the acf plot(sin graph pattern)
May be you will need to examine the PACF, you have a large peak in the first lag, followed by a decreasing wave that alternates between positive and negative correlations. Which can mean an autoregressive term of higher order in the data.
Use the partial autocorrelation function to determine the order of the autoregressive term.

Qualitative data analysis using data mining techniques [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 22 companies response about 22 questions/parameters in a 22x22 matrix. I applied clustering technique which gives me different groups with similarities.
Now I would like to find correlations between parameters and companies preferences. Which technique is more suitable in R?
Normally we build Bayesian network to find a graphical relationship between different parameters from data. As this data is very limited, how i can build Bayesian Network for it?
Any suggestion to analyze this data.
Try looking at Feature selection and Feature Importance in R, it's simple,
this could lead you: http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Some packages are good: https://cran.r-project.org/web/packages/FSelector/FSelector.pdf
, https://cran.r-project.org/web/packages/varSelRF/varSelRF.pdf
this is good SE question with good answers: https://stats.stackexchange.com/questions/56092/feature-selection-packages-in-r-which-do-both-regression-and-classification

Machine learning - Calculating the importance of a "value" in a variable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I’m analyzing a medical dataset containing 15 variables and 1.5 million data points. I would like to predict hospitalization and more importantly which type of medication may be responsible. The medicine-variable have around 700 types of drugs. Does anyone know how to calculate the importance of a "value" (type of drug in this case) in a variable for boosting? I need to know if ‘drug A’ is better for prediction than ‘drug B’ both in a variable called ‘medicine’.
The logistic regression model is able to give such information in terms of p-values for each drug, but I would like to use a more complex method. Of cause you can create a binary variable of each type of drug, but this gives 700 extra variables and does not seems to work very well. I’m currently using r. I really hope you can help me solve this problem. Thanks in advance! Kind regards Peter
see varImp() in library caret, which supports all the ML algorithms you referenced.

Resources