I need to find all predictors(p-value < 0.05) from my dataset using loops. Is there any way to do it? - r

I am new to R and I am using glm() function to fit a logistic model. I have 5 columns. I need to find all possible predictors using a loop based on their p-values(less than 0.05).
My dataset has 40,000 entries which contains numerical and categorical variables and it looks more or less like this:
"Age" "Sex" "Occupation" "Education" "Income"
50 Male Farmer High School False
30 Female Maid High School False
25 Male Engineer Graduate True
The target variable "Income" denotes if the person earns more or less than 30K. If true means, they earn more than 30K and vice versa. I would like to find the predictor variables that can be used to predict the target using loops. Also, can I find the best 3 predictors based on their p-values?
Thanks in Advance!

If I understood correctly your question you are looking into a way of test univariable models given your dataframe (i am in fact in doubt if you want to test every combination of these variables including cross variation)
My suggestion is to use purrr::map function and create list for every column. Check the following example based on your information:
library(tidyr)
library(purrr)
## Sample data
df <- data.frame(
Age = rnorm(n = 40000,
mean = mean(c(50,30,25)),
sd(c(50,30,25))),
Ocupation = sample(x = c("Farmer", "Maid", "Engineer"),
size = 40000,
replace = TRUE),
Education = sample(x = c("High School", "Graduate", "UnderGraduate"),
size = 40000,
replace = TRUE),
Income = as.logical(rbinom(40000, 1, 0.5))
)
## Split dataframe into lists
list_df <- Map(cbind, split.default(df[-4], names(df)[-4]))
list_df <- lapply(list_df, cbind, "target" = df[4])
## Use map to fit a model for each list
list_models <- map(.x = list_df,
.f = ~glm(Income ~ ., data = .x, family = binomial))
You can call each model using list_models[i].
Now addressing the second part of your question concerning p-values. Given that each project is unique and so are their metrics i suggest you double check you usage of p-values. Granted, they are important, but they provid you a probability of acceptance given a specific statistic test and treshold which depends on context. It is a fundamental tool of statistical quality and decision (not only about t-test, but f-test and hence forward). But for ranking ? hmm i would say is a litle odd. But just saying :)

Related

Binary outcome, different trial #s across low N?

I have a sample of 4 individuals, all who have a varying number of trials (I work with a special population so what I get is what I get!)
The outcome is a binary yes/no
I want to know:
did the total sample select yes more often than chance?
did each individual select yes more often than chance?
Here is dummy data in R.
SbjEL <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, NA))
SbjKZ <- data.frame(Sbj = c('KZ'),
TrialNum = c(1:12),
Choice = c(0,1,1,1,1,1,1,1,1,1,1, 1))
SbjMA <- data.frame(Sbj = c('MA'),
TrialNum = c(1:12),
Choice = c(0,0,1,1,1,1,1,1,1,1,1, 1))
SbjTC <- data.frame(Sbj = c('EL'),
TrialNum = c(1:12),
Choice = c(1,1,1,1,1,1,1,1, NA,NA,NA, NA))
For a different experiment with the same sample, I had more trials and did a one sample t test for the sample, and a binomial distribution to see what # of trials of Yes would be higher than chance.
# Did group select YES more than chance? --> 43 yes/48
Response_v <- c(21,22)
t.test(Response_v, mu = 12, alternative = "two.sided")
# How many YES selections would be more often than chance?
# 24 trials were completed --> 21 yes / 24
binom.test(21, 24, 1/2)
My issue is this starts to fall apart when I get down to 8-12 trials.
Any ideas? I am lost
A t-test is not appropriate here for either Q1 or Q2. With large samples you can use some approximations, but your counts are very small. So, you’re on the right track with the binomial test, but not the t-test.
For your Q1: you first ought to decide how the subjects are assumed to relate to each other. Are you pretty confident that each is providing an estimate of the same Bernoulli probability, p? Or instead, a-priori do you want to allow the possibility that subjects have different p’s? There are further questions to answer, overlapping with those you need to consider for Q2.
For your Q2: The exact method of choice depends on a number of things: For example, do you want to incorporate prior information (e.g. using historical data as a reference)? If not, there are purely frequentist methods to use off the shelf. Next, do you expect the yes/no’s to be independent, or are they more like a ‘signal’ in which the order matters? Third, is it possible that there is a mixture of Bernoullis for any of the subjects? And so on. These questions can be considered through software such as that found at www.datatrie.com/advisor

Need help applying regression model to dataset in R (sports data)

Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

performing a chi square test across multiple variables and extracting the relevant p value in R

Ok straight to the question. I have a database with lots and lots of categorical variable.
Sample database with a few variables as below
gender <- as.factor(sample( letters[6:7], 100, replace=TRUE, prob=c(0.2, 0.8) ))
smoking <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.6,0.4)))
alcohol <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.3,0.7)))
htn <- as.factor(sample(c(0,1),size=100,replace=T,prob=c(0.2,0.8)))
tertile <- as.factor(sample(c(1,2,3),size=100,replace=T,prob=c(0.3,0.3,0.4)))
df <- as.data.frame(cbind(gender,smoking,alcohol,htn,tertile))
I want to test the hypothesis, using a chi square test, that there is a difference in the portion of smokers, alcohol use, hypertension (htn) etc by tertile (3 factors). I then want to extract the p values for each variable.
Now i know i can test each individual variable using a 2 by 3 cross tabulation but is there a more efficient code to derive the test statistic and p-value across all variables in one go and extract the p value across each variable
Thanks in advance
Anoop
If you want to do all the comparisons in one statement, you can do
mapply(function(x, y) chisq.test(x, y)$p.value, df[, -5], MoreArgs=list(df[,5]))
# gender smoking alcohol htn
# 0.4967724 0.8251178 0.5008898 0.3775083
Of course doing tests this way is somewhat statistically inefficient since you are doing multiple tests here so some correction is required to maintain an appropriate type 1 error rate.
You can run the following code chunk if you want to get the test result in details:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE))
You can get just p-values:
lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value)
This is to get the p-values in the data frame:
data.frame(lapply(df[,-5], function(x) chisq.test(table(x,df$tertile), simulate.p.value = TRUE)$p.value))
Thanks to RPub for inspiring.
http://www.rpubs.com/kaz_yos/1204

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources