How to get all the columns in model.matrix [duplicate] - r

I am trying to convert a data frame with categorical variables to a model.matrix but am losing levels of variables.
Here's my code:
df1 <- data.frame(id = 1:200, y =rbinom(200, 1, .5), var1 = factor(rep(c('abc','def','ghi','jkl'),50)))
df1$var2 <- factor(rep(c('ab c','ghi','jkl','def'),50))
df1$var3 <- factor(rep(c('abc','ghi','nop','xyz'),50))
df1$var2 <- as.character(df1$var2)
df1$var2 <- gsub('\\s','',df1$var2)
df1$var2 <- factor(df1$var2)
sapply(df1, levels)
mm1 <- model.matrix(~ 0+.,df1)
head(mm1)
Any suggestions? Is this a matrix non-invertability issue?

The model matrix is perfectly correct. For factors, the model matrix contains one column less than there are factors: this information is already contained in the (Intercept) column. You are missing this column because you have specified +0 in your model term. Try this:
mm2 <- model.matrix(~., df1)
head(mm2)
You will now see the (Intercept) column which encodes "default" information, and now also the first level of var1 is missing in the column names. The (Intercept) represents your observation at the "reference level", which is the combination of first level of each categorical attribute. Any deviation from this reference level is encoded in the var*??? columns, and since your model assumes no interactions between these columns, you get (4 - 1) * 3 var*??? columns plus the (Intercept) column (which is replaced by var1abc in your initial model matrix).
Unfortunately I lack the precise terms to describe this. Anyone help me out?

Related

How to reverse the group comparison order for t.test()?

I would like to conduct a two-sample independent t-test and my grouping variable, "group" is factored as '0' and '1'. When calling the t.test(), it calculates the difference as mean for group 0 - mean for group 1, giving me a negative difference and negative confidence intervals.
How can I reverse the order of comparison so that R estimates mean for group 1 - mean for group 0? I know it compares alphabetically, so I assume it compares numbers in increasing value. I already tried fct_rev() from the forcats package to set group '1' as the reference group, but that did not change the order in which t.test compared and still gave negative outputs. This is important as it is a 1 sided test so I have expectations on the sign of the difference.
Thanks!
EDIT: I'd like to keep the names as '0' and '1' because they need to be coded as such for subsequent analyses to work
You can use the levels argument when constructing the factor. It is possible to revert the levels (see code below) and it is also possible to specify an arbitrary order:
# two samples with different mean
x <- c(rnorm(10, 5, 1), rnorm(10, 8, 1))
# the grouping variable
f1 <- factor(rep(0:1, each=10))
# the grouping variable with reverse order of levels
f2 <- factor(f1, levels=rev(levels(f1)))
t.test(x ~ f1)
t.test(x ~ f2)
For some reason I cannot comment, can you not simply recode your variables and rerun it so they run in the correct order?
e.g.:
mydata <- read.csv(yourdataset.csv, na.strings = "")
#make a new "recoded group" variable
mydata$group.recoded <- mydata$group
mydata <- within(mydata, group.recoded[group.variable==group0] <- "group2")

generating a sum coding scheme

I have a dataframe which looks like this:
df <- data.frame(id= rep(seq(1:125),3),
timpoint= c(rep("T1", 125), rep("T2", 125), rep("T3", 125)),
treatment=c(rep("A",25),rep("B",25),rep("C",25),rep("D",25),rep("E",25)))
interaction.col <- paste(df$timpoint, df$treatment, sep = "_")
df <- cbind(df, interaction.col)
I am trying to generate a sum coding scheme for the interaction column which is a combination of the first two columns. According to this paper I should get a matrix of (a−1)×(b−1) columns and n rows(in this case 375)
I have read up on using contrasts:
contrasts(df$interaction.col) <- "contr.sum"
df.c <- contrasts(df$interaction.col)
However, somehow the output is a 15x14 matrix, while it should be a 375 x8.
Also, only the very last row is set to -1, which shouldn't be the case. For all the ID's of the last treatment (E) the interaction column should be set to -1 for the corresponding timepoint.
The last ID in treatment group E should be -1 for all columns. What am i doing wrong here?
Depending on what effects you are interested in and how you will fit the model, you will end up with different number of effects. For example, in the case whereby you fit the main and interaction effects, you should end up with 8 columns for the interactions ie (a-1) x (b-1). In the case you do not fit the main effects you end up with a*b - 1:
Here is how to create your matrix:
With main effects:
model.matrix(~treatment * timpoint, df, list(treatment = contr.sum, timpoint=contr.sum))
In this case, the last 8 columns are the ones you are interested in
Without main effects:
model.matrix(~treatment:timpoint, df, list(treatment = contr.sum, timpoint=contr.sum))

Manually implementing Regression Likelihood Ratio Test

I'm trying to implement my own linear regression likelihood ratio test.
The test is where you take the sum of squares of a reduced model and the sum of squares of a full model and compare it to the F statistic.
However, I am having some trouble implementing the function, especially when dealing with dummy variables.
This is the dataset I am working with and testing the function on.
Here is the code so far:
The function inputs are the setup matrix mat, the response matrix which has just one column, the indices (variables) being test, and the alpha value the test is at.
linear_regression_likelihood <- function(mat, response, indices, alpha) {
mat <- as.matrix(mat)
reduced <- mat[,c(1, indices)]
q <- 1 #set q = 1 just to test on data
p <- dim(mat)[2]
n <- dim(mat)[1]
f_stat <- qf(1-alpha, df1 = p-q, df2 = n-(p+1))
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
y_hat_full <- mat%*%beta_hat_full
SSRes_full <- t(response - y_hat_full)%*%(response-y_hat_full)
beta_hat_red <- qr.solve(t(reduced)%*%reduced)%*%t(reduced)%*%response
y_hat_red <- reduced%*%beta_hat_red
SSRes_red <- t(response - y_hat_red)%*%(response-y_hat_red)
s_2 <- (t(response - mat%*%beta_hat_full)%*%(response - mat%*%beta_hat_full))/(n-p+1)
critical_value <- ((SSRes_red - SSRes_full)/(p-q))/s_2
print(critical_value)
if (critical_value > f_stat) {
return ("Reject H0")
}
else {
return ("Fail to Reject H0")
}
}
Here is the setup code, where I setup the matrix in the correct format. Data is the read in CSV file.
data <- data[, 2:5]
mat <- data[, 2:4]
response <- data[, 1]
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy
mat <- cbind(1, mat[1:2], dummy)
linear_regression_likelihood(mat, response, 2:3, 0.05)
This is the error I keep getting.
Error in solve.default(as.matrix(c)) : system is computationally singular: reciprocal condition number = 1.63035e-18
I know it has to do with taking the inverse of the matrix after it is multiplied, but the function is unable to do so. I thought it may be due to the dummy variables having too small of values, but I am not sure of any other way to include the dummy variables.
The test I am doing is to check whether the factor variable x3 has any affect on the response y. The actual answer which I verified using the built in functions states that we fail to reject the null hypothesis.
The error originates from line
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
If you go through your function step-by-step you will see an error
Error in qr.solve(t(mat) %*% mat) : singular matrix 'a' in solve
The problem here is that your model matrix does not have full column rank, which translates to your regression coefficients not being unique. This is a result of the way you "dummyfied" x3. In order to ensure full rank, you need to remove one dummy column (or manually remove the intercept).
In the following example I remove the A column from dummy which means that resulting x3 coefficients measure the effect of a unit-change in B, C, and D against A.
# Read data
data <- read.csv("data_hw5.csv")
data <- data[, 2:5]
# Extract predictor and response data
mat <- data[, 2:4]
response <- data[, 1]
# Dummify categorical predictor x3
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy <- dummy[, -1] # Remove A to have A as baseline
mat <- cbind(1, mat[1:2], dummy)
# Apply linear_regression_likelihood
linear_regression_likelihood(mat, response, 2:3, 0.05);
# [,1]
#[1,] 8.291975
#[1] "Reject H0"
A note
The error could have been avoided if you had used base R's function model.matrix which ensures full rank when "dummyfying" categorical variables (model.matrix is also implicitly called in lm and glm to deal with categorical, i.e. factor variables).
Take a look at
mm <- model.matrix(y ~ x1 + x2 + x3, data = data)
which by default omits the first level of factor variable x3. mm is identical to mat after (correct) "dummification".

Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Resources