I have a dataframe that looks like this:
Date A B MONTH
2016-01-01 3 10 January
2016-01-02 5 13 January
2016-01-03 8 12 January
.
.
.
2016-12-29 4 13 December
2016-12-30 5 12 December
2016-12-31 6 4 December
With this dataframe, I want to run a regression model with the Month column as dummy variables.
I have tried two methods to run this and each time I do it, it always excludes the month "April".
Any idea why this may be happening?
1st method:
lm(A ~ MONTH + B, data = df)
Example output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.248e+01 3.600e+01 0.902 0.36754
MONTHAugust 7.425e+02 3.630e+01 6.680 9.29e-11 ***
MONTHDecember -1.840e+02 3.277e+01 -5.613 4.02e-08 ***
MONTHFebruary -8.673e+00 2.855e+01 -0.129 0.89770
MONTHJanuary -4.084e+01 2.945e+01 -0.368 0.71291
MONTHJuly 9.407e+02 3.100e+01 4.540 7.73e-06 ***
MONTHJune 3.387e+01 3.077e+01 2.401 0.01687 *
MONTHMarch 2.797e+02 2.884e+01 6.231 1.32e-09 ***
MONTHMay -9.500e+01 3.122e+01 -3.043 0.00252 **
MONTHNovember -1.321e+01 3.555e+01 -1.778 0.07626 .
MONTHOctober 7.145e+01 3.200e+01 0.983 0.32637
MONTHSeptember 9.691e+02 3.916e+01 4.319 2.04e-05 ***
B 5.279e-02 1.161e-03 11.013 < 2e-16 ***
2nd Method:
A <- model.matrix(A ~ B + MONTH, df)
head(A)
(Intercept) Sum.of.Media.Cost MONTHAugust MONTHDecember MONTHFebruary MONTHJanuary MONTHJuly MONTHJune MONTHMarch MONTHMay
1 1 0 0 0 0
1 0 0 0 0
2 1 0 0 0 0
1 0 0 0 0
3 1 0 0 0 0
1 0 0 0 0
4 1 0 0 0 0
1 0 0 0 0
5 1 0 0 0 0
1 0 0 0 0
6 1 0 0 0 0
1 0 0 0 0
MONTHNovember MONTHOctober MONTHSeptember
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Try A ~ B + MONTH -1 -- if your dummies are complete, their linear combination is the same as the constant. Hence reduced rank, and you cannot do that so something has to give.
Either you keep the constant (and remove one monthly dummy) to produce "per month offset to intercept", or, and that is what I would do, remove the constant to get "monthly intercept".
When you deal with dummy variables it's normal. If you have n levels for your factor variable, then you need only n-1 dummy variables. Since the remaining case is when all the dummy variables are zero. I think that April is the month excluded beacause is the first one if you consider alphabetical ordering.
Related
I'm struggling to get mice to impute all the variables with missing values in my dataset. It's working perfectly for 4 of the variables, but not 3 others (and I'm getting the 3 logged events, which I suspect correspond to the 3 in question: GCSPupils, Hypoxia, Hypotension), but I can't figure out the issue. There seems to be variability in those variables (not constants), so mice should work. I want to do single imputation of 7 variables (the other variables have complete data).
# We run the mice code with 0 iterations
imp <- mice(TXAIMPACT_final, maxit = 0)
# Extract predictor Matrix and methods of imputation
predM <- imp$predictorMatrix
meth <- imp$method
#Setting values of variables I'd like to leave out to 0 in the predictor matrix
predM[,c("subjectId")] <- 0
# Specify a separate imputation model for variables of interest
# Dichotomous variable
log <- c("Hypotension", "Hypoxia")
# Unordered categorical variable
poly2 <- c("GCSPupils", "GCSMotor")
# Turn their methods matrix into the specified imputation models
meth[log] <- "logreg"
meth[poly2] <- "polyreg"
Here, I check to make sure "meth" is correct, and it is:
meth
subjectId Age GCS GCSMotor GCSPupils Glucose Hemoglobin
"" "" "" "polyreg" "polyreg" "pmm" "pmm"
Hypotension Hypoxia MarshallCT SAH EDH GOS GFAP
"logreg" "logreg" "pmm" "" "" "" ""
The methods are all correct as I specified. I do notice something funny about the Predictor Matrix, which is that the 3 variables not imputing only show "0" for their columns:
predM
subjectId Age GCS GCSMotor GCSPupils Glucose Hemoglobin Hypotension Hypoxia MarshallCT SAH EDH GOS GFAP
subjectId 0 1 1 1 0 1 1 0 0 1 1 1 1 1
Age 0 0 1 1 0 1 1 0 0 1 1 1 1 1
GCS 0 1 0 1 0 1 1 0 0 1 1 1 1 1
GCSMotor 0 1 1 0 0 1 1 0 0 1 1 1 1 1
GCSPupils 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Glucose 0 1 1 1 0 0 1 0 0 1 1 1 1 1
Hemoglobin 0 1 1 1 0 1 0 0 0 1 1 1 1 1
Hypotension 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hypoxia 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MarshallCT 0 1 1 1 0 1 1 0 0 0 1 1 1 1
SAH 0 1 1 1 0 1 1 0 0 1 0 1 1 1
EDH 0 1 1 1 0 1 1 0 0 1 1 0 1 1
GOS 0 1 1 1 0 1 1 0 0 1 1 1 0 1
GFAP 0 1 1 1 0 1 1 0 0 1 1 1 1 0
I think this is the problem, but I'm not sure how to solve. Finally, here is my single imputation:
imp2 <- complete(mice(TXAIMPACT_final, maxit = 1,
+ predictorMatrix = predM,
+ method = meth, print = TRUE))
iter imp variable
1 1 GCSMotor Glucose Hemoglobin MarshallCT
1 2 GCSMotor Glucose Hemoglobin MarshallCT
1 3 GCSMotor Glucose Hemoglobin MarshallCT
1 4 GCSMotor Glucose Hemoglobin MarshallCT
1 5 GCSMotor Glucose Hemoglobin MarshallCT
Warning: Number of logged events: 3
Thanks in advance!
Figured it out--posting here in case someone else has this issue. My variables that were not imputing were stored as character classes, which blocked imputation. As soon as I switched them to numeric, my issues disappeared.
I have to perform a nonlinear multiple regression with data that looks like the following:
ID Customer Country Industry Machine-type Service hours**
1 A China mass A1 120
2 B Europe customized A2 400
3 C US mass A1 60
4 D Rus mass A3 250
5 A China mass A2 480
6 B Europe customized A1 300
7 C US mass A4 250
8 D Rus customized A2 260
9 A China Customized A2 310
10 B Europe mass A1 110
11 C US Customized A4 40
12 D Rus customized A2 80
Dependent variable: Service hours
Independent variables: Customer, Country, Industry, Machine type
I did a linear regression, but because the assumption of linearity does not hold I have to perform a nonlinear regression.
I know nonlinear regression can be done with the nls function. How do I add the categorical variables to the nonlinear regression so that I get the statistical summary in R?
Column names after adding dummies: table with dummies
ID Customer.a Customer.b Customer.c Customer.d Country.China Country.Europe Country.Rus Country.US Industry.customized industry.Customized Industry.mass Machine type.A1 Machine type.A2 Machine type.A3 Service hours
1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 120
2 0 1 0 0 0 1 0 0 1 0 0 0 1 0 400
3 0 0 1 0 0 0 0 1 0 0 1 0 0 1 60
4 0 0 0 1 0 0 1 0 0 0 1 1 0 0 250
5 1 0 0 0 1 0 0 0 1 0 0 0 0 1 480
6 0 1 0 0 0 1 0 0 0 1 0 1 0 0 300
7 0 0 1 0 0 0 0 1 0 0 1 0 0 1 250
8 0 0 0 1 0 0 1 0 1 0 0 0 1 0 260
9 1 0 0 0 1 0 0 0 0 0 1 0 1 0 210
10 0 1 0 0 0 1 0 0 1 0 0 0 1 0 110
11 0 0 1 0 0 0 0 1 0 0 1 0 0 1 40
12 0 0 0 1 0 0 1 0 0 0 1 1 0 0 80
The way to handle categorical predictors is dependent on the number of levels the predictor can hold.
For predictors such as gender which can only take 2 forms (male or female), you can simply represent them as a binary (1,0) variable.
For predictors with greater than 2 levels, we use 1-of-k dummy encoding where k is the number of levels the particular variable takes. See the dummies package for useful functions!
After this, you can fit the model using formula:
nls(Service.hours ~ predictor1 + predictor2 + predictorN, data = df)
I'm trying to do discrete choice modeling on the below data. Basically, 30 customers have 16 different choices of pizza. They can choose more than 1 type of pizza and the ones they choose is indicated by choice variable.
pizza cust choice pan thin pineapple veggie sausage romano mozarella oz
1 1 Cust1 0 1 0 1 0 0 1 0 1
2 2 Cust1 1 0 1 1 0 0 0 0 0
3 3 Cust1 0 0 0 1 0 0 0 1 1
4 4 Cust1 1 0 1 1 0 0 0 0 0
5 5 Cust1 1 1 0 0 1 0 0 0 1
6 6 Cust1 0 0 1 0 1 0 1 0 0
7 7 Cust1 0 0 0 0 1 0 0 0 1
8 8 Cust1 1 0 1 0 1 0 0 1 0
9 9 Cust1 0 1 0 0 0 1 0 1 0
10 10 Cust1 1 0 1 0 0 1 0 0 1
11 11 Cust1 0 0 0 0 0 1 1 0 0
12 12 Cust1 0 0 1 0 0 1 0 0 1
13 13 Cust1 0 1 0 0 0 0 0 0 0
14 14 Cust1 1 0 1 0 0 0 0 1 1
15 15 Cust1 0 0 0 0 0 0 0 0 0
16 16 Cust1 0 0 1 0 0 0 1 0 1
17 1 Cust10 0 1 0 1 0 0 1 0 1
18 2 Cust10 0 0 1 1 0 0 0 0 0
19 3 Cust10 0 0 0 1 0 0 0 1 1
20 4 Cust10 0 0 1 1 0 0 0 0 0
When I use the below command to transform my data. I tried making few changes here like adding chid.var = "chid" and alt.levels=c(1:16). If I use both alt.levels and alt.var it gives me an error saying pizza already exists and will be replaced. However, I get no error if I use either of them.
pz <- mlogit.data(pizza,shape = "long",choice = "choice",
varying = 4:8, id = "cust", alt.var = "pizza")
Finally, when I use the mlogit command, I get this error.
mlogit(choice ~ pan + thin + pineapple + veggie + sausage + romano + mozarella + oz, pz)
Error in solve.default(H, g[!fixed]) :
system is computationally singular: reciprocal condition number = 8.23306e-19
This is my first post on stackoverflow. I visit this site very often and so far never needed to post as I found solutions already. I went through almost all similar posts like this one but in vain. I'm new to discrete choice modeling so I don't know if I'm making any fundamental mistake here.
Also, I'm not really sure what chid.var does.
Couldn't solve this problem. Though you can use multinom function from nnet package. It seems to work. Verified the answer.
The dataset remains the same as shown in the question so no need for any transformation
library("nnet")
pizza_model <- multinom(choice ~ Price + IsThin + IsPan ,data=pizza_all)
summary(pizza_model)
where choice is a dependent categorical variable which you want to predict. Price, IsThin, and IsPan are independent variables. Below is the output
Call:
multinom(formula = choice ~ Price + I_cPan + I_cThin, data = pizza_all)
Coefficients:
Values Std. Err.
(Intercept) 0.007192623 1.3298018
Price -0.149665357 0.1464976
I_cPan 0.098438084 0.3138538
I_cThin 0.624447867 0.2637110
Residual Deviance: 553.8519
AIC: 561.8519
I am working with a large dataset of a fishing fleet and I need to format it for a poisson regression and other count models. See below for a subset of the data. The count variable is 'days'. p1:p3 are indicator variables for port group and f1:f4 are indicator variables for other fishing activity.
yr week id days rev p1 p2 p3 f1 f2 f3 f4
2016 3 1 1 5568.3 0 1 0 0 0 0 0
2016 4 1 3 8869.53 0 1 0 0 0 0 0
2016 5 1 2 12025.8 0 1 0 0 0 0 0
2016 6 1 2 9126.6 0 1 0 0 0 0 0
2016 7 1 3 4415.4 0 1 0 0 0 0 0
2016 8 1 2 11586.6 0 1 0 0 0 0 0
2016 10 1 1 2144.4 0 1 0 0 0 0 0
2016 11 1 1 2183.25 0 1 0 0 0 0 0
2016 14 1 2 4998 0 1 0 0 0 0 0
2016 15 1 3 117 0 1 0 0 0 0 0
2016 1 2 4 12743.3 0 0 1 1 1 0 0
2016 2 2 2 7473.48 0 0 1 1 0 0 0
2016 5 2 2 8885.52 0 0 1 1 0 0 0
2016 7 2 1 15330.6 0 0 1 1 1 0 0
2016 8 2 2 3763.8 0 0 1 1 1 0 0
2016 9 2 1 2274.05 0 0 1 1 1 0 0
These rows only represent active weeks but I need to incorporate each vessel's inactive weeks. For example, for id=1, in year (yr) 2016 I need to add rows that start at week=1, and then rows for weeks 9,12, and 13. These rows will need to maintain the same information in the dummy categories (these don't change by yr), and have zeros in the 'days' column. I don't need to add rows after the last value of 'week' for that year and vessel.
This is where things get really complicated:
In the revenue (rev) column for these newly created rows I need to add the average revenue for that week and year for all vessels that share the same port group (p1:p3).
Finally, I need to add a new column of lagged revenues. For each row, the value for lagged revenue should be the value in the 'rev' column for the previous week for that vessel in that year.
The value for week 1 for each vessel should be the average of the first 2 weeks of revenue for that vessel in that year.
This task blows my data manipulation skills to smithereens and banging my head against the wall is starting to hurt. Any suggestions would be well appreciated! Thanks.
Thanks to https://stackoverflow.com/users/3001626/david-arenburg, and https://stackoverflow.com/users/2802241/user2802241, the issue has been solved. You can see a post on the adding rows part at:
Adding rows to a data.table according to column values
test<-data.frame(DT %>%
group_by(yr, id) %>%
complete(week = 1:max(week)) %>%
replace_na(list(days = 0)) %>%
group_by(yr, id) %>%
mutate_each(funs(replace(., is.na(.), mean(., na.rm = T))), p1:f4))
poisson<-data.table(test)
setkey(poisson,yr,id,week)
avrev<-poisson[,.(avrev = mean(rev,na.rm=T)),by=.(p1,p2,[p3,week,yr)]
avrev<-transform(avrev,xyz=interaction(p1,p2,p3,week,yr,sep=''))
poisson<-transform(poisson,xyz=interaction(tier200,tier300,tier500,week,yr,sep=''))
poisson<-transform(poisson,uniqueid=interaction(drvid,season,sep=''))
poisson$rev[is.na(poisson$rev)]<- avrev$avrev[match(poisson$xyz[is.na(poisson$rev)],avrev$xyz)]
poisson[, lagrev:=c(rev[1], rev[-.N]), by=uniqueid]
I'm sure there is a much nicer and neater way to accomplish the task but this works. David Arenburg also posted an answer in the comments section that utilizes data.table to create the new rows- see the other post.
To get the average of revenue by week, year, p1, p2, and p3 just use the aggregate function:
average_rev <- aggregate(rev~week+year+p1+p2+p3, data=your_dataframe, FUN=mean)
To add a new column of lagged revenues:
your_dataframe$lagged_rev <- c(NA, your_dataframe$rev[1:(nrow(_your_dataframe)-1)])
To get average rev for the last two weeks:
your_dataframe$avg_rev <- rowMeans(your_dataframe[,c('rev','lagged_rev')])
I am performing a research project into the factors that make someone more likely to vote, with a focus on the distance people live from a polling place. I the full voter registration and voter histories for millions of individuals. There are several ways in which someone can vote (in person, absentee, early, or provisional) or not vote (not registered, registered but didn't vote, or ineligible to vote). My data comes with a column (29) for how someone voted in a given election. NULL means not registered, V for in person, etc.
For regression analysis, I want to create a different column for each voter type (1 for yes, 0 for no, column numbers 68-74) and another 1/0 column (number 75) for whether or not someone voted at all. The code I wrote below should do the trick, but it's running impossibly slowly on my computer and hasn't even been able to get to the 1000th row after an hour. It works perfectly, except the speed. I've been approved to use my university's supercomputer*, but I want to figure out a faster algorithm. I have R and STATA both on my laptop and the supercomputer* and would be happy to use either.
dcv.new <- read.csv("VoterHist.csv", header=TRUE)
# I previously set columns 68-75 to default to 0
for(i in 1:nrow(dcv.new))
{
if(is.na(dcv.new[i,29]))
{
dcv.new[i,69] <- 1
}
else if(dcv.new[i,29]=="V")
{
dcv.new[i,68] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="A")
{
dcv.new[i,70] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="N")
{
dcv.new[i,71] <- 1
}
else if(dcv.new[i,29]=="E")
{
dcv.new[i,72] <- 1
}
else if(dcv.new[i,29]=="Y")
{
dcv.new[i,73] <- 1
}
else if(dcv.new[i,29]=="P")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
else if(dcv.new[i,29]=="X")
{
dcv.new[i,74] <- 1
dcv.new[i,75] <- 1
}
}
*Technically "High performance computing cluster", but let's be honest, supercomputer sounds way cooler.
R is vectorised, in the main, so look for vectorised operations in place of loops. In this case you can vectorise each operation so it works on the entire matrix rather than on individual rows.
Here are the first three of your if else statements:
dcv.new[is.na(dcv.new[,29]), 69] <- 1
dcv.new[dcv.new[,29]=="V", c(68,75)] <- 1
dcv.new[dcv.new[,29]=="A", c(70,75)] <- 1
....
You should get the idea.
Some explanation:
What we are doing is selecting rows from certain columns of dcv.new that meet criteria (such as == "V") and then we assign the value 1 to each of those selected elements of dcv.new in a single operation. R recycles the 1 that we assigned such that it becomes the same length as that required to fill all the selected elements.
Note how we select more than one column at once for updating: dcv.new[x , c(68,75)] updates columns 68 and 75 for rows x only, where x is a logical vector indexing the rows we need to update. The logical vector is produced by statements like dcv.new[,29]=="V". These return a TRUE if an element of dcv.new[,29] equals "V" and FALSE if not.
However...!
In the case of regression, we can let R make the matrix of dummy variables for us, we don't need to do it by hand. Say the column dcv.new[, 29] was named voterType. If we coerce it to be a factor
dcv.new <- transform(dcv.new, voterType = factor(voterType))
when we fit a model using the formula notation we can do:
mod <- lm(response ~ voterType, data = dcv.new)
and R will create the appropriate contrasts to make voterType use the correct degrees of freedom. By default R uses the first level of a factor as the base level and hence model coefficients represent deviations from this reference level. To see what is the reference level for voterType after converting it to a factor do
with(dcv.new, levels(voterType)[1])
Note that most modelling functions that take a formula, like the one shown above, work as I described and show below. You aren't restricted to lm() models.
Here is a small example
set.seed(42)
dcv.new <- data.frame(response = rnorm(20),
voterType = sample(c("V","A","N","E","Y","P","X",NA), 20,
replace = TRUE))
head(dcv.new)
> head(dcv.new)
response voterType
1 1.3709584 E
2 -0.5646982 E
3 0.3631284 V
4 0.6328626 <NA>
5 0.4042683 E
6 -0.1061245 <NA>
The model can then be fitted as
mod <- lm(response ~ voterType, data = dcv.new)
summary(mod)
giving in this case
> mod <- lm(response ~ voterType, data = dcv.new)
> summary(mod)
Call:
lm(formula = response ~ voterType, data = dcv.new)
Residuals:
Min 1Q Median 3Q Max
-2.8241 -0.4075 0.0000 0.5856 1.9030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.656 1.425 -1.864 0.0952 .
voterTypeE 2.612 1.593 1.639 0.1356
voterTypeN 3.040 1.646 1.847 0.0978 .
voterTypeP 2.742 1.646 1.666 0.1300
voterTypeV 2.771 1.745 1.588 0.1468
voterTypeX 2.378 2.015 1.180 0.2684
voterTypeY 3.285 1.745 1.882 0.0925 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.425 on 9 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.3154, Adjusted R-squared: -0.1411
F-statistic: 0.6909 on 6 and 9 DF, p-value: 0.6635
The magic all happens with the formula code but essentially what happens behind the scenes is that once R has located all the variables named in the formula, it essentially ends up calling something like
model.matrix( ~ voterType, data = dcv.new)
which generates the covariate matrix needed for the underlying matrix algebra and QR decomposition. That code above, for the small example gives:
> model.matrix(~ voterType, data = dcv.new)
(Intercept) voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX
1 1 1 0 0 0 0
2 1 1 0 0 0 0
3 1 0 0 0 1 0
5 1 1 0 0 0 0
8 1 0 0 1 0 0
10 1 0 0 0 0 0
11 1 0 1 0 0 0
12 1 0 1 0 0 0
13 1 1 0 0 0 0
14 1 0 0 0 0 1
15 1 0 0 0 1 0
16 1 0 0 1 0 0
17 1 0 0 1 0 0
18 1 0 0 0 0 0
19 1 0 1 0 0 0
20 1 0 0 0 0 0
voterTypeY
1 0
2 0
3 0
5 0
8 0
10 1
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 1
attr(,"assign")
[1] 0 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
Which is what you are wanting to do with your code. So if you really need it, you could use model.matrix() like I show to also generate the matrix - stripping off the attributes as you don't need them.
In this case the reference level is "A":
> with(dcv.new, levels(voterType)[1])
[1] "A"
which is represented by the (Intercept) column in the output from model.matrix. Note that these treatment contrasts code for deviations from the reference level. You can get dummy values by suppressing the intercept in the formula by adding -1 (0r +0):
> model.matrix(~ voterType - 1, data = dcv.new)
voterTypeA voterTypeE voterTypeN voterTypeP voterTypeV voterTypeX voterTypeY
1 0 1 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 0 0 1 0 0
5 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0
10 0 0 0 0 0 0 1
11 0 0 1 0 0 0 0
12 0 0 1 0 0 0 0
13 0 1 0 0 0 0 0
14 0 0 0 0 0 1 0
15 0 0 0 0 1 0 0
16 0 0 0 1 0 0 0
17 0 0 0 1 0 0 0
18 1 0 0 0 0 0 0
19 0 0 1 0 0 0 0
20 0 0 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$voterType
[1] "contr.treatment"
You should vectorize your code. And forget about so many if's
dcv.new[is.na(dcv.new[,29]),69] <- 1
dcv.new[dcv.new[,29] == "V", c(68, 75)] <- 1
....enter code here
Continue as needed