Handling alternative-specific NA values in mlogit - r

It is common in mode choice models to have variables that vary with alternatives ("generic variables") but that are undefined for certain modes. For example, transit fare is present for bus and light rail, but undefined for automobiles and biking. Note that the fare is not zero.
I'm trying to make this work with the mlogit package for R. In this MWE I've asserted that price is undefined for fishing from the beach. This results in a singularity error.
library(mlogit)
#> Warning: package 'mlogit' was built under R version 3.5.2
#> Loading required package: Formula
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
#> Loading required package: lmtest
data("Fishing", package = "mlogit")
Fishing$price.beach <- NA
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
head(Fish)
#> mode income alt price catch chid
#> 1.beach FALSE 7083.332 beach NA 0.0678 1
#> 1.boat FALSE 7083.332 boat 157.930 0.2601 1
#> 1.charter TRUE 7083.332 charter 182.930 0.5391 1
#> 1.pier FALSE 7083.332 pier 157.930 0.0503 1
#> 2.beach FALSE 1250.000 beach NA 0.1049 2
#> 2.boat FALSE 1250.000 boat 10.534 0.1574 2
mlogit(mode ~ catch + price | income, data = Fish, na.action = na.omit)
#> Error in solve.default(H, g[!fixed]): system is computationally singular: reciprocal condition number = 3.92205e-24
Created on 2019-07-08 by the reprex package (v0.2.1)
This happens when price is moved to the alternative-specific variable position as well. I think the issue may lie in the na.action function argument, but I can't find any documentation on this argument beyond the basic documentation tag:
na.action: a function which indicates what should happen when the data contains NAs
There appear to be no examples showing how this term is used differently and what the results are. There's a related unanswered question here.

There appears to be a few things going on.
I am not quite sure how na.action = na.omit works under the hood, but it sounds to me like it will drop the entire row. I always find it better to do this explicitly.
When you drop the entire row, you will have choice occasions where no choice was made. This is not going to work. Remember, we are working with logit type probabilities. Furthermore, if no choice is made, no information is gained, so we need to drop these choice observations entirely. Doing these two steps in combination, I am able to run the model you propose.
Here is a commented working example:
library(mlogit)
# Read in the data
data("Fishing", package = "mlogit")
# Set price for the beach option to NA
Fishing$price.beach <- NA
# Scale income
Fishing$income <- Fishing$income / 10000
# Turn into 'mlogit' data
fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
# Explicitly drop the alts with NA in price
fish <- fish[fish$alt != "beach", ]
# Dropping all NA also means that we now have choice occasions where no choice
# was made and we need to get rid of these as well
fish$choice_made <- rep(colSums(matrix(fish$mode, nrow = 3)), each = 3)
fish <- fish[fish$choice_made == 1, ]
fish <- mlogit.data(fish, shape = "long", alt.var = "alt", choice = "mode")
# Run an MNL model
mnl <- mlogit(mode ~ catch + price | income, data = fish)
summary(mnl)
In general, when working with these models, I find it very useful to always make all data transformations before running a model rather than rely on functions such as na.action.

Related

Trying to create a data frame for mlogit and keep running into this error Error in names(data)[ix] : invalid subscript type 'language'

I am trying to use this data set https://data.cityofnewyork.us/Transportation/Citywide-Mobility-Survey-Person-Survey-2019/6bqn-qdwq to create an mnl model but every time I try to change my original data frame like this
nydata_df = dfidx(nydata, shape="wide",choice="work_mode",varying = sort)
I get this error here.
Error in names(data)[ix] : invalid subscript type 'language'
I'm unclear about what is causing this error I think it is something wrong with dplyr but I am not sure.
According to this vignette from the mlogit package, the varying argument should be used to specify which variables should be "lengthened" when converting a dataframe from wide to long using dfidx. Are you actively trying to lengthen your dataframe (like in the style of dplyr::pivot_longer())?
If you aren't, I don't believe that you need the varying argument (see ?stats::reshape for more info on varying). If you want to use the varying argument, you should specify specific variables rather than only "sort" (example1, example2). Additionally, when I run your models, I don't get a NaN for McFadden's R2, p-value, or chi-square test. Are your packages fully updated?
library(dfidx)
library(mlogit)
library(performance) # to extract McFadden's R2 easily
packageVersion("dfidx")
#> [1] '0.0.5'
packageVersion("mlogit")
#> [1] '1.1.1'
packageVersion("dplyr")
#> [1] '1.0.10'
# currently running RStudio Version 2022.7.2.576
nydata <- read.csv(url("https://data.cityofnewyork.us/api/views/6bqn-qdwq/rows.csv?accessType=DOWNLOAD"))
nydata_df <- dfidx(data = nydata,
shape = "wide",
choice = "work_mode")
m <- mlogit(work_mode ~ 1, nydata_df)
#summary(m)
r2_mcfadden(m)
#> McFadden's R2
#> 1.110223e-16
m3 <- mlogit(work_mode ~ 1 | harassment_mode + age, nydata_df)
#summary(m3)
r2_mcfadden(m3)
#> McFadden's R2
#> 0.03410362

Why is the quantile function not working for this dplyr function?

I'm working through Faraway's 2016 book Extending the Linear Model with R and have encountered an issue with the code that I don't know how to fix. Here is the relevant syntax leading up to the error:
#### Load Data & Libraries ####
library(faraway)
library(tidyverse)
data(wcgs)
#### Add Variables ####
wcgs$y <- ifelse(wcgs$chd == "no",0,1) # create binary response from chd
wcgs$bmi <- with(wcgs,
703*wcgs$weight/(wcgs$height^2)) # create BMI variable
#### Create GLM Model ####
lmod <- glm(chd ~ height + cigs,
family = binomial,
wcgs)
#### Mutate Data ####
wcgs <- mutate(wcgs,
residuals=residuals(lmod),
linpred=predict(lmod)) # create residuals/pred values
And this is the part where the error arises (the third line which includes a mutate function:
#### Error Code (Last Line) ####
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob=predict(lmod,
type="response")) # make pred data
gdf <- group_by(wcgsm,
cut(linpred,
breaks=unique(quantile(linpred,
(1:100)/101)))) # bin NA
Which gives me this error:
Error in `group_by()`:
! Problem adding computed columns.
Caused by error in `mutate()`:
! Problem while computing `..1 = cut(linpred, breaks = unique(quantile(linpred,
(1:100)/101)))`.
✖ `..1` must be size 3140 or 1, not 3154.
I dont understand what this error means. When I run dim(wcgs), I get there are 3154 rows, and when I run dim(na.omit(wcgs)) I get 3140 rows. The only thing I can think of is that the predicted model values dont line up with the new na.omit data, but I'm not sure now how to work around that given the rest of this chapter uses this data manipulation.
predict methods for R's modeling functions always predict from the original data set the models were fitted to. To have a new data set, in this case a subset of the data wcgs, argument newdata must be explicitly set.
The error in the predict line at the bottom is therefore expected behavior.
#### Load Data & Libraries ####
suppressPackageStartupMessages({
#library(faraway)
library(dplyr)
})
data(wcgs, package = "faraway")
#### Add Variables ####
wcgs$y <- as.integer(wcgs$chd == "yes") # create binary response from chd
wcgs$bmi <- with(wcgs, 703*weight/(height^2)) # create BMI variable
#### Create GLM Model ####
lmod <- glm(chd ~ height + cigs, family = binomial, data = wcgs)
#### Mutate Data ####
# create residuals/pred values
wcgs <- mutate(wcgs,
residuals = residuals(lmod),
linpred = predict(lmod))
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob = predict(lmod, type="response")) # make pred data
#> Error in `mutate()`:
#> ! Problem while computing `predprob = predict(lmod, type = "response")`.
#> ✖ `predprob` must be size 3140 or 1, not 3154.
Created on 2022-07-16 by the reprex package (v2.0.1)
See where the error comes from.
predprob_all <- predict(lmod, type = "response")
predprob_na.omit <- predict(lmod, newdata = wcgsm, type = "response")
length(predprob_all)
#> [1] 3154
length(predprob_na.omit)
#> [1] 3140
Created on 2022-07-16 by the reprex package (v2.0.1)
These lengths are the values in the error message, once again, as expected.
There is also the problem of the quantiles in cut(., breaks) not spanning the entire range of linpred. Values outside the quantiles' range will become NA. This is solved with the two endpoints of the breaks vector.
And I have given a name to the binned vector.
The following code works and, I believe, does what is needed.
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
predprob = predict(lmod, newdata = wcgsm, type="response")) # make pred data
breaks <- c(-Inf,
unique(quantile(wcgsm$linpred, (1:100)/101)),
Inf)
gdf <- group_by(wcgsm,
bins = cut(linpred, breaks = breaks)) # bin NA
anyNA(gdf$bins)
#> [1] FALSE
Created on 2022-07-16 by the reprex package (v2.0.1)

Error in t.default(x) : argument is not a matrix

I am trying to see what people's willingness to pay is for either nuclear or wind energy (far away or local) through a stated choice preference. I used the multinomial logit model, however when estimating the discreet choice for the different scenarios I keep getting an error:
Error in t.default(x) : argument is not a matrix
While gmnl gives this error, mixl seems to be working fine
Code:
install.packages("gmnl")
install.packages("mlogit")
library("gmnl") # Load gmnl package
library("mlogit") # Load mlogit package
library(readxl)
Example_data <- read_excel("Example data.xlsx")
View(Example_data)
data <- as.data.frame(Example_data)
df01 <- mlogit.data(data,
id.var = "id",
choice = "Choice",
varying = 3:17,
shape = "wide",
sep = "")
lc <- gmnl(Choice ~ MODE + DWELLING + SIZE + COST + DISTANCE | 0 | 0 | 0 | 1 ,
data = df01,
model = 'lc',
Q = 3,
panel = TRUE,
method = "bhhh")
It could be that there is something wrong with my data. However, when comparing previous works from other people my data is setup in a similar way and I cannot run their calculations either.
From what I have seen with earlier posts, it could also be a package problem. But what can I do to fix it or continue if that is the case.
The picture below shows an example of the data for 2 individuals, which consists of 15 scenarios with 3 options to choose from.
Example_data

De-identifying survival or flexsurvreg objects in R

Please consider the following:
I need to provide some R code syntax to analyse data with the flexsurv package. I am not allowed to receive/analyse directly or on-site. I am however allowed to receive the analysis results.
Problem
When we run the flexsurvreg() function on some data (here ovarian from the flexsurv package), the created object (here fitw) contains enough information to "re-create" or "back-engineer" the actual data. But then I would technically have access to the data I am not allowed to have.
# Load package
library("flexsurv")
#> Loading required package: survival
# Run flexsurvreg with data = ovarian
fitw <- flexsurvreg(formula = Surv(futime, fustat) ~ factor(rx) + age,
data = ovarian, dist="weibull")
# Look at first observation in ovarian
ovarian[1, ]
#> futime fustat age resid.ds rx ecog.ps
#> 1 59 1 72.3315 2 1 1
# With the following from the survival object, the data could be re-created
fitw$data$Y[1, ]
#> time status start stop time1 time2
#> 59 1 0 59 59 Inf
fitw$data$m[1, ]
#> Surv(futime, fustat) factor(rx) age (weights)
#> 1 59 1 72.3315 1
Potential solution
We could write the code so that it also sets all those data that might be used for this back-engineering to NA as follows:
# Setting all survival object observation to NA
fitw$data$Y <- NA
fitw$data$m <- NA
fitw$data$mml$scale <- NA
fitw$data$mml$rate <- NA
fitw$data$mml$mu <- NA
Created on 2021-08-27 by the reprex package (v2.0.0)
Question
If I proceed as the above and set all these parameters to NA, could I then receive the fitw object (e.g. as an .RDS file) without ever being able to "back-engineer" the original data? Or is there any other way to share fitw without the attached data?
Thanks!
Setting, e.g. fitw$data <- NULL will remove all the individual-level data from the fitted model object. Some of the output functions may not work with objects stripped of data however. In the current development version on github, printing the model object should work. Also summary and predict methods should work, as long as covariate values are supplied in newdata - omitting them won't work, since the default is to take the covariate values from the observed data.

Error "$ operator is invalid for atomic vectors" despite not using atomic vectors or $

Hello fellow Stackers! This is my first question so i am curious if you can help me! :)
First: I checked similar questions and unfortunately none of the solutions worked for me. Tried it for nearly 3 days now :/ Since I am working with sensitive data I cannot provide the original table for reprex, unfortunately. However I will create a small substitutional example-table for testing.
To get to the problem:
I want to predict a norm value using the package "CNorm". It requires raw data, classification data, a model and min/max values and some other things that are less important. The problem is: Whatever I do, whatever Data-type and working directory I use, it gives me the error "$ operator is invalid for atomic vectors" to change that I transformed the original .sav-file to a Dataframe. Well- nothing happened. I tested the type of the data and it said dataframe, not atomic vector. Also i tried using "[1]" for location or ["Correct"] for names but still the same error showed up. Same for using 2 single Dataframes, using lists. I have tried to use $ to check, if i get a different error but also the same. I even used another workspace to check if the old workspace was bugged.
So maybe I just did really stupid mistakes but I really tried and it did not work out so I am asking you, what the solution might be. Here is some data to test! :)
install.packages("haven")
library(haven)
install.packages("CNORM")
library(CNORM)
SpecificNormValue <- predictNorm((Data_4[1]),(Data_4[2]),model = T,minNorm = 12, maxNorm = 75, force = FALSE, covariate = NULL)
So that is one of the commands I used on the Dataframe "Data_4". I also tried not using brackets or using "xxx" to get the column names but to no avail.
The following is the example Dataframe. To test it more realistic I would recommend an Exel-file with 2 columns and 900 rows(+ Column title) (like the original). The "correct"-values can be random selected by Excel and they differ from 35 to 50, the age differs from 6 to 12.
Correct
Age
40
6
45
7
50
6
35
6
I really hope someone of you can figure out the problem and how I get the command properly running. I really have no other idea right now.
Thanks for checking my question and thanks in advance for your time! I would be glad to hear from you!
The source of that error isn't your data, it's the third argument to predictNorm: model = T. According to the predictNorm documentation, this is supposed to be a "regression model or a cnorm object". Instead you are passing in a logical value (T = TRUE) which is an atomic vector and causes this error when predictNorm tries to access the components of the model with $.
I don't know enough about your problem to say what kind of model you need to use to get the answer you want, but for example passing it an object constructed by cnorm() returns without an error using your data and parameters (there are some warnings because of the small size of your test dataset):
library(haven)
library(cNORM)
#> Good morning star-shine, cNORM says 'Hello!'
Data_4 <- data.frame(correct = c(40, 45, 50, 35),
age = c(6,7,6,6))
SpecificNormValue <- predictNorm(Data_4$correct,
Data_4$age,
model = cnorm(Data_4$correct, Data_4$age),
minNorm = 12,
maxNorm = 75,
force = FALSE,
covariate = NULL)
#> Warning in rankByGroup(raw = raw, group = group, scale = scale, weights =
#> weights, : The dataset includes cases, whose percentile depends on less than
#> 30 cases (minimum is 1). Please check the distribution of the cases over the
#> grouping variable. The confidence of the norm scores is low in that part of the
#> scale. Consider redividing the cases over the grouping variable. In cases of
#> disorganized percentile curves after modelling, it might help to reduce the 'k'
#> parameter.
#> Multiple R2 between raw score and explanatory variable: R2 = 0.0667
#> Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
#> force.in, : 21 linear dependencies found
#> Reordering variables and trying again:
#> Warning in log(vr): NaNs produced
#> Warning in log(vr): NaNs produced
#> Specified R2 falls below the value of the most primitive model. Falling back to model 1.
#> R-Square Adj. = 0.993999
#> Final regression model: raw ~ L4A3
#> Regression function: raw ~ 30.89167234 + (6.824413606e-09*L4A3)
#> Raw Score RMSE = 0.35358
#>
#> Use 'printSubset(model)' to get detailed information on the different solutions, 'plotPercentiles(model) to display percentile plot, plotSubset(model)' to inspect model fit.
Created on 2020-12-08 by the reprex package (v0.3.0)
Note I used Data_4$age and Data_4$correct for the first two arguments. Data_4[,1] and Data_4[[1]] also work, but Data_4[1] doesn't, because that returns a subset of a data frame not a vector as expected by predictNorm.

Resources