How to use a multinomial logistic regression model to predict future observations - r

My question seems a little vague so I will provide background context and my reproducible code to try and clarify.
I am interested in classifying crime occurrences in various neighbourhoods of a city, based on each neighbourhood's socioeconomic indicators. My end goal is to be able to generate a reasonably accurate prediction which would suggest the most likely neighbourhood that the next crime should occur. I chose to fit a multinomial regression model, and I am having a hard time interpreting its results.
Here is how my data looks:
> str(df)
'data.frame': 1796 obs. of 12 variables:
$ Time : chr "14:37:00" "14:37:00" "16:23:00" "00:10:00" ...
$ Neighbourhood : chr "Grand Boulevard" "Grand Boulevard" "West Town" "West Englewood" ...
$ Population : num 22209 22209 84698 26346 24976 ...
$ Area : num 1.74 1.74 4.58 3.15 2.55 2.95 3.15 1.04 7.15 1.28 ...
$ Density : chr "12,763.79" "12,763.79" "18,493.01" "8,363.81" ...
$ Crowded.Housing: num 3.3 3.3 2.3 4.8 2.7 3.3 4.8 2.4 6.3 9.4 ...
$ Poverty : num 29.3 29.3 14.7 34.4 8.9 27.8 34.4 21.7 28.6 41.7 ...
$ Unemployment : num 24.3 24.3 6.6 35.9 9.5 24 35.9 15.7 22.6 25.8 ...
$ Education : num 15.9 15.9 12.9 26.3 18.8 14.5 26.3 11.3 24.4 24.5 ...
$ Age : num 39.5 39.5 21.7 40.7 37.6 40.3 40.7 35.4 37.9 43.6 ...
$ Income : num 23472 23472 43198 11317 25113 ...
$ Hardship : num 57 57 10 89 29 60 89 26 73 92 ...
Here is the code for my model:
c.nnet = nnet::multinom(Neighbourhood ~
Crowded.Housing +
Poverty +
Unemployment +
Education +
Income +
Hardship,
data = df,
MaxNWts = 100000)
Here are some classification accuracy metrics:
> odds <- c.nnet[["fitted.values"]]
> pd = predict(c.nnet,type="class")
> table = table(df$Neighbourhood, pd); classAgreement(table)
$diag
[1] 0.6631403
$kappa
[1] 0.6451884
$rand
[1] 0.9560459
$crand
[1] 0.6035169
> sum(diag(table))/sum(table)
[1] 0.6631403
Lastly, here is the output of the predicted classes and the associated class probabilities.
>head(pd)
[1] Chatham Chatham West Town West Englewood New City Chatham
72 Levels: Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park ... Woodlaw
> head(odds)
Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park
1 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
2 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
3 7.276802e-04 2.796196e-06 1.540627e-03 9.642981e-03 0.0001623333 4.575838e-05 0.0004173684 1.229428e-03 0.0007718075 2.308536e-02 9.021844e-03
4 7.168266e-05 7.869570e-04 1.743114e-05 3.519012e-05 0.0473000895 9.256728e-02 0.0058524740 4.373425e-05 0.0002943829 4.752441e-06 6.214005e-07
5 2.376865e-03 3.647976e-04 3.261888e-03 5.958128e-02 0.0090540446 4.103546e-02 0.0028125946 9.329274e-03 0.0339153709 1.394973e-02 9.034131e-02
6 7.735586e-04 5.958576e-04 2.345032e-04 4.058962e-04 0.0833015893 2.374063e-02 0.0169124221 3.038695e-04 0.0005576943 2.163316e-04 1.263609e-05
As far as my understanding goes, the latter output (odds) represents the probability of each crime occurence belonging to each of the 72 unique neighbourhoods I have in my data, while the former (pd) represents the predicted classes based on my data set. This leads to my specific question; How can I use these predicted classes in order to generate some sort of forecast as to where the next crime is likely to occur (i.e. something like a time-series forecast with 1 step ahead)?

You can create a newdata data frame with the values you want to predict over and then use the predict function to obtain predicted probabilities for each class. For example,
# estimate model
library(nnet)
dat <- mtcars
dat$gear <- factor(dat$gear)
mod <- multinom(gear ~ mpg + hp, data = dat)
# what values we want predictions for
out_of_sample_data <- data.frame(mpg = c(19, 20), hp = c(130, 140))
# generate predicted probabilities
predict(mod, newdata = out_of_sample_data, type = "probs")
#> 3 4 5
#> 1 0.6993027 0.2777716 0.02292562
#> 2 0.6217686 0.2750779 0.10315351
Obviously, you'll need to populate your out of sample data with values you believe with occur in the future, which can be tricky (to say the least).

Related

Dealing with outliers by using interquantile ranges - but afterwards just different outliers

I am dealing with a dataset consisting of several key banking balance sheet and income statement figures (deleted some variables for this post):
'data.frame': 52028 obs. of 38 variables:
$ institutionid : int 4307883 4209717 4558501 4392480 4306242 4303334 114518 4183859 4307849 4256486 ...
$ fiscalyear : Factor w/ 8 levels "2010","2011",..: 1 1 1 1 1 1 1 1 1 1 ...
$ institutionname : chr "Kure Shinkin Bank" "Shinkin Central Bank" "Shibata Shinkin Bank" "Takasaki Shinkin Bank" ...
$ Tier 1 Ratio : num 9.8 20.68 13.93 6.84 19.43 ...
$ snlindustryid : int 28 2 28 2 2 1 1 1 2 1 ...
$ snlindustryname : chr "Other Banking" "Savings Bank/Thrift/Mutual" "Other Banking" "Savings Bank/Thrift/Mutual" ...
$ countryname : chr "Japan" "Japan" "Japan" "Japan" ...
$ Interest Income : num 141.3 3330.3 16.2 83.6 289.8 ...
$ Net Interest Income : num 122.8 756.4 14.1 74.4 250.4 ...
$ Operating Revenue : num 137.8 NA 13.8 80.1 NA ...
$ Provision for Loan Losses: num 27.546 NA 0.535 13.26 NA ...
$ Compensation and Benefits: num NA NA 6.07 36.8 NA ...
$ EBIT : num 27.04 2814.57 5.05 16.67 88.05 ...
$ Net Income befoire Taxes : num 8.57 224.58 2.98 7.42 48.62 ...
$ Provision for Taxes : num -7.861 -113.864 0.159 0.125 14.525 ...
$ Net Income : num 16.43 338.45 2.83 7.29 34.1 ...
$ net_margin : num 2.98 1.06 3.56 3.05 2.5 ...
I am trying to run a DiD regression using net_margins, a figure that is calculated as net income / total gross loans. When I first plot the net_margins they look like this:
Clearly, there are values included that don't make economic sense. This is partly due the fact that some banks in the dataset have unreasonable figures for e.g. gross loans. If you divide by something close to zero some unreasonable large numbers will come out.
My first intuition was to just get rid of the outliers by doing this:
Q <- quantile(dataset$net_margin, probs = c(0.25,0.75))
IQR <- IQR(dataset$net_margin)
up <- Q[2]+1.5*IQR # Upper Range
low<- Q[1]-1.5*IQR # Lower Range
#Eliminating outliers
dataset_cleaned <- dataset %>%
filter(net_margin<up & net_margin > low)
If I plot the data now it looks like this:
Through removing the outliers I basically created new medians and interquantile ranges, thus my data is now still plagued heavily by outliers.
In other posts that suggested using the IQR to remove outliers that was not the case however.
I am a bit on the dead end with my own statistical (and R) knowledge. Is this a right practice to remove outliers for such a dataset? Thank you!

Reverse Johnson transformation

I want to perform a regression and I have a data set with a left-skewed target variable (Murder) like this:
data("USAArrests")
str(USAArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
hist(USAArrests&Murder)
Since the data is left-skewed. I can do a log transformation of the target in order to improve the performance of the model.
train = USArrests[1:30,]
train$Murder = log(train$Murder)
test = USArrests[31:50,]
If I want to apply this model on the test set a have to reverse the transformation to get the actual result. This I can do by exp.
fit = lm(Murder~., data = train)
pred = predict(fit, test)
exp(pred)
However, in my case, the log transformation is not enough to get a normal distribution of the target. So I used the Johnson transformation.
library(bestNormalize)
train$Murder = yeojohnson(train$Murder)$x.t
Is there a possibility to reverse this transformation like the log transformation like above?
As noted by Rui Barradas, the predict function can be used here. Instead of directly pulling out x.t from the yeojohnson function, you can do the following:
# Store the transformation object
yj_obj <- yeojohnson(train$Murder)
# Perform transformation
yj_vals <- predict(yj_obj)
# Reverse transformation
orig_vals <- predict(yj_obj, newdata = yj_vals, inverse = TRUE)
# Should be the same as the original values
all.equal(orig_vals, train$Murder)
The same workflow can be done with the log and exponentiation transformation via the log_x function (together with the predict function and the inverse = TRUE argument).

PLM is not recognizing my id variable name

I'm doing a regression analysis considering fixed effects using plm() from package plm. I have selected the twoways method to account for both time and individual effects. However, after runing the below code I keep receiving this message:
Error in pdata.frame(data, index) :
variable id does not exist (individual index)
Here the code:
pdata <- DATABASE[,c(2:4,13:21)]
pdata$id <- group_indices(pdata,ISO3.p,Productcode)
coutnin <- dcast.data.table(pdata,ISO3.p+Productcode~.,value.var = "id")
setcolorder(pdata,neworder=c("id","Year"))
pdata <- pdata.frame(pdata,index=c("id","Year"))
reg <- plm(pdata,diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), effect="twoways", model="within", index = c("id","Year"))
Please mind that pdata structure shows that there are multiple levels in the id variable which is in numeric form, I tried initially to use a string type variable but I keep receiving the same outcome:
Classes ‘data.table’ and 'data.frame': 1211800 obs. of 13 variables:
$ id : int 4835 6050 13158 15247 17164 18401 19564 23553 24895 27541 ...
$ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ Productcode: chr "101" "101" "101" "101" ...
$ ISO3.p : Factor w/ 171 levels "ABW","AFG","AGO",..: 8 9 20 22 27 28 29 34 37 40 ...
$ e : num 0.245 -0.238 1.624 0.693 0.31 ...
$ RERcp : num -0.14073 -0.16277 1.01262 0.03908 -0.00243 ...
$ RERpp : num -0.1712 NA NA NA -0.0952 ...
$ RER_GVC : num -3.44 NaN NA NA NaN ...
$ GDPR.p : num 27.5 26.6 23.5 20.3 27.8 ...
$ GDPR.r : num 30.4 30.4 30.4 30.4 30.4 ...
$ GVCPos : num 0.141 0.141 0.141 0.141 0.141 ...
$ GVCPar : num 0.436 0.436 0.436 0.436 0.436 ...
$ TV : num 17.1 17.1 17.1 17.1 17.1 ...
- attr(*, ".internal.selfref")=<externalptr>
When I convert the data.table into a pdata.frame I do not receive any warning, it happens only after I run the plm function. From running View(table(index(pdata), useNA = "ifany")) it displays no value larger than 1, therefore I assume I have no duplicates obs in my data.
Try to put the data argument at the second place in the plm statement. In case pdata has been converted to a pdata.frame already, leave out the index argument in the plm statement, i.e., try this:
reg <- plm(diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), data = pdata, effect = "twoways", model = "within")

Data Subset error in R using %in% wildcard

My df:
> str(merged)
'data.frame': 714 obs. of 9 variables:
$ Date : Date, format: "2013-03-29" "2013-03-29" "2013-03-29" "2013-03-29" ...
$ patch : Factor w/ 7 levels "BVG1","BVG11",..: 1 2 3 4 5 6 7 1 2 3 ...
$ prod : num 2.93 2.77 2.86 2.87 3.01 ...
$ workmix_pct : int 100 10 16 13 17 21 22 100 11 19 ...
$ jobcounts : int 9480 968 1551 1267 1625 1946 2123 7328 810 1374 ...
$ travel : num 30.7 34.3 33.8 29.1 28.1 24.9 34 31.8 32.7 36.4 ...
$ FWIHweeklyAvg: num 1.63 4.48 3.1 1.36 1.55 ...
$ CST.NAME : Factor w/ 7 levels "Central Scotland",..: 4 2 3 1 5 7 6 4 2 3 ...
$ month : chr "March" "March" "March" "March" ...
> head(merged)
Date patch prod workmix_pct jobcounts travel FWIHweeklyAvg CST.NAME month
1 2013-03-29 BVG1 2.932208 100 9480 30.7 1.627024 Scotland March
2 2013-03-29 BVG11 2.769156 10 968 34.3 4.475714 Highlands & Islands March
3 2013-03-29 BVG12 2.857344 16 1551 33.8 3.098571 North East Scotland March
4 2013-03-29 BVG13 2.870111 13 1267 29.1 1.361429 Central Scotland March
5 2013-03-29 BVG14 3.011260 17 1625 28.1 1.550000 South East Scotland March
6 2013-03-29 BVG15 3.236246 21 1946 24.9 1.392857 West Central Scotland March
I am trying to subset on patch BVG1 by:
data=merged[patch %in% c("BVG1"),]
But getting an error:
Error in match(x, table, nomatch = 0L) : object 'patch' not found
Don't understand why...
I am trying to plot separate timeseries per patch using ggplot
This is what I have tried:
ggplot(data=merged, aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
This plots all patches on one graph... But I want to show BVG1 timeseries only and this is what I was trying:
ggplot(data=merged[patch %in% c("BVG1"),], aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
But getting the same error.
Any ideas?
UPDATE
Problem solved using [merged$patch %in% c("BVG1"),]
You could also do
data <- subset(merged, patch == "BVG1")
Since you're only conditioning on patch being a single value, you don't need %in%, you can just test for equality.
When you use subset(), R automatically interprets variables referenced in the context of the data frame, so merged$patch is unnecessary.
Try
data=merged[merged$patch %in% c("BVG1"),]
That should solve your problems. patch is defined in your dataframe, so you need to tell R where to find it.
Additionally, you may want to look at facet_wrap instead of subsetting. For instance, adding + facet_wrap(~ patch) to your plot command should show you all patches at once. I am not sure this is what you desire as output, but I thought I should point it out as an idea...

how to convert data.frame to numeric/matrix AND create barplot from time series from a csv file

I want to make a barplot of Snow Data. The data is stored in a .csv-File and has a date column and 12 Location columns with a SWE Value in integer.
In order to creat a barplot the datatype has to be either a vector or a matrix. So my question is how i can transform the file (data.frame) to a matrix and create a grouped barplot from it. X-axes should be "date", Y-axes "SWE [mm]"
My .csv-file looks like this:
Date SB1 SB2 SB3 ...
1.1.2013 95 90 91 ...
1.2.2013 87 80 82 ...
1.3.2013 45 30 15 ...
1.4.2013 23 18 3 ...
so far I tried:
setwd("path")
swe = read.csv("name.csv", header=TRUE, sep=";")
swe$new = paste(swe$Date," ")
swe$new = strptime(swe$new, "%d.%m.%Y")
swe2 <- data.matrix(swe)
dimnames(swe2) <- NA
jpeg("swe_sb1.jpg")
barplot(swe2$Date, swe2$SWE_SB1, ..., beside = TRUE)
dev.off()
it gives me the error message:
> setwd("path")
> swe = read.csv("name.csv", header=TRUE, sep=";")
> swe$new = paste(swe$Date," ")
> swe$new = strptime(swe$new, "%d.%m.%Y")
> swe2 <- data.matrix(swe)
> dimnames(swe2) <- NA
Fehler in dimnames(swe2) <- NA : 'dimnames' muss eine Liste sein
> str(swe2)
num [1:4, 1:38] 2 1 3 4 119 117 87 118 54 35 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:38] "Date" "SWE_SB1" "SH_SB1" "SD_SB1" ...
> jpeg("swe_sb1.jpg")
> barplot(swe2$Date, swe2$SWE_SB1)
Fehler in swe2$Date : $ operator is invalid for atomic vectors
> dev.off()
jpeg:75:swe_all.jpg
2
any help would be greatly appreciated!
You're making this way harder than it is. R has great examples for all of it's functions, so ?barplot might have been a better place to start.
Anyways, what you have is a matrix that you want to make a grouped boxplot from. If you have a matrix like the example one you'd see by typing VADeaths:
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
And you wanted to create a boxplot, you simply type barplot(VADeaths,grouped=T) and you end up with
If you want to switch the x and y, all you have to do is barplot(t(VADeaths),grouped=T), and you have:. So all you have to do is read in your data using read.csv or whatever, transpose it and plot it!
read.csv() returns a data.frame, barplot() does not accept this class.
Use as.matrix() before plotting to turn your data in an accepted class:
x <-as.matrix(x)

Resources