Reverse Johnson transformation - r

I want to perform a regression and I have a data set with a left-skewed target variable (Murder) like this:
data("USAArrests")
str(USAArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
hist(USAArrests&Murder)
Since the data is left-skewed. I can do a log transformation of the target in order to improve the performance of the model.
train = USArrests[1:30,]
train$Murder = log(train$Murder)
test = USArrests[31:50,]
If I want to apply this model on the test set a have to reverse the transformation to get the actual result. This I can do by exp.
fit = lm(Murder~., data = train)
pred = predict(fit, test)
exp(pred)
However, in my case, the log transformation is not enough to get a normal distribution of the target. So I used the Johnson transformation.
library(bestNormalize)
train$Murder = yeojohnson(train$Murder)$x.t
Is there a possibility to reverse this transformation like the log transformation like above?

As noted by Rui Barradas, the predict function can be used here. Instead of directly pulling out x.t from the yeojohnson function, you can do the following:
# Store the transformation object
yj_obj <- yeojohnson(train$Murder)
# Perform transformation
yj_vals <- predict(yj_obj)
# Reverse transformation
orig_vals <- predict(yj_obj, newdata = yj_vals, inverse = TRUE)
# Should be the same as the original values
all.equal(orig_vals, train$Murder)
The same workflow can be done with the log and exponentiation transformation via the log_x function (together with the predict function and the inverse = TRUE argument).

Related

Adding new variable to an imputed dataset based on imputed values

I wish to create a variable which is computed from values in two other variables from an imputed dataset and I was wondering if there's a way to achieve this?
e.g. if I wanted to create a new variable var_new to the nhanes dataset which I've run 16 m estimates on (in the mice package below), which was equal to the value of chl - bmi, is there a way to achieve this?
library(mice)
aux_vart <- mice::quickpred(
nhanes,
mincor = 0.1
)
imp <- mice::mice(nhanes, pred = aux_vart, m = 16, meth = "pmm")
I tried doing this with my original dataset and then imputing from that, but because the new variable is a function of the others it has resulted in nonconvergence of my models and wildly inaccurate parameter estimates on other models I've created.
First create the full data sets and then add the column:
all_sets <- lapply(1:16, function(x) complete(imp, x))
final <- lapply(all_sets, function(x) cbind(x, var_new=x$chl - x$bmi))
Now final is a list containing all 16 data sets, final[[1]] to final[[16]], for example:
str(final[[1]])
# 'data.frame': 25 obs. of 5 variables:
# $ age : num 1 2 1 3 1 3 1 1 2 2 ...
# $ bmi : num 28.7 22.7 22 22.7 20.4 25.5 22.5 30.1 22 26.3 ...
# $ hyp : num 1 1 1 1 1 1 1 1 1 2 ...
# $ chl : num 187 187 187 218 113 184 118 187 238 206 ...
# $ var_new: num 158.3 164.3 165 195.3 92.6 ...

How to use a multinomial logistic regression model to predict future observations

My question seems a little vague so I will provide background context and my reproducible code to try and clarify.
I am interested in classifying crime occurrences in various neighbourhoods of a city, based on each neighbourhood's socioeconomic indicators. My end goal is to be able to generate a reasonably accurate prediction which would suggest the most likely neighbourhood that the next crime should occur. I chose to fit a multinomial regression model, and I am having a hard time interpreting its results.
Here is how my data looks:
> str(df)
'data.frame': 1796 obs. of 12 variables:
$ Time : chr "14:37:00" "14:37:00" "16:23:00" "00:10:00" ...
$ Neighbourhood : chr "Grand Boulevard" "Grand Boulevard" "West Town" "West Englewood" ...
$ Population : num 22209 22209 84698 26346 24976 ...
$ Area : num 1.74 1.74 4.58 3.15 2.55 2.95 3.15 1.04 7.15 1.28 ...
$ Density : chr "12,763.79" "12,763.79" "18,493.01" "8,363.81" ...
$ Crowded.Housing: num 3.3 3.3 2.3 4.8 2.7 3.3 4.8 2.4 6.3 9.4 ...
$ Poverty : num 29.3 29.3 14.7 34.4 8.9 27.8 34.4 21.7 28.6 41.7 ...
$ Unemployment : num 24.3 24.3 6.6 35.9 9.5 24 35.9 15.7 22.6 25.8 ...
$ Education : num 15.9 15.9 12.9 26.3 18.8 14.5 26.3 11.3 24.4 24.5 ...
$ Age : num 39.5 39.5 21.7 40.7 37.6 40.3 40.7 35.4 37.9 43.6 ...
$ Income : num 23472 23472 43198 11317 25113 ...
$ Hardship : num 57 57 10 89 29 60 89 26 73 92 ...
Here is the code for my model:
c.nnet = nnet::multinom(Neighbourhood ~
Crowded.Housing +
Poverty +
Unemployment +
Education +
Income +
Hardship,
data = df,
MaxNWts = 100000)
Here are some classification accuracy metrics:
> odds <- c.nnet[["fitted.values"]]
> pd = predict(c.nnet,type="class")
> table = table(df$Neighbourhood, pd); classAgreement(table)
$diag
[1] 0.6631403
$kappa
[1] 0.6451884
$rand
[1] 0.9560459
$crand
[1] 0.6035169
> sum(diag(table))/sum(table)
[1] 0.6631403
Lastly, here is the output of the predicted classes and the associated class probabilities.
>head(pd)
[1] Chatham Chatham West Town West Englewood New City Chatham
72 Levels: Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park ... Woodlaw
> head(odds)
Albany Park Archer Heights Armour Square Ashburn Auburn Gresham Austin Avalon Park Avondale Belmont Cragin Bridgeport Brighton Park
1 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
2 8.293444e-04 3.078169e-04 3.394213e-04 5.070003e-04 0.0333699087 8.205015e-03 0.0140058699 3.519157e-04 0.0005199967 3.962345e-04 1.796575e-05
3 7.276802e-04 2.796196e-06 1.540627e-03 9.642981e-03 0.0001623333 4.575838e-05 0.0004173684 1.229428e-03 0.0007718075 2.308536e-02 9.021844e-03
4 7.168266e-05 7.869570e-04 1.743114e-05 3.519012e-05 0.0473000895 9.256728e-02 0.0058524740 4.373425e-05 0.0002943829 4.752441e-06 6.214005e-07
5 2.376865e-03 3.647976e-04 3.261888e-03 5.958128e-02 0.0090540446 4.103546e-02 0.0028125946 9.329274e-03 0.0339153709 1.394973e-02 9.034131e-02
6 7.735586e-04 5.958576e-04 2.345032e-04 4.058962e-04 0.0833015893 2.374063e-02 0.0169124221 3.038695e-04 0.0005576943 2.163316e-04 1.263609e-05
As far as my understanding goes, the latter output (odds) represents the probability of each crime occurence belonging to each of the 72 unique neighbourhoods I have in my data, while the former (pd) represents the predicted classes based on my data set. This leads to my specific question; How can I use these predicted classes in order to generate some sort of forecast as to where the next crime is likely to occur (i.e. something like a time-series forecast with 1 step ahead)?
You can create a newdata data frame with the values you want to predict over and then use the predict function to obtain predicted probabilities for each class. For example,
# estimate model
library(nnet)
dat <- mtcars
dat$gear <- factor(dat$gear)
mod <- multinom(gear ~ mpg + hp, data = dat)
# what values we want predictions for
out_of_sample_data <- data.frame(mpg = c(19, 20), hp = c(130, 140))
# generate predicted probabilities
predict(mod, newdata = out_of_sample_data, type = "probs")
#> 3 4 5
#> 1 0.6993027 0.2777716 0.02292562
#> 2 0.6217686 0.2750779 0.10315351
Obviously, you'll need to populate your out of sample data with values you believe with occur in the future, which can be tricky (to say the least).

data partitionning function CreateDataPartition cross validation problem

I am trying to get predictions of a multiple variables model, its eplt, its made of 7 scores and one final exam score moy_exam2, I want to predict the later using the 7 scores, I have 29441 obs,like this:
'data.frame': 19643 obs. of 8 variables:
$ HG : num 11.5 14 7.5 10.5 9.5 9.5 10 14 11.5 14 ...
$ Math : num 8 7.25 9.25 13.25 4.25 ...
$ Ar : num 11.2 12.8 8.5 11.5 9.5 ...
$ Fr : num 4 4.25 6.5 6.75 5.5 ...
$ EI : num 8 10.5 2.5 4 7 9.5 8.5 9.5 12 14 ...
$ SVT : num 5.25 9.25 7 11.5 12.5 ...
$ PC : num 11.5 16.75 4.25 13.75 10 ...
$ moy_exam2: num 8.15 9.48 7.23 10.33 7.44 ...
I decided 85% for training and 15% for testing out the model, so in partitioning the data with CreateDataPartition I try this :
# Load the data
data("neplt")
# Inspect the data
library(tidyverse)
sample_n(neplt, 3)
# Split the data into training and test set
set.seed(1,sample.kind = "Rounding")
#remember the last sample
training.samples=neplt$moy_exam2
library(Rcpp)
training.samples <- neplt$moy_exam2 %>%
createDataPartition(neplt,p = 0.85, list = FALSE,times = 1)
train.data <- neplt[training.samples, ]
test.data <- neplt[-training.samples, ]
# Build the model
model <- lm(moy_exam2 ~., data = train.data, na.action=na.omit)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$moy_exam2),
RMSE = RMSE(predictions, test.data$moy_exam2),
MAE = MAE(predictions, test.data$moy_exam2))
I get the error
Error in split_indices(as.integer(splitv), attr(splitv, "n")) :
function 'Rcpp_precious_remove' not provided by package 'Rcpp'
I don't use any split_indices function here! and the Rccp is already loaded, so I continue the executing, but the program gets stuck on the CreateDataPartition line,
I clean the data eplt using na.omit and also with na.exclude to remove any doubt about the NA missing values,
then, I tried adding the sample.kind = "Rounding" attribute to the set.seed to get it to work, still the Rstudio keeps loading indefinitely, and the console shows a + sign:
does it seems to be related to the memory capacity? or doesnt it have indefinite number of sample that the it couldn't finish it in 100 years, its been running for hours with no results!
I had a similar problem and error code when running summarySE. It seems like others have had issues like this too: Rcpp package doesn't include Rcpp_precious_remove
I installed and loaded Rcpp again and it worked thereafter!

Converting SAS code to R code

I have been trying to convert a SAS code that calculates Simple Regression and Mixed Models. I've achieved to convert simple Regression but when it comes to Mixed Model, my trials turn into fails. The SAS code shnown below is the code that I try to convert
"parc" "m" "dap" "ht" is the header labes of dataset, respectively.
data algoritmo ;
input parc m dap ht ;
lnH = LOG(ht-1.3);
lnD = LOG(dap) ;
cards ;
8 1 24.3 26.7
8 1 29.9 30.7
8 1 32.6 31.7
8 1 35.9 33.7
8 1 36.5 32.5
22 2 22.3 21.0
22 2 26.9 23.1
22 2 26.9 20.5
22 2 32.4 21.5
22 2 33.5 25.0
85 3 33.6 33.5
85 3 36.0 33.0
85 3 37.0 35.0
85 3 40.8 35.0
;
run ;
/* Simpre Regression Model */
PROC REG DATA=algoritmo ;
model lnH = lnD ;
output out=out p=pred ;
run ; quit ;
/* Mixed-Effects Model */
PROC MIXED DATA=algoritmo COVTEST METHOD=REML ;
TITLE ' lnH = (B0+bok)+(B1+b1k)*lnd ' ;
MODEL lnH = lnD / S OUTPM=outpm OUTP=outp ;
RANDOM intercept lnD /SUBJECT=m s G TYPE=UN ;
RUN ;
Here is the part of code that I converted. This part of code works perfect for me.
data1= read.table(file.choose(), header=T, sep=",")
attach(data1)
lnH=log(ht-1.3)
lnD =log(dap)
data2 = cbind(data1,lnH, lnD)
#Simple Linear Model
model1 = lm(lnH~lnD,data=data2)
summary(model1)
But for the rest I'm stuck.
model2 = lme(lnH~lnD ,data=data2,random=~1|lnD / m, method= "REML", weights=varPower(0.2,form=~dap))
summary(model2)
with the help oh Roland, replacing random=~1|lnD with random=~lnD|mworked pretty good.

how to convert data.frame to numeric/matrix AND create barplot from time series from a csv file

I want to make a barplot of Snow Data. The data is stored in a .csv-File and has a date column and 12 Location columns with a SWE Value in integer.
In order to creat a barplot the datatype has to be either a vector or a matrix. So my question is how i can transform the file (data.frame) to a matrix and create a grouped barplot from it. X-axes should be "date", Y-axes "SWE [mm]"
My .csv-file looks like this:
Date SB1 SB2 SB3 ...
1.1.2013 95 90 91 ...
1.2.2013 87 80 82 ...
1.3.2013 45 30 15 ...
1.4.2013 23 18 3 ...
so far I tried:
setwd("path")
swe = read.csv("name.csv", header=TRUE, sep=";")
swe$new = paste(swe$Date," ")
swe$new = strptime(swe$new, "%d.%m.%Y")
swe2 <- data.matrix(swe)
dimnames(swe2) <- NA
jpeg("swe_sb1.jpg")
barplot(swe2$Date, swe2$SWE_SB1, ..., beside = TRUE)
dev.off()
it gives me the error message:
> setwd("path")
> swe = read.csv("name.csv", header=TRUE, sep=";")
> swe$new = paste(swe$Date," ")
> swe$new = strptime(swe$new, "%d.%m.%Y")
> swe2 <- data.matrix(swe)
> dimnames(swe2) <- NA
Fehler in dimnames(swe2) <- NA : 'dimnames' muss eine Liste sein
> str(swe2)
num [1:4, 1:38] 2 1 3 4 119 117 87 118 54 35 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:38] "Date" "SWE_SB1" "SH_SB1" "SD_SB1" ...
> jpeg("swe_sb1.jpg")
> barplot(swe2$Date, swe2$SWE_SB1)
Fehler in swe2$Date : $ operator is invalid for atomic vectors
> dev.off()
jpeg:75:swe_all.jpg
2
any help would be greatly appreciated!
You're making this way harder than it is. R has great examples for all of it's functions, so ?barplot might have been a better place to start.
Anyways, what you have is a matrix that you want to make a grouped boxplot from. If you have a matrix like the example one you'd see by typing VADeaths:
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
And you wanted to create a boxplot, you simply type barplot(VADeaths,grouped=T) and you end up with
If you want to switch the x and y, all you have to do is barplot(t(VADeaths),grouped=T), and you have:. So all you have to do is read in your data using read.csv or whatever, transpose it and plot it!
read.csv() returns a data.frame, barplot() does not accept this class.
Use as.matrix() before plotting to turn your data in an accepted class:
x <-as.matrix(x)

Resources