Finding correlation of multiple columns in R - r

My requirement is to find the Co-Relation of E_Id, IncomeType and Tax to help understand if any E_Id, IncomeType always leads to higher Tax. My sample data for the required columns is
E_id IncomeType Tax
1 1 121
2 1 11.23
2 3 51.623
1 1 115.23
3 4 675.1
I have around 5 lacs of data, 4 types of IncomeType, 340 unique E_id. I grouped the data and now my data looks something like this:
E_Id Tax_Income_1 Tax_Income_2 Tax_Income_3 Tax_Income_4
1 118025 66513.25 148134 274072.16
2 200527 235278 247536.42 487333.98
3 3376.93 11279 114312.5 130463.97
4 44630 22285.95 20830.55 2375
5 42902.63 15649 7602.01 3624
Now I don't have any idea how to find the correlation. This is my first analytics project, please provide some guidance.

I would like to draw your attention to - correlation_table {funModeling}
data(mtcars)
correlation_table(data=mtcars, target="mpg")
Variable mpg
1 mpg 1.00
2 drat 0.68
3 gear 0.48
4 qsec 0.42
5 carb -0.55
6 hp -0.78
7 cyl -0.85
8 disp -0.85
9 wt -0.87

Also using the mtcars data set as an example, the cor() unction will produce a matrix of variable correlations.
data(mtcars)
cor(mtcars)
You can also graphically represent these correlations:
corrgram(mtcars)

Using the mtcars dataset as an example you can visualize the correlations of all the variables like this:
data(mtcars)
pairs(mpg ~ ., data = mtcars)

Related

how to find range of a continous variable where a count variable is non-zero R

I am trying to find the range of variable lat for each other column containing occurence records e.g. 0,1,2,3 etc. where the record of occurrence is non-zero (range of lat where occurence >0). I've tried to subset the data for each column without rows with 0 individuals recorded but I can't get it to work.
i tried to extract the minimum and maximum of lat for each species column where the occurence was >0 using which.max/min:
allfreq$lat[which.min(allfreq$lat[allfreq$Fem.mad !=0])]
however the results made no sense in that the values were nowhere near the minimum and maximum I observed visually.
Using mtcars dataset
> sapply(mtcars,function(x){range(x[x!=0])})
mpg cyl disp hp drat wt qsec vs am gear carb
[1,] 10.4 4 71.1 52 2.76 1.513 14.5 1 1 3 1
[2,] 33.9 8 472.0 335 4.93 5.424 22.9 1 1 5 8

How do I only report selected summary statistics in a table that lists variables as rows using R?

I have a dataset and I need to create a simple table with the number of observations, means, and standard deviations of all the variables (columns). I can't find a way to get only the required 3 summary statistics. Everything I tried keeps giving me min, max, median, 1st and 3rd quartiles, etc. The table should look something like this (with a title):
Table 1: Table Title
_______________________________________
Variables Observations Mean Std.Dev
_______________________________________
Age 30 24 2
... . . .
... . . .
_______________________________________
The summary () does not work because it gives too many other summary statistics. I have done this:
sapply(dataset, function(x) list(means=mean(x,na.rm=TRUE), sds=sd(x,na.rm=TRUE)))
But how do I form the table from this? And is there a better way to do this than using "sapply"?
sapply does return the values that you want but it is not properly structured.
Using mtcars data as an example :
#Get the required statistics and convert the data into dataframe
summ_data <- data.frame(t(sapply(mtcars, function(x)
list(means = mean(x,na.rm=TRUE), sds = sd(x,na.rm=TRUE)))))
#Change rownames to new column
summ_data$variables <- rownames(summ_data)
#Remove rownames
rownames(summ_data) <- NULL
#Make variable column as 1st column
cbind(summ_data[ncol(summ_data)], summ_data[-ncol(summ_data)])
Another way would be using dplyr functions :
library(dplyr)
mtcars %>%
summarise(across(.fns = list(means = mean, sds = sd),
.names = '{col}_{fn}')) %>%
tidyr::pivot_longer(cols = everything(),
names_to = c('variable', '.value'),
names_sep = '_')
# A tibble: 11 x 3
# variable means sds
# <chr> <dbl> <dbl>
# 1 mpg 20.1 6.03
# 2 cyl 6.19 1.79
# 3 disp 231. 124.
# 4 hp 147. 68.6
# 5 drat 3.60 0.535
# 6 wt 3.22 0.978
# 7 qsec 17.8 1.79
# 8 vs 0.438 0.504
# 9 am 0.406 0.499
#10 gear 3.69 0.738
#11 carb 2.81 1.62

Summary Statistics table with factors and continuous variables

I am trying to create a simple summary statistics table (min, max, mean, n, etc) that handles both factor variables and continuous variables, even when there is more than one factor variable. I'm trying to produce good looking HTML output, eg stargazer or huxtable output.
For a simple reproducible example, I'll use mtcars but change two of the variables to factors, and simplify to three variables.
library(tidyverse)
library(stargazer)
mtcars_df <- mtcars
mtcars_df <- mtcars_df %>%
mutate(vs = factor(vs),
am = factor(am)) %>%
select(mpg, vs, am)
head(mtcars_df)
So the data has two factor variables, vs and am. mpg is left as a double:
#> mpg vs am
#> <dbl> <fctr> <fctr>
#> 1 21.0 0 1
#> 2 21.0 0 1
#> 3 22.8 1 1
#> 4 21.4 1 0
#> 5 18.7 0 0
#> 6 18.1 1 0
My desired output would look something like this (format only, the numbers aren't all correct for am0):
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs0 32 0.562 0.504 0 0 1 1
vs1 32 0.438 0.504 0 0 1 1
am0 32 0.594 0.499 0 0 1 1
am1 32 0.406 0.499 0 0 1 1
------------------------------------------------------
A straight call to stargazer does not handle factors (but we have a solution for summarising one factor, below)
# this doesn't give factors
stargazer(mtcars_df, type = "text")
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
------------------------------------------------------
This previous answer from #jake-fisher works very well to summarise one factor variable.
https://stackoverflow.com/a/26935270/8742237
The code below from the previous answer gives both values of the first factor vs, i.e. vs0 and vs1 but when it comes to the second factor, am, it only lists summary statistics for one value of am:
am0 is missing.
I do realise that this is because we want to avoid the dummy variable trap when modeling, but my issue is not about modeling, it's about creating a summary table with all values of all factor variables.
options(na.action = "na.pass") # so that we keep missing values in the data
X <- model.matrix(~ . - 1, data = mtcars_df)
X.df <- data.frame(X) # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs0 32 0.562 0.504 0 0 1 1
vs1 32 0.438 0.504 0 0 1 1
am1 32 0.406 0.499 0 0 1 1
------------------------------------------------------
While use of stargazer or huxtable would be preferred, if there's an easier way to produce this sort of summary table with a different library, that would still be very helpful.
In the end, instead of using model.matrix(), which is designed to drop the base case when creating dummy variables, a simple fix is to use mlr::createDummyFeatures(), which creates a Dummy for all values, even the base case.
library(tidyverse)
library(stargazer)
library(mlr)
mtcars_df <- mtcars
mtcars_df <- mtcars_df %>%
mutate(vs = factor(vs),
am = factor(am)) %>%
select(mpg, vs, am)
head(mtcars_df)
X <- mlr::createDummyFeatures(obj = mtcars_df)
X.df <- data.frame(X) # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")
which does give the desired output:
======================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg 32 20.091 6.027 10 15.4 22.8 34
vs.0 32 0.562 0.504 0 0 1 1
vs.1 32 0.438 0.504 0 0 1 1
am.0 32 0.594 0.499 0 0 1 1
am.1 32 0.406 0.499 0 0 1 1
------------------------------------------------------

Obtain importance of individual trees in a RandomForest

Question: Is there a way to extract the variable importance for each individual CART model from a randomForest object?
rf_mod$forest doesn't seem to have this information, and the docs don't mention it.
In R's randomForest package, the average variable importance for the entire forest of CART models is given by importance(rf_mod).
library(randomForest)
df <- mtcars
set.seed(1)
rf_mod = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 200)
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
We can also extract individual tree structure with getTree. Here's the first tree.
head(getTree(rf_mod, k = 1, labelVar = TRUE))
left daughter right daughter split var split point status prediction
1 2 3 wt 2.15 -3 18.91875
2 0 0 <NA> 0.00 -1 31.56667
3 4 5 wt 3.16 -3 17.61034
4 6 7 drat 3.66 -3 21.26667
5 8 9 carb 3.50 -3 15.96500
6 0 0 <NA> 0.00 -1 19.70000
One workaround is to grow many CARTs (i.e. - ntree = 1), get the variable importance of each tree, and average the resulting %IncMSE:
# number of trees to grow
nn <- 200
# function to run nn CART models
run_rf <- function(rand_seed){
set.seed(rand_seed)
one_tr = randomForest(mpg ~ .,
data = df,
importance = TRUE,
ntree = 1)
return(one_tr)
}
# list to store output of each model
l <- vector("list", length = nn)
l <- lapply(1:nn, run_rf)
The extraction, averaging, and comparison step.
# extract importance of each CART model
library(dplyr); library(purrr)
map(l, importance) %>%
map(as.data.frame) %>%
map( ~ { .$var = rownames(.); rownames(.) <- NULL; return(.) } ) %>%
bind_rows() %>%
group_by(var) %>%
summarise(`%IncMSE` = mean(`%IncMSE`)) %>%
arrange(-`%IncMSE`)
# A tibble: 10 x 2
var `%IncMSE`
<chr> <dbl>
1 wt 8.52
2 cyl 7.75
3 disp 7.74
4 hp 5.53
5 drat 1.65
6 carb 1.52
7 vs 0.938
8 qsec 0.824
9 gear 0.495
10 am 0.355
# compare to the RF model above
importance(rf_mod)
%IncMSE IncNodePurity
cyl 6.0927875 111.65028
disp 8.7730959 261.06991
hp 7.8329831 212.74916
drat 2.9529334 79.01387
wt 7.9015687 246.32633
qsec 0.7741212 26.30662
vs 1.6908975 31.95701
am 2.5298261 13.33669
gear 1.5512788 17.77610
carb 3.2346351 35.69909
I'd like to be able to extract the variable importance of each tree directly from a randomForest object, without this roundabout method that involves completely re-running the RF in order to facilitate reproducible cumulative variable importance plots like this one, and the one below shown for mtcars. Minimal example here.
I'm aware that a single tree's variable importance is not statistically meaningful, and it's not my intention to interpret trees in isolation. I want them for the purpose of visualization and communicating that as trees increase in a forest, the variable importance measures jump around before stabilizing.
When training a randomForest model, the importance scores are computed for the entire forest and stored directly inside the object. Tree-specific scores are not kept and so cannot be directly retrieved from a randomForest object.
Unfortunately, you are correct about having to incrementally construct a forest. The good news is that a randomForest object is self-contained, and you don't need to implement your own run_rf. Instead, you can use stats::update to re-fit the random forest model with a single tree and randomForest::grow to add additional trees one at a time:
## Starting with a random forest having a single tree,
## grow it 9 times, one tree at a time
rfs <- purrr::accumulate( .init = update(rf_mod, ntree=1),
rep(1,9), randomForest::grow )
## Retrieve the importance scores from each random forest
imp <- purrr::map( rfs, ~importance(.x)[,"%IncMSE"] )
## Combine all results into a single data frame
dplyr::bind_rows( !!!imp )
# # A tibble: 10 x 10
# cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 18.8 8.63 1.05 0 1.17 0 0 0 0.194
# 2 0 10.0 46.4 0.561 0 -0.299 0 0 0.543 2.05
# 3 0 22.4 31.2 0.955 0 -0.199 0 0 0.362 5.1
# 4 1.55 24.1 23.4 0.717 0 -0.150 0 0 0.272 5.28
# 5 1.24 22.8 23.6 0.573 0 -0.178 0 0 -0.0259 4.98
# 6 1.03 26.2 22.3 0.478 1.25 0.775 0 0 -0.0216 4.1
# 7 0.887 22.5 22.5 0.406 1.79 -0.101 0 0 -0.0185 3.56
# 8 0.776 19.7 21.3 0.944 1.70 0.105 0 0.0225 -0.0162 3.11
# 9 0.690 18.4 19.1 0.839 1.51 1.24 1.01 0.02 -0.0144 2.77
# 10 0.621 18.4 21.2 0.937 1.32 1.11 0.910 0.0725 -0.114 2.49
The data frame shows how feature importance changes with each additional tree. This is the right panel of your plot example. The trees themselves (for the left panel) can be retrieved from the final forest, which is given by dplyr::last( rfs ).
Disclaimer: This is not really an answer, but too long to post as a comment. Will remove if deemed not appropriate.
While I (think I) understand your question, to be honest I am unsure whether your question makes sense from a statistics/ML point-of-view. The following is based on my obviously limited understanding of RF and CART. Perhaps my comment-post will lead to some insights.
Let's start with some general random forest (RF) theory on variable importance from Hastie, Tibshirani, Friedman, The Elements of Statistical Learning, p. 593 (bold-face mine):
At each split in each tree, the improvement in the split-criterion is the
importance measure attributed to the splitting variable, and is accumulated
over all the trees in the forest separately for each variable. [...]
Random forests also use the oob samples to construct a different variable-importance measure, apparently to measure the prediction strength of each variable.
So the variable importance measure in RF is defined as a measure accumulated over all trees.
In traditional single classification trees (CARTs), variable importance is characterised through the Gini index that measures node impurity (see e.g. How to measure/rank “variable importance” when using CART? (specifically using {rpart} from R) and Carolin Strobl's PhD thesis)
More complex measures to characterise variable importance in CART-like models exist; for example in rpart:
An overall measure of variable importance is the sum of the goodness of split
measures for each split for which it was the primary variable, plus goodness * (adjusted
agreement) for all splits in which it was a surrogate. In the printout these are scaled to sum
to 100 and the rounded values are shown, omitting any variable whose proportion is less
than 1%.
So the bottom line here is the following: At the very least it won't be easy (and in the worst case it won't make sense) to compare variable measures from single classifaction trees with variable importance measures applied to ensemble-based methods like RF.
Which leads me to ask: Why do you want to extract variable importance measures for individual trees from an RF model? Even if you came up with a method to calculate variable importances from individual trees, I believe they wouldn't be very meaningful, and they wouldn't have to "converge" to the ensemble-accumulated values.
We can simplify it by
library(tidyverse)
out <- map(seq_len(nn), ~
run_rf(.x) %>%
importance) %>%
reduce(`+`) %>%
magrittr::divide_by(nn)

In the "Tables"-package: How to get column percentages of a subset of a variable?

In the table below the column named "Percent" shows the total column percent. How do I get it to show the column percent of each level of "am" within each level of "vs"?
This is what I've got:
This is what I'm looking for:
Knitr chunk below:
<<echo=FALSE,results='asis'>>=
#
# library(tables)
# library(Hmisc)
# library(Formula)
## This gives me column percentages for the total table.
latex( tabular( Factor(vs)*Factor(am) ~ gear*Percent("col"), data=mtcars ) )
## I am trying to get column percentages for each level of "vs"
#
I think you would need to change your formula to do this. Like this for example:
tabular(Factor(vs) ~ gear*Percent("row")*Factor(am), data = mtcars)
# gear
# Percent
# am
#vs 0 1
#0 66.67 33.33
#1 50.00 50.00
You can use the Equal() pseudofunction for the denom option to make levels of factor vs the denominator.
library(tables)
tabular( Factor(vs)*Factor(am) ~ gear*Percent(denom = Equal(vs)), data=mtcars)
#>
#> gear
#> vs am Percent
#> 0 0 66.67
#> 1 33.33
#> 1 0 50.00
#> 1 50.00
Created on 2020-09-07 by the reprex package (v0.3.0)

Resources