Using groups as table header in kableExtra - r

I recently discovered kableExtra. Making tables in R instead of manually entering values in a Word-table is much faster and less prone to error.
I want to make tables that have a (or several) grouping variable(s) as a header.
Basically, instead of this
iris %>%
group_by(Species) %>%
summarise(mean = mean(Sepal.Length), sd = sd(Sepal.Length)) %>%
kbl(digits = 1,format = "pipe")
|Species | mean| sd|
|:----------|----:|---:|
|setosa | 5.0| 0.4|
|versicolor | 5.9| 0.5|
|virginica | 6.6| 0.6|
Instead I want to accomplish this, or a variation of this.
| | Setosa | Versicolor | Virginica | |
|------|--------|------------|-----------|---|
| mean | 5.0 | 5.9 | 6.6 | |
| sd | 0.4 | 0.5 | 0.5 | |
| | | | | |
For multiple headers, I was thinking something in the lines of this
iris %>%
mutate(long = ifelse(Sepal.Length > 5,TRUE,FALSE)) %>%
group_by(Species,long) %>%
summarise(mean = mean(Sepal.Length), sd = sd(Sepal.Length)) %>%
kbl(digits = 1)
|Species |long | mean| sd|
|:----------|:-----|----:|---:|
|setosa |FALSE | 4.8| 0.2|
|setosa |TRUE | 5.3| 0.2|
|versicolor |FALSE | 5.0| 0.1|
|versicolor |TRUE | 6.0| 0.5|
|virginica |FALSE | 4.9| NA|
|virginica |TRUE | 6.6| 0.6|
But instead producing
| | Setosa | Setosa | Versicolor | Versicolor | Virginica | Virginica | |
|------|--------|--------|------------|------------|-----------|-----------|---|
| long | TRUE | FALSE | TRUE | FALSE | TRUE | FALSE | |
| mean | 5.3 | 4.8 | 6.0 | 5.0 | 6.6 | 4.9 | |
| sd | 0.2 | 0.2 | 0.5 | 0.1 | 0.6 | NA | |
| | | | | | | | |
Bonus points for not repeating the table header, but having a merged cell.
Can anyone point me to any examples or relevant documentation?
I am using R 4.0.2, and as such only have access to kableExtra, not kable.

Related

Top 3 Box without TRUE FALSE in R / Rstudio

I'm new to R. I'm able to create top 3 and bottom 3 boxes in my tables, but it displays as "TRUE" and "FALSE" like this...
The code that i used is...
library(expss)
X4607 %>%
tab_cells(qcs1a_SQ001, "Top 3 Box"=qcs1a_SQ001>7 & qcs1a_SQ001<11, "Bottom 3 Box"=qcs1a_SQ001<=2) %>%
tab_cols(total(), spcode) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
Is there any way to just have the number of 'TRUE' come in under the "top 3 box" label and get rid of the "TRUE" and "FALSE" displaying.
There is a special function subtotal for that:
library(expss)
set.seed(123)
N = 100
X4607 = data.frame(
spcode = sample(c("South", "North"), size = N, replace = TRUE),
qcs1a_SQ001 = sample(c(1:10, 99), size = N, replace = TRUE)
)
X4607 %>%
tab_cells(subtotal(qcs1a_SQ001, "Bottom 3 Box" = 1:3, "Top 3 Box" = 7:10, position = "bottom")) %>%
tab_cols(total(), spcode) %>%
tab_stat_cpct() %>%
tab_last_sig_cpct() %>%
tab_pivot()
# | | #Total | spcode | |
# | | | North | South |
# | | | A | B |
# | ------------ | ------ | ------ | ------ |
# | 1 | 5.0 | 9.3 | 1.8 |
# | 2 | 5.0 | 4.7 | 5.3 |
# | 3 | 9.0 | 4.7 | 12.3 |
# | 4 | 11.0 | 7.0 | 14.0 |
# | 5 | 6.0 | 9.3 | 3.5 |
# | 6 | 10.0 | 2.3 | 15.8 A |
# | 7 | 13.0 | 16.3 | 10.5 |
# | 8 | 14.0 | 16.3 | 12.3 |
# | 9 | 10.0 | 11.6 | 8.8 |
# | 10 | 10.0 | 9.3 | 10.5 |
# | 99 | 7.0 | 9.3 | 5.3 |
# | Bottom 3 Box | 19.0 | 18.6 | 19.3 |
# | Top 3 Box | 47.0 | 53.5 | 42.1 |
# | #Total cases | 100 | 43 | 57 |

expss table with row percentage within nested variables in R

When using the expss package in R for creating tables, how does one get the row_percentages to be calculated within a nested variable? In the example below, I would like the row percentage to be calculated within each time period. Thus, I would like the row percentage to sum to 100% within each time period (2015-2016 and 2017-2018). Now however, the percentage is calculated over the entire row.
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot()
Created on 2019-09-28 by the reprex package (v0.3.0)
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 13.6 | 36.4 | 13.6 | 36.4 |
| | 6 | row_perc | 28.6 | 21.4 | 28.6 | 21.4 |
| | 8 | row_perc | 42.9 | 7.1 | 42.9 | 7.1 |
| | #Total cases | row_perc | 19.0 | 13.0 | 19.0 | 13.0 |
I believe this is what you are after:
library(expss)
data(mtcars)
mtcars$period <- "2015-2016"
mtcars <- rbind(mtcars, mtcars)
mtcars$period[33:64] <- "2017-2018"
mtcars = apply_labels(mtcars,
cyl = "Number of cylinders",
am = "Transmission",
am = c("Automatic" = 0,
"Manual"=1),
period = "Measurement period"
)
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period %nest% am ) %>%
tab_subgroup(period =="2015-2016") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period =="2017-2018") %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "inside_rows")
Pay attention to the use of tab_subgroup() which determines which subgroup of year period we want to calculate the percentage as well as to stat_position = "inside_rows" which determines where we want to put the calculated output in the final table.
Output:
| | | | Measurement period | | | |
| | | | 2015-2016 | | 2017-2018 | |
| | | | Transmission | | Transmission | |
| | | | Automatic | Manual | Automatic | Manual |
| ------------------- | ------------ | -------- | ------------------ | ------ | ------------ | ------ |
| Number of cylinders | 4 | row_perc | 27.3 | 72.7 | | |
| | | | | | 27.3 | 72.7 |
| | 6 | row_perc | 57.1 | 42.9 | | |
| | | | | | 57.1 | 42.9 |
| | 8 | row_perc | 85.7 | 14.3 | | |
| | | | | | 85.7 | 14.3 |
| | #Total cases | row_perc | 19.0 | 13.0 | | |
| | | | | | 19.0 | 13.0 |
EDIT:
We do not need %nest% if we do not want nested rows(i.e. twice more rows). In this case, the final part of the code should be modified as follows:
mtcars %>%
tab_cells(cyl) %>%
tab_cols(period,am) %>%
tab_subgroup(period ==c("2015-2016")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_subgroup(period ==c("2017-2018")) %>%
tab_stat_rpct(label = "row_perc") %>%
tab_pivot(stat_position = "outside_columns")
Output:
| | | Measurement period | Transmission | | |
| | | 2015-2016 | Automatic | Manual | Automatic |
| | | row_perc | row_perc | row_perc | row_perc |
| ------------------- | ------------ | ------------------ | ------------ | -------- | --------- |
| Number of cylinders | 4 | 100 | 27.3 | 72.7 | 27.3 |
| | 6 | 100 | 57.1 | 42.9 | 57.1 |
| | 8 | 100 | 85.7 | 14.3 | 85.7 |
| | #Total cases | 32 | 19.0 | 13.0 | 19.0 |
| Measurement period |
Manual | 2017-2018 |
row_perc | row_perc |
-------- | ------------------ |
72.7 | 100 |
42.9 | 100 |
14.3 | 100 |
13.0 | 32 |

For each combination of a set of variables in a list, calculating correlations between this combination and another variable in R

In R I want to generate correlation co-efficients by comparing 2 variables whilst also retaining a phylogenetic signal.
The initial way I thought to do this is not computationally efficient, and I think there is a much simpler, but I do not have the skills in R to do it.
I have a csv file which looks like this:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
What I want to do is, for each possible combination of the percentages within the 20 single letter columns (amino acids, so 10 million combinations). Is to calculate the correlation between each different combination and the OGT variable in the CSV.... (whilst retaining a phylogenetic signal)
My current code is this:
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)
taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")
zeldodata <- read.csv("COMPLETECOPYFORR.csv")
Species <- dput(zeldodata)
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
S <- unique(SpeciesLong$protein)
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, c("Scombi", "joint_protein"))
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
Species <- cbind(Species, SpeciesAggregate)
`
Which attempts to feed in each combination into memory and then calculate the sum of each proportion of each of the acids, but this takes forever to finish and crashes before completion.
I think it would be better to feed in correlation co-efficents into a vector, and then just print out the relative co-efficients of each different combination for each species, but I don't know the best way of doing this in R.
I also aim to retain a phylogenetic signal using the ape package using something along the lines of this:
pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
data = zeldodata, method = "ML")
summary(pglsModel)
Apologies for how unclear this is, if anyone has any advice, much appreciated!
Edit: Link to taxonomyforzeldospecies.nex
Output from dput(Zeldodata):
1 Species OGT Domain A C D E F G H I K L M N P Q R S T V W Y
------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
2 Aeropyrum pernix 95 Archaea 9.7659115711 0.6720465616 4.3895390781 7.6501943794 2.9344881615 8.8666657183 1.5011817208 5.6901432494 4.1428307243 11.0604191603 2.21143353 1.9387130928 5.1038552753 1.6855017182 7.7664358772 6.266067034 4.2052190807 9.2692433532 1.318690698 3.5614200159
3 Argobacterium fabrum 26 Bacteria 11.5698896021 0.7985475923 5.5884500155 5.8165463343 4.0512504104 8.2643271309 2.0116736244 5.7962804605 3.8931525401 9.9250463349 2.5980609708 2.9846761128 4.7828063605 3.1262365491 6.5684282943 5.9454781844 5.3740045968 7.3382308193 1.2519739683 2.3149400984
4 Anaeromyxobacter dehalogenans 27 Bacteria 16.0337898849 0.8860252895 5.1368827707 6.1864992608 2.9730203513 9.3167603253 1.9360386851 2.940143349 2.3473650439 10.898494736 1.6343905351 1.5247123262 6.3580285706 2.4715303021 9.2639057482 4.1890063803 4.3992339725 8.3885969061 1.2890166336 1.8265589289
5 Aquifex aeolicus 85 Bacteria 5.8730327277 0.795341216 4.3287799008 9.6746388172 5.1386954322 6.7148035486 1.5438364179 7.3358775924 9.4641440609 10.5736658776 1.9263080969 3.6183861236 4.0518679067 2.0493569604 4.9229955632 4.7976564501 4.2005259246 7.9169763709 0.9292167138 4.1438942987
6 Archaeoglobus fulgidus 83 Archaea 7.8742687687 1.1695110027 4.9165979364 8.9548767369 4.568636662 7.2640358917 1.4998752909 7.2472039919 6.8957233203 9.4826333048 2.6014466253 3.206476915 3.8419576418 1.7789787933 5.7572748236 5.4763351139 4.1490633048 8.6330814159 1.0325605451 3.6494619148
this will give you a long data frame with each combination and sum per Species (takes about 35 seconds on my machine)...
zeldodata <-
Species %>%
gather(protein, proportion, A:Y) %>%
group_by(Species) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion)
an example of calculating each species separately and saving the data to disk before reading each one in and combining them...
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
# read in CSV file
zeldodata <-
read_delim(
delim = "|",
trim_ws = TRUE,
col_names = TRUE,
col_types = "cicdddddddddddddddddddd",
file = "Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y
Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159
Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
)
# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
zeldodata %>%
filter(Species == species) %>%
gather(protein, proportion, A:Y) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion) %>%
saveRDS(file = paste0(species, ".RDS"))
}
# read in and combine all the RDS files
zeldodata <-
list.files(pattern = "\\.RDS") %>%
map(read_rds) %>%
bind_rows()

org-mode: add a header to a table programmatically

I have a table definied in org-mode:
`#+RESULTS[4fc5d440d2954e8355d32d8004cab567f9918a64]: table
| 7.4159 | 3.0522 | 5.9452 |
| -1.0548 | 12.574 | -6.5001 |
| 7.4159 | 3.0522 | 5.9452 |
| 5.1884 | 4.9813 | 4.9813 |
`
and I want to produce the following table:
#+caption: Caption of my table
| | group 1 | group 2 | group 3 |
|--------+---------+---------+---------|
| plan 1 | 7.416 | 3.052 | 5.945 |
| plan 2 | -1.055 | 12.574 | -6.5 |
| plan 3 | 7.416 | 3.052 | 5.945 |
| plan 4 | 5.1884 | 4.9813 | 4.9813 |
How can I accomplish that? Here is what I tried (in R):
`
#+begin_src R :colnames yes :var table=table :session
data.frame(table)
#+end_src
`
But of course it doesn't work, here is what I get:
`#RESULTS:
| X7.4159 | X3.0522 | X5.9452 |
|---------+---------+---------|
| -1.0548 | 12.574 | -6.5001 |
| 7.4159 | 3.0522 | 5.9452 |
| 5.1884 | 4.9813 | 4.9813 |`
Any suggestions?
thanks!
This gets pretty close. first define this function:
#+BEGIN_SRC emacs-lisp
(defun add-caption (caption)
(concat (format "org\n#+caption: %s" caption)))
#+END_SRC
Next, use this kind of src block. I use python, but it should work in R too, you just need the :wrap. I passed your data in through the var, you don't need it if you generate the data in the block.
#+BEGIN_SRC python :results value :var data=data :wrap (add-caption "Some really long, uninteresting, caption about data that is in this table.")
data.insert(0, ["", "group 1", "group 2", "group 3"])
data.insert(1, None)
return data
#+END_SRC
This outputs
#+BEGIN_org
#+caption: Some really long, uninteresting, caption about data that is in this table.
| | group 1 | group 2 | group 3 |
|--------+---------+---------+---------|
| plan 1 | 7.416 | 3.052 | 5.945 |
| plan 2 | -1.055 | 12.574 | -6.5 |
| plan 3 | 7.416 | 3.052 | 5.945 |
| plan 4 | 5.1884 | 4.9813 | 4.9813 |
#+END_org
and it exports ok too I think.

How to find correlations in a dataset containing over 350 columns in R

I have a dataset with ~360 measurement types listed as columns and has 200 rows each with unique ID.
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| | ID | M1 | M2 | M3 | M4 | M5 | … | M360 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1 | 6F0ZC | 0.068 | 0.0691 | 37.727 | 42.6139 | 41.7356 | … | 44.9293 |
| 2 | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3 | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262 | … | 45.7038 |
| 4 | 6F108 | 0.0685 | 0.069 | 38.3026 | 43.5699 | 42.3 | … | 46.1701 |
| 5 | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6 | 6F19W | 0.0682 | 0.071 | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7 | 6F1A0 | 0.0681 | 0.069 | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8 | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438 | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528 | 44.606 | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
I am trying to find correlations within these measurements and check for highly correlated features and visualize them. With so many columns, I am not able to do the regular correlation plots. (chart.Correlation,corrgram,etc..)
I also tried using qgraph but the measurements get cluttered at one place and is not very intuitive.
library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"),
layout="spring",
label.cex=0.9,
minimum = 0.90,
label.scale=FALSE)
Is there a good approach to visualize it & tell how these measurements are correlated with each other?
As mentioned in a comment, corrplot(...) might be a good option. Here is a ggplot option that does something similar. The basic idea is to draw a heat map, where color represents the correlation coefficient.
# create artificial dataset - you have this already
set.seed(1) # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))
# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
geom_tile()+
scale_fill_gradientn(colours=rev(heat.colors(10)))
coord_fixed()
# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
geom_tile()+
scale_fill_manual(values=rev(heat.colors(5)))+
coord_fixed()
Note the diagonal. This is by design - the contrived data is set up so that rows i and i+1 are perfectly correlated, for every other row.

Resources