How to find correlations in a dataset containing over 350 columns in R - r

I have a dataset with ~360 measurement types listed as columns and has 200 rows each with unique ID.
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| | ID | M1 | M2 | M3 | M4 | M5 | … | M360 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1 | 6F0ZC | 0.068 | 0.0691 | 37.727 | 42.6139 | 41.7356 | … | 44.9293 |
| 2 | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3 | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262 | … | 45.7038 |
| 4 | 6F108 | 0.0685 | 0.069 | 38.3026 | 43.5699 | 42.3 | … | 46.1701 |
| 5 | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6 | 6F19W | 0.0682 | 0.071 | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7 | 6F1A0 | 0.0681 | 0.069 | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8 | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438 | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528 | 44.606 | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
I am trying to find correlations within these measurements and check for highly correlated features and visualize them. With so many columns, I am not able to do the regular correlation plots. (chart.Correlation,corrgram,etc..)
I also tried using qgraph but the measurements get cluttered at one place and is not very intuitive.
library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"),
layout="spring",
label.cex=0.9,
minimum = 0.90,
label.scale=FALSE)
Is there a good approach to visualize it & tell how these measurements are correlated with each other?

As mentioned in a comment, corrplot(...) might be a good option. Here is a ggplot option that does something similar. The basic idea is to draw a heat map, where color represents the correlation coefficient.
# create artificial dataset - you have this already
set.seed(1) # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))
# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
geom_tile()+
scale_fill_gradientn(colours=rev(heat.colors(10)))
coord_fixed()
# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
geom_tile()+
scale_fill_manual(values=rev(heat.colors(5)))+
coord_fixed()
Note the diagonal. This is by design - the contrived data is set up so that rows i and i+1 are perfectly correlated, for every other row.

Related

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

Data Cleanup with R: Getting rid of extra fullstops

I am cleaning the data using R.
Below is my data format
Input
1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |
expected output:
1) 100 | 101.25 | 102.25 |NA | NA |201.5
2) 200.05|200.26 | 205 |NA | NA |3000
3) 300.98|300.26 |2001.26 |NA |0.2 |5.65
there are extra full stops at in the table, which I am trying to get cleaned, but to retain decimal numbers in its format
I tried replace all in R, which clears all the full stops, and decimal numbers are distorted.
If the trailing full stop is really the only manifestation of the problem, then you may try just removing it with sub:
x <- c("101.25", "200.56.", "300.26")
x <- sub("\\.$", "", x)
You can use look-ahead to replace dot(.) which are not before space or | as:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'
y <- gsub("([.]+)(?=[[:blank:]|])","",x,perl = TRUE)
cat(y)
# 1) 100 | 101.25 | 102.25 | | | 201.5 |
# 2) 200.05 | 200.56 | 205 | | | 3000 |
# 3) 300.98 | 300.26 | 2001.56| | 0.2| 5.65 |
Regex explanation:
([.]+) - Group any number of . before look-ahead
(?=[[:blank:]|]) - Look-ahead before :blank: or |
Data:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'

org-mode: add a header to a table programmatically

I have a table definied in org-mode:
`#+RESULTS[4fc5d440d2954e8355d32d8004cab567f9918a64]: table
| 7.4159 | 3.0522 | 5.9452 |
| -1.0548 | 12.574 | -6.5001 |
| 7.4159 | 3.0522 | 5.9452 |
| 5.1884 | 4.9813 | 4.9813 |
`
and I want to produce the following table:
#+caption: Caption of my table
| | group 1 | group 2 | group 3 |
|--------+---------+---------+---------|
| plan 1 | 7.416 | 3.052 | 5.945 |
| plan 2 | -1.055 | 12.574 | -6.5 |
| plan 3 | 7.416 | 3.052 | 5.945 |
| plan 4 | 5.1884 | 4.9813 | 4.9813 |
How can I accomplish that? Here is what I tried (in R):
`
#+begin_src R :colnames yes :var table=table :session
data.frame(table)
#+end_src
`
But of course it doesn't work, here is what I get:
`#RESULTS:
| X7.4159 | X3.0522 | X5.9452 |
|---------+---------+---------|
| -1.0548 | 12.574 | -6.5001 |
| 7.4159 | 3.0522 | 5.9452 |
| 5.1884 | 4.9813 | 4.9813 |`
Any suggestions?
thanks!
This gets pretty close. first define this function:
#+BEGIN_SRC emacs-lisp
(defun add-caption (caption)
(concat (format "org\n#+caption: %s" caption)))
#+END_SRC
Next, use this kind of src block. I use python, but it should work in R too, you just need the :wrap. I passed your data in through the var, you don't need it if you generate the data in the block.
#+BEGIN_SRC python :results value :var data=data :wrap (add-caption "Some really long, uninteresting, caption about data that is in this table.")
data.insert(0, ["", "group 1", "group 2", "group 3"])
data.insert(1, None)
return data
#+END_SRC
This outputs
#+BEGIN_org
#+caption: Some really long, uninteresting, caption about data that is in this table.
| | group 1 | group 2 | group 3 |
|--------+---------+---------+---------|
| plan 1 | 7.416 | 3.052 | 5.945 |
| plan 2 | -1.055 | 12.574 | -6.5 |
| plan 3 | 7.416 | 3.052 | 5.945 |
| plan 4 | 5.1884 | 4.9813 | 4.9813 |
#+END_org
and it exports ok too I think.

How to make a multiple corpora in R

This is a car review data which has more than 40,000 rows and each review has more than 500 characters. This is sample data : https://drive.google.com/open?id=1ZRwzYH5McZIP2NLKxncmFaQ0mX1Pe0GShTMu57Tac_E
| brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| brand2 | 500 characters3 | 100 Characters3 | | | | | |
| brand2 | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| brand3 | 500 characters6 | 100 characters6 | | | | | |
I'd like to merge review column by brands like this :
| Brand | review | favorite | c4 | c5 | c6 | c7 | c8 |
| brand1 | 500 characters1 | 100 characters1 | | | | | |
| brand2 | 500 characters2 | 100 Characters2 | | | | | |
| | 500 characters3 | 100 Characters3 | | | | | |
| | 500 characters4 | 100 Characters4 | | | | | |
| brand3 | 500 characters5 | 100 Characters5 | | | | | |
| | 500 characters6 | 100 characters6 | | | | | |
So, I tired to use aggregate().
temp <- aggregate(data$review ~ data$brand , data, as.list )
But, It takes very long.
Is there any simple way to merge that?
Thank you in advance!
Try splitting them on each factor and then pasting them together. aggregate() is a horribly slow function and should be avoided for all but the smallest datasets.
This should do the trick: (note I downloaded your Google file as sampleDF.csv here)
sampleDF <- read.csv("~/Downloads/sampleDF.csv", stringsAsFactors = FALSE)
# aggregate text by brand
brand.split <- split(sampleDF$text, as.factor(sampleDF$Brand))
brand.grouped <- sapply(brand.split, paste, collapse = " ")
# aggregate favorite by brand
favorite.split <- split(sampleDF$favorite, as.factor(sampleDF$Brand))
favorite.grouped <- sapply(favorite.split, paste, collapse = " ")
newDf <- data.frame(brand = names(brand.split),
text <- favorite.grouped,
favorite <- favorite.grouped,
stringsAsFactors = FALSE)
If you want to bring in other variables they will need to vary at the brand level only.

Hmisc Table Creation

Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)

Resources