correlation heatmap using heatmaply R - r

I'm trying to create an heatmap on the base of spearman correlation and with dendrogramm corresponding to spearman correlation values.
My input file is composed as follow:
> data[1:6,1:6]
group EG PN C0 C10 C10.1
1 Patients 24 729 352.66598 43.80707 75.16226
2 Patients 24 729 195.48486 17.15763 33.60365
3 Patients 24 729 106.85937 15.13400 34.47340
4 Patients 27 1060 76.70645 14.98315 22.09885
5 Patients 27 1060 354.07169 50.61995 98.36765
6 Patients 27 1060 331.84956 92.00343 125.46658
> data[150:160,1:6]
group EG PN C0 C10 C10.1
150 Controls 27 1011 99.94756 9.018773 20.207498
151 Controls 30 616 300.20203 25.667548 37.363280
152 Controls 30 616 190.38030 18.811198 46.417332
153 Controls 26 930 79.44666 7.801935 4.569444
154 Controls 24 724 381.74026 39.842241 42.144842
155 Controls 24 724 191.39962 19.008729 31.064398
I'm able to made up a simple correlation plot but i would like to create an unique heatmap with both protein and subjects dendrogram on the base on spearman correlation. Does anyone know how to do? thanks in advance

The following code displays an interactive heatmap using Spearman's rank correlation to cluster both rows and columns (in this case for the mtcars dataset).
heatmaply(mtcars,
distfun = function(x) as.dist(1 - cor(t(x), method="spearman")))

Related

Autocorrelation Functions - time series analysis in R

I have the following df:
TS A_f1 A_p B_f1 B_p C_f1 C_p
1 10 100 15 150 17 170
2 20 200 25 250 27 270
3 30 300 35 350 37 370
This is, however, only a simplification of my real df with 40k+ observations and 100+ features.
TS are timestamps - in every row there are stores listed ("A","B","C", n ...) with features (f1, p, f_n ...)
Before I want to train a LSTM on my df, I want to use the acf function (or pacf) to find some patterns on my data to do some feature selection beforehand.
Any idea, how I can do this with my data?

How are we supposed to get at matrix diagonals and partial regression plots using r programming?

Given the data
farm
up
right
left
24.3
34.3
50
45
30.2
35.3
54
45
49
45
540
4353
70
60
334
343
69
80
54
342
# for finding Studentized residuals vs fitted value
mod1<-lm(farm~up+right+left)
plot(mod1)
# for finding cooks distance
plot(cookd(lm(farm~up+right+left, data=data)))
could not find function "cookd"
I don't know how to find partial and diagonal matrix though I also couldn't find much information online.
Please help or correct me if I am wrong.

dplyr sample_n returns different number of rows in table

I am working with dplyr and sample_n in R and trying to get an even group of rows to work with in my data frame.
So, I have a data set, head of data as follows:
> head(SEH)
Time.Level Demo.Age SEH.Total
92 PRE 12 110
335 PRE 12 80
720 MID 14 85
196 MID 11 95
408 POST 18 60
184 POST 10 99
I separated out the data into three different data frames according to time level. So I have a SEH.pre, an SEH.mid and an SEH.post. I then do a describe and I know I have uneven groups of pre, mid, post. So, I want to random sample out pre, mid, post groups to be an even size. For example, I have the SEH.pre and SEH.mid group n sizes below:
> describe(SEH.pre)
vars n
Time.Level* 1 887
Demo.Age 2 883
SEH.Total 3 887
> describe(SEH.mid)
vars n
Time.Level* 1 894
Demo.Age 2 872
SEH.Total 3 894
So, now I run sample_n on the SEH.pre thinking that I can re-sample to an n of 860 across all columns. I run the following command:
SEH.pre2 <- sample_n(SEH.pre, 860, replace = FALSE)
And then I describe and the Demo.Age is less than the rest:
> describe(SEH.pre2)
vars n ...
Time.Level* 1 860
Demo.Age 2 856
SEH.Total 3 860
I feel like a big idiot but I cannot figure out why this is. I have tried it multiple times and Demo.Age varies from 856 to 859, but is never 860. I want all three columns to be 860. How do I do this? And why am I mis-thinking that sample_n should create even groups out of uneven?

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

multidimensional data clustering

Problem: I have two groups of multidimensional heterogeneous data. I have concocted a simple illustrative example below. Notice that some columns are discrete (age) while some are binary (gender) and another is even an ordered pair (pant size).
Person Age gender height weight pant_size
Control_1 55 M 167.6 155 32,34
Control_2 68 F 154.1 137 28,28
Control_3 53 F 148.9 128 27,28
Control_4 57 M 167.6 165 38,34
Control_5 62 M 147.4 172 36,32
Control_6 44 M 157.6 159 32,32
Control_7 76 F 172.1 114 30,32
Control_8 49 M 161.8 146 34,34
Control_9 53 M 164.4 181 32,36
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28
The question is does the entire experimental group differ significantly from the entire control group?
Or roughly speaking do they form two distinct clusters in the space of [age,gender,height,weight,pant_size]?
The general idea of what I’ve tried so far is a metric that compares corresponding columns of the experimental group to those of the control; the metric then takes the sum of the column scores (see below). A somewhat arbitrary threshold is picked to decide if the two groups are different. This arbitrariness is confounded by the weighting of the columns which is also somewhat arbitrary. Remarkably this approaches is preforming well for the actual problem I have but it needs to be formalized. I’m wondering if this approach is similar to any existing approaches or if other well established approaches more widely accepted?
Person Age gender height weight pant_size
experiment_1 39 F 139.6 112 26,28
experiment_2 52 M 154.1 159 32,32
experiment_3 43 F 148.9 123 27,28
experiment_4 55 M 167.6 188 36,38
experiment_5 61 M 161.4 171 36,32
experiment_6 48 F 149.1 144 28,28 metric
column score 2 1 5 1 7 16
Treat this as a classification rather than a clustering problem if you assume the results "cluster".
Because you don't need to find these clusters, but they are predefined classes.
The "rewritten" approach is as follows:
Train different classifiers to predict whether a point is from data A or data B. If you can get a much better accuracy than 50% (assuming balanced data) then the geoups do differ. If all your classifiers are only as good as random (and you didn't make mistakes) then tthe two sets are probably just too similar.

Resources