Calculate Hopkins Statistics (coefficient) between two groups with value whose have different ID in Rs - r

I have dataset “data_file” which contains five columns & 1 million rows:
X1 "ID_Number" (numeric),
X2 “Sample_Type”,
X3 “Signal_X” (numeric),
X4 “Signal_Y” (numeric),
X5 “Signal_Z” (numeric).
Each value of the ID corresponds to a set of values “Signal_X”, “Signal_Y” and “Signal_Z”.
ID_Number :: Sample_Type :: Signal_X :: Signal_Y :: Signal_Z
2 Sample 337 1538 0.6314152
2 Sample 106 1840 0.9923422
…
2 Sample 94 1445 0.9967044
10 Sample 164 1777 0.9950826
10 Sample 183 1933 0.9931457
10 Sample 176 1590 0.9690951
…
10 Sample 139 1339 0.9820210
12 Sample 154 1397 0.9700886
12 Sample 144 1206 0.9457763
… etc
By scanning the ID I found the correlation coefficient b/w “Signal_X”
and “Signal_Y” using the following code:
library(plyr)
dataAE<- ddply(data_file, " ID_Number ", summarise, CorrelationCoefficient=cor(SignalX, SignalY))
View(dataAE)
The output should look like this.
datasetID Correlation Coefficient
1 2 0.48083503
2 3 -0.81036062
3 10 -0.32098672
4 12 -0.20251427
5 24 -0.18004939
6 51 -0.45803370
7 54 -0.59001642
8 63 -0.53976850
etc …
By analogy, I'm trying to find – to Compute Hopkins statistic & find optimal number of clusters for my
dataset.
library(clustertend)
set.seed(123)
hopkins(data_file, n = nrow(data_file)-1)
I tried to replace CorrelationCoefficient=cor(SignalX, SignalY) at
HopkinsStatistics=hopkins(SignalX, SignalY)
… And without results.
Manually & without problem for each ID set I used the following code
library(clustertend)
# Compute Hopkins statistic for dataset
set.seed(123)
subset$sampletype<- NULL
df<-scale(subset)
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat
res
The problem is how to automate the calculations & Using loops.
Please help me. Thanks in advance.

Related

Autocorrelation Functions - time series analysis in R

I have the following df:
TS A_f1 A_p B_f1 B_p C_f1 C_p
1 10 100 15 150 17 170
2 20 200 25 250 27 270
3 30 300 35 350 37 370
This is, however, only a simplification of my real df with 40k+ observations and 100+ features.
TS are timestamps - in every row there are stores listed ("A","B","C", n ...) with features (f1, p, f_n ...)
Before I want to train a LSTM on my df, I want to use the acf function (or pacf) to find some patterns on my data to do some feature selection beforehand.
Any idea, how I can do this with my data?

How to find correlation coefficients in a loop?

I have a dataset like this:
Account_tenure_years = c(982,983,984,985,986,987,988)
N=c(12328,18990,21255,27996,32014,15487,4347)
Y=c(76,64,61,76,94,55,11)
df_table_account_tenure_vs_PPC = data.frame(Account_tenure_years,N,Y)
The dataset looks like this:
Account_tenure_years N Y
982 12328 76
983 18990 64
984 21255 61
985 27996 76
986 32014 94
987 15487 55
988 4347 11
What I want to do is this:
I want to find correlation between any two of the Account_tenure_years, example, 982,983 and find the correlation coefficient with N and Y columns i.e I want to find the correlation coefficient of the below table
Account_tenure_years N Y
982 12328 76
983 18990 64
Now I want to repeat this 8C2 times i.e 28 times. Taking different rows and finding the correlation coefficient in each case.
i.e in the next iteration I would want :
Account_tenure_years N Y
983 18990 64
984 21255 61
And find its correlation coefficient. Now after I have received all those 28 correlation coefficients, I average them out and find a mean correlation coefficient for the entire dataset.
How do I do this in R?
Ok lets get this straight if I find out the correlation coefficient between the columns
Account_tenure_years column, N
Also if I try to find out the correlation coefficient between the columns
Account_tenure_years column, Y
And if I find negative correlation coefficients in each case , can we infer anything from that?
It is not an ideal way to calculate correlation coefficient for each case. It should be calculated for the entire dataset:
Account_tenure_years = c(982,983,984,985,986,987,988)
N=c(12328,18990,21255,27996,32014,15487,4347)
Y=c(76,64,61,76,94,55,11)
df = data.frame(Account_tenure_years,N,Y)
cor(df$Account_tenure_years,df$N)
cor(df$Account_tenure_years,df$Y)
Output is as shown below:
> cor(df$Account_tenure_years,df$N)
[1] -0.1662244
> cor(df$Account_tenure_years,df$Y)
[1] -0.5332263
You can inferred that data is negatively correlated. It means increase in the value of Account_tenure_years will decrease the value of N and Y or vice-versa.
Please feel free to correct me!
It should be easier to do this to transpose your data, And the best part is that you don't even need to write a loop.
try this:
dt <- data.table::fread("
Account_tenure_years N Y
982 12328 76
983 18990 64
984 21255 61
985 27996 76
986 32014 94
987 15487 55
988 4347 11
")
dt.t <- as.data.frame(t(dt[, 2:3]))
colnames(dt.t) = dt$Account_tenure_years
# transpose
dt.t
#> 982 983 984 985 986 987 988
#> N 12328 18990 21255 27996 32014 15487 4347
#> Y 76 64 61 76 94 55 11
# calculate correlation matrix, read more help(cor)
cor(dt.t)
#> 982 983 984 985 986 987 988
#> 982 1 1 1 1 1 1 1
#> 983 1 1 1 1 1 1 1
#> 984 1 1 1 1 1 1 1
#> 985 1 1 1 1 1 1 1
#> 986 1 1 1 1 1 1 1
#> 987 1 1 1 1 1 1 1
#> 988 1 1 1 1 1 1 1
Created on 2018-07-20 by the reprex package (v0.2.0.9000).
I do not understand how you want to compute correlation coefficients between two variables with only one observation for each. Therefore, I assume you have more rows than provided here.
First define all combinations:
combinations <- combn(df_table_account_tenure_vs_PPC$Account_tenure_years, 2)
For each combination, you want to extract the corresponding rows and compute the correlation coefficients for each variable:
coefficients <- apply(combinations, 2, function(x, df_table_account_tenure_vs_PPC){
coef <- sapply(c("N", "Y"), function(v, x, df_table_account_tenure_vs_PPC){
c <- cor(df_table_account_tenure_vs_PPC[df_table_account_tenure_vs_PPC == x[1], v], df_table_account_tenure_vs_PPC[df_table_account_tenure_vs_PPC == x[2], v])
return(c)},
x, df_table_account_tenure_vs_PPC)
return(c(x, coef))},
df_table_account_tenure_vs_PPC)
Then, you can aggregate your results in a data.frame:
df <- as.data.frame(t(coefficients))
colnames(df) <- c("Year1", "Year2", "N_cor", "Y_cor")
This should work. Please tell me if you have any problem.
Again, make sure you have more than one observation in each condition if you want a meaningful correlation coefficient.

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Resources