How to find correlation coefficients in a loop? - r

I have a dataset like this:
Account_tenure_years = c(982,983,984,985,986,987,988)
N=c(12328,18990,21255,27996,32014,15487,4347)
Y=c(76,64,61,76,94,55,11)
df_table_account_tenure_vs_PPC = data.frame(Account_tenure_years,N,Y)
The dataset looks like this:
Account_tenure_years N Y
982 12328 76
983 18990 64
984 21255 61
985 27996 76
986 32014 94
987 15487 55
988 4347 11
What I want to do is this:
I want to find correlation between any two of the Account_tenure_years, example, 982,983 and find the correlation coefficient with N and Y columns i.e I want to find the correlation coefficient of the below table
Account_tenure_years N Y
982 12328 76
983 18990 64
Now I want to repeat this 8C2 times i.e 28 times. Taking different rows and finding the correlation coefficient in each case.
i.e in the next iteration I would want :
Account_tenure_years N Y
983 18990 64
984 21255 61
And find its correlation coefficient. Now after I have received all those 28 correlation coefficients, I average them out and find a mean correlation coefficient for the entire dataset.
How do I do this in R?
Ok lets get this straight if I find out the correlation coefficient between the columns
Account_tenure_years column, N
Also if I try to find out the correlation coefficient between the columns
Account_tenure_years column, Y
And if I find negative correlation coefficients in each case , can we infer anything from that?

It is not an ideal way to calculate correlation coefficient for each case. It should be calculated for the entire dataset:
Account_tenure_years = c(982,983,984,985,986,987,988)
N=c(12328,18990,21255,27996,32014,15487,4347)
Y=c(76,64,61,76,94,55,11)
df = data.frame(Account_tenure_years,N,Y)
cor(df$Account_tenure_years,df$N)
cor(df$Account_tenure_years,df$Y)
Output is as shown below:
> cor(df$Account_tenure_years,df$N)
[1] -0.1662244
> cor(df$Account_tenure_years,df$Y)
[1] -0.5332263
You can inferred that data is negatively correlated. It means increase in the value of Account_tenure_years will decrease the value of N and Y or vice-versa.
Please feel free to correct me!

It should be easier to do this to transpose your data, And the best part is that you don't even need to write a loop.
try this:
dt <- data.table::fread("
Account_tenure_years N Y
982 12328 76
983 18990 64
984 21255 61
985 27996 76
986 32014 94
987 15487 55
988 4347 11
")
dt.t <- as.data.frame(t(dt[, 2:3]))
colnames(dt.t) = dt$Account_tenure_years
# transpose
dt.t
#> 982 983 984 985 986 987 988
#> N 12328 18990 21255 27996 32014 15487 4347
#> Y 76 64 61 76 94 55 11
# calculate correlation matrix, read more help(cor)
cor(dt.t)
#> 982 983 984 985 986 987 988
#> 982 1 1 1 1 1 1 1
#> 983 1 1 1 1 1 1 1
#> 984 1 1 1 1 1 1 1
#> 985 1 1 1 1 1 1 1
#> 986 1 1 1 1 1 1 1
#> 987 1 1 1 1 1 1 1
#> 988 1 1 1 1 1 1 1
Created on 2018-07-20 by the reprex package (v0.2.0.9000).

I do not understand how you want to compute correlation coefficients between two variables with only one observation for each. Therefore, I assume you have more rows than provided here.
First define all combinations:
combinations <- combn(df_table_account_tenure_vs_PPC$Account_tenure_years, 2)
For each combination, you want to extract the corresponding rows and compute the correlation coefficients for each variable:
coefficients <- apply(combinations, 2, function(x, df_table_account_tenure_vs_PPC){
coef <- sapply(c("N", "Y"), function(v, x, df_table_account_tenure_vs_PPC){
c <- cor(df_table_account_tenure_vs_PPC[df_table_account_tenure_vs_PPC == x[1], v], df_table_account_tenure_vs_PPC[df_table_account_tenure_vs_PPC == x[2], v])
return(c)},
x, df_table_account_tenure_vs_PPC)
return(c(x, coef))},
df_table_account_tenure_vs_PPC)
Then, you can aggregate your results in a data.frame:
df <- as.data.frame(t(coefficients))
colnames(df) <- c("Year1", "Year2", "N_cor", "Y_cor")
This should work. Please tell me if you have any problem.
Again, make sure you have more than one observation in each condition if you want a meaningful correlation coefficient.

Related

R group data into equal groups with a metric variable

I'm struggeling to get a good performing script for this problem: I have a table with a score, x, y. I want to sort the table by score and than build groups based on the x value. Each group should have an equal sum (not counts) of x. x is a metric number in the dataset and resembles the historic turnover of a customer.
score x y
0.436024136 3 435
0.282303336 46 56
0.532358015 24 34
0.644236597 0 2
0.99623626 0 4
0.557673456 56 46
0.08898779 0 7
0.702941303 453 2
0.415717835 23 1
0.017497461 234 3
0.426239166 23 59
0.638896238 234 86
0.629610596 26 68
0.073107526 0 35
0.85741877 0 977
0.468612039 0 324
0.740704267 23 56
0.720147257 0 68
0.965212467 23 0
a good way to do so is adding a group variable to the data.frame with cumsum! Now you can easily sum the groups with e. g. subset.
data.frame$group <-cumsum(as.numeric(data.frame$x)) %/% (ceiling(sum(data.frame$x) / 3)) + 1
remarks:
in big data.frames cumsum(as.numeric()) works reliably
%/% is a division where you get an integer back
the '+1' just let your groups start with 1 instead of 0
thank you #Ronak Shah!

Calculate Hopkins Statistics (coefficient) between two groups with value whose have different ID in Rs

I have dataset “data_file” which contains five columns & 1 million rows:
X1 "ID_Number" (numeric),
X2 “Sample_Type”,
X3 “Signal_X” (numeric),
X4 “Signal_Y” (numeric),
X5 “Signal_Z” (numeric).
Each value of the ID corresponds to a set of values “Signal_X”, “Signal_Y” and “Signal_Z”.
ID_Number :: Sample_Type :: Signal_X :: Signal_Y :: Signal_Z
2 Sample 337 1538 0.6314152
2 Sample 106 1840 0.9923422
…
2 Sample 94 1445 0.9967044
10 Sample 164 1777 0.9950826
10 Sample 183 1933 0.9931457
10 Sample 176 1590 0.9690951
…
10 Sample 139 1339 0.9820210
12 Sample 154 1397 0.9700886
12 Sample 144 1206 0.9457763
… etc
By scanning the ID I found the correlation coefficient b/w “Signal_X”
and “Signal_Y” using the following code:
library(plyr)
dataAE<- ddply(data_file, " ID_Number ", summarise, CorrelationCoefficient=cor(SignalX, SignalY))
View(dataAE)
The output should look like this.
datasetID Correlation Coefficient
1 2 0.48083503
2 3 -0.81036062
3 10 -0.32098672
4 12 -0.20251427
5 24 -0.18004939
6 51 -0.45803370
7 54 -0.59001642
8 63 -0.53976850
etc …
By analogy, I'm trying to find – to Compute Hopkins statistic & find optimal number of clusters for my
dataset.
library(clustertend)
set.seed(123)
hopkins(data_file, n = nrow(data_file)-1)
I tried to replace CorrelationCoefficient=cor(SignalX, SignalY) at
HopkinsStatistics=hopkins(SignalX, SignalY)
… And without results.
Manually & without problem for each ID set I used the following code
library(clustertend)
# Compute Hopkins statistic for dataset
set.seed(123)
subset$sampletype<- NULL
df<-scale(subset)
res <- get_clust_tendency(df, 40, graph = FALSE)
# Hopskin statistic
res$hopkins_stat
res
The problem is how to automate the calculations & Using loops.
Please help me. Thanks in advance.

Divide a vector by different values based on the result of the division

I have a Df like this:
x y z
<dbl> <dbl> <dbl>
1 408001.9 343 0
2 407919.2 343 0
3 407839.6 343 0
4 407761.2 343 0
5 407681.7 343 0
6 407599.0 343 0
7 407511.0 343 0
8 407420.5 343 0
9 407331.0 343 0
10 407242.0 343 0
11 407152.7 343 0
12 407062.5 343 0
13 406970.7 343 0
14 406876.6 342 0
15 406777.1 342 0
16 406671.0 342 0
17 406560.9 342 0
18 406449.4 342 0
19 406339.0 342 0
20 406232.5 342 0
... ... ... ...
with x decreasing.
And a vector like
vec=(a1, a2, a3, a4, a5, a6, ...)
with a1< a2< a3< a4...
Now I want to divide df$x by vec[1], what will give the same result (rounded) as for df$y.
But now, when the value in df$z drops by one to 342, I want to divide the value in df$x by vec[2] from then on, to get the new df$z values.
From here the result will be different from df$y, as for df$y the number to divide with is allways vec[1]and will not change
Every time the value I get for df$z drops by one, the next values for df$z shal be calculated with the corresponding vec[i] where i is the number of drops+1 so far
In the end I want a vector df$z, where the values are df$x / vec[i], where vec [i] depends on, what the last number of df$z is.
reproducible example:
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
)
vec <- seq(10, 10.9, 0.03)
for(i in 1:31){
test[i+1] <- round(test$x/vec[i])
}
This will give you a df with one col for every value of vec, that test$x got divided by.
Now, in the end, my vector shall contain the values of col2 until the value in col2 drops from 60 to 59. Afterwards I want the values from col3 until the value in col3 drops below 59 to 58. Then I want the values from col4 and so on.
How can I achive this with any data(like mine above, which is not linear ditributed as this example.)
I tried some for and while loops, but none worked. I didn't even get close to what I want.
I think my problem is that I dont know how to make the condition depenent on a value(the value of df$z at point i), that I want to calculate in the same operation. I want to calculate the value of df$z[i] with the value of vec[t], that has been used so far. But if the value of df$z drops by one at a certain observation[i], the value of vec[t+1] shall be used for the division from then on.
Thanks for your help.
I hope I've understood what you are asking. This might be it...
test <- data.frame(x = sort((seq(500, 600, 2)), decreasing = T)
vec <- seq(10, 10.9, 0.03)
#this function determines the index of `vec` to use
xcol<-function(v){
x<-rep(NA,length(v))
x[1] <- 1
for(i in 2:length(v)){
x[i] <- x[i-1]
if(round(v[i]/vec[x[i]])<round(v[i-1]/vec[x[i]])){
x[i] <- x[i]+1
}
}
return(x)
}
test$xcol <- xcol(test$x)
test$z <- round(test$x/vec[test$xcol])
test
x xcol z
1 600 1 60
2 598 1 60
3 596 1 60
4 594 2 59
5 592 2 59
6 590 2 59
7 588 2 59
8 586 3 58
9 584 3 58
10 582 3 58
11 580 3 58
12 578 4 57
...

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Resources