I'm working in healthcare and I need help on how to use R.
I explain: I have a set of data like that:
S1 S2 S3 S4 S5
0.498 1.48 1.43 0.536 0.548
2.03 1.7 3.74 2.13 2.02
0.272 0.242 0.989 0.534 0.787
0.986 2.03 2.53 1.65 2.31
0.307 0.934 0.633 0.36 0.281
0.78 0.76 0.706 0.81 1.11
0.829 2.03 0.667 1.48 1.42
0.497 1.27 0.952 1.23 1.73
0.553 0.286 0.513 0.422 0.573
Here are my objectives:
Do correlation between every column
Calculate p-values
Calculate R-squared
Only show when R2>0.5 and p-values <0.05
Here is my code so far (it's not the most efficient but it work):
> e<-read.table(‘Workbook8nm.csv’, header=TRUE, sep=“,”, dec=“.”, na.strings=“NA”)
> f<-data.frame(e)
> M<-cor(f, use=“complete”) #Do the correlation like I want
> library(‘psych’)
> N<-corr.test (f) #Give me p-values
So, so far I have my correlation in M and my p-values in N.
I need help on how to show R2 ?
And second part how to make R only show me when R2>0.5 and p-values<0.05 for example ? I used this line :
P<-M[which(m>0.9))]
To show me only when the pearson coefficent is more than 0.9 as a training. But it just make me a list of every values that are superior to 0.9 ... So I don't know between which and which column this coefficient come from. The best would be that it show me significant values in a table with the name of column so after I can easily identify them.
The reason I want to do that is because by table is 570 by 570 so I can't look at every p-values to keep only the significant one.
I hope I was clear ! It's my first post here, tell me if I did any mistake !
Thanks for your help !
I'm sure there is a function somewhere in the R space to do this quicker, but I wrote a quick function to expand a matrix into a data.frame with the "row" and "column" as columns, and the value as a third column.
matrixToFrame <- function(m, name) {
e <- expand.grid(row=rownames(m), col=colnames(m))
e[name] <- as.vector(m)
e
}
We can transform the correlation matrix into a data frame like so:
> matrixToFrame(cor(f), "cor")
row col cor
1 S1 S1 1.0000000
2 S2 S1 0.5322052
3 S3 S1 0.8573687
4 S4 S1 0.8542438
5 S5 S1 0.6820144
6 S1 S2 0.5322052
....
And we can merge the result of corr.test and cor because the columns match up
> b <- merge(matrixToFrame(corr.test(a)$p, "p"), matrixToFrame(cor(a), "cor"))
> head(b)
row col p cor
1 S1 S1 0.0000000000 1.0000000
2 S1 S2 0.2743683745 0.5322052
3 S1 S3 0.0281656707 0.8573687
4 S1 S4 0.0281656707 0.8542438
5 S1 S5 0.2134783039 0.6820144
6 S2 S1 0.1402243214 0.5322052
Then we can just filter for the elements that we want
> b[b$cor > .5 & b$p > .2,]
row col p cor
2 S1 S2 0.2743684 0.5322052
5 S1 S5 0.2134783 0.6820144
8 S2 S3 0.2743684 0.5356585
10 S2 S5 0.2134783 0.6724486
15 S3 S5 0.2134783 0.6827349
EDIT: I found R matrix to rownames colnames values, which provides a couple of attempts at matrixToFrame; nothing particularly more elegant than what I have here, though.
EDIT2: Make sure to read the docs carefully for corr.test -- it looks like different information gets encoded in the upper and lower diagonal (?), so the results here may be deceptive. You may want to do some filtering with lower.tri or upper.tri before the final filtering step.
Related
The acf method in the stats package returns a complex output. For example
x = rnorm(1000, mean=100, sd=10)
acf(x)
returns a plot. If I do
acf_x = acf(x)
acf_x
it returns
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 7 8 9 10 11
1.000 0.000 -0.031 -0.002 -0.052 0.017 -0.014 0.030 0.011 0.002 -0.044 0.000
12 13 14 15 16 17 18 19 20 21 22 23
0.055 -0.007 0.049 0.025 -0.027 -0.048 0.033 0.027 0.043 -0.007 -0.010 0.025
24 25 26 27 28 29 30
-0.083 0.045 -0.074 0.016 0.041 -0.046 0.010
If I look at class(acf) it returns 'acf'.
How do I extract the autocorrelation versus lag into a data_frame?
More generally, when presented with a function that returns a complex object, how do I extract the data from it, i.e. is there a general pattern for this type of function?
If you look at the help function of acf via ?acf you'll see under "value" what the output will look like.
In this case, the acf object is a list with several elements.
If you e.g. want the lags, you can simply access this via:
my_lags <- acf_x$lag
Deschen's answer to the original question gives the general response - how do I discover the elements in a complex model object: str(). One can also use the names() function for S3 objects, where the result lists the names one can use to extract elements from the list() with the $ or [[ forms of the extract operator.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
names(acf_x)
> names(acf_x)
[1] "acf" "type" "n.used" "lag" "series" "snames"
>
Since the acf and lag elements are stored as arrays, we'll need to extract just the first dimension to obtain a simple vector. We can accomplish this by chaining the [ form of the extract operator onto the object that is generated by the [[ extract on the model object.
head(acf_x[["acf"]][,1,1]) # second extract returns a simple vector
> head(acf_x[["acf"]][,1,1])
[1] 1.000000000 -0.034863150 0.037745441 -0.020464290 -0.004974406
[6] 0.016770363
In this case R performs the extraction left to right - first acf_x[["acf"]] is evaluated, and then [,1,1] is applied to the result.
For the concrete part of the question, "how do I create a data frame with this data?" One can create a data frame with the output from the acf() function as follows.
set.seed(95014)
x = rnorm(1000, mean=100, sd=10)
acf_x <- acf(x)
results <- data.frame(acf_value = acf_x$acf[,1,1],
acf_lag = acf_x$lag[,1,1])
head(results)
...and the output:
> head(results)
acf_value acf_lag
1 1.000000000 0
2 -0.034863150 1
3 0.037745441 2
4 -0.020464290 3
5 -0.004974406 4
6 0.016770363 5
Try
str(acf_x)
or
print.default(acf_x)
This will get you an idea how the object looks like internally and how to access the elements in it.
This question is sort of a follow-up to how to extract intragroup and intergroup distances from a distance matrix? in R. In that question, they first computed the distance matrix for all points, and then simply extracted the inter-class distance matrix. I have a situation where I'd like to bypass the initial computation and skip right to extraction, i.e. I want to directly compute the inter-class distance matrix. Drawing from the linked example, with tweaks, let's say I have some data in a dataframe called df:
values<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values, class)
What I'd like is a distance matrix:
1 2 3 8 10
4 .003 .295 .395 .003 .005
5 .598 .300 .200 .598 .590
6 .198 .100 .200 .198 .190
7 .001 .299 .399 .001 .009
9 .298 .000 .100 .298 .290
Does there already exist in R an elegant and fast way to do this?
EDIT After receiving a good solution for the 1D case above, I thought of a bonus question: what about a higher-dimensional case, say if instead df looks like this:
values1<-c(0.002,0.3,0.4,0.005,0.6,0.2,0.001,0.002,0.3,0.01)
values2<-c(0.001,0.1,0.1,0.001,0.1,0.1,0.001,0.001,0.1,0.01)
class<-c("A","A","A","B","B","B","B","A","B","A")
df<-data.frame(values1, values2, class)
And I'm interested in again getting a matrix of the Euclidean distance between points in class B with points in class A.
For general n-dimensional Euclidean distance, we can exploit the equation (not R, but algebra):
square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)
where the sums are over the dimensions of vectors a and b for i=[1,n]. Here, a and b are one pair from A and B. The key here is that this equation can be written as a matrix equation for all pairs in A and B.
In code:
## First split the data with respect to the class
n <- 2 ## the number of dimensions, for this example is 2
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(rowSums(tmp$B*tmp$B),rowSums(tmp$A*tmp$A))),
nrow=nrow(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
Notes:
The inner rowSums compute sum_i(b[i]*b[i]) and sum_i(a[i]*a[i]) for each b in B and a in A, respectively.
expand.grid then generates all pairs between B and A.
The outer rowSums computes the sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) for all these pairs.
This result is then reshaped into a matrix. Note that the number of rows of this matrix is the number of points of class B as you requested.
Then subtract two times the inner product of all pairs. This inner product can be written as a matrix multiply tmp$B %*% t(tmp$A) where I left out the coercion to matrix for clarity.
Finally, take the square root.
Using this code with your data:
print(d)
## 1 2 3 8 10
##4 0.0030000 0.3111688 0.4072174 0.0030000 0.01029563
##5 0.6061394 0.3000000 0.2000000 0.6061394 0.59682493
##6 0.2213707 0.1000000 0.2000000 0.2213707 0.21023796
##7 0.0010000 0.3149635 0.4110985 0.0010000 0.01272792
##9 0.3140143 0.0000000 0.1000000 0.3140143 0.30364453
Note that this code will work for any n > 1. We can recover your previous 1-d result by setting n to 1 and not perform the inner rowSums (because there is now only one column in tmp$A and tmp$B):
n <- 1 ## the number of dimensions, set this now to 1
tmp <- split(df[,1:n], df$class)
d <- sqrt(matrix(rowSums(expand.grid(tmp$B*tmp$B,tmp$A*tmp$A)),
nrow=length(tmp$B)) -
2. * as.matrix(tmp$B) %*% t(as.matrix(tmp$A)))
print(d)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 0.003 0.295 0.395 0.003 0.005
##[2,] 0.598 0.300 0.200 0.598 0.590
##[3,] 0.198 0.100 0.200 0.198 0.190
##[4,] 0.001 0.299 0.399 0.001 0.009
##[5,] 0.298 0.000 0.100 0.298 0.290
Here's an attempt via generating each combination and then simply taking the difference from each value:
abs(matrix(Reduce(`-`, expand.grid(split(df$values, df$class))), nrow=5, byrow=TRUE))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0.003 0.295 0.395 0.003 0.005
#[2,] 0.598 0.300 0.200 0.598 0.590
#[3,] 0.198 0.100 0.200 0.198 0.190
#[4,] 0.001 0.299 0.399 0.001 0.009
#[5,] 0.298 0.000 0.100 0.298 0.290
A sample of 100 subjects responded to two personality tests. These tests have slightly different wordings but are generally the same, i.e. they both measure the same 4 attitudes. Therefore, I have 2 matrices like this, with 4 scores per subject:
>test1
subj A1 A2 A3 A4
1 -2.14 1.21 0.93 -1.72
2 0.25 1.17 0.67 0.67
>test2
subj A1 A2 A3 A4
1 -1.99 1.11 1.00 -1.52
2 0.24 1.20 0.71 0.65
I'd like to evaluate the similarity of profiles in the two tests, i.e. the similarity of two sets of 4 scores for each individual. I feel like the mahalanobis distance is the measure I need and I checked some packages (HDMD, StatMatch) but couldn't find the right function.
One approach to this is to create a difference score matrix and then calculate the Mahalanobis distances on the difference scores.
testDiff <- test1 - test2
testDiffMahalanobis <- mahalanobis(testDiff,
center = colMeans(testDiff),
cov = cov(testDiff))
I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.
I have a nasty problem that bugs me a lot.
I have a list (dataframe) that looks like this:
a b c
1 1.00234 1.05667 1.00198
I want to round the numbers of this dataframe to two decimal number.
But the trailing zeros have to be kept like the following:
a b c
1 1.00 1.06 1.00
I tried the round and printf() and so on, it doesn't work, because my data is a list. It can't be coerced. However, I'd like to keep my data structure.
Anyone of you know how to solve this? I appreciate it very much!!!
Here you go:
df <- data.frame(a=1.00234, b=1.0567, c=1.00198, d=99999, e=.00001)
format(round(df, 2), nsmall=2)
# a b c d e
# 1 1.00 1.06 1.00 99999.00 0.00
You can use sapply to effectively loop the rounding over the columns. I'm having trouble determining exactly what your data looks like. But if it's a data.frame with numerous rows and columns, see this example. You'll need to convert back to class data.frame if that matters, as this method of sapply will coerce the data frame to a matrix.
> dd <- data.frame(a = runif(6), b = runif(6), c = rnorm(6))
> dd
a b c
1 0.3992252 0.9905755 -0.2557345
2 0.5052276 0.7990887 -0.7557547
3 0.3215714 0.1134675 -0.4389722
4 0.1794793 0.5372685 1.1657751
5 0.9543305 0.8908360 -1.5966621
6 0.9525730 0.5991279 -0.4819168
> as.data.frame(sapply(dd, round, 2))
a b c
1 0.40 0.99 -0.26
2 0.51 0.80 -0.76
3 0.32 0.11 -0.44
4 0.18 0.54 1.17
5 0.95 0.89 -1.60
6 0.95 0.60 -0.48
#include<iostream>
#include<conio.h>
#include<math.h>
using namespace std;
int main()
{
float a=1.00634;
float p=round(a*100); float q=p/100;
printf("%.2f",q);
getch();
return 0;
}