First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!
This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.
Related
I conducted a factor analysis and wanted to create the latent concept (postmaterialism and materialism) with the correlated variables (see output fa). Later on I want to merge this data set I used for the fa with another data set, hence I kept the ID variable in order to use it later as key variable. Now my problem is that I need to exclude the factor loadings from the ID variable because otherwise it'll contort the score of the latent concept of each individual. I tried different commands like:
!("ID"), with = FALSE, - ("ID"), with = FALSE, setdiff(names(expl_fa2),("ID")), with = FALSE
but nothing worked.
This is my code for the latent variables:
data_fa_1 <- data_fa_1 %>% mutate(postmat = expl_fa2$score[,1], mat = expl_fa2$scores[,2])
And this is the output from the factor analysis:
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 h2 u2 com
import_of_new_ideas 0.48 0.06 0.233 0.77 1.0
import_of_safety 0.06 0.61 0.375 0.63 1.0
import_of_trying_things 0.66 0.03 0.435 0.57 1.0
import_of_obedience 0.01 0.49 0.240 0.76 1.0
import_of_modesty 0.01 0.44 0.197 0.80 1.0
import_of_good_time 0.62 0.01 0.382 0.62 1.0
import_of_freedom 0.43 0.16 0.208 0.79 1.3
import_of_strong_gov 0.15 0.57 0.350 0.65 1.1
import_of_adventures 0.64 -0.15 0.427 0.57 1.1
import_of_well_behav 0.03 0.64 0.412 0.59 1.0
import_of_traditions 0.03 0.50 0.253 0.75 1.0
import_of_fun 0.67 0.03 0.449 0.55 1.0
ID 0.07 0.04 0.007 0.99 1.7
Can anyone help me with the command I need to use in order to exclude the factor loadings from the ID variable (see output fa) from the creation of the latent variables "postmat" and "mat"?
Not sure if this is really your question, but assuming you just want to remove the first column from a data.table, here is an example data.table and 3 ways how you could exclude the ID column for that example:
DT <- data.table(
ID=LETTERS[1:10],
matrix(rnorm(50), nrow=10, dimnames = list(NULL, paste0("col", 1:5)))
)
DT[,- 1]
DT[, -"ID"]
DT[, setdiff(colnames(DT), "ID"), with=FALSE]
I am trying find a function that allows me two easily get the confidence interval of difference between two means.
I am pretty sure t.test has this functionality, but I haven't been able to make it work. Below is a screenshot of what I have tried so far:
Image
This is the dataset I am using
Indoor Outdoor
1 0.07 0.29
2 0.08 0.68
3 0.09 0.47
4 0.12 0.54
5 0.12 0.97
6 0.12 0.35
7 0.13 0.49
8 0.14 0.84
9 0.15 0.86
10 0.15 0.28
11 0.17 0.32
12 0.17 0.32
13 0.18 1.55
14 0.18 0.66
15 0.18 0.29
16 0.18 0.21
17 0.19 1.02
18 0.20 1.59
19 0.22 0.90
20 0.22 0.52
21 0.23 0.12
22 0.23 0.54
23 0.25 0.88
24 0.26 0.49
25 0.28 1.24
26 0.28 0.48
27 0.29 0.27
28 0.34 0.37
29 0.39 1.26
30 0.40 0.70
31 0.45 0.76
32 0.54 0.99
33 0.62 0.36
and I have been trying to use t.test function that has been installed from
install.packages("ggpubr")
I am pretty new to R, so sorry if there is a simple answer to this question. I have searched around quite a bit and haven't been able to find anything that I am looking for.
Note: The output I am looking for is Between -1.224 and 0.376
Edit:
The CI of difference between means I am looking for is if a random 34th datapoint was added to the chart by picking a random value in the Indoor column and a random value in the Outdoor column and duplicating it. Running the t.test will output the correct CI for the difference of means for the given sample size of 33.
How can I go about doing this pretending the sample size is 34?
there's probably something more convenient in the standard library, but it's pretty easy to calculate. given your df variable, we can just do:
# calculate mean of difference
d_mu <- mean(df$Indoor) - mean(df$Outdoor)
# calculate SD of difference
d_sd <- sqrt(var(df$Indoor) + var(df$Outdoor))
# calculate 95% CI of this
d_mu + d_sd * qt(c(0.025, 0.975), nrow(df)*2)
giving me: -1.2246 0.3767
mostly for #AkselA: I often find it helpful to check my work by sampling simpler distributions, in this case I'd do something like:
a <- mean(df$Indoor) + sd(df$Indoor) * rt(1000000, nrow(df)-1)
b <- mean(df$Outdoor) + sd(df$Outdoor) * rt(1000000, nrow(df)-1)
quantile(a - b, c(0.025, 0.975))
which gives me answers much closer to the CI I gave in the comment
Even though I always find the approach of manually calculating the results, as shown by #Sam Mason, the most insightful, there are some who want a shortcut. And sometimes, it's also ok to be lazy :)
So among the different ways to calculate CIs, this is imho the most comfortable:
DescTools::MeanDiffCI(Indoor, Outdoor)
Here's a reprex:
IV <- diamonds$price
DV <- rnorm(length(IV), mean = mean(IV), sd = sd(IV))
DescTools::MeanDiffCI(IV, DV)
gives
meandiff lwr.ci upr.ci
-18.94825 -66.51845 28.62195
This is calculated with 999 bootstrapped samples by default. If you want 1000 or more, you can just add that in the argument R:
DescTools::MeanDiffCI(IV, DV, R = 1000)
Here is the result of PCA.
RC1 and RC3 can be interpreted which variables are related.
But, can not interpreted in RC2.
When the eigen value is checked, the number of factor is 3.
But can there really be only two? or Which variables should be related in RC2?
Input variable is 7 types. and I used 'principal()' function.
names(mydata)
[1] "A" "B" "C" "D" "E" "F" "G"
> x<-cbind(A, B, C, D, E, F, G)
> e_value<-eigen(cor(x))
> e_value
eigen() decomposition
$values
[1] 2.3502254 1.4170606 1.2658360 0.8148231 0.5608698 0.3438629 0.2473222
$vectors
[,1] [,2] [,3] [,4] [,5]
[,6] [,7]
[1,] 0.2388621 0.46839043 0.37003850 0.47205027 -0.58802244
-0.133939151 -0.009233395
[2,] 0.1671739 -0.71097984 -0.14062597 0.25083439 -0.26726985
-0.502411130 -0.244983436
[3,] 0.2132841 -0.19677142 0.64662974 0.34508779 0.61416969
-0.003950736 0.036814153
[4,] 0.1697817 -0.24468987 0.55631886 -0.69016805 -0.34039757
0.039899816 0.089531675
[5,] 0.4857016 0.36681570 -0.09905329 -0.31456085 0.26225761
-0.344919726 -0.577088755
[6,] -0.5359245 0.20164924 0.17958243 -0.13144417 0.11755661
-0.748885304 0.218966481
[7,] 0.5635252 0.03619081 -0.27131854 -0.05105919 0.08439733
-0.219629096 0.741315659
> PCA<-principal(x,nfactors = 3, rotate = "varimax")
> print(PCA)
Principal Components Analysis
Call: principal(r = x, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
RC1 RC2 RC3
SS loadings 2.23 1.40 1.40
Proportion Var 0.32 0.20 0.20
Cumulative Var 0.32 0.52 0.72
Proportion Explained 0.44 0.28 0.28
Cumulative Proportion 0.44 0.72 1.00
Mean item complexity = 1.3
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.11
with the empirical chi square 63.33 with prob < 1.1e-13
Fit based upon off diagonal values = 0.84
The relationship between the PCA and 7 variables is right in the output:
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
A 0.24 0.69 0.29 0.62 0.38 1.6
B 0.25 -0.83 0.24 0.81 0.19 1.3
C 0.06 0.05 0.83 0.69 0.31 1.0
D 0.03 -0.04 0.74 0.54 0.46 1.0
E 0.76 0.42 -0.01 0.76 0.24 1.5
F -0.83 0.24 -0.17 0.77 0.23 1.3
G 0.92 -0.01 0.00 0.84 0.16 1.0
This matrix is telling you which variables are correlated with THE PCA. So you can see that A and B are strongly correlated with principal component 2 (labeled RC2).
PCA is a rotation of the data, so there are as many principal components as there are variables (7). But most people are interested in visualization, so generally only the first 2 or 3 principal components are chosen for plotting.
The psych::print.psych() function produces beautiful output for the factor analysis objects produced by psych::fa(). I would like to obtain the table that follows the text "Standardized loadings (pattern matrix) based upon correlation matrix" as a data frame without cutting and pasting.
library(psych)
my.fa <- fa(Harman74.cor$cov, 4)
my.fa #Equivalent to print.psych(my.fa)
Yields the following (I'm showing the first four items here):
Factor Analysis using method = minres
Call: fa(r = Harman74.cor$cov, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR3 MR2 MR4 h2 u2 com
VisualPerception 0.04 0.69 0.04 0.06 0.55 0.45 1.0
Cubes 0.05 0.46 -0.02 0.01 0.23 0.77 1.0
PaperFormBoard 0.09 0.54 -0.15 0.06 0.34 0.66 1.2
Flags 0.18 0.52 -0.04 -0.02 0.35 0.65 1.2
I tried examining the source code for print.psych (Using View(print.psych) in RStudio), but could only find a section for printing standardized loadings for 'Factor analysis by Groups'.
The my.fa$weights are not standardized, and the table is missing the h2, u2, and com columns. If they can be standardized, the following code could work:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$weights, my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
Printing my.fa.table gives the following (I show the first four lines), which indicates $weights is incorrect:
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception -0.021000973 0.28028576 0.006002429 -0.001855021 0.5501829 0.4498201 1.028593
2: Cubes -0.003545975 0.11022570 -0.009545919 -0.012565221 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.028562047 0.13244895 -0.019162262 0.014448449 0.3384722 0.6615293 1.224154
4: Flags 0.009187032 0.14430196 -0.025374834 -0.033737089 0.3497962 0.6502043 1.246102
Replacing $weights with $loadings gives the following error message:
Error in as.data.frame.default(x, ...) :
cannot coerce class ‘"loadings"’ to a data.frame
Update:
Adding [,] fixed the class issue:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$loadings[,], my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
my.fa.table
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception 0.04224875 0.686002901 0.041831185 0.05624303 0.5501829 0.4498201 1.028593
2: Cubes 0.05309628 0.455343417 -0.022143990 0.01372376 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.08733001 0.543848733 -0.147686005 0.05523805 0.3384722 0.6615293 1.224154
4: Flags 0.17641395 0.517235582 -0.038878915 -0.02229273 0.3497962 0.6502043 1.246102
I would still be happy to get an answer that does this more elegantly or explains why this isn't built in.
It is not built in because each person wants something slightly different. As you discovered, you can create a table by combining four objects from fa: the loadings, the communalities, the uniqueness, and the complexity.
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
so, for the Thurstone correlation matrix:
f <- fa(Thurstone,3)
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
Produces
MR1 MR2 MR3 h2 u2 com
Sentences 0.90 -0.03 0.04 0.82 0.18 1.01
Vocabulary 0.89 0.06 -0.03 0.84 0.16 1.01
Sent.Completion 0.84 0.03 0.00 0.74 0.26 1.00
First.Letters 0.00 0.85 0.00 0.73 0.27 1.00
Four.Letter.Words -0.02 0.75 0.10 0.63 0.37 1.04
Suffixes 0.18 0.63 -0.08 0.50 0.50 1.20
Letter.Series 0.03 -0.01 0.84 0.73 0.27 1.00
Pedigrees 0.38 -0.05 0.46 0.51 0.49 1.96
Letter.Group -0.06 0.21 0.63 0.52 0.48 1.25
Or, you can try the fa2latex for nice LaTex based formatting.
fa2latex(f)
which produces a LateX table in quasi APA style.
I always transpose by using t(file) command in R.
But i it is not running properly (not running at all) on big data file (250,000 rows and 200 columns). Any ideas.
I need to calculate correlation between 2nd row (PTBP1) with all other rows (except 8 rows including header). In order to do this I transpose rows to columns and then use cor function.
But I struck at transpose fn. Any help would be really appreciated!
I copied example from one of the post in stackoverflow (They are also almost discussing the same problem but seems no answer yet!)
ID A B C D E F G H I [200 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
250,000 rows
Use a matrix instead. The only advantage of a dataframe over a matrix is the capacity to have different classes in the columns and you clearly do not have that situation, since a transposed dataframe could not support such a result.
I don't get why you want to transpose the data.frame. If you just use cor it doesn't matter if your data is in rows or columns.
Actually, it is one of the major advantages of R that it doen's matter if your data fits in the classical row-column pattern as SPSS and others programs require data to be.
There are numerous ways to correlate the first row with all other rows (I don't get which rows you want to exclude). One is using a loop (here the loop is implicit in the call to one of the *apply family functions):
lapply(2:(dim(fn)[1]), function(x) cor(fn[1,],fn[x,]))
Note that I expect you data.frame to ba called fn. To skip some rows change the 2 to the number you want. Furthermore, I would probably use vapply here.
I hope this answer points you in the correct direction and that is to not use t() if you absolutely don't need it.