I was trying to find associations between top 10 frequent words with the rest of the frequent words int the input text.
When I look at the individual output of findAssocs():
findAssocs(dtm, "good", corlimit=0.4)
It gives the output clearly by printing the word 'good' with which associations have been sought.
$good
better got hook next content fit person
0.44 0.44 0.44 0.44 0.43 0.43 0.43
But when I try to automate this process for a character vector having top 10 words:
t10 <- c("busi", "entertain", "topic", "interact", "track", "content", "paper", "media", "game", "good")
the output is a list of correlations for each of those elements BUT WITHOUT THE WORD WITH WHICH THE ASSOCIATIONS HAVE BEEN SOUGHT. The sample output is as below (plz notice that the word at t10[i] is not printed, unlike the above individual output where 'good' was clearly printed):
for(i in 1:10) {
t10_words[i] <- as.list(findAssocs(dtm, t10[i], corlimit=0.4))
}
> t10_words
[[1]]
littl descript disrupt enter model
0.50 0.48 0.48 0.48 0.48
[[2]]
immers anyth effect full holodeck iot problem say startrek such suspect wow
0.68 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48
[[3]]
area captur give overal like alon avid begin
0.51 0.47 0.47 0.47 0.44 0.43 0.43 0.43
circuit cloud collaboration communic communiti concis confus defin
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
discord doesnt drop enablesupport esport event everi everyon
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
How do I print the output along with the actual association word?
Can somebody please help me with this??
Thanks.
After running your for loop, add the following piece of code:
names(t10_words) <- t10
This will name the lists with the words specified in t10.
Related
First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!
This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.
The psych::print.psych() function produces beautiful output for the factor analysis objects produced by psych::fa(). I would like to obtain the table that follows the text "Standardized loadings (pattern matrix) based upon correlation matrix" as a data frame without cutting and pasting.
library(psych)
my.fa <- fa(Harman74.cor$cov, 4)
my.fa #Equivalent to print.psych(my.fa)
Yields the following (I'm showing the first four items here):
Factor Analysis using method = minres
Call: fa(r = Harman74.cor$cov, nfactors = 4)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR3 MR2 MR4 h2 u2 com
VisualPerception 0.04 0.69 0.04 0.06 0.55 0.45 1.0
Cubes 0.05 0.46 -0.02 0.01 0.23 0.77 1.0
PaperFormBoard 0.09 0.54 -0.15 0.06 0.34 0.66 1.2
Flags 0.18 0.52 -0.04 -0.02 0.35 0.65 1.2
I tried examining the source code for print.psych (Using View(print.psych) in RStudio), but could only find a section for printing standardized loadings for 'Factor analysis by Groups'.
The my.fa$weights are not standardized, and the table is missing the h2, u2, and com columns. If they can be standardized, the following code could work:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$weights, my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
Printing my.fa.table gives the following (I show the first four lines), which indicates $weights is incorrect:
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception -0.021000973 0.28028576 0.006002429 -0.001855021 0.5501829 0.4498201 1.028593
2: Cubes -0.003545975 0.11022570 -0.009545919 -0.012565221 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.028562047 0.13244895 -0.019162262 0.014448449 0.3384722 0.6615293 1.224154
4: Flags 0.009187032 0.14430196 -0.025374834 -0.033737089 0.3497962 0.6502043 1.246102
Replacing $weights with $loadings gives the following error message:
Error in as.data.frame.default(x, ...) :
cannot coerce class ‘"loadings"’ to a data.frame
Update:
Adding [,] fixed the class issue:
library(data.table)
library(psych)
my.fa <- fa(Harman74.cor$cov,4)
my.fa.table <- data.table(dimnames(Harman74.cor$cov)[[1]],
my.fa$loadings[,], my.fa$communalities, my.fa$uniquenesses, my.fa$complexity)
setnames(my.fa.table, old = c("V1", "V3", "V4", "V5"),
new = c("item", "h2", "u2", "com"))
my.fa.table
item MR1 MR3 MR2 MR4 h2 u2 com
1: VisualPerception 0.04224875 0.686002901 0.041831185 0.05624303 0.5501829 0.4498201 1.028593
2: Cubes 0.05309628 0.455343417 -0.022143990 0.01372376 0.2298420 0.7701563 1.033828
3: PaperFormBoard 0.08733001 0.543848733 -0.147686005 0.05523805 0.3384722 0.6615293 1.224154
4: Flags 0.17641395 0.517235582 -0.038878915 -0.02229273 0.3497962 0.6502043 1.246102
I would still be happy to get an answer that does this more elegantly or explains why this isn't built in.
It is not built in because each person wants something slightly different. As you discovered, you can create a table by combining four objects from fa: the loadings, the communalities, the uniqueness, and the complexity.
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
so, for the Thurstone correlation matrix:
f <- fa(Thurstone,3)
df <- data.frame(unclass(f$loadings), h2=f$communalities, u2= f$uniqueness,com=f$complexity)
round(df,2)
Produces
MR1 MR2 MR3 h2 u2 com
Sentences 0.90 -0.03 0.04 0.82 0.18 1.01
Vocabulary 0.89 0.06 -0.03 0.84 0.16 1.01
Sent.Completion 0.84 0.03 0.00 0.74 0.26 1.00
First.Letters 0.00 0.85 0.00 0.73 0.27 1.00
Four.Letter.Words -0.02 0.75 0.10 0.63 0.37 1.04
Suffixes 0.18 0.63 -0.08 0.50 0.50 1.20
Letter.Series 0.03 -0.01 0.84 0.73 0.27 1.00
Pedigrees 0.38 -0.05 0.46 0.51 0.49 1.96
Letter.Group -0.06 0.21 0.63 0.52 0.48 1.25
Or, you can try the fa2latex for nice LaTex based formatting.
fa2latex(f)
which produces a LateX table in quasi APA style.
So I am trying to get an output in csv file but I am having trouble formatting as per my need.
My Code
method.metric <- mmetric(testCenScal[[course_name]], method.pred, c("RMSE", "R2", "MAE", "COR"))
write.table(method.metric, "metric.csv", sep = ",", col.names = T, append = T)
Current Output
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
It would be nice if I could format this output into something like:
RMSE R2 MAE COR param1 param2
0.89 0.35 0.65 0.59 courseA Blackboost
0.89 0.35 0.65 0.59 courseB Blackboost
0.89 0.35 0.65 0.59 courseC Blackboost
0.89 0.35 0.65 0.59 courseD Blackboost
0.89 0.35 0.65 0.59 courseE Blackboost
0.89 0.35 0.65 0.59 courseA Rpart
0.89 0.35 0.65 0.59 courseB Rpart
0.89 0.35 0.65 0.59 courseC Rpart
0.89 0.35 0.65 0.59 courseD Rpart
0.89 0.35 0.65 0.59 courseE Rpart
I dont know what is "x" and where is it coming from, I guess I don't have the column name mentioned therefore it prints default as "x"?
I have this code in a function so I am passing two parameters one is the method and another is the target field. I would like to print those while appending it to a CSV file.
If I type dput(method.metric)
I get output as:
structure(c(0.869309100173694, 0.356594555638249, 0.653084184175849,
0.597155386510286), .Names = c("RMSE", "R2", "MAE", "COR"))
I already tried using the code write.csv(method.metric, file ="metric.csv", row.names=FALSE, eol=",", append=T) but it did not help much.
I will try to work on what you said formatting in R using cbind and other functions. If I get the output in above format, I will be able to create graphs with ease as I have lot of predictive model results being output.
I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)
You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()
You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")
I always transpose by using t(file) command in R.
But i it is not running properly (not running at all) on big data file (250,000 rows and 200 columns). Any ideas.
I need to calculate correlation between 2nd row (PTBP1) with all other rows (except 8 rows including header). In order to do this I transpose rows to columns and then use cor function.
But I struck at transpose fn. Any help would be really appreciated!
I copied example from one of the post in stackoverflow (They are also almost discussing the same problem but seems no answer yet!)
ID A B C D E F G H I [200 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
250,000 rows
Use a matrix instead. The only advantage of a dataframe over a matrix is the capacity to have different classes in the columns and you clearly do not have that situation, since a transposed dataframe could not support such a result.
I don't get why you want to transpose the data.frame. If you just use cor it doesn't matter if your data is in rows or columns.
Actually, it is one of the major advantages of R that it doen's matter if your data fits in the classical row-column pattern as SPSS and others programs require data to be.
There are numerous ways to correlate the first row with all other rows (I don't get which rows you want to exclude). One is using a loop (here the loop is implicit in the call to one of the *apply family functions):
lapply(2:(dim(fn)[1]), function(x) cor(fn[1,],fn[x,]))
Note that I expect you data.frame to ba called fn. To skip some rows change the 2 to the number you want. Furthermore, I would probably use vapply here.
I hope this answer points you in the correct direction and that is to not use t() if you absolutely don't need it.