First of all, I'd like to say that I'm completely new to R, and I'm just trying to accomplish this one task.
So, what I'm trying to do is that I'd like to create an network diagram from a weighted matrix. I made an example:
The CSV is a simple correlation matrix that looks like this:
,A,B,C,D,E,F,G
A,1,0.9,0.64,0.43,0.38,0.33,0.33
B,0.9,1,0.64,0.33,0.43,0.38,0.38
C,0.64,0.64,1,0.59,0.69,0.64,0.64
D,0.43,0.33,0.59,1,0.28,0.23,0.28
E,0.38,0.43,0.69,0.28,1,0.95,0.9
F,0.33,0.38,0.64,0.23,0.95,1,0.9
G,0.33,0.38,0.64,0.28,0.9,0.9,1
I tried to draw the wanted result by myself and came up with this:
To be more precise, I draw the diagram first, then, using a ruler, I took note of the distances, calculated an equation to get the weights and made the CSV table.
The higher the value is, the closer the two points are to each other.
However, whatever I do, the best result I get is this:
And this is how I'm trying to accomplish it, using this tutorial:
First of all, I import my matrix:
> matrix <- read.csv(file = 'test_dataset.csv')
But after printing the matrix out with head(), this already somehow cuts the last line of the matrix:
> head(matrix)
ï.. A B C D E F G
1 A 1.00 0.90 0.64 0.43 0.38 0.33 0.33
2 B 0.90 1.00 0.64 0.33 0.43 0.38 0.38
3 C 0.64 0.64 1.00 0.59 0.69 0.64 0.64
4 D 0.43 0.33 0.59 1.00 0.28 0.23 0.28
5 E 0.38 0.43 0.69 0.28 1.00 0.95 0.90
6 F 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 8
I then proceed with removing the first column so the matrix is square again...
> matrix <- data.matrix(matrix)[,-1]
> head(matrix)
A B C D E F G
[1,] 1.00 0.90 0.64 0.43 0.38 0.33 0.33
[2,] 0.90 1.00 0.64 0.33 0.43 0.38 0.38
[3,] 0.64 0.64 1.00 0.59 0.69 0.64 0.64
[4,] 0.43 0.33 0.59 1.00 0.28 0.23 0.28
[5,] 0.38 0.43 0.69 0.28 1.00 0.95 0.90
[6,] 0.33 0.38 0.64 0.23 0.95 1.00 0.90
> dim(matrix)
[1] 7 7
Then I create the graph and try to plot it:
> network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
> plot(network)
And the result above appears...
So, after spending the last few hours googling and trying way, way more things, this is the closest I've been able to get to.
So I'm asking for your help, thank you very much!
This is all fine.
head() just prints out the first 6 rows of a matrix or dataframe, if you want to see all of it use print() or just the name of the matrix variable.
graph_from_adjacency_matrix produces a link between two nodes if the value is non-zero. That's why you are getting every node linked to every other node.
To get what that tutorial is doing you need to add a line like
matrix[matrix<0.5] <- 0
to remove the edges for correlations below a cut off before you create the graph.
It's still not going to produce a chart like your hand drawn one (where closeness is roughly the correlation), just clump them together if they are above 0.5 correlation.
I was trying to find associations between top 10 frequent words with the rest of the frequent words int the input text.
When I look at the individual output of findAssocs():
findAssocs(dtm, "good", corlimit=0.4)
It gives the output clearly by printing the word 'good' with which associations have been sought.
$good
better got hook next content fit person
0.44 0.44 0.44 0.44 0.43 0.43 0.43
But when I try to automate this process for a character vector having top 10 words:
t10 <- c("busi", "entertain", "topic", "interact", "track", "content", "paper", "media", "game", "good")
the output is a list of correlations for each of those elements BUT WITHOUT THE WORD WITH WHICH THE ASSOCIATIONS HAVE BEEN SOUGHT. The sample output is as below (plz notice that the word at t10[i] is not printed, unlike the above individual output where 'good' was clearly printed):
for(i in 1:10) {
t10_words[i] <- as.list(findAssocs(dtm, t10[i], corlimit=0.4))
}
> t10_words
[[1]]
littl descript disrupt enter model
0.50 0.48 0.48 0.48 0.48
[[2]]
immers anyth effect full holodeck iot problem say startrek such suspect wow
0.68 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48
[[3]]
area captur give overal like alon avid begin
0.51 0.47 0.47 0.47 0.44 0.43 0.43 0.43
circuit cloud collaboration communic communiti concis confus defin
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
discord doesnt drop enablesupport esport event everi everyon
0.43 0.43 0.43 0.43 0.43 0.43 0.43 0.43
How do I print the output along with the actual association word?
Can somebody please help me with this??
Thanks.
After running your for loop, add the following piece of code:
names(t10_words) <- t10
This will name the lists with the words specified in t10.
I am having trouble interpolating the values of two data series. I have a reference time in first column. The second column is time linked for values of P130. I want to interpolate new values of P130 (third column) according to reference time.
The reference time and timeP130 have the first and last value the same and they are all in variable steps, so there is no pattern.
Reference_time timeP130 P130 results
0.0001 0.0001 0.2194 0.2194
0.000694 0.003 0.25 0.22552
0.00138889 0.0035 0.26 0.23164
0.00208333 0.006 0.24 0.23776
0.00277778 0.009 0.245 0.24388
0.003 0.009 0.255 0.25
0.00416667 0.0125 0.27 ETC
0.00486111 0.015 0.21
0.00555556 0.018 0.20
0.00625 0.0208 0.2194
0.00694444 0.021 0.2194
0.00763889 0.0211 0.2194
0.00833333 0.0215 0.2194
0.00902778 0.022 0.2195
0.00972222 0.0327 0.2591
0.0104167 0.0433 0.3664
0.0111111 0.0839 0.4068
0.0118056 2.5 0.4087
0.0125 0.27
0.0141944
0.0158889
0.0165833
0.0182778
2.5 0.4087
I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)
You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()
You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")
The product of one simulation is a large data.frame, with fixed columns and rows. I ran several hundreds of simulations, with each result stored in a separate RData file (for efficient reading).
Now I want to gather all those files together and create statistics for each field of this data.frame into the "cells" structure which is basically a list of vectors with . This is how I do it:
#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation
cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
cells[[i]]<-as.list(rep(NA,rowscount))
for(j in 1:rows)
{
cells[[i]][[j]]<-rep(NA,simcount)
}
}
names(cells)<-colnames
addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
simdata<readRDS(simfilenames[[simnr]])
for(i in 1:colscount)
{
for(j in 1:rowscount)
{
if (!is.na(simdata[j,i]))
{
cells[[i]][[j]][simnr]<-simdata[j,i]
}
}
}
}
library(plyr)
a_ply(1:simcount,1,addcells)
The problem is, that this the
> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
user system elapsed
0.088 0.004 0.093
While
? system.time(addcells(1))
user system elapsed
147.328 0.296 147.644
I would expect both commands to have comparable execution times (or at least the latter be max 10 x slower). I guess I am doing something very inefficient there, but what? The whole cells data structure is rather big, it takes around 1GB of memory.
I need to transpose data in this way, because later I do many descriptive statistics on the results (like computing means, sd, quantiles, and maybe histograms), so it is important, that the data for each cell is stored as a (single-dimensional) vector.
Here is profiling output:
> summaryRprof('/tmp/temp/rprof.out')
$by.self
self.time self.pct total.time total.pct
"[.data.frame" 71.98 47.20 129.52 84.93
"names" 11.98 7.86 11.98 7.86
"length" 10.84 7.11 10.84 7.11
"addcells" 10.66 6.99 151.52 99.36
".subset" 10.62 6.96 10.62 6.96
"[" 9.68 6.35 139.20 91.28
"match" 6.06 3.97 11.36 7.45
"sys.call" 4.68 3.07 4.68 3.07
"%in%" 4.50 2.95 15.86 10.40
"all" 4.28 2.81 4.28 2.81
"==" 2.34 1.53 2.34 1.53
".subset2" 1.28 0.84 1.28 0.84
"is.na" 1.06 0.70 1.06 0.70
"nargs" 0.62 0.41 0.62 0.41
"gc" 0.54 0.35 0.54 0.35
"!" 0.42 0.28 0.42 0.28
"dim" 0.34 0.22 0.34 0.22
".Call" 0.12 0.08 0.12 0.08
"readRDS" 0.10 0.07 0.12 0.08
"cat" 0.10 0.07 0.10 0.07
"readLines" 0.04 0.03 0.04 0.03
"strsplit" 0.04 0.03 0.04 0.03
"addParaBreaks" 0.02 0.01 0.04 0.03
It looks that indexing the list structure takes a lot of time. But I can't make it array, because not all cells are numeric, and R doesn't easily support hash map...