generating adjacency table with R - r

I generated an adjacency table mytable with cosine similarity, m1 is a DTM
cosineSim <- function(x){
as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2)))))
}
cs <- cosineSim(m1)
mytable
"";"1";"2";"3";"4";"5";"6";"7";"8"
"1";0;0;0;0;0;0;0;0
"2";0;0;0;0;0;0;0;0
"3";0;0;0;0.259;0;0;0;0
"4";0;0;0;0;0;0;0;0.324
"5";0;0;0;0;0;0;0;0
"6";0;0;0;0;0;0;0;0
"7";0;0;0;0;0;0;0;0
"8";0;0;0;0;0;0;0;0
When I open it with Gephi, I find that the nodes include all the numbers in the table
Id label
" "
1" 1"
2" 2"
3" 3"
4" 4"
5" 5"
6" 6"
7" 7"
8 8
0 0
0.259 0.259
0.324 0.324
8" 8"
I expected the nodes only include 1-8 as ids, not "", "0 and other numbers. Is there something wrong with my adjacency table?

Remove the double quotes and try to reimport. Since you are using R I would propose to automate your pipeline by using igraph and in your case graph_from_adjacency_matrix, cf here. Then you will need to export the graph in GraphML which Gephi can easily read
Here is some example code for the sake of completeness:
library(igraph)
t <- ';1;2;3;4;5;6;7;8
1;0;0;0;0;0;0;0;0
2;0;0;0;0;0;0;0;0
3;0;0;0;0.259;0;0;0;0
4;0;0;0;0;0;0;0;0.324
5;0;0;0;0;0;0;0;0
6;0;0;0;0;0;0;0;0
7;0;0;0;0;0;0;0;0
8;0;0;0;0;0;0;0;0'
f <- read.csv(textConnection(t), sep = ";", header = T, row.names = 1)
m <- as.matrix(f, rownames.force = T)
colnames(m) <- seq(1:dim(f)[1])
rownames(m) <- seq(1:dim(f)[1])
graph <- graph_from_adjacency_matrix(m, mode=c("directed"), weighted = T)
write.graph(graph, "mygraph.graphml", format=c("graphml") )

Related

Sequence of numbers by hyphen without hyphenating single occurrences

I want to generate readable number sequences (e.g. 1, 2, 3, 4 = 1-4), but for a set of data where each number in the sequence must have four digits (e.g. 99 = 0099 or 1 = 0001 or 1022 = 1022) AND where there are different letters in front of each number.
I was looking at the answer to this question, which managed to do almost exactly as I want with two caveats:
If there is a stand-alone number that does not appear in a sequence, it will appear twice with a hyphen in between
If there are several stand-alone numbers that do no appear in a sequence, they won't be included in the result
### Create Data Set ====
## Create the data for different tags. I'm only using two unique levels here, but in my dataset I've got
## 400+ unique levels.
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
## Combine data
my.seq1 <- c(FM, SC)
## Sort data by number in sequence
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]
### Attempt Number Sequencing ====
## Get the letters
sp.tags <- substr(my.seq1, 1, 2)
## Get the readable number sequence
lapply(split(my.seq1, sp.tags), ## Split data by the tag ID
function(x){
## Get the run lengths as per [previous answer][1]
rl <- rle(c(1, pmin(diff(as.numeric(substr(x, 3, 7))), 2)))
## Generate number sequence by separator as per [previous answer][1]
seq2 <- paste0(x[c(1, cumsum(rl$lengths))], c("-", ",")[rl$values], collapse="")
return(substr(seq2, 1, nchar(seq2)-1))
})
## Combine lists and sort elements
my.seq2 <- unlist(strsplit(do.call(c, my.seq2), ","))
my.seq2 <- my.seq2[order(substr(my.seq2, 3, 7))]
names(my.seq2) <- NULL
my.seq2
[1] "FM0001-FM0001" "SC0002-SC0004" "FM0016-FM0019" "FM0028" "SC0039"
my.seq1
[1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021"
[13] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
The major problems with this are:
Some values are completely missing from the data set (e.g. FM0021, FM0024, FM0026)
The first number in the sequence (FM0001) appears with a hyphen in between
I feel like I'm getting warmer by using A5C1D2H2I1M1N2O1R2T1's answer to utilize seqToHumanReadable because it's quite elegant AND solves both problems. Two more problems are that I'm not able to tag the ID before each number and can't force the number of digits to four (e.g. 0004 becomes 4).
library(R.utils)
lapply(split(my.seq1, sp.tags), function(x){
return(unlist(strsplit(seqToHumanReadable(substr(x, 3, 7)), ',')))
})
$FM
[1] "1" " 16-19" " 21" " 24" " 26" " 28"
$SC
[1] "2-4" " 10" " 12" " 14" " 33" " 36" " 39"
Ideally the result would be:
"FM0001, SC002-SC004, SC0012, SC0014, FM0017-FM0019, FM0021, FM0024, FM0026, FM0028, SC0033, SC0036, SC0039"
Any ideas? It's one of those things that's really simple to do by hand but would take blinking ages, and you'd think a function would exist for it but I haven't found it yet or it doesn't exist :(
This should do?
# get the prefix/tag and number
tag <- gsub("(^[A-z]+)(.+)", "\\1", my.seq1)
num <- gsub("([A-z]+)(\\d+$)", "\\2", my.seq1)
# get a sequence id
n <- length(tag)
do_match <- c(FALSE, diff(as.numeric(num)) == 1 & tag[-1] == tag[-n])
seq_id <- cumsum(!do_match) # a sequence id
# tapply to combine the result
res <- setNames(tapply(my.seq1, seq_id, function(x)
if(length(x) < 2)
return(x)
else
paste(x[1], x[length(x)], sep = "-")), NULL)
# show the result
res
#R> [1] "FM0001" "SC0002-SC0004" "SC0010" "SC0012" "SC0014" "FM0016-FM0019" "FM0021"
#R> [8] "FM0024" "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
# compare with
my.seq1
#R> [1] "FM0001" "SC0002" "SC0003" "SC0004" "SC0010" "SC0012" "SC0014" "FM0016" "FM0017" "FM0018" "FM0019" "FM0021" "FM0024"
#R> [14] "FM0026" "FM0028" "SC0033" "SC0036" "SC0039"
Data
FM <- paste0('FM', c('0001', '0016', '0017', '0018', '0019', '0021', '0024', '0026', '0028'))
SC <- paste0('SC', c('0002', '0003', '0004', '0010', '0012', '0014', '0033', '0036', '0039'))
my.seq1 <- c(FM, SC)
my.seq1 <- my.seq1[order(substr(my.seq1, 3, 7))]

Format function output as a customised multiple lines string

I'm trying to make a function which gives output with simple format.
If I already calculated estimated values of beta's, what should I do if I want following result format.
Coefficients
-------------
Constant: 5.2
Beta1: 4
Beta2: 9
Beta3: 2
.
.
.
I tried cat() function but to use cat(), I have to write every line manually like:
cat("Coefficients","\n","-------------","\n","Constant: 5.2","\n","Beta1: 4",....)
Is there any way to make that simple result format?
If you have a vector of 10 results and you want to label them Beta1 to Beta10 you could do:
result = 10:1
b_order = 1:10
paste0("beta", b_order, ": ", result)
This gives:
[1] "beta1: 10" "beta2: 9" "beta3: 8" "beta4: 7" "beta5: 6" "beta6: 5" "beta7: 4" "beta8: 3" "beta9: 2" "beta10: 1"

R Error when using beside=TRUE parameter

I am plotting a graph with barplot() and any attempts to use the beside=TRUE parameter seem to return the error of Error in -0.01 * height : non-numeric argument to binary operator
The following is the code for the graph:
combi <- as.matrix(combine)
barplot(combi, main="Top 5 hospitals in California",
ylab="Mortality/Admission Rates", col = heat.colors(5), las=1)
The output of the graph is that the bars are stacked on each other instead of being beside each other.
The issue is not reproducible, when combineis a data.frame:
combine <- data.frame(
HeartAttack = c(13.4,12.3,16,13,15.2),
HeartFailure = c(11.1,7.3,10.7,8.9,10.8),
Pneumonia = c(11.8,6.8,10,9.9,9.5),
HeartAttack2 = c(18.3,19.3,21.8,21.6,17.3),
HeartFailure2 = c(24,23.3,24.2,23.8,24.6),
Pneumonia2 = c(17.4,19,17,18.4,18.2)
)
combi <- as.matrix(combine)
barplot(combi, main="Top 5 hospitals in California",
ylab="Mortality/Admission Rates", col = heat.colors(5), las=1, beside = TRUE)
Had the same issue earlier (different dataset, tho) and resolved it by using as.numeric() on my dataframe after I converted it to matrix with as.matrix(). Leaving as as.numeric()" out leads to "Error in -0.01 * height : non-numeric argument to binary operator"
¯\(ツ)/¯
My df called tmp:
> tmp
125 1245 1252 1254 1525 1545 12125 12425 12525 12545 125245 125425
Freq.x.2d "14" " 1" " 1" " 1" " 3" " 2" " 1" " 1" " 9" " 4" " 1" " 5"
Freq.x.3d "13" " 0" " 1" " 0" " 4" " 0" " 0" " 0" "14" " 4" " 1" " 2"
> dim(tmp)
[1] 2 28
> is(tmp)
[1] "matrix" "array" "structure" "vector"
> tmp <- as.matrix(tmp)
> dim(tmp)
[1] 2 28
> is(tmp)
[1] "matrix" "array" "structure" "vector"
> tmp <- as.numeric(tmp)
> dim(tmp)
NULL
> is(tmp)
[1] "numeric" "vector"
barplot(tmp, las=2, beside=TRUE, col=c("grey40","grey80"))

how to sort out a nested list in R

The original data was a simple list named "data" like this
[1] "score: 10 / review 1 / ID 1
[2] "score: 9 / review 2 / ID 2
[3] "score: 8 / review 3 / ID 3
----
[30] "score: 7 / review 30 / ID&DATE: 30
In order to sort out scores reviews and ID&DATEs separately,
I first made it a matrix, and then split them by "/" using str_split "stringr"
so the whole process went like this.
a1 <- readLines("data.txt")
a2 <- t(a1) # Matrix
a3 <- t(a2) # reversing rows and columns
b1 <- str_split(a,"/")
here is the problem
b1 came out as a nested list like this.
[[1]]
[1] "score: 10"
[2] "review 1"
[3] "ID 1"
[[2]]
[1] "score: 9"
[2] "review 2"
[3] "ID 2"
[[3]]
[1] "score: 8"
[2] "review 3"
[3] "ID 3"
------
[[30]]
[1] "score: 7"
[2] "review 30"
[3] "ID 30"
I want to extract the values of [[1]][1], [[2]][1], [[3]][1], ... [[30]][1], [[n]][2], and [[n]][3] SEPARATELY, and make each one of them a dataframe.
Any clues?
The following would work for a particular type of nested list that looks like your data. Without a reproducible example, I don't know for sure:
# create nested list
temp <- list(a=c(list("score: 10"), "review 1", "ID 1"),
b=c("score: 9", "review 2", "ID 2"),
c=c("score: 8", "review 3","ID 3"))
# create data frame from this list
df <- data.frame(score=unlist(sapply(temp, function(i) i[1])),
review=unlist(sapply(temp, function(i) i[2])),
ID=unlist(sapply(temp, function(i) i[3])))
I use sapply to pull out elements from each list item. Then, unlist is applied to the output so that it becomes a vector. All of this out put is wrapped in a data.frame. Note that you can rearrange the output so that the variables are arranged differently.
An even cleaner method, mentioned by #parfait, uses do.call and rbind:
# construct data.frame, rbinding each list item
df <- data.frame(do.call(rbind, temp))
# add the desired names
names(df) <- c('score', 'review', 'ID')

fuse some information in a vector

Something maybe obvious but I can't seem to see it :
I have a vector like this :
vec<-c("i: 1","n: alpha","a: term1","a: term2", "i: 2","n: beta","a: term3","i: 3","n: gamma","a: term4","a: term5","a: term6")
and I need to get this :
out<-c("i: 1","n: alpha","a: term1;term2", "i: 2","n: beta","a: term3","i: 3","n: gamma","a: term4;term5;term6")
That is, for each unique i:, fuse the a: when there are more than one.
I tried with diff and rle but the resulted code (see below) is too long and I think I'm complicating uselessly the problem...
my code :
out<-vec
a<-which(grepl("^a: ",vec))
diffa<-diff(a)
diffa1<-which(diffa==1)
rle_a<-rle(diffa)$lengths[rle(diffa)$values==1]
indwh<-1
for(ind in 1:length(rle_a)){
allindwh<-indwh:(indwh+rle_a[ind]-1)
out[a[c(diffa1[allindwh],diffa1[allindwh[length(allindwh)]]+1)]]<-paste(out[a[diffa1[allindwh[1]]]],paste(gsub("a: ","",out[a[c(diffa1[allindwh[-1]],diffa1[allindwh[length(allindwh)]]+1)]]),collapse=";"),sep=";")
indwh<-indwh+rle_a[ind]
}
out<-unique(out)
So I get what I want but I would really appreciate any hint to simplify it.
Here's an easier approach with tapply:
# index of 'a's
idx <- grepl("^a", vec)
# find groups
grp <- c(0, cumsum(diff(idx) < 0))
# apply function to vector based on groups
unlist(tapply(vec, grp, FUN = function(x)
c(x[1:2], paste("a:", paste(sub("^a:\\s*", "", x[-(1:2)]), collapse = ";")))),
use.names = FALSE)
# [1] "i: 1" "n: alpha" "a: term1;term2"
# [4] "i: 2" "n: beta" "a: term3"
# [7] "i: 3" "n: gamma" "a: term4;term5;term6"

Resources