R network graphs - r

I have data where X-column is review and then columns of words that most reviews have. Is it possible to create a graph where nodes would be reviews and edges would be words?
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0
Another idea is to claster the reviews in the graph according to the used words and their frequency.
Thank you very much. Any help is appreciated.

Here are three approaches to explore the relationships in your data:
par(mfrow=c(1,3))
# two mode network (reviews+words)
library(igraph)
set.seed(1)
g <- graph_from_data_frame(subset(reshape2::melt(df, 1), !!value, -value)[2:1])
V(g)$type <- bipartite.mapping(g)$type
plot(g, layout = layout_as_bipartite(g)[, 2:1], vertex.color = V(g)$type+1L)
# just the reviews:
library(reshape2)
lst <- with(subset(melt(df, 1), !!value)[2:1], split(X, variable))
lst <- lst[lengths(lst)>1]
lst <- lapply(lst, function(x) t(combn(x, m=2)))
g <- graph_from_edgelist(do.call(rbind, lst), dir = F)
E(g)$label <- rep(names(lst), sapply(lst, nrow))
plot(g)
# review clustering
df[-1] %>% dist(meth="bin") %>% hclust %>% plot
Output:
Data:
df <- read.table(header=T, text="
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0")
PS: There may be a shortcut to no. 2 (reviews as nodes & words as edges) - feel free to add it.

With your example data one could create graph consiting of nodes (each review) and edges (two reviews are connected when they use the same word). Moreover, you could weight the edges according to how many words two reviews have in common and moreover you could use different shapes/colors of the edges to represent the different words.
There are several ways to create a graph with your data. First, to create a adjacency matrix, where each columns and rows would each represent a review. The adjecency matrix only counts whether there is a common word between two reviews or not. In case two reviews share a common word it takes the value 1, otherwise it is zero.
The adjency matrix would look similar to this, where the latters denote column and row labels:
Review A B C D
A 0 1 1 1
B 1 0 0 1
C 1 0 0 1
D 1 1 1 0
With the R command graph_from_adjency( ) in the igraph package you could then create a graph and use the plot functions.
Second you could also create a weight matrix, which counts how many words are shared between two review. Using the same command graph_from_adjency( , weighted=T) from the igraph package you could create from that matrix a graph .
You can find a good introduction to network analysis with the igraph package here: http://kateto.net/networks-r-igraph
Review A B C D
A 0 2 3 1
B 2 0 0 2
C 3 0 0 2
D 1 2 2 0
Third, you could specifiy the graph from an edge and nodes data frames.
The nodes data frame would contain a short id of each node and maybe the name and all other information you may want to include about the nodes :
id long_review_name
R1 A
R2 B
R3 C
R4 D
The edges data frame collects all the information about the connections between two reviews. First, and most important it would record all edges in the columns from and to . Further, it could contain the frequency as weight on the edges and type would denote, which word connection the two nodes share:
from to weight type
R1 R2 1 american
R1 R2 1 age
R1 R3 2 american
R1 R3 1 age
R1 R4 1 age
R2 R4 2 american
To turn the edges and the node data frame into a graph you would need to use the command graph_from_data_frame(d=links, vertices=nodes).

Related

How to add multiple values to data.frame without loop?

Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

R: clustering documents

I've got a documentTermMatrix that looks as follows:
artikel naam product personeel loon verlof
doc 1 1 1 2 1 0 0
doc 2 1 1 1 0 0 0
doc 3 0 0 1 1 2 1
doc 4 0 0 0 1 1 1
In the package tm, it's possible to calculate the hamming distance between 2 documents. But now I want to cluster all the documents that have a hamming distance smaller than 3.
So here I would like that cluster 1 is document 1 and 2, and that cluster 2 is document 3 and 4. Is there a possibility to do that?
I saved your table to myData:
myData
artikel naam product personeel loon verlof
doc1 1 1 2 1 0 0
doc2 1 1 1 0 0 0
doc3 0 0 1 1 2 1
doc4 0 0 0 1 1 1
Then used hamming.distance() function from e1071 library. You can use your own distances (as long as they are in the matrix form)
lilbrary(e1071)
distMat <- hamming.distance(myData)
Followed by hierarchical clustering using "complete" linkage method to make sure that the maximum distance within one cluster could be specified later.
dendrogram <- hclust(as.dist(distMat), method="complete")
Select groups according to the maximum distance between points in a group (maximum = 5)
groups <- cutree(dendrogram, h=5)
Finally plot the results:
plot(dendrogram) # main plot
points(c(-100, 100), c(5,5), col="red", type="l", lty=2) # add cutting line
rect.hclust(dendrogram, h=5, border=c(1:length(unique(groups)))+1) # draw rectangles
Another way to see the cluster membership for each document is with table:
table(groups, rownames(myData))
groups doc1 doc2 doc3 doc4
1 1 1 0 0
2 0 0 1 1
So documents 1st and 2nd fall into one group while 3rd and 4th - to another group.

Data Transformations in R

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
I need it be transformed into this way. Show the sum of all transactions made by the Person on individual items.
Person a b d e
1 2 1 0 0
2 0 0 1 0
3 1 0 0 1
I was able to achieve the above by using the
table(Person,Item) in R
The new requirement I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
Is there a better way in R to do this type of transformation ?
Your way (creating a new column via paste) is probably the easiest. You could also do this:
require(reshape2)
dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)

Restructure Data in R

I am just starting to get beyond the basics in R and have come to a point where I need some help. I want to restructure some data. Here is what a sample dataframe may look like:
ID Sex Res Contact
1 M MA ABR
1 M MA CON
1 M MA WWF
2 F FL WIT
2 F FL CON
3 X GA XYZ
I want the data to look like:
ID SEX Res ABR CON WWF WIT XYZ
1 M MA 1 1 1 0 0
2 F FL 0 1 0 1 0
3 X GA 0 0 0 0 1
What are my options? How would I do this in R?
In short, I am looking to keep the values of the CONT column and use them as column names in the restructred data frame. I want to hold a variable set of columns constant (in th example above, I held ID, Sex, and Res constant).
Also, is it possible to control the values in the restructured data? I may want to keep the data as binary. I may want some data to have the value be the count of times each contact value exists for each ID.
The reshape package is what you want. Documentation here: http://had.co.nz/reshape/. Not to toot my own horn, but I've also written up some notes on reshape's use here: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_reshape.html
For your purpose, this code should work
library(reshape)
data$value <- 1
cast(data, ID + Sex + Res ~ Contact, fun = "length")
model.matrix works great (this was asked recently, and gappy had this good answer):
> model.matrix(~ factor(d$Contact) -1)
factor(d$Contact)ABR factor(d$Contact)CON factor(d$Contact)WIT factor(d$Contact)WWF factor(d$Contact)XYZ
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
5 0 1 0 0 0
6 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(d$Contact)`
[1] "contr.treatment"

Resources