Colors and a plotting term document matrix - r

Following the example of plotting a term-document matrix below,
library("tm")
data("crude")
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
plot(tdm, terms = findFreqTerms(tdm, lowfreq = 6)[1:25], corThreshold = 0.5)
Is there a way to colorize the nodes based on how many vertices they have? Is there an example of making the nodes with more vertices larger or something to that effect as well?

I appears that the nodes that end up being plotted are of the class AgNode. The properties that you can set of the AgNode are listed on the ?AgNode help page. Once you know what properties you would like to set, you can pass a list to a nodeAttrs parameter to your plotting command. (EDIT: actually a better list is probably the node attributes description in the Rgraphviz documentation)
The nodeAttrs parameter take a list where each named element of that list corresponds to one of the properties of AgNode. At each position, you store a named vector where the name of the vector corresponds to the node name (ie the word in your term matrix) and the value represents the value for that attribute. For example
list(
color=c(futures="blue", demand="red"),
shape=c(crude="ellipse", budget="circle"),
)
So when you wanted to color the terms by the number of vertexes they have, i'm going to assume you mean edges as each word is a single vertex in the graph. So, using your tdm object
freqterms <- findFreqTerms(tdm, lowfreq = 6)[1:25]
vtxcnt <- rowSums(cor(as.matrix(t(tdm[freqterms,])))>.5)-1
I save the terms you wanted, and then I basically copied the code inside the plot command to calculate the correlations with your cutoff of 0.5 to see how many other words each word in this subset is connected to. That's the vtxcnt variable. (There may be a more efficient way to extract this but I could not find it). Now I'm ready to assign colors
mycols<-c("#f7fbff","#deebf7","#c6dbef",
"#9ecae1","#6baed6","#4292c6",
"#2171b5", "#084594")
vc <- mycols[vtxcnt+1]
names(vc) <- names(vtxcnt)
Here I grabbed some colors from ColorBrewer. I have 8 values because the values of vtxcnt range from 0-8. If you had a wider range or wanted to collapsed categories, you could use the cut() command to categorize them. Then I created a named vector vc that matches up each word to the appropriate color. vc should look like this
head(vc)
# ability accord agreement ali also analysts
# "#084594" "#c6dbef" "#2171b5" "#9ecae1" "#f7fbff" "#4292c6"
And now we are ready to make the plot
pp <- plot(tdm,
terms = freqterms,
corThreshold = 0.5,
nodeAttrs=list(fillcolor=vc))
So as you can see the customizing of nodes is pretty flexible. You can color them how every you like if you pass the correct values to nodeAttrs.

Related

How can I get the same piece (duplicate code) of an image from many different photos every time?

From 5000 photos of license plates I want to determine which duplicate code these license plates have.
Here are 2 examples of a duplicate code on a license plate.
In the first example the duplicate code is 2 and in the second example the duplicate code is 1.
With the package Magick and Tesseract, see code below, I was able to retrieve the piece of the photo from the first example where the duplicate code is and to read the duplicate code. Only in the second example and other photos is the photo different.
So I am looking for something that can recognize where the duplicate code is and that will read the duplicate code. Note: The duplicate code is always above the 1st indent mark.
Does someone have an idea how to read the duplicate code automatically from 5000 different photos?
library(magick)
library(tesseract)
#Load foto:
foto <- image_read("C:/Users/camie/OneDrive/Documenten/kenteken3.jpg")
#Get piece of photo where duplicate code is retrieved:
foto2 <- image_crop(foto,"10X24-620-170")
#read duplicate code:
cat(ocr(foto3))
Here is an approach based on the package EBImage. ImageMagik is great for image manipulation but I think EBImage may provide more quantitative tools that are useful here. As for all image processing, the quality of input image matters a great deal. The approach suggested here would likely benefit from noise and artifact removal, scaling and possibly cropping.
Also, some licenses seem to have additional symbols in the position of interest that are not numbers. Clearly more pre-processing and filtering are needed for such cases.
Sample image
# Starting from EBImage
if (!require(EBImage)) {
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
library(EBImage)
}
# Test images
# setwd(<image directory>)
f1 <- "license1.jpg"
f2 <- "license2.jpg"
# Read image and convert to normalized greyscale
img0 <- readImage(f1)
img <- channel(img0, "grey")
img <- normalize(img)
# plot(img) # insert plot or display commands as desired
# Rudimentary image process for ~300 pixel wide JPEG
xmf <- medianFilter(img, 1)
xgb <- gblur(xmf, 1)
xth <- xgb < otsu(xgb) # Otsu's algorithm to determine best threshold
xto <- opening(xth, makeBrush(3, shape = "diamond"))
A binary (thresholded) image has been produced and cleaned up to identify objects as shown here.
# Create object mask with unique integer for each object
xm <- bwlabel(xto)
# plot(colorLabels(xm)) # optional code to visualize the objects
In addition to the rudimentary image process, some "object processing" can be applied as shown here. Objects along the edge are not going to be of interest so they are removed. Similarly, artifacts that give rise to horizontal (wide) streaks can be removed as well.
# Drop objects touching the edge
nx <- dim(xm)[1]
ny <- dim(xm)[2]
sel <- unique(c(xm[1,], xm[nx,], xm[,1], xm[,ny]))
sel <- sel[sel != 0]
xm <- rmObjects(xm, sel, reenumerate = TRUE)
# Drop exceptionally wide objects (33% of image width)
major <- computeFeatures.moment(xm)[,"m.majoraxis"]
sel <- which(major > nx/3)
xm <- rmObjects(xm, sel, reenumerate = TRUE)
The following logic identifies the center of mass for each object with the computeFeatures.moment function of EBImage. It seems that the main symbols will be along a horizontal line while the candidate object will be above that line (lower y-value in EBImage Image object). An alternative approach would be to find objects stacked on one another, i.e., objects with similar x-values.
For the examples I explored, one standard deviation away from the median y-value for the center of mass appears to be sufficient to identify candidate object. This is used to determine the limits shown below. Of course, this logic should be adjusted as dictated by the actual data.
# Determine center of mass for remaining objects
M <- computeFeatures.moment(xm)
x <- M[,1]
y <- M[,2]
# Show suggested limit on image (y coordinates are inverted)
plot(img)
limit <- median(y) - sd(y)
abline(h = limit, col = "red")
# Show centers of mass on original image
ok <- y < limit
points(x[!ok], y[!ok], pch = 16, col = "blue")
points(x[ok], y[ok], pch = 16, col = "red")
The image shows the segmented objects after having discarded objects along the edge. Red shows the candidate, blue shows the non-candidates.
Because some licenses have two symbols above the dash, the following code selects the leftmost of possible candidates, expands the object mask and returns a rectangular crop of the image that can be passed to ocr().
# Accept leftmost (first) of candidate objects
left <- min(x[which(ok)])
sel <- which(x == left)
# Enlarge object mask and extract the candidate image
xm <- dilate(xm, makeBrush(7, "disc"))
ix <- range(apply(xm, 2, function(v) which(v == sel)))
iy <- range(apply(xm, 1, function(v) which(v == sel)))
xx <- ix[1]:ix[2]
yy <- iy[1]:iy[2]
# "Return" selected portion of image
ans <- img[xx, yy] # this is what can be passed to tesseract
plot(ans, interpolate = FALSE)
Here is the unscaled and extracted candidate image from example 1:
Another sample image
The same code applied to this example gives the following:
With a few more checks for errors and for illogical conditions, the code could be assembled into single function and applied to the list of 5000 files! But of course that assumes they are properly formatted, etc. etc.
What with the existance of multiple layouts for Dutch license plates, I'm not sure if you just can hardcode a method to extract a duplication value. Also you don't mention if every image you have always has the same quality and/or orientation/scale/skew/etc.
You could in theory apply a Convolutional Neural Network that categorizes license plates in a several categories. (0 for n/a, 1 for 1, 2 for 2, etc.) However I am not familiar with related packages in R, so I won't be able to point you to some.

Setting networkD3 Node Size with vector of weights

I'm trying to use data.tree and NetworkD3 in R to create a tree representation of a file system where the nodes of the graph are weighted by file size.
library(data.tree)
library(networkD3)
repo <- Node$new("Repository")
git <- repo$AddChild(".git")
prod <- repo$AddChild("Production")
exp <- repo$AddChild("Experimental")
repo$size <- 866000
git$size <- 661000
prod$size <- 153000
exp$size <- 48000
I can get a vector of these sizes using Get, so that
sizes <- repo$Get("size")
But when I try to put it all together, I'm not sure how to include this weight information in the network visualization step. Trying to do something like this...
reponet <- ToDataFrameNetwork(repo,"repo")
net <- forceNetwork(reponet, Nodesize = repo$Get("size"))
to no avail. Basically I'm trying to do what Julia Silge did in this great SO blog post. Does anyone know how to set this?
Check the help file for forceNetwork... there are numerous, mandatory parameters that you have not set.
You can use simpleNetwork to plot a network with just a links data frame like you have, but it doesn't allow you to control the node size... for example...
simpleNetwork(reponet)
To control the node size, you need to use forceNetwork, but it requires a links data frame and a nodes data frame. You could build the nodes data frame from the sizes object you created, and then adjust the source and target IDs in your links data frame to match the indexes of the appropriate node in your nodes data frame (0 indexed because it's sent to JavaScript)... for example...
nodesdf <- data.frame(name = names(sizes), nodesize = sizes / 10000, group = 1)
reponet$from <- match(reponet$from, nodesdf$name) - 1
reponet$to <- match(reponet$to, nodesdf$name) - 1
forceNetwork(reponet, Nodes = nodesdf, Source = "from", Target = "to",
NodeID = "name", Group = "group", Nodesize = "nodesize")

(igraph) Grouped layout based on attribute

I'm using the iGraph package in R to layout a network graph, and I would like to group the vertex coordinates based on attribute values.
Similar to the answered question How to make grouped layout in igraph?, my question differs in that the nodes needn't be grouped by a community membership that was derived from a community detection algorithm.
Rather, I want to layout with groups based on attribute values that are known in advance for each vertex.
For example, if each vertex has an attribute "Master.Org", and there are ~10 to ~20 distinct values for Master.Org, then how can I layout the graph such that all vertices within the same Master.Org are grouped ?
Thanks!
Additional Detail
In fact, two separate attributes provide nested levels of grouping.
My goal is to layout a graph object such that the "Master.Org" and "Org.Of" values are grouped together in their XY coordinates on the graph.
For example, each node will belong to an "Org.Of". And there can be multiple "Org.Of" values within the "Master.Org".
Thoughts ?
Thanks!
While this question is rather old, it is a reasonable question and deserves an answer.
No data was provided so I will generate an arbitrary example.
library(igraph)
set.seed(1234)
G = erdos.renyi.game(20, 0.25)
V(G)$Group1 = sample(3,20, replace=TRUE)
plot(G, vertex.color=rainbow(3, alpha=0.4)[V(G)$Group1])
Without doing anything, the Group is ignored.
Now, we need to create a layout that will plot nodes
in the same group close together. We can do this by creating
a graph with the same nodes, but with additional links between
nodes in the same group. The within-group links will be given
a high weight and the original links will be given a small weight.
This will cluster nodes in the same group. We then apply the
layout to plotting the original graph, without the extra links.
They were just to get a good layout.
G_Grouped = G
E(G_Grouped)$weight = 1
## Add edges with high weight between all nodes in the same group
for(i in unique(V(G)$Group1)) {
GroupV = which(V(G)$Group1 == i)
G_Grouped = add_edges(G_Grouped, combn(GroupV, 2), attr=list(weight=5))
}
## Now create a layout based on G_Grouped
set.seed(567)
LO = layout_with_fr(G_Grouped)
## Use the layout to plot the original graph
plot(G, vertex.color=rainbow(3, alpha=0.4)[V(G)$Group1], layout=LO)
If you want to go beyond this to have multiple levels of grouping, just add additional links with appropriate weights to connect the subgroups too.

Finding the size of biconnected components in R

I am analyzing an undirected graph in R. I'm trying to (eventually) write a function to get the ratio of the size (number of vertices) of the largest connected component to the size of the largest biconnected component - of any random graph. I was able to extract the size of the largest connected component, but am having trouble with the size of the largest biconnected component. I started off using the igraph function biconnected_components on graph g:
bicomponent_list <- biconnected_components(g)
bicomponent_list$components # lists all of the components, including size and vertex names
length(bicomponent_list$components[[1]]) # returns number of vertices of first bicomponent
Then my half-baked idea was to somehow order this list in decreasing number of vertices, so that I can always call length(bicomponent_list$components[[1]]) and it will be the largest biconnected component. But I don't know how to sort this correctly. Perhaps I have to convert it to a vector? But I also don't know how to specify that I want the number of vertices in the vector. Does anyone know, or have a better way to do it? Thanks so much!
library(igraph)
# generating sample graph
g1 <- barabasi.game(100, 1, 5)
V(g1)$name <- as.character(1:100)
g2 <- erdos.renyi.game(50, graph.density(g1), directed = TRUE)
V(g2)$name <- as.character(101:200)
g3 <- graph.union(g1, g2, byname = TRUE)
# analyzing the bicomponents
bicomponent_list <- biconnected_components(g3)
bi_list <- as.list(bicomponent_list$components)
bi_list <- lapply(bi_list, length) # lists the sizes of all of the components that I want to reorder
My desired outcome would be ordering bi_list such that length(bicomponent_list$components[[1]]) returns the bicomponent with the most vertices.
The components property is a list containing vertex lists. You can iterate over them and find the length of them like so
sapply(bicomponent_list$components, length)
and if you just wanted the largest, wrap that in a max()
max(sapply(bicomponent_list$components, length))

Self organising map visualisation result interpretation

Using the R Kohonen package, I have obtained a "codes" plot which shows the codebook vectors.
I would like to ask, shouldn't the codebook vectors of neighbouring nodes be similar? Why are the top 2 nodes on the left so different?
Is there a way to organise it in a meaningful organisation such as this image below? Source from here. Where the countries of high poverty are clustered at the bottom.
library("kohonen")
data("wines")
wines.sc <- scale(wines)
set.seed(7)
wine.som <- som(data = wines.sc, grid = somgrid(5, 4, "hexagonal"))
# types of plots
plot(wine.som, type="codes", main = "Wine data")
Map 1 is the average vector result for each node. The top 2 nodes that you highlighted are very similar.
Map 2 is a kind of similarity index between the nodes.
If you want to obtain such kind of map using the map 1 result you may have to develop your own plotting function with the following parameters:
Pick up the most relevant nodes or the most different ones (manually or automatically). Then, you have to attribute a color to each of these nodes.
Give a color the the neigbours nodes using the average distance between the center of each node from the selected nodes. Shorter distance = close color, higher distance = fading color.
To sum up, that's a lot of work for nearly nothing. Map 1 is better and contains a lot of informations. Map 2 is nice looking...

Resources