assortativity.nominal in igraph - r

I am analysing a coauthorship network (net) of V(net)=1327 and E(net)=4121.
I have tried to find the
assortativity.nominal(net, types=V(net)$field)
this is based on the field where the researcher belongs. I do not know why, but R crush each time I want to run this. I have run
assortativity(net, types=V(net)$publication)
that is in function of the number of publications or coautorships each researcher has, in this case there is no problem.
I just fear that the case with assortativity.nominal has to do with the size and order of the network.

Something that might work:
assortativity_nominal(graph,as.numeric(as.factor(V(graph)$attribute)))

The problem is with the trait being NA for some of the vertices. If you change all the NAs to "none", then it works, however, you probably don't want this and nor do I, and I don't know how to get round it.

Related

How to combine two actions into one object

I recently just started with R a few weeks ago at the Uni. We were given a problem which we had to solve. However in this problem, I find that there are two answers that fit the question:
Verify that you created lo_heval correctly (incl. missing values). Store your verification in the object proof2.
So i find this is correct:
proof2 <- soep[1:100, c("heval", "lo_heval")]
But I think that this answer is also correct:
proof2 <- table(soep$heval, soep$lo_heval, useNA = "always")
Instead of having to decide for one answer, how do I combine them both into the object? I tried to use &, but I get an error. I may be using it wrong.
Prof. if you're seeing this, please don't fail me. I just can't decide between them.
Thanks in advance!
R lists can hold any arbitrary objects in them, so you could use
proof2 <- list(
soep[1:100, c("heval", "lo_heval")],
table(soep$heval, soep$lo_heval, useNA = "always")
)
However, to my mind 100 rows of two columns isn't proof - it's an exercise to look through those and verify things are right. (And what about the rows past 100? It's a decent spot check, but if there are more rows in the data it is more strong evidence than proof.) The table approach, on the other hand, seems succinct and effective.

R cluster analysis and dendrogram with correlation matrix

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?
I already tried this:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:
Any idea how to improve it? And what can I actually get out of it?
I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.
Check this page for more methods.
Hope it helps you.
Edit
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Plus
Also see this post to find the perfect graphy answer from #Ben
Edit
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package.
Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

How to check that a user-defined function works in r?

THis is probably a very silly question, but how can I check if a function written by myself will work or not?
I'm writing a not very simple function involving many other functions and loops and was wondering if there are any ways to check for errors/bugs, or simply just check if the function will work. Do I just create a simple fake data frame and test on it?
As suggested by other users in the comment, I have added the part of the function that I have written. So basically I have a data frame with good and bad data, and bad data are marked with flags. I want to write a function that allows me to produce plots as usual (with the flag points) when user sets flag.option to 1, and remove the flag points from the plot when user sets flag.option to 0.
AIR.plot <- function(mydata, flag.option) {
if (flag.option == 1) {
par(mfrow(2,1))
conc <- tapply(mydata$CO2, format(mydata$date, "%Y-%m-%d %T"), mean)
dates <- seq(mydata$date[1], mydata$date[nrow(mydata(mydata))], length = nrow(conc))
plot(dates, conc,
type = "p",
col = "blue",
xlab = "day",
ylab = "CO2"), error = function(e) plot.new(type = "n")
barplot(mydata$lines, horiz = TRUE, col = c("red", "blue")) # this is just a small bar plot on the bottom that specifies which sample-taking line (red or blue) is providing the samples
} else if (flag.option == 0) {
# I haven't figured out how to write this part yet but essentially I want to remove all
# of the rows with flags on
}
}
Thanks in advance, I'm not an experienced R user yet so please help me.
Before we (meaning, at my workplace) release any code to our production environment we run through a series of testing procedures to make sure our code behaves the way we want it to. It usually involves several people with different perspectives on the code.
Ideally, such verification should start before you write any code. Some questions you should be able to answer are:
What should the code do?
What inputs should it accept? (including type, ranges, etc)
What should the output look like?
How will it handle missing values?
How will it handle NULL values?
How will it handle zero-length values?
If you prepare a list of requirements and write your documentation before you begin writing any code, the probability of success goes up pretty quickly. Naturally, as you begin writing your code, you may find that your requirements need to be adjusted, or the function arguments need to be modified. That's okay, but document those changes when they happen.
While you are writing your function, use a package like assertthat or checkmate to write as many argument checks as you need in your code. Some of the best, most reliable code where I work consists of about 100 lines of argument checks and 3-4 lines of what the code actually is intended to do. It may seem like overkill, but you prevent a lot of problems from bad inputs that you never intended for users to provide.
When you've finished writing your function, you should at this point have a list of requirements and clearly documented expectations of your arguments. This is where you make use of the testthat package.
Write tests that verify all of the requirements you wrote are met.
Write tests that verify you can no put in unintended inputs and get the results you want.
Write tests that verify you get the output you intended on your test data.
Write tests that test any edge cases you can think of.
It can take a long time to write all of these tests, but once it is done, any further development is easier to check since anything that violates your existing requirements should fail the test.
That being said, I'm really bad at following this process in my own work. I have the tendency to write code, then document what I did. But the best code I've written has been where I've planned it out conceptually, wrote my documentation, coded, and then tested against my documentation.
As #antoine-sac pointed out in the links, some things cannot be checked programmatically; for example, if your function terminates.
Looking at it pragmatically, have a look at the packages assertthat and testthat. assertthat will help you insert checks of results "in between", testthat is for writing proper tests. Yes, the usual way of writing tests is creating a small test example including test data.

Nodes Size Relative to Associated Value

I'm fairly new to R and haven't been able to find an answer for this. Someone else asked a similar question, but no solution was ever reported. If I should have posted this Q on a different stackexchange, I apologize and will delete if it can't be migrated.
Using data I pulled from the FDIC on US based financial institutions and their total asset holdings, I would like to create a basic network graph where each node is proportionally sized to each other node in the graph. Each node would also be labeled with the name of the financial institution.
The edges of the graph actually don't matter for now, but I want each node connected to the network by at least one edge.
As of now, I've already successfully created a very basic network with 8 banks, connected by edges I randomly assigned, as shown here (I apparently can't embed pictures yet, sorry about that):
My .csv file will be formatted as:
id, bank, assets
1, JP Morgan Chase, 16928000
2, Bank of America, 19075000
... ... ...
For the graph I already created, it is the same as above except without the asset column. It was also only 8 banks, where the file I hope to use will have 25.
Like I already said, as for edges, I just randomly assigned some. If someone knows an easier way of creating random edges that connect the nodes I create, please let me know. Otherwise, this is how my file is formatted as of now:
to, from
1, 2
1, 3
...
And I created the graph I linked with the following commands:
> nodes <- read.csv("~/foo/foo/foo.csv")
> links <- read.csv("~/blah/taco/burrito/blah.csv")
> net <- graph_from_data_frame(d=links, vertices = nodes, directed = F)
> class(net)
> net
IGRAPH UN-- 8 10 --
+ attr: name (v/c), bank (v/c)
+ edges (vertex names):
[1] 1--2 1--3 1--4 1--5 2--3 2--4 2--7 4--5 5--8 7--8
> plot(net, main = "Financial Intermediaries", edge.arrow.size=.4, vertex.size=25, vertex.label.cex=1.5, vertex.label.color="black", vertex.label=V(net)$bank)
I hope I was clear with my problem and gave the necessary details/code. If not, please just let me know and I'll post it up here. Like I said, I'm really new to R (I literally picked it up today, lol), and much of the code I've used so far was less or more taken from Katya Ognyanova's examples/presentations on her blog.
For the sake of clarity, I'm currently using RStudio (most recent stable) and R v3.2.5.
I have been only using the igraph package, but if what I want can't be done with that, I am more than willing to switch over to a different package. That said, I would like to stay with R (unless there really is something so much easier for this it can't be ignored. I would like to stick with and learn R).
Thank you for any and all help, I really appreciate it.
as #Osssan linked to in the comments, there was a partial solution floating around.
That said, I think I created more of a 'hack' solution than a proper one with what I gleaned from the previous question. Here is what I did.
In my csv file, I had four columns. In the third column, I had the asset's for a given bank. NOTE Since I don't know how to do data manipulation inside of R, I had to do some work to adjust the size of the asset value so that it did not result in nodes that covered the entirety of the graph. With my solution, you will NOT get nodes that are relative in size automatically. You must do that first.
Since I wanted to create a network with nodes(banks) that were variable in size according to their respective asset holdings, what I did was create a separate vector like so
> df <- read.csv("~/blah/blah/blah.csv", colClasses = c("NULL","NULL", NA, "NULL"))
What this command does is read in the csv file, looks at the headings with colClasses and tell the interpreter to vacuum up all columns specified (non-NULL). With this vector, I then plugged it into my the plot function as such:
> plot(net, main = "Financial Intermediaries", edge.arrow.size=.4, vertex.size=as.matrix(df), vertex.label.color="black")
where I make a matrix using the as.matrix(df) and set it to vertex.size=. Given a vector of only one dimension, R is able to quickly make the appropriate matrix (I guess).
I still have to do some relabeling and connecting with edges, but it worked in graphing. I graphed the largest 26 commercial banks by total asset holdings (and adjusted them to % of total commercial bank assets in the US), so you will see that the size of nodes increase from 26-1. Here's the output.
Like I said, this solution works, but I am far from sure whether it would be considered proper or kosher. I welcome anyone to edit this solution so that it clarifies what is actually happening with my code and or post a proper/optimized solution if it exists. I'm going to give this post a solid few days before marking it solved, as I would like to still get a solid answer on this confusing problem.
P.S. If anyone knows of a way to force nodes not to overlap, I would appreciate a comment explaining how to do that. If you look at my picture, you'll see that the effect of dwarfing the other nodes is diminished when the largest node is covered by it's closely sized peers.

Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
Compare the fields, in this case just the name. You can use one or more comparator for this, for example an edit distance like the Levenshtein distance or something like the cosine distance that compares the number of common words.
Turn an array fo distance scores into a probability that a pair of records are truly about the same thing
Cluster those pairwise probability scores into groups of records that likely all refer to the same thing.
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
Canon
Powershot
A20
IS
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
Strip out all common words (a, an, the, new). What is "common" depends on context.
Take the first letter of each word and its length and make that an word key.
When a suspect word comes up, looks for words with the same or similar word key.
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
Canon --> weight = 0.05 (not a very distinguishing word)
PowerShot --> weight = 0.37 (can be distinguishing)
a20IS --> weight = 0.96 (very distinguishing)
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.

Resources