I'm using the Textrank method explained here to get the summary of the text. Is there a way to plot the output of the textrank_sentences like a network of all the textrank_ids connected to each other?
library(textrank)
data(joboffer)
library(udpipe)
tagger <- udpipe_load_model(tagger$file_model)
joboffer <- udpipe_annotate(tagger, job_rawtxt)
joboffer <- as.data.frame(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id","paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"))
terminology <- terminology[, c("textrank_id", "lemma")]
tr <- textrank_sentences(data = sentences, terminology = terminology)
This question is rather old, but is a good question and deserves an answer.
Yes! textrank returns all the information that you need. Just look
at the output of str(tr). Part of it says:
$ sentences_dist:Classes ‘data.table’ and 'data.frame': 666 obs. of 3 variables:
..$ textrank_id_1: int [1:666] 1 1 1 1 1 1 1 1 1 1 ...
..$ textrank_id_2: int [1:666] 2 3 4 5 6 7 8 9 10 11 ...
..$ weight : num [1:666] 0.1429 0.4167 0 0.0625 0 ...
This gives which sentences are connected in the form of a lower triangular matrix. Two sentences are connected if the weight of their connection is greater than zero. To visualize the graph, use the non-zero weights as an edgelist and build the graph.
Links = which(tr$sentences_dist$weight > 0)
EdgeList = cbind(tr$sentences_dist$textrank_id_1[Links],
tr$sentences_dist$textrank_id_2[Links])
library(igraph)
SGraph1 = graph_from_edgelist(EdgeList, directed=FALSE)
set.seed(42)
plot(SGraph1)
We see that 11 of the nodes (sentences) are not connected to any other node.
For example, sentences 15 and 36
tr$sentences$sentence[c(36,15)]
[1] "Contact:"
[2] "Integration of the models into the existing architecture."
But other other nodes do connect up, for example node 1 is connected to node 2.
tr$sentences$sentence[c(1,2)]
[1] "Statistical expert / data scientist / analytical developer"
[2] "BNOSAC (Belgium Network of Open Source Analytical Consultants),
is a Belgium consultancy company specialized in data analysis and
statistical consultancy using open source tools."
because those sentences share the (important) words "statistical", "data", and "analytical".
The singleton nodes take up a lot of space in the graph making the other nodes rather crowded. So I will also show the graph with those removed.
which(degree(SGraph1) == 0)
[1] 4 7 15 20 21 23 25 26 29 30 36
SGraph2 = delete.vertices(SGraph1, which(degree(SGraph1) == 0))
set.seed(42)
plot(SGraph2)
That shows the relations between sentences somewhat better, but I expect that you can find a nicer layout for the graph that better shows the relations. However, that is not the thrust of the question and I leave it to you to make the graph pretty.
Related
I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.
I've been building simulators in Excel with VBA to understand the distribution of outcomes a player may experience as they open up collectible card packs. These were largely built with nested for loops, and as you can imagine...were slow as molasses.
I've been spinning up on R over the last couple months, and have come up with a function that handles a particular definition of a pack (i.e., two cards with particular drop rates for n characters on either card), and now am trying to abstract my function so that it can take any number of cards of whatever type of thing you want to throw at it(i.e., currency, gear, materials, etc).
What this simulator is basically doing is saying "I want to watch 10,000 people open up 250 packs of 2 cards" and then I perform some analysis after the results are generated to ask questions like "How many $ will you need to spend to acquire character x?" or "What's the distribution of outcomes for getting x, y or z pieces of a character?"
Here's my generic function and then I'll provide some inputs that the function operates on:
mySimAnyCard <- function(observations, packs, lookup, droptable, cardNum){
obvs <- rep(1:observations, each = packs)
pks <- rep(1:packs, times = observations)
crd <- rep(cardNum, length.out = length(obvs))
if("prob" %in% colnames(lookup))
{
awrd = sample(lookup[,"award"], length(obvs), replace = TRUE, prob = lookup[,"prob"])
} else {
awrd = sample(unique(lookup[,"award"]), length(obvs), replace = TRUE)
}
qty = sample(droptable[,"qty"], length(obvs), prob = droptable[,"prob"], replace = TRUE)
df <- data.frame(observation = obvs, pack = pks, card = cardNum, award = awrd, quantity = qty)
observations and packs are set to an integer.
lookup takes a dataframe:
award prob
1 Nick 0.5
2 Alex 0.4
3 Sam 0.1
and droptable takes a similar dataframe :
qty prob
1 10 0.1355
2 12 0.3500
3 15 0.2500
4 20 0.1500
5 25 0.1000
6 50 0.0080
... continued
cardnum also takes an integer.
It's fine to run this multiple times and assign the output to a variable and then rbind and order, but what I'd really like to do is feed a master function a dataframe that contains which cards it needs to provision and which lookup and droptables it should pull against for each card a la:
card lookup droptable
1 1 char1 chardrops
2 2 char1 chardrops
3 3 char2 <NA>
4 4 credits <NA>
5 5 credits creditdrops
6 6 abilityMats abilityMatDrops
7 7 abilityMats abilityMatDrops
It's probably never going to be more than 20 cards...so I'm willing to take the speed of a for loop, but I'm curious how the SO community would approach this problem.
Here's what I put together thus far:
mySimAllCards <- function(observations, packs, cards){
full <- data.frame()
for(i in i:length(cards$card)){
tmp <- mySimAnyCard(observations, packs, cards[i,2], cards[i,3], i)
full <- rbind(full, tmp)
}
}
which trips over
Error in `[.default`(lookup, , "award") : incorrect number of dimensions
I can work through the issues above, but is there a better approach to consider?
I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}
I need to plot this structure in R:
(1000) Diseases of the genitourinary system
(1580) Nephritis, nephrotic syndrome, and nephrosis
(580) Acute glomerulonephritis
(580.9) Glomerulonephritis, acute, unspec.
(581) Nephrotic syndrome
(581.9) Nephrotic syndrome, unspec.
(582) Chronic glomerulonephritis
(582.9) Glomerulonephritis, chronic, unspec.
(583) Nephritis and nephropathy, not specified as acute or chronic
(584) Acute renal failure
(584.5) Renal failure, acute w/ tubular necrosis
as a nice jpg/pdf/(or other) which will have nodes, connections and labels using R of the structure above.
I looked at libraries which require GrafViz installed and had no luck so native solution (e.g., using ggplot2) would be best. I was not able to put together the code using igraph. I am also new to graphics in R with no textbook based foundation. Any hints or advice would be appreciated.
The structure above is only example. Other structure might have 50+ concepts to plot and printed as very large PDF/poster. The plot would be static (no interaction with it using mouse).
Here is an effort at an answer. It is an approximation to what you need and the final result looks like this:
I made this in igraph and the code uses a simulation of the type of data that you describe.
library("igraph")
vertex.df <- read.table(text = "id code name
0 1000 'Diseases of the genitourinary system '
1 1580 'Nephritis, nephrotic syndrome, and nephrosis '
2 580 'Acute glomerulonephritis'
3 580.9 'Glomerulonephritis, acute, unspec.'
4 581 'Nephrotic syndrome'
5 581.9 'Nephrotic syndrome, unspec.'
6 582 'Chronic glomerulonephritis'
7 582.9 'Glomerulonephritis, chronic, unspec.'
8 583 'Nephritis and nephropathy, not specified as acute or chronic'
9 584 'Acute renal failure'
10 584.5 'Renal failure, acute w/ tubular necrosis'",
header = TRUE,
stringsAsFactor = FALSE)
vertex.df$code <- as.character( vertex.df$code )
edge.df <- read.table(text = "from to
0 1
1 2
1 4
1 6
1 8
1 9
2 3
4 5
6 7
9 10",
header = TRUE)
edges <- matrix(c(edge.df$from, edge.df$to), nc=2)
g <- graph.empty()
g <- add.vertices(g, nrow(vertex.df),
id=vertex.df$id,
code=vertex.df$code,
name=vertex.df$name)
g <- add.edges(g, t(edges))
plot(g,
layout = layout.kamada.kawai,
vertex.label = V(g)$code,
vertex.size = 35,
vertex.color = "white",
vertex.label.family = "sans")
I use the ICD Code as vertex label. This is because the long text of the disease names looks untidy when plotted at this scale.
When plotting, you can change the vertex.label parameter to V(g)$name if you want the disease name instead of the ICD code. I suspect that if you print to a large pdf and remove the vertex outlines, you may be able to get a nice looking tree. Look at ?igraph.plotting for the details on parameters that you can change.
I hope it gives you a step up in your experimentation.
I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data.
Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like the clustering to group these differently than say "R3no4".
Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.
the dataset is from the kdd 2010 cup
I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.
my data frame looks like this:
str(data1)
'data.frame': 809694 obs. of 19 variables:
$ Row : int 1 2 3 4 5 6 7 8 9 10 ...
$ Anon.Student.Id : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Problem.Hierarchy : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
$ Problem.Name : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
$ Problem.View : int 1 1 1 1 2 2 3 3 4 4 ...
$ Step.Name : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...
I'm most interested in the Step.Name feature, since it contains the greatest number of unique factor values.
and some example values for step name:
[97170] (1+7)/4 = x
[97171] (1-sqrt(1^2-4*2*-6))/4 = x
[97172] (1-sqrt(1^2-(-48)))/4 = x
[97173] (1-sqrt(1-(-48)))/4 = x
[97174] (1-sqrt(49))/4 = x
[97175] (1-7)/4 = x
[97176] x^2+15x+44 = 0
[97177] a-factor-node
[97178] b-factor-node
[97179] c-factor-node
[97180] num1-factor-node
[97181] num2-factor-node
[97182] den1-factor-node
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x
[97185] (-15+sqrt((-15)^2-176))/2 = x
[97186] (-15+sqrt(225-176))/2 = x
[97187] (-15+sqrt(49))/2 = x
[97188] (-15+7)/2 = x
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x
[97190] (-15-sqrt((-15)^2-176))/2 = x
[97191] (-15-sqrt(225-176))/2 = x
[97192] (-15-sqrt(49))/2 = x
[97193] (-15-7)/2 = x
[97194] 2x^2+x = 0
[97195] a-factor-node
[97196] b-factor-node
[97197] c-factor-node
[97198] num1-factor-node
[97199] num2-factor-node
[97200] den1-factor-node
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x
[97203] (-1+sqrt((-1)^2-0))/4 = x
[97204] (-1+sqrt((-1)^2))/4 = x
[97205] (-1+1)/4 = x
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x
[97207] (-1-sqrt((-1)^2-0))/4 = x
[97208] (-1-sqrt((-1)^2))/4 = x
[97209] (-1-1)/4 = x
[97210] x^2-6x = 0
[97211] a-factor-node
[97212] b-factor-node
Clustering is just scoring each instance in a data array according to some metric, sorting the data array according to this calculated score, then slicing into some number of segments, assigning a label each one.
In other words, you can cluster any data for which you can formulate some meaningful function to calculate similarity of each data point w/r/t the others; this is usually referred to as a similarity metric.
There are a lot of these, but only a small subset of them are useful to evaluate strings. Of these, perhaps the most commonly used is Levenshtein Distance (aka Edit Distance).
This metric is expressed as an integer, and it increments one unit (+1) for each 'edit'--inserting, deleting, or changing a letter--required to transform one word into another. Summing those individual edits (one for each letter) gives you the Levenshtein Distance.
The R Package vwr includes an implementation:
> library(vwr)
> levenshtein.distance('cat', 'hat')
hat
1
> levenshtein.distance('cat', 'catwalk')
catwalk
4
> levenshtein.distance('catwalk', 'sidewalk')
sidewalk
4
> # using a data set supplied with the vmr library
> EW = english.words
> ew1 = sample(EW, 20) # random select 20 words from EW
> # the second argument is a vector of words, returns a vector of distances
> dx = levenshtein.distance('cat', ew1)
> dx
furriers graves crooned cursively gabled caparisons drainpipes
8 5 6 8 5 8 9
patricians medially beholder chirpiness fluttered bobolink lamentably
8 7 8 9 8 8 8
depredations alights unearthed thimbles supersede dissembler
10 6 7 8 9 10
While Levenshtein Distance can be used to cluster your data, whether it should be used for your data is a question i'll leave to you (i.e., the primary use case for L/D is clearly pure text data).
(Perhaps the next-most-common similarity metric that operates on strings is Hamming Distance. Hamming Distance (unlike Levenshtein) requires that the two strings be of equal length, hence it won't work for your data.)
Perhaps:
> grepl("[[:alpha:]]", c("-18 + 5 = -13", "5 - 18 = -13","R3no4") )
[1] FALSE FALSE TRUE