Text mining: how to count the frequency of two words occurring close together - r

Let's say I have the following data frame, df.
speaker <- c('Lincoln','Douglas')
text <- c('The framers of the Constitution, those framers...',
'The framers of our great Constitution.')
df <- data.frame(speaker,text)
I want to find (or write) a function that can count the frequency of two words occurring close together. Say I want to count instance of "framers" occurring within three words of the word "Constitution" and vice versa.
Thus, for Lincoln, the function would return 2 because you have one instance of "framers" followed by "Constitution" and another instance of "Constitution" followed by "framers." For Douglas, the function would return 0 because "Constitution" is four words away from "framers."
I'm new to text mining, so I apologize if I'm missing an easy solution.

Related

R:how to extract the first integer or decimal number from a text, and if the first number equal to specific numbers extract the second integer/decimal

The data is like this:
example - name of database
detail - the first column the contain sting with number in it (the number can be attached to $ etc. like 25m$ and also can be decimal like 1.2m$ or $1.2M)
lets say the datatable look like this:
example$detail<- c("The cole mine market worth every year 100M$ and the equipment they use worth 30$m per capita", "In 2017 the first enterpenur realized there is a potential of 500$M in cole mining", "The cole can make 23b$ per year ans help 1000000 familys living on it")
i want to add a column to the example data table - named: "number" that will extract the first number in the string in column "detail". BUT if this number is equal to one of the numbers in vector "year" (its not in the example database - its a seprate list i created) i want it to extract the second number of the string example$detail.
so i create another years list (separate from the database),
years<-c(2016:2030 )
im trying to create new column - number
what i did so far:
I managed to add variable that extract the first number of a string, by writing the following command:
example$number<-as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) ) # EXTRACT ONLT INTEGERS
example$number1<-format(round(as.numeric(str_extract(example$detail, "\\d+\\.*\\d*")), 2), nsmall = 2) #EXTRACT THE NUMBERS AS DECIMALS WITH TWO DIGITS AFTER THE . (ITS ENOUGH FOR ME)
example$number1<-ifelse(example$number %in% years, TRUE, example$number1 ) #IF THE FIRST NUMBER EXTRACTED ARE IN THE YEARS VECTOR RETURN "TRUE"
and then i tried to write a code that extract the second number according to this if and its not working, just return me errors
i tried:
gsub("[^\d]*[\d]+[^\d]+([\d]+)", example$detail)
str_extract(example$detail, "\d+(?=[A-Z\s.]+$)",[[2]])
as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) )
as.numeric(strsplit(example$detail, "\\D+")[1])
i didnt understand how i symbolized any number (integer\digits) or how i symbolized THE SECOND number in string.
thanks a lot!!
List item
Since no good example data is provided I'm just going to 'wing-it' here.
Imagine the dataframe df has the columns year (int) and details (char), then
df = mutate(clean_details = sub("[^0-9.-]", "",details),
clean_details_part1 = as.integer(strsplit(clean_details,"[.]")[[1]][1]),
clean_details_part2 = as.integer(strsplit(clean_details,"[.]")[[1]][2])
)
This works with the code I wrote up. I didn't apply the logic because I see you're proficient enough to do that. I believe a simple ifelse statement would do to create a boolean and then you can filter on that boolean, or a most direct way.

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Find specific strings, count their frequency in a given text, and report it as a proportion of the number of words

Trying to write a function in R that would :
1) look through each observation's string variables
2) identify and count certain strings that the user defines
3) report the findings as a proportion of the total number of words each observation contains.
Here's a sample dataset:
df <- data.frame(essay1=c("OMG. american sign language. knee-slides in leather pants", "my face looks totally different every time. lol."),
essay2=c("cheez-its and dried cranberries. sparkling apple juice is pretty\ndamned coooooool too.<br />\nas for music, movies and books: the great american authors, mostly\nfrom the canon, fitzgerald, vonnegut, hemmingway, hawthorne, etc.\nthen of course the europeans, dostoyevski, joyce, the romantics,\netc. also, one of the best books i have read is all quiet on the\nwestern front. OMG. I really love that. lol", "at i should have for dinner\nand when; some random math puzzle, which I loooooove; what it means to be alive; if\nthe meaning of life exists in the first place; how the !##$ can the\npolitical mess be fixed; how the %^&* can the education system\nbe fixed; my current game design project; my current writing Lol"),
essay3=c("Lol. I enjoy life and then no so sure what else to say", "how about no?"))
The furtherest I managed to get is this function:
find.query <- function(char.vector, query){
which.has.query <- grep(query, char.vector, ignore.case = TRUE)
length(which.has.query) != 0
}
profile.has.query <- function(data.frame, query){
query <- tolower(query)
has.query <- apply(data.frame, 1, find.query, query=query)
return(has.query)
}
This allows the user to detect if a given value is in the 'essay' for a given used, but that's not enough for the three goals outlined above. What this function would ideally do is to count the number of words identified, then divide that count by the total count of words in the overall essays (row sum of counts for each user).
Any advice on how to approach this?
Using the stringi package as in this post:
How do I count the number of words in a text (string) in R?
library(stringi)
words.identified.over.total.words <- function(dataframe, query){
# make the query all lower-case
query <- tolower(query)
# count the total number of words
total.words <- apply(dataframe, 2, stri_count, regex = "\\S+")
# count the number of words matching query
number.query <- apply(dataframe, 2, stri_count, regex = query)
# divide the number of words identified by total words for each column
final.result <- colSums(number.query) / colSums(total.words)
return(final.result)
}
(The df in your question has each essay in a column, so the function sums each column. However, in the text of your question you say you want row sums. If the input data frame was meant to have one essay per row, then you can change the function to reflect that.)

How to create a word grouping report using R language and .Net?

I would like to create a simple application in C# that takes in a group of words, then returns all groupings of those individual words from a data set.
For example, given car and bike, return a list of groups/combinations of words (with the number of combinations found) from a data set.
To further clarify - given a category named "car", I would like to see a list of word groupings with the word "car". This category could also be several words rather than just one.
With a sample data set of:
CAR:
Another car for sale
Blue car on the horizon
For Sale - used car
this car is painted blue
should return
car : for sale : 2
car : blue : 2
I'd like to set a threshold, say 20 or greater, so if there are over 20 instances of the word(s) with car, then display them - category, words, count, where only category is known; words and count is determined by the algorithm.
The data set is in a SQL Server 2008 table, and I was hoping to use something like a .Net implementation of R to accomplish this.
I am guessing that the best way to accomplish this may be with the R programming language, and am only now looking at R.Net.
I would prefer to do this with .Net, as that is what I am most familiar with, but open to suggestions.
Can someone with some experience with this lead me in the right direction?
Thanks.
It seems your question consists of 4 parts:
Getting data from SQL Server 2008
Extracting substrings from a set of strings
Setting a threshold for when to accept that number
Producing some document or other output (?) containing this.
For 1, I think that's a different question (see the RODBC package), but I won't be dealing with that here as that's not the main part of your question. You've left 4. a little vague and I think that's also peripheral to the meat of your question.
Part 2 can be easily dealt with using regular expressions:
countstring <- function(string, pattern){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
This function basically gets a vector of strings and a pattern to search for. It finds which of them match and gets the sum of the number that do (ie the count). It then prints out these together in one string. For example:
car <- c("Another car for sale", "Blue car on the horizon", "For Sale - used car", "this car is painted blue")
countstring(car, "blue")
## [1] "car : blue : 2"
Part 3 requires a small change to the function
countstring <- function(string, pattern, threshold=20){
stringcount <- sum(grepl(pattern, string, ignore.case=TRUE), na.rm=TRUE)
if(stringcount >= threshold){
paste(deparse(substitute(string)), pattern, stringcount, sep=" : ")
}
}

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources