Count how many times specific words are used - r

I want to perform textmining on several bank account descriptions. My first step would be get a ranking of the words that are used the most in the description.
So lets say i have a dataframe that looks like this:
a b
1 1 House expenses
2 2 Office furniture bought
3 3 Office supplies ordered
Then I want to create a ranking of the use of the words. Like this:
Name Times
1. Office 2
2. Furniture 1
Etc...
Any thoughts on how I can quickly get an overview of the words that are used most in the description?

Another way around this is using the tm package.
You can create a corpus:
require(tm)
corpus <- Corpus(DataframeSource(data))
dtm<-DocumentTermMatrix(corpus)
dtmDataFrame <- as.data.frame(inspect(dtm))
by default it makes term frequencies tf using "weightTf". I converted the Document Term Matrix into a Dataframe.
Now what you have is a row per document, a column for each term and the value is the term frequency for every term, you can just create the rankings in a straightforward way, adding all values for each column.
colSums(dtmDataFrame)
You can sort it too after, whatever. The good point of using tm is that you can filter easily words out, process them with bunch of things like stop words, remove punctuations, stemming, remove sparse words in case you need it.

d<-data.frame(a=c(1,2,3), b=c("1 House expenses", "2 Office furniture bought", "3 Office supplies ordered"), stringsAsFactors =FALSE)
e <- unlist(strsplit(d$b, " "))
f <- e[! e %in% c("")]
g <- sapply(f, function(x) { sum(f %in% c(x))})
h = data.frame(Name=names(g), Times=g)
h[!duplicated(h),]

Related

Calculate similarity within a dataframe across specific rows (R)

I have a dataframe that looks something like this:
df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))
The output looks like this:
Index Title Date
1 Sherlock 01/01/20
2 Peaky Blinders 01/01/20
3 Eastenders 01/01/20
4 BBC News 01/01/20
5 Antiques Roadshow 01/01/20
6 Eastenders 02/01/20
7 BBC News 02/01/20
8 Casualty 02/01/20
9 Dragons Den 02/01/20
10 Peaky Blinders 02/01/20
I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).
It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...
Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.
Here is another possibility. You can create a simple function to calculate the similarity or other index between groups. Then, split your data frame by date into a list, and lapply the custom function to each in the list (final result will be a list).
calc_similar <- function(i) {
sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}
s <- split(df$title, df$date)
setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])
Output
$`2020-01-02`
[1] 0.6
I have come up with this solution. However, I'm unsure about how will it work when the number of records per day is different (i.e. you have 8 titles for day n and 15 titles for day n+1). I guess you would like to normalize with respect to the day with more records. Anyway, here it comes:
divide <- split.data.frame(df, as.factor(df$date))
similarity <- vector()
for(i in 1:(length(divide)-1)){
index <- sum((divide[[i]]$title) %in% divide[[i+1]]$title)/max(c(length(divide[[i]]$title), length((divide[[i+1]]$title))))
similarity <- c(similarity, index)
}
similarity

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

row wise count the number of the words in a review text in an R dataframe

I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}

How to write a for-loop that searches names from data.frame in a character vector?

I have a data.frame with names of football players, for example:
names <- data.frame(id=c(1,2,3,4,5,6,7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'))
> names
id year
1 1 Maradona
2 2 Cruyff
3 3 Messi
4 4 Ronaldo
5 5 Pele
6 6 Van Basten
7 7 Diego
I also have a 6,000 scraped text files, containing stories about these football players. These stories are stored as 6,000 elements in a large vector called stories.
Is there a way a loop (or an apply function) can be written that searches for the names of each of the football players. If a match or multiple matches occur, I would like to record the element number and the name(s) of the football player.
For example, consider the following text in stories[1]:
Diego Armando Maradona (born 30 October 1960) is a retired Argentine
professional footballer. He has served as a manager and coach at other
clubs as well as the national team of Argentina. Many in the sport,
including football writers, former players, current players and
football fans, regard Maradona as the greatest football player of all
time. He was joint FIFA Player of the 20th Century
with Pele.
The ideal data.frame would have the following structure:
> outcome
element name1 name2
1 1 Maradona Pele
Does somebody know a way to write such a code that results in one data.frame for with information on all football players?
I just did it with a loop, but maybe you can do it with an apply function
#Make sure you include stringsAsFactors = F or my code won't work
football_names <- data.frame(id=c(1:7),
year=c('Maradona', 'Cruyff', 'Messi', 'Ronaldo', 'Pele', 'Van Basten', 'Diego'),stringsAsFactors = F)
outcome <- data.frame(element=football_names$id)
for (i in 1:nrow(football_names)){
names_in_story <- football_names$year[football_names$year %in% unlist(strsplit(stories[i],split=" "))]
for (j in 1:length(names_in_story)){
outcome[i,j+1] <- names_in_story[j]
}
}
names(outcome) <- c("element",paste0("name",1:(ncol(outcome)-1)))
I don't undertsand your question exactly. But you can try to use a string match using astringr function and lapply.
I assumed that your data stories is a list.
The function finds all names you provide into the function as a vector and counts their occurence. The output is again a list.
foo <- function(x,y) table(unlist(str_match_all(x,paste0(y,collapse = "|"))))
The result
res <- lapply(series, foo,names$year)
Then you can merge and sum up the data (rowSums()) for example like this:
Reduce(function(...) merge(..., all=T, by="Var1"), res)

Resources