How do I turn this code into a working function? - r

This is my data
[1] "the rooms were clean very comfortable and the staff was amazing they went over and beyond to help make our stay enjoyable i highly recommend this hotel for anyone visiting downtown "
[2] "excellent property and very convenient to activities front desk staff is extremely efficient, pleasant and helpful property is clean and has a fantastic old time charm "
[3] "the rooftop cafeteria of hotel was great. wen i say food was great "
I want to create a function that returns the count of positive sentiments per row. This is my code so far.
x=data$sentences[1:3]
library(tidytext)
library(tm)
bing <- get_sentiments("bing")
positive = bing %>% filter(sentiment %in% "positive")
positive = subset(positive, select = -c(sentiment))
positive = as.vector(positive$word)
positive = paste0(positive," ")
positive_reviews <- function(x) {
data = as.vector(x)
data = Corpus(VectorSource(data))
data = tm_map(data, removePunctuation)
data = as.character(data)
positive_count = sapply(positive, function(x) str_count(data,x))
return(sum(positive_count))
}
print(positive_reviews(x))
I am running into an error that I do not know how to fix.
Error: no function to return from, jumping to top level
How would I write my code to make this work?

Related

Text Mining Scraped Data (R)

I wrote the code below to look for the word "nationality" in a job postings dataset, where I am essentially trying to see how many employers specify that a given candidate must of a particular visa type or nationality.
I know that in the raw data itself (in excel), there are several cases where the job description where the word "nationality" is mentioned.
nationality_finder = function(string){
nationality = c(" ")
split_string = strsplit(string, split = NULL)
split_string = split_string[[1]]
flag = 0
for(letter in split_string){
if(flag > 0){nationality = append(nationality, letter)}
if(letter == "nationality "){flag = 1}
if(letter == " "){flag = flag-0.5}
}
nationality = paste(nationality, collapse = '')
return(nationality)
}
for(n in 1:length(df2$description)){
df2$nationality[n] <- nationality_finder(df2$description[n])
}
df2%>%
view()
Furthermore, the code is working w/out errors, but it is not producing what I am looking for. I am essentially looking to create another variable where 1 indicates that the word "nationality" is mention, and 0 otherwise. Specifically, I am looking for words such as "citizen" and "nationality" under the job description variable. And the text under each job description is extremely long but here, I just gave a summarized version for brevity.
Text example for a job description in the dataset
Title: Event Planner
Nationality: Saudi National
Location: Riyadh, Saudi Arabia
Salary: Open
Salary depends on the candidates skills, experience, and other attributes.
Another job description:
- Have recently graduated or looking for a career change and be looking for
an entry level role (we will offer full training)
- Priority will be taken for applications by U.S. nationality holders
You can try something like this. I'm assuming you've a data.frame as data, and you want to add a new column.
dats$check <- as.numeric(grepl("nationality",dats$description,ignore.case=TRUE))
dats$check
[1] 1 1 0 1
grepl() is going to detect in the column dats$description the string nationality, ignoring case (ignore.case = TRUE) and as.numeric() is going to convert TRUE FALSE into 1 0.
With fake data:
dats <- structure(list(description = c("Title: Event Planner\n \n Nationality: Saudi National\n \n Location: Riyadh, Saudi Arabia\n \n Salary: Open\n \n Salary depends on the candidates skills, experience, and other attributes.",
"- Have recently graduated or looking for a career change and be looking for\n an entry level role (we will offer full training) \n \n - Priority will be taken for applications by U.S. nationality holders ",
"do not have that word here", "aaaaNationalitybb"), check = c(1,
1, 0, 1)), row.names = c(NA, -4L), class = "data.frame")

Removing Apostrophies while text mining, or don't

This is absolutely driving me nuts, and I'm ashamed to say that I've spent the last 3 hours trying to figure this out.
I'm mining tweeter data, and I'd like to do some text analysis, but words like "doesn't" are throwing me off. I either want to keep the ' or replace it with an empty string (""). I've tried:
tweets$text <- gsub("\'", "", tweets$text)
tweets$text <- gsub("\\'", "", tweets$text)
tweets$text <- gsub("'", "", tweets$text)
tweets$text <- gsub("\W", "", tweets$text)
What I WANT is doesn't -> doesnt OR doesn't.
I want to remove the rest of the special characters, but because what comes after ' changes the word, I want to keep that. Later in the code I'm using gsub("[^A-Za-z]", " ", twt_txt_url) to clean the special characters.
I shared this code earlier for a different question, but it should still get the point across. Note that this is split into two codes, one for you to pull the data, and two to see how I'm cleaning.
PULLING DATA:
library(rtweet)
library(tidyverse)
library(httpuv)
# API access keys
app_name = "app_name"
consumer_key <- "consumer_key"
consumer_secret <- "consumer_secret"
access_token <- "access_token"
access_secret <- "access_secret"
# Create twitter connection
create_token(app = app_name,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_token_secret)
# who do we want to observe
account <- "#BillGates"
# 3200 is the max we can pull at once
account.timeline <- get_timeline(account, n=100, includeRts =TRUE)
# create data frame and csv from tweets
write_as_csv(account.timeline, BillGates.csv", fileEncoding = "UTF-8")
CLEANING DATA
library(tidyverse)
library(qdapRegex)
library(tm)
library(qdap)
library(wordcloud)
tweets <- read_csv("BillGate.csv")
# This is where I'm trying to remove the ' from words
tweets$text <- gsub("\\'","", tweets$text)
# Separate out the text column
twt_txt <- tweets$text
# remove URLs
twt_txt_url <- rm_twitter_url(twt_txt)
# remove special characters
twt_txt_chrs <- gsub("[^A-Za-z]", " ", twt_txt_url)
# convert to a text corpus
twt_corpus <- twt_txt_chrs %>%
VectorSource() %>%
Corpus()
Here's some of the data that you can manipulate as well. It seems like I can remove the first ', but nothing after that.
df = data.frame(
tweet = c(1, 2, 3, 4, 5),
text = c(
"Standing up for science has never been more important. Congratulations to Dr. Anthony Fauci and Dr. Salim Abdool Karim on receiving this honor.",
"I've known and learned from #RonConway for more than 40 years. I'm glad to see #svangel team up with #bchesky to mentor and support companies working to create more economic empowerment opportunities for people across the world.",
"This book has nothing to do with viruses or pandemics. But it is surprisingly relevant for these times. #exlarson provides a brilliant and gripping account of another era of widespread anxiety: the years 1940 and 1941.",
"The season finale of our podcast features two incredible people who are using their positions as artists to change the world for the better.",
"Like many people, I’ve tried to deepen my understanding of systemic racism in recent months. If you’re interested in learning more about the lives caught up in our country's justice system, I highly recommend #thenewjimcrow by Michelle Alexander."
)
)

Split a sentence to word columns using loops or functions in R?

I have a dataframe corpus in R which looks like this :enter image description here
And I want to create n-grams(upto 5-grams) using loops or functions. currently, I am doing it manually in this way:
Sample corpus structure:
{"colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time",
"the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum",
"felt my first earthquake today whole building at work was shaking",
"she is the kind of mother friend and woman i aspire everyday to be",
"she was processed and released pending a court appearance",
"watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle",
"every night when we listen to poohs heartbeat our hearts feel so much happiness and peace",}
`onegram <- NGramTokenizer(corpusdf, Weka_control(min=1, max=1))
onegram <- data.frame(table(onegram))
onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]
colnames(onegram) <- c("Word", "Freq")
onegram [1:15,]
bigram <- NGramTokenizer(corpusdf, Weka_control(min=2, max=2, delimiters = tokendelim))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
colnames(bigram) <- c("Word", "Freq")
bigram [1:15,]`
Any ideas?
I don't know the function NGramTokenizer and couldn't get it to work. So here is a solution in quanteda, which produces individual tokens objects for each iteration (gram_1 for onegram, gram_2 for bigrams and so on):
corpusdf <- data.frame(text = c("colleagues were also at the other two events in aberystwyth and flint and by all accounts had a great time", "the lineup was whittled down to a more palatable five in when the bing crosby souffle going my way bested both gaslight and double indemnity proving oscar voters have always had a taste for pabulum", "felt my first earthquake today whole building at work was shaking", "she is the kind of mother friend and woman i aspire everyday to be", "she was processed and released pending a court appearance", "watching some sunday night despite the sadness i have been feeling i also feel very blessed and happy to be carrying another miracle", "every night when we listen to poohs heartbeat our hearts feel so much happiness and peace"),
stringsAsFactors = FALSE)
library("quanteda")
tokens <- tokens(corpusdf$text, what = "word")
for (n in seq_len(5)) {
temp <- tokens_ngrams(tokens, n = n, skip = 0L, concatenator = "_")
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)
Output looks like this:
> head(gram_2)
Word Freq
53 had_a 2
101 to_be 2
1 a_court 1
2 a_great 1
3 a_more 1
4 a_taste 1
Update: After I realised NGramTokenizer belongs to the RWeka package and not tm, #phiver 's answer works for me
ngrams <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=1, max=5))
ngrams <- data.frame(table(ngrams),
stringsAsFactors = FALSE)
ngrams <- ngrams[order(ngrams$Freq, decreasing = TRUE),]
head(ngrams)
However, this mixes up all ngrams which does not make much sense if you want to rank frequencies (onegrams will naturally be on top). So here is a loop solution:
for (n in seq_len(5)) {
temp <- RWeka::NGramTokenizer(corpusdf, Weka_control(min=n, max=n))
temp <- data.frame(table(unlist(temp)),
stringsAsFactors = FALSE)
colnames(temp) <- c("Word", "Freq")
temp <- temp[order(temp$Freq, decreasing = TRUE),]
assign(paste0("gram_", n), temp)
}
head(gram_2)

R - extracting multiple patterns from string using gregexpr

I am working with a dataset where I have a column describing different products. In the product description is also the weight of the product, which is what I'd like to extract. My problem is that some products come in dual-packs, meaning that the description starts with '2x', while the actual weight is at the end of the description. For example:
x = '2x pet food brand 12kg'
What I'd like to do is to shorten this to just 2x12kg.
I'm not great at using regexp in R and was hoping that someone here could help me.
I have tried doing this using gregexp in the following way:
m <- gregexpr("(^[0-9]+x [0-9]+kg)", x)
Unfortunately this only gives me '10kg' not including the '2x'
I would appreciate any help at all with this.
EDIT ----
After sorting out my initial problem, I found that there were a few instances in the data of a different format, which I also like to extract:
x = 'Pet food brand 15x85g'
# Should be:
x = '15x85g'
I have tried to play around with OR statements in gsub, like:
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+kg)|([0-9]+x)?[^0-9]*([0-9.]+g)', '\\1\\2', x)
#And
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+(kg|g)), x)
While this still extracts the kilos, it only removes the instances with grams and leaves the rest of the string, like:
x = 'Pet food brand '
Or running gsub a second time using:
m <- gsub('([0-9]+x[0-9]+g)', '\\1', x)
The latter option does not extract the product weights at all, and just leaves the string intact.
Sorry for not noticing that the strings were formatted differently earlier. Again, any help would be appreciated.
You could use this regular expression
m = gregexpr("([0-9]+x|[0-9.]+kg)", string, ignore.case = T)
result = regmatches(string, m)
r = paste0(unlist(result),collapse = "")
For string = "2x pet food brand 12kg" you get "2x12kg"
This also works if kilograms have decimals:
For string = "23x pet food 23.5Kg" you get "23x23.5Kg"
(edited to correct mistake pointed out by #R. Schifini)
You can use regex like this:
x <- '2x pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "2x12kg"
This would get you the weight even if there is no "2x" in the beginning of the string:
x <- 'pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "12kg"

Split Speaker and Dialogue in RStudio

I have documents such as :
President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
And when I read those .txt documents I would like to create a second column indicating the speaker name.
So what I tried was to first create a list of all possible names and replace them..
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)
prok <- as.data.frame(prok)
prok$speaker <- grepl("#[^\\#:]*:",prok$prok, ignore.case = T)
My plan was to then get the name between # and : via regex if speaker == true and apply it downwards until there is a different name (and remove all applause/shout brackets obviously), but that is also where I am not sure how I could do that.
Here is the approach:
require (qdap)
#text is the document text
# remove round brackets and text b/w ()
a <- bracketX(text, "round")
names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
searchString <- paste(names[1],names[2], sep = ".+")
# Get string from names[1] till names[2] with the help of searchString
string <- regmatches(a, regexpr(searchString, a))
# remove names[2] from string
string <- gsub(names[2],"",string)
This code can be looped when there are more than 2 names
Here is an approach leaning heavily on dplyr.
First, I added a sentence to your sample text to illustrate why we can't just use a colon to identify speaker names.
sampleText <-
"President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
This sentence right here: it is an example of a problem"
I then split the text to simulate the format that it appears you are reading it in (which also puts each speech in a part of a list).
splitText <- strsplit(sampleText, "\n")
Then, I am pulling out all of the potential speakers (anything that precedes a colon) to
allSpeakers <- lapply(splitText, function(thisText){
grep(":", thisText, value = TRUE) %>%
gsub(":.*", "", .) %>%
gsub("\\(", "", .)
}) %>%
unlist() %>%
unique()
Which gives us:
[1] "President Dr. Norbert Lammert"
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"
[4] "This sentence right here"
Obviously, the last one is not a legitimate name, so should be excluded from our list of speakers:
legitSpeakers <-
allSpeakers[-4]
Now, we are ready to work through the speech. I have included stepwise comments below, instead of describing in text here
speechText <- lapply(splitText, function(thisText){
# Remove applause and interjections (things in parentheses)
# along with any blank lines; though you could leave blanks if you want
cleanText <-
grep("(^\\(.*\\)$)|(^$)", thisText
, value = TRUE, invert = TRUE)
# Split each line by a semicolor
strsplit(cleanText, ":") %>%
lapply(function(x){
# Check if the first element is a legit speaker
if(x[1] %in% legitSpeakers){
# If so, set the speaker, and put the statement in a separate portion
# taking care to re-collapse any breaks caused by additional colons
out <- data.frame(speaker = x[1]
, text = paste(x[-1], collapse = ":"))
} else{
# If not a legit speaker, set speaker to NA and reset text as above
out <- data.frame(speaker = NA
, text = paste(x, collapse = ":"))
}
# Return whichever version we made above
return(out)
}) %>%
# Bind all of the rows together
bind_rows %>%
# Identify clusters of speech that go with a single speaker
mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
# Group by those clusters
group_by(speakingGroup) %>%
# Collapse that speaking down into a single row
summarise(speaker = speaker[1]
, fullText = paste(text, collapse = "\n"))
})
This yields
[[1]]
speakingGroup speaker fullText
1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.
2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
If you prefer to have each line of text separately, replace the summarise at the end with mutate(speaker = speaker[1]) and you will get one line for each line of the speech, like this:
speaker text speakingGroup
President Dr. Norbert Lammert I declare the session open. 1
President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2
This seems to work
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("#Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","#President Dr. Norbert Lammert:")
testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")
testprok$V1 <- mgsub(members,members_r,testprok$V1)
testprok$V2 <- ifelse(grepl("#[^\\#:]*:",testprok$V1),testprok$V1,NA)
####function from http://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
testprok$V2 = repeat.before(testprok$V2)

Resources