Excluding words in sentimentr - r

How do you drop multiple terms from the sentimentr dictionary?
For example, the words "please" and "advise" are associated with positive sentiment, but I do not want those particular words to influence my analysis.
I've figured out a way with the following script to exclude 1 word but need to exclude many more:
mysentiment<- lexicon::hash_sentiment_jockers_rinker[x != "please"]
mytext <- c(
'Hello, We are looking to purchase this material for a part we will be making, but your site doesnt state that this is RoHS complaint. Is it possible that its just not listed as such online, but it actually is RoHS complaint? Please advise. '
)
sentiment_by(mytext, polarity_dt = mysentiment)
extract_sentiment_terms(mytext,polarity_dt = mysentiment)

You can subset the mysentiment data.table. Just create a vector of the words you don't want included and use it to subset.
mysentiment<- lexicon::hash_sentiment_jockers_rinker
words_to_exclude <- c("please", "advise")
mysentiment <- mysentiment [!x %in% words_to_exclude]

Related

identify elements with specific language, f.e. chinese

I have a dataset that looks simplified similar to this:
call_id<- c("001","002","003","004","005","012","024")
transcript <- c("All the best and happy birthday",
"万事如意,生日快乐",
"See you tomorrow",
"Nice hearing from you",
"再相见",
"玩",
"恭喜你 ")
df <- as.data.frame(cbind(call_id, transcript))
I need a code that gives me the call_id or row numbers for the observations where the transcript column includes chinese language. My final goal is to exclude the rows where the transcript column contains chinese language. As I have a data set with 250,000 observation, obviously it must be a code that does this automatically, not one that does this by hand for this small data set. I have already done some analysis with Quanteda. Is there any possibility in Quanteda for this ? Thanks in advance.
How about using the Unicode character class for Chinese characters?
> txt <- c("All the best and happy birthday", "万事如意,生日快乐")
> stringi::stri_detect_regex(txt, "\\p{Han}")
[1] FALSE TRUE
You can use textcat package in R to detect multiple languages. It can detect upto 74 languages and uses a reduced n-gram approach designed to remove redundancies of the original approach.
Here's an example to remove rows having Chinese language-
library("textcat")
out_df <- df[textcat(df$transcript) != "chinese",]

How do I use the which function to search my dataframe?

I have a bunch of PDFs that I would like to search through in order to quickly locate tables and graphs relevant to my research.
#I load the following libraries
library(pdfsearch)
library(tm)
library(pdftools)
#I assign the directory of my PDF files to the path where they are located
directory <- '/References'
#and then I search the directory for the keywords "table", "graph", and "chart"
txt <- keyword_directory(directory,
keyword = c('table', 'graph', 'chart'),
split_pdf = TRUE,
remove_hyphen = TRUE,
full_names = TRUE)
#Up to this point everything works fine. I get a nice data.frame called "txt"
#with 1356 objects in 7 columns. However, when I try to search the data.frame
#I start running into trouble.
#I start with "hunter" a term that I know resides in the token_text column
txt[which(txt$token_text == 'hunter'), ]
#executing this code produces the following message
[1] ID pdf_name keyword page_num line_num line_text token_text
<0 rows> (or 0-length row.names)
Am I using the right tool to search through my data.frame? Is there an easier way to cross reference this data? Is there a package out there somewhere that is designed to help one crawl through a mountain of PDFs? Thanks for your time
The which function returns TRUE or FALSE based on if the condition is met (for every value given in that condition, e.g. all values in a dataframe's column). You can subset a dataframe by inputing TRUE/FALSE values for the rows you want to keep / discard.
Combining this you get:
txt[which(txt$token_text == 'hunter'), ] which you did and got no rows returned. As was pointed out in the comments, which is for exact matching and you may have no exact matches.
Getting TRUE/FALSE based on partial matches or regex you can use the grepl function instead:
txt[grepl("hunter", txt$token_text, ignore.case=TRUE), ]
For easier understanding I prefer doing this with the dplyr package:
library(dplyr)
txt %>% filter(grepl("hunter",token_text, ignore.case=TRUE))

Add new document to R corpus to find unique words

I have a corpus of speeches and I would like to identify the unique words within one kind of speeches.
This is what I did, I extracted two corpora from the larger one. In the script EUP_control_corpus and IMF_control_corpus. I made IMF_control_corpus into one text file which I want to combine with EUP_control_corpus, then by using tf.idf I want to find out which terms are unique for the IMF speeches in relation to EUP speeches.
However, I'm stuck at the part of adding to (combining with) a corpus. To me it seems like this should be very simple so I don't understand why I can't find anything on it. Is it so simple that no-one has asked this question?
I tried making both into a dfm and then joining them, or turning the text file back into a corpus to join them, but in both instances, the single text file turned out to have, once more, a great number of documents.
#Create date format
base_corpus$documents$int_date <-
as.Date( base_corpus$documents$date, format = "%d-%m-%Y")
head(as.Date( base_corpus$documents$date, format = "%d-%m-%Y"))
#Select pre-crisis EUP speeches for control group
EUP_control_corpus<-
corpus_subset(base_corpus, country == "European Parliament" & int_date < as.Date( '31-12-2012', format = "%d-%m-%Y"))
head(docnames(EUP_control_corpus), 50)
ndoc(EUP_control_corpus)
#Create dfm out of EUP corpus
EUP_control_dfm <-
dfm(EUP_control_corpus, tolower = TRUE, stem = FALSE)
ndoc(EUP_control_dfm)
#Select pre-crisis IMF speeches for control group
IMF_control_corpus<-
corpus_subset(base_corpus, country == "International Monetary Fund" & int_date < as.Date( '31-12-2012', format = "%d-%m-%Y"))
head(docnames(IMF_control_corpus), 50)
ndoc(IMF_control_corpus)
#Combine IMF_control_corpus into one text
IMF_control_text<-
texts(corpus(texts(IMF_control_corpus, groups = "texts")))
IMF_control_dfm<-
dfm(IMF_control_text)
ndoc(IMF_control_dfm)
#Add IMF_control_text to EUP_control_dfm
plus_dfm<-
dfm(rbind(EUP_control_dfm, IMF_control_dfm))
ndoc((plus_dfm))
#Add IMF_control_text to EUP_control_corpus/ doesn't work, make text into single text corpus and then add?
total_control_corpus<-
corpus(EUP_control_corpus, IMF_control_text)
ndoc(total_control_corpus)
I have the idea that the group function in quanteda could be useful to do this in another way, but I decided to post the question first as have been on the search for a couple of days already.
Thank you for reading this question.
This is not a question with a reproducible example, so it is hard to provide a correct answer. Here are some suggestions:
Create a new document variable called control that takes on one of two values, IMF or EU. Use this using the conditionals that you were previously using with the corpus_subset() command. From that, you can easily create a dfm that will continue to include this docvar, or you can use the groups = "control" argument to dfm() to collapse the counts by the values of this variable.
Use docvars(thecorpus, "thevariable") <- newvalue instead of addressing the inner contents of the corpus object. That method is not stable since we may change the internal contents of the corpus at any time.
I found a solution. It might not be the prettiest one, but it works.
#Loop through the corpus and paste all documents into one document
temp <- IMF_control_corpus$documents$texts[1]
for(i in 2:337){
temp <- paste(temp,IMF_control_corpus$documents$texts[i])
}
#Create corpus out of text and add docvars, make sure it looks the same as EUP_control_corpus
single_IMF_corpus <- corpus(temp)
single_IMF_corpus$documents$title <- "IMF Text"
single_IMF_corpus$documents$date <- ""
single_IMF_corpus$documents$country <- "International Monetary Fund"
single_IMF_corpus$documents$speaker <- "IMF"
single_IMF_corpus$documents$length <- ""
single_IMF_corpus$documents$language <- "en"
single_IMF_corpus$documents$language2 <- "english"
single_IMF_corpus$documents$int_date <- as.Date("", format = "%d-%m-%Y")
#Combine single_IMF_corpus and EUP_control_corpus
total_control_corpus<-
c(EUP_control_corpus, single_IMF_corpus)
ndoc(total_control_corpus)
ndoc(EUP_control_corpus)

Pairing qualitative user data with text-mining results

I have pairs of customer feedback data in a CSV, denoting whether the customer recommended the service they received (1 or 0), "rec", and an associated comment, "comment". I am trying to compare the customer feedback between those who recommended the service and those who did not.
I have used the tm package to simply read all the lines in a CSV with only comments and do some follow-on text-mining on all the comments, which worked:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x))
Now I am trying to compare the comments of those customers who recommend and those who do not by including the "rec" column, but I have not been able to create a corpus from a single column CSV - I tried the following:
>file_loc <- "C:/Users/..(etc)...file.csv"
x <- read.csv(file_loc, header = TRUE)
require(tm)
fdbk <- Corpus(DataframeSource(x$comment))
But I get an error saying
"Error in if (vectorized && (length <= 0))
stop("vectorized sources must have positive length") :
missing value where TRUE/FALSE needed"
I also tried binding the "rec" codes to the comments after creating a topic model, but certain comments end up getting filtered by the "topic" function so the "rec" column is longer than the # of documents in the resulting topic model.
If this something I can do with the tm package simply? I haven't worked with the qdap package at all but is that something that is more appropriate here?
... as ben mentioned:
vec <- as.character(x[,"place of comments"])
Corpus(VectorSource(vec))
perhaps some customer id as meta data would be nice...
hth

Data cleaning in Excel sheets using R

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

Resources