identify elements with specific language, f.e. chinese - r

I have a dataset that looks simplified similar to this:
call_id<- c("001","002","003","004","005","012","024")
transcript <- c("All the best and happy birthday",
"万事如意,生日快乐",
"See you tomorrow",
"Nice hearing from you",
"再相见",
"玩",
"恭喜你 ")
df <- as.data.frame(cbind(call_id, transcript))
I need a code that gives me the call_id or row numbers for the observations where the transcript column includes chinese language. My final goal is to exclude the rows where the transcript column contains chinese language. As I have a data set with 250,000 observation, obviously it must be a code that does this automatically, not one that does this by hand for this small data set. I have already done some analysis with Quanteda. Is there any possibility in Quanteda for this ? Thanks in advance.

How about using the Unicode character class for Chinese characters?
> txt <- c("All the best and happy birthday", "万事如意,生日快乐")
> stringi::stri_detect_regex(txt, "\\p{Han}")
[1] FALSE TRUE

You can use textcat package in R to detect multiple languages. It can detect upto 74 languages and uses a reduced n-gram approach designed to remove redundancies of the original approach.
Here's an example to remove rows having Chinese language-
library("textcat")
out_df <- df[textcat(df$transcript) != "chinese",]

Related

How can you exclude certain words before periods from being used as sentence breaks in quanteda's corpus_reshape?

In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")) is similar but does unfortunately solve the problem. Here is an example:
library("quanteda")
txt <- c(
d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
Corpus consisting of 4 documents.
d1.1 :
"With us we have Dr."
d1.2 :
"Smith."
d1.3 :
"We are not sure... where we stand."
d2.1 :
"The U.S. is south of Canada."
It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!
Please use corpus_segment with pattern & valuetype = "regex".
You may find example here
https://quanteda.io/reference/corpus_segment.html
You may also use use_docvars option.

Conditional string replacement in r

I have a dataframe that contains a column with brokernames, which are handwritten by customers, that I would like to go through in order to replace the handwritten brokernames with unique brokernames that I have in a list.
A snippet of my data looks like this:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
My list of unique broker names looks like this:
Unique_brokers <- list("REALESTATE", "ESTATE")
Based on some sort of pattern-recognition, I would like to replace the brokernames in my Data dataframe with the unique brokernames in my Unique_brokers list.
I've partially managed to do this somewhat manually using a combination of case_when and str_detect from dplyr and stringr respectively.
Data <- Data %>%
mutate("UniqueBroker" = case_when(str_detect(Broker, regex("realestate", ignore_case=T))~"REALEASTE",
str_detect(Broker, regex("estate", ignore_case=T))~"ESTATE",
TRUE~"OTHER"))
However, this is fairly timeconsuming with >100 unique brokers and more than 12500 combinations of handwritten brokernames in ~80.000 records.
I was wondering whether it would be possible to make this replacement using mapply, I haven't been able to so far, however.
Many thanks in advance!
EDIT
Data$Broker consists of all kinds of combinations in terms of spelling, information included, etc.
E.g.
Data$Broker <- c("Real-estate", "Real estate", "Real estate department 788", "Michael / REAL Estate")
You can use the i flag for case-insensitive regex and use str_replace_all -
library(stringr)
Unique_brokers <- c("REALESTATE", "ESTATE")
Data$Unique_brokers <- str_replace_all(Data$Broker,
setNames(Unique_brokers, str_c('(?i)', Unique_brokers)))
A dplyr and stringrsolution:
Data %>%
mutate(Broker_unique = if_else(str_detect(Broker, "(?i)real(-|\\s)?estate"),
"REALESTATE",
"ESTATE"))
Date Broker Broker_unique
1 01-10-2020 Real-estate REALESTATE
2 01-10-2020 estate ESTATE
3 01-11-2020 Real estate department 788 REALESTATE
4 01-11-2020 Michael / REAL Estate REALESTATE
The pattern works like this:
(?i): make match case-insensitive
real: match literal real
-(-|\\s)?: match - OR (i.e., one whitespace) optionally
estate: match estate literally
Test data:
Data <- data.frame(Date = c("01-10-2020", "01-10-2020", "01-11-2020", "01-11-2020"),
Broker = c("RealEstate", "REALestate", "Estate", "ESTATE"))
Data$Broker <- c("Real-estate", "estate", "Real estate department 788", "Michael / REAL Estate")

Excluding words in sentimentr

How do you drop multiple terms from the sentimentr dictionary?
For example, the words "please" and "advise" are associated with positive sentiment, but I do not want those particular words to influence my analysis.
I've figured out a way with the following script to exclude 1 word but need to exclude many more:
mysentiment<- lexicon::hash_sentiment_jockers_rinker[x != "please"]
mytext <- c(
'Hello, We are looking to purchase this material for a part we will be making, but your site doesnt state that this is RoHS complaint. Is it possible that its just not listed as such online, but it actually is RoHS complaint? Please advise. '
)
sentiment_by(mytext, polarity_dt = mysentiment)
extract_sentiment_terms(mytext,polarity_dt = mysentiment)
You can subset the mysentiment data.table. Just create a vector of the words you don't want included and use it to subset.
mysentiment<- lexicon::hash_sentiment_jockers_rinker
words_to_exclude <- c("please", "advise")
mysentiment <- mysentiment [!x %in% words_to_exclude]

Text Mining: Getting a Sentence-Term Matrix

I'm currently running into trouble finding anything relevant to creating a sentence-term matrix in R using text mining.
I'm using the tm package and the only thing that I can find is converting to a tdm or dtm.
I'm using only one excel file where I'm only interested in text mining one column of. That one column has about 1200 rows within it. I want to create a row (sentence) - term matrix. I want to create a matrix that tells me the frequency of words in each row (sentence).
I want to create a matrix of 1's and 0's that I can run a PCA analysis on later.
A dtm in my case is not helpful because since I'm only using one file, the number of rows is 1 and the columns are the frequency of words in that whole document.
Instead, I want to treat the sentences as documents if that makes sense. From there, I want a matrix which the frequency of words in each sentence.
Thank you!
When using text2vecyou just need to feed the content of your column as character vector into the tokenizer function - see below example.
Concerning your downstream analysis I would not recommend to run PCA on count data / integer values, PCA is not designed for this kind of data. You should either apply normalization, tfidf weighting, etc. on your dtm to turn it to continuous data before feeding it to PCA or otherwise apply correspondence analysis instead.
library(text2vex)
docs <- c("the coffee is warm",
"the coffee is cold",
"the coffee is hot",
"the coffee is warm",
"the coffee is hot",
"the coffee is perfect")
#Generate document term matrix with text2vec
tokens = docs %>%
word_tokenizer()
it = itoken(tokens
,ids = paste0("sent_", 1:length(docs))
,progressbar = FALSE)
vocab = create_vocabulary(it)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
With the corpus library:
library(corpus)
library(Matrix)
corpus <- federalist # sample data
x <- term_matrix(text_split(corpus, "sentences"))
Although, in your case, it sounds like you already split the text into sentences. If that is true, then there is no need for the text_split call; just do
x <- term_matrix(data$your_column_with_sentences)
(replacing data$your_column_with_sentences with whatever is appropriate for your data).
Can't add comments so here's a suggestion:
# Read Data from file using fread (for .csv from data.table package)
dat <- fread(filename, <add parameters as needed - col.namess, nrow etc>)
counts <- sapply(row_start:row_end, function(z) str_count(dat[z,.(selected_col_name)],"the"))
This will give you all occurances of "the" in the column of interested for the selected rows. You could also use apply if it's for all rows. Or other nested functions for different variations. Bear in mind that you would need to check for lowercast/uppercase letters - you can use tolower to achieve that. Hope this is helpful!

Data cleaning in Excel sheets using R

I have data in Excel sheets and I need a way to clean it. I would like remove inconsistent values, like Branch name is specified as (Computer Science and Engineering, C.S.E, C.S, Computer Science). So how can I bring all of them into single notation?
The car package has a recode function. See it's help page for worked examples.
In fact an argument could be made that this should be a closed question:
Why is recode in R not changing the original values?
How to recode a variable to numeric in R?
Recode/relevel data.frame factors with different levels
And a few more questions easily identifiable with a search: [r] recode
EDIT:
I liked Marek's comment so much I decided to make a function that implemented it. (Factors have always been one of those R-traps for me and his approach seemed very intuitive.) The function is designed to take character or factor class input and return a grouped result that also classifies an "all_others" level.
my_recode <- function(fac, levslist){ nfac <- factor(fac);
inlevs <- levels(nfac);
othrlevs <- inlevs[ !inlevs %in% unlist(levslist) ]
# levslist of the form :::: list(
# animal = c("cow", "pig"),
# bird = c("eagle", "pigeon") )
levels(nfac)<- c(levslist, all_others =othrlevs); nfac}
df <- data.frame(name = c('cow','pig','eagle','pigeon', "zebra"),
stringsAsFactors = FALSE)
df$type <- my_recode(df$name, list(
animal = c("cow", "pig"),
bird = c("eagle", "pigeon") ) )
df
#-----------
name type
1 cow animal
2 pig animal
3 eagle bird
4 pigeon bird
5 zebra all_others
You want a way to clean your data and you specify R. Is there a reason for it? (automation, remote control [console], ...)
If not, I would suggest Open Refine. It is a great tool exactly for this job. It is not hosted, you can safely download it and run against your dataset (xls/xlsx work fine), you then create a text facet and group away.
It uses advanced algorithms (and even gives you a choice) and is really helpful. I have cleaned a lot of data in no time.
The videos at the official web site are useful.
There are no one size fits all solutions for these types of problems. From what I understand you have Branch Names that are inconsistently labelled.
You would like to see C.S.E. but what you actually have is CS, Computer Science, CSE, etc. And perhaps a number of other Branch Names that are inconsistent.
The first thing I would do is get a unique list of Branch Names in the file. I'll provide an example using letters() so you can see what I mean
your_df <- data.frame(ID=1:2000)
your_df$BranchNames <- sample(letters,2000, replace=T)
your_df$BranchNames <- as.character(your_df$BranchNames) # only if it's a factor
unique.names <- sort(unique(your_df$BranchNames))
Now that we have a sorted list of unique values, we can create a listing of recodes:
Let's say we wanted to rename A through G as just A
your_df$BranchNames[your_df$BranchNames %in% unique.names[1:7]] <- "A"
And you'd repeat the process above eliminating or group the unique names as appropriate.

Resources