Regex match for singular version BUT NOT plural in R [duplicate] - r

This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!

Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"

You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor

Related

looping to tokenize using text2vec

Edited to shorten and provide sample data.
I have text data consisting of 8 questions asked of a number of participants twice. I want to use text2vec to compare the similarity of their responses to these questions at the two points in time (duplicate detection). Here is how my initial data is structured (in this example there are just 3 participants, 4 questions instead of 8, and 2 quarters/time periods). I want to do similarity comparison for each participant's response in the first quarter vs. the second quarter. I intend to use package text2vec's psim command to do this.
df<-read.table(text="ID,Quarter,Question,Answertext
Joy,1,And another question,adsfjasljsdaf jkldfjkl
Joy,2,And another question,dsadsj jlijsad jkldf
Paul,1,And another question,adsfj aslj sd afs dfj ksdf
Paul,2,And another question,dsadsj jlijsad
Greg,1,And another question,adsfjasljsdaf
Greg,2,And another question, asddsf asdfasd sdfasfsdf
Joy,1,this is the first question that was asked,this is joys answer to this question
Joy,2,this is the first question that was asked,this is joys answer to this question
Paul,1,this is the first question that was asked,this is Pauls answer to this question
Paul,2,this is the first question that was asked,Pauls answer is different
Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
Greg,2,this is the first question that was asked,this is Gregs answer to this question
Joy,1,This is the text of another question,more random text
Joy,2,This is the text of another question, adkjjlj;ds sdafd
Paul,1,This is the text of another question,more random text
Paul,2,This is the text of another question, adkjjlj;ds sdafd
Greg,1,This is the text of another question,more random text
Greg,2,This is the text of another question,sdaf asdfasd asdff
Joy,1,this was asked second.,some random text
Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
Paul,1,this was asked second.,some random text
Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
Greg,1,this was asked second.,some random text
Greg,2,this was asked second.,ada dasdffasdf asdf asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')
I've done some more thinking and I believe the right approach is to split the dataframe into a list of dataframes, not separate items.
questlist<-split(df,f=df$Question)
then write a function to create the vocabulary for each question.
library(text2vec)
vocabmkr<-function(x) {
itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer()
}
test<-lapply(questlist, vocabmkr)
But then I think I need to split the original dataframe into question-quarter combinations and apply the vocab from the other list to it and am not sure how to go about that.
Ultimately, I want a similarity score telling me if the participants are duplicating some or all of their responses from the first and second quarters.
EDIT: Here is how I would do this for a single question starting with the above dataframe.
quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1
quest1q2<-filter(quest1,Quarter==2)
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2
#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)
similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.
b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)
Edit:
Ok, I have worked with the lapply some more.
df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question
vocabmkr<-function(x) {
itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}
vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.
dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)
How do I apply the vocab list (consisting of 4 elements) to the dfqq list (consisting of 8)?
I'm sorry, that sounds frustrating. In case you have more to do and did want a more automatic way to do it, here's one approach that might work for you:
First, convert your example code for a single dataframe into a function:
analyze_vocab <- function(df_) {
quest1vocab =
itoken(as.character(df_$Answertext), ids = df_$ID) %>%
create_vocabulary() %>%
prune_vocabulary(term_count_min = 1) %>%
vocab_vectorizer()
quarter1 = filter(df_, Quarter == 1)
quarter1 = itoken(as.character(quarter1$Answertext),
ids = quarter1$ID)
quarter2 = filter(df_, Quarter == 2)
quarter2 = itoken(as.character(quarter2$Answertext),
ids = quarter2$ID)
q1mat = create_dtm(quarter1, quest1vocab)
q2mat = create_dtm(quarter2, quest1vocab)
similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")
b = data.frame(
ID = names(similarity),
Similarity = similarity)
output <- full_join(b, df_)
return(output)
}
Now, you can split if you want and then use lapply like this: lapply(split(df, df$Question), analyze_vocab). However, you already seem comfortable with piping so you might as well go with that approach:
similarity_df <- df %>%
group_by(Question) %>%
do(analyze_vocab(.))
Output:
> head(similarity_df, 12)
# A tibble: 12 x 5
# Groups: Question [2]
ID Similarity Quarter Question Answertext
<fct> <dbl> <int> <fct> <fct>
1 Joy 0 1 And another question adsfjasljsdaf jkldfjkl
2 Joy 0 2 And another question "dsadsj jlijsad jkldf "
3 Paul 0 1 And another question adsfj aslj sd afs dfj ksdf
4 Paul 0 2 And another question dsadsj jlijsad
5 Greg 0 1 And another question adsfjasljsdaf
6 Greg 0 2 And another question " asddsf asdfasd sdfasfsdf"
7 Joy 1 1 this is the first question that was asked this is joys answer to this question
8 Joy 1 2 this is the first question that was asked this is joys answer to this question
9 Paul 0.429 1 this is the first question that was asked this is Pauls answer to this question
10 Paul 0.429 2 this is the first question that was asked "Pauls answer is different "
11 Greg 0.667 1 this is the first question that was asked this is Gregs answer to this question nearly the same
12 Greg 0.667 2 this is the first question that was asked this is Gregs answer to this question
The values in similarity match the ones shown in your example endproduct (note that values shown are rounded for tibble display), so it seems to be working as intended.
I gave up and did this manually one dataframe at a time. I'm sure there's a simple way to do it as a list but I can't for the life of me figure out how to apply a list of functions (the vocab vectorizers) to the "Answertext" column in the list of dataframes.
As powerful as R is, a simple for loop that allows text swapping into the command (a la Stata's "foreach") is grossly lacking. I get that there is a different workflow involving breaking a dataframe into a list and iterating over that but for some activities this complicates matters grossly, necessitating complex indexes to refer not just to the list but also to the specific vectors contained in the list. I also recognize that the Stata-like behavior can be achieved using assign and paste0 but this, like most code in R, is terribly clunky and obtuse. sigh.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

gsub to create new variables [duplicate]

This question already has answers here:
How to split a string on first number only
(4 answers)
Closed 4 years ago.
I want to create 2 variables from 1 variable in R.
I have following character variable for gas station:
station
Valero 1810 N Foster Rd & IH-10 E
from this variable I want to create 2: station_id and address
station_id
Valero
address
1810 N Foster Rd & IH-10 E
In my data set all strings in station variable begin with words (up to 3 words, eg: EZ Mart) and all addresses begin with numeric value.
I was trying to achieve this goal using gsub for last couple hours but I couldn't do it.
Thank you
Base R solution: This works for the sample string you give. You need to test if this works for your other cases. It would've been good to include more than one sample string.
ss <- "Valero 1810 N Foster Rd & IH-10 E";
station_id <- trimws(gsub("(\\w+\\s+){1,3}(\\d+.+)$", "\\1", ss));
address <- gsub("(\\w+\\s+){1,3}(\\d+.+)$", "\\2", ss);
station_id;
#[1] "Valero"
address;
#[1] "1810 N Foster Rd & IH-10 E"

removing variables containing certain string in r [duplicate]

This question already has answers here:
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Delete rows containing specific strings in R
(7 answers)
Closed 4 years ago.
I'd have hundreds of observations and I'd like to remove the ones that contain the string "english basement". I can't seem to find the right syntax to do so. I can only figure out how to keep observations with the that string. For instance, I used the code below to get only observations containing the string, and it worked perfectly:
eng_base <- zdata %>%
filter(str_detect(zdata$ListingDescription, “english basement”))
Now I want a data set,top_10mpEB, that excludes observations containing "english basement". Your help is greatly appreciated.
I do not know how your data looks like, but maybe this example helps you - I think you just need to negate the logical vector returned by str_detect:
library(dplyr)
library(stringr)
zdata <- data.frame(ListingDescription = c(rep("english basement, etc",3), letters[1:2] ))
zdata
# ListingDescription
#1 english basement, etc
#2 english basement, etc
#3 english basement, etc
#4 a
#5 b
zdata %>%
filter(!str_detect(ListingDescription, "english basement"))
# ListingDescription
#1: a
#2: b
Or using data.table package (no need of stringr::str_detect):
library(data.table)
setDT(zdata)
zdata[! ListingDescription %like% "english basement"]
# ListingDescription
#1: a
#2: b
You can do this using grepl():
x <- data.frame(ListingDescription = c('english basement other words description continued',
'great fireplace and an english basement',
'no basement',
'a house with a sauna!',
'the pool is great... and wait till you see the english basement!',
'new listing...will go fast'),
rent = c(3444, 23444, 346, 9000, 1250, 599))
x_english_basement <- x[grepl('english basement',
x$ListingDescription)==FALSE, ]
You can use dplyr to easily filter your dataframe.
library(dplyr)
new_data <- data %>%
filter(!ListingDescription=="english basement")
The ! became my best friend once I realized it meant "doesnt equal"

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Resources