looping to tokenize using text2vec - r

Edited to shorten and provide sample data.
I have text data consisting of 8 questions asked of a number of participants twice. I want to use text2vec to compare the similarity of their responses to these questions at the two points in time (duplicate detection). Here is how my initial data is structured (in this example there are just 3 participants, 4 questions instead of 8, and 2 quarters/time periods). I want to do similarity comparison for each participant's response in the first quarter vs. the second quarter. I intend to use package text2vec's psim command to do this.
df<-read.table(text="ID,Quarter,Question,Answertext
Joy,1,And another question,adsfjasljsdaf jkldfjkl
Joy,2,And another question,dsadsj jlijsad jkldf
Paul,1,And another question,adsfj aslj sd afs dfj ksdf
Paul,2,And another question,dsadsj jlijsad
Greg,1,And another question,adsfjasljsdaf
Greg,2,And another question, asddsf asdfasd sdfasfsdf
Joy,1,this is the first question that was asked,this is joys answer to this question
Joy,2,this is the first question that was asked,this is joys answer to this question
Paul,1,this is the first question that was asked,this is Pauls answer to this question
Paul,2,this is the first question that was asked,Pauls answer is different
Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
Greg,2,this is the first question that was asked,this is Gregs answer to this question
Joy,1,This is the text of another question,more random text
Joy,2,This is the text of another question, adkjjlj;ds sdafd
Paul,1,This is the text of another question,more random text
Paul,2,This is the text of another question, adkjjlj;ds sdafd
Greg,1,This is the text of another question,more random text
Greg,2,This is the text of another question,sdaf asdfasd asdff
Joy,1,this was asked second.,some random text
Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
Paul,1,this was asked second.,some random text
Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
Greg,1,this was asked second.,some random text
Greg,2,this was asked second.,ada dasdffasdf asdf asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')
I've done some more thinking and I believe the right approach is to split the dataframe into a list of dataframes, not separate items.
questlist<-split(df,f=df$Question)
then write a function to create the vocabulary for each question.
library(text2vec)
vocabmkr<-function(x) {
itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer()
}
test<-lapply(questlist, vocabmkr)
But then I think I need to split the original dataframe into question-quarter combinations and apply the vocab from the other list to it and am not sure how to go about that.
Ultimately, I want a similarity score telling me if the participants are duplicating some or all of their responses from the first and second quarters.
EDIT: Here is how I would do this for a single question starting with the above dataframe.
quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
quest1q1<-filter(quest1,Quarter==1)
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1
quest1q2<-filter(quest1,Quarter==2)
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2
#now apply the vocabulary to the two matrices
quest1q1<-create_dtm(quest1q1,quest1vocab)
quest1q2<-create_dtm(quest1q2,quest1vocab)
similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.
b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
endproduct<-full_join(b,quest1)
Edit:
Ok, I have worked with the lapply some more.
df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question
vocabmkr<-function(x) {
itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
}
vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.
dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)
How do I apply the vocab list (consisting of 4 elements) to the dfqq list (consisting of 8)?

I'm sorry, that sounds frustrating. In case you have more to do and did want a more automatic way to do it, here's one approach that might work for you:
First, convert your example code for a single dataframe into a function:
analyze_vocab <- function(df_) {
quest1vocab =
itoken(as.character(df_$Answertext), ids = df_$ID) %>%
create_vocabulary() %>%
prune_vocabulary(term_count_min = 1) %>%
vocab_vectorizer()
quarter1 = filter(df_, Quarter == 1)
quarter1 = itoken(as.character(quarter1$Answertext),
ids = quarter1$ID)
quarter2 = filter(df_, Quarter == 2)
quarter2 = itoken(as.character(quarter2$Answertext),
ids = quarter2$ID)
q1mat = create_dtm(quarter1, quest1vocab)
q2mat = create_dtm(quarter2, quest1vocab)
similarity = psim2(q1mat, q2mat, method = "jaccard", norm = "none")
b = data.frame(
ID = names(similarity),
Similarity = similarity)
output <- full_join(b, df_)
return(output)
}
Now, you can split if you want and then use lapply like this: lapply(split(df, df$Question), analyze_vocab). However, you already seem comfortable with piping so you might as well go with that approach:
similarity_df <- df %>%
group_by(Question) %>%
do(analyze_vocab(.))
Output:
> head(similarity_df, 12)
# A tibble: 12 x 5
# Groups: Question [2]
ID Similarity Quarter Question Answertext
<fct> <dbl> <int> <fct> <fct>
1 Joy 0 1 And another question adsfjasljsdaf jkldfjkl
2 Joy 0 2 And another question "dsadsj jlijsad jkldf "
3 Paul 0 1 And another question adsfj aslj sd afs dfj ksdf
4 Paul 0 2 And another question dsadsj jlijsad
5 Greg 0 1 And another question adsfjasljsdaf
6 Greg 0 2 And another question " asddsf asdfasd sdfasfsdf"
7 Joy 1 1 this is the first question that was asked this is joys answer to this question
8 Joy 1 2 this is the first question that was asked this is joys answer to this question
9 Paul 0.429 1 this is the first question that was asked this is Pauls answer to this question
10 Paul 0.429 2 this is the first question that was asked "Pauls answer is different "
11 Greg 0.667 1 this is the first question that was asked this is Gregs answer to this question nearly the same
12 Greg 0.667 2 this is the first question that was asked this is Gregs answer to this question
The values in similarity match the ones shown in your example endproduct (note that values shown are rounded for tibble display), so it seems to be working as intended.

I gave up and did this manually one dataframe at a time. I'm sure there's a simple way to do it as a list but I can't for the life of me figure out how to apply a list of functions (the vocab vectorizers) to the "Answertext" column in the list of dataframes.
As powerful as R is, a simple for loop that allows text swapping into the command (a la Stata's "foreach") is grossly lacking. I get that there is a different workflow involving breaking a dataframe into a list and iterating over that but for some activities this complicates matters grossly, necessitating complex indexes to refer not just to the list but also to the specific vectors contained in the list. I also recognize that the Stata-like behavior can be achieved using assign and paste0 but this, like most code in R, is terribly clunky and obtuse. sigh.

Related

Regex match for singular version BUT NOT plural in R [duplicate]

This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!
Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"
You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor

How do I find the sum of a category under a subset?

So... I'm very illiterate when it comes to RStudio and I'm using this program for a class... I'm trying to figure out how to sum a subset of a category. I apologize in advance if this doesn't make sense but I'll do my best to explain because I have no clue what I'm doing and would also appreciate an explanation of why and not just what the answer would be. Note: The two lines I included are part of the directions I have to follow, not something I just typed in because I knew how to - I don't... It's the last part, the sum, that I am not explained how to do and thus I don't know what to do and would appreciate help figuring out.
For example,
I have this:
category_name category2_name
1 ABC
2 ABC
3 ABC
4 ABC
5 ABC
6 BDE
5 EFG
7 EFG
I wanted to find the sum of these numbers, so I was told to put in this:
sum(dataname$category_name)
After doing this, I'm asked to type this in, apparently creating a subset.
allabc <- subset(dataname, dataname$category_name2 == "abc")
I created this subset and now I have a new table popped up with this subset. I'm asked to sum only the numbers of this ABC subset... I have absolutely no clue on how to do this. If someone could help me out, I'd really appreciate it!
R is the software you are using. It is case-sensitive. So "abc" is not equal to "ABC".
The arguments are the "things" you put inside functions. Some arguments have the same name as the functions (which is a little confusing at first, but you get used to this eventually). So when I say the subset argument, I am talking about your second argument to the subset function, which you didn't name. That's ok, but when starting to learn R, try to always name your arguments.
So,
allabc <- subset(dataname, dataname$category_name2 == "abc")
Needs to be changed to:
allabc <- subset(dataname, subset=category2_name == "ABC")
And you also don't need to specify the name of the data again in the subset argument, since you've done that already in the first argument (which you didn't name, but almost everyone never bothers to do that).
This is the most easily done using tidyverse.
# Your data
data <- data.frame(category_name = 1:8, category_name2 = c(rep("ABC", 5), "BDE", "EFG", "EFG"))
# Installing tidyverse
install.packages("tidyverse")
# Loading tidyverse
library(tidyverse)
# For each category_name2 the category_name is summed
data %>%
group_by(category_name2) %>%
summarise(sum_by_group = sum(category_name))
# Output
category_name2 sum_by_group
ABC 15
BDE 6
EFG 15

R Question - Trying to use separate to split data with a non-constant delimiter

One of the variables is participant age groups, an example of one of the records is shown below,
0::Adult 18+||1:: Adult 18+||2::Adult 18+||3::Child 0-11
How do you best split this out so that it will give Adult 18 + with the result of 3 and Child 0-11 with 1?
I tried using separate, but as the delimiter is not constant, it was omitting a lot of the records. Any suggestions would be helpful, thank you! As this is my first post, let me know if I need to add more information.
Here is one way:
library(magrittr)
vals <- "0::Adult 18+||1:: Adult 18+||2::Adult 18+||3::Child 0-11"
strsplit(gsub("[^[:alpha:][:space:]]","", vals), "\\s+") %>% as.data.frame() %>% table()
Adult Child
3 1

removing variables containing certain string in r [duplicate]

This question already has answers here:
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Delete rows containing specific strings in R
(7 answers)
Closed 4 years ago.
I'd have hundreds of observations and I'd like to remove the ones that contain the string "english basement". I can't seem to find the right syntax to do so. I can only figure out how to keep observations with the that string. For instance, I used the code below to get only observations containing the string, and it worked perfectly:
eng_base <- zdata %>%
filter(str_detect(zdata$ListingDescription, “english basement”))
Now I want a data set,top_10mpEB, that excludes observations containing "english basement". Your help is greatly appreciated.
I do not know how your data looks like, but maybe this example helps you - I think you just need to negate the logical vector returned by str_detect:
library(dplyr)
library(stringr)
zdata <- data.frame(ListingDescription = c(rep("english basement, etc",3), letters[1:2] ))
zdata
# ListingDescription
#1 english basement, etc
#2 english basement, etc
#3 english basement, etc
#4 a
#5 b
zdata %>%
filter(!str_detect(ListingDescription, "english basement"))
# ListingDescription
#1: a
#2: b
Or using data.table package (no need of stringr::str_detect):
library(data.table)
setDT(zdata)
zdata[! ListingDescription %like% "english basement"]
# ListingDescription
#1: a
#2: b
You can do this using grepl():
x <- data.frame(ListingDescription = c('english basement other words description continued',
'great fireplace and an english basement',
'no basement',
'a house with a sauna!',
'the pool is great... and wait till you see the english basement!',
'new listing...will go fast'),
rent = c(3444, 23444, 346, 9000, 1250, 599))
x_english_basement <- x[grepl('english basement',
x$ListingDescription)==FALSE, ]
You can use dplyr to easily filter your dataframe.
library(dplyr)
new_data <- data %>%
filter(!ListingDescription=="english basement")
The ! became my best friend once I realized it meant "doesnt equal"

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Resources