before I get started, I would like you to know that I am completely new to coding in R. For a group assignment our professor set up a database by scraping data from Amazon. Within the database, which is called 'dat', there is a column named 'product_name'. We were given a set group of utilitarian words. I think you can guess where this is going. Within the column 'product_name' we have to find for each product name whether any of the utilitarian words appeared. If yes, how many times. We were given the following code by our professor to use for this assignment:
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus)))
}
after which i created the following codes:
uti_words <-c("additives","antioxidant","artificial", "busy", "calcium","calories", "carb", "carbohydrates", "chemicals", "cholesterol", "convenient", "dense", "diet", "fast")
sentences <- (dat$product_name)
nb_words (lexicon=uti_words,corpus=sentences)
when i run nb_words, however, I noticed something went wrong. A sentence contained the word 'breakfast'. My code counted this as a match because the word 'fast' from 'uti_words' matched with it. I don't want this to happen, does anyone know how to make it so that I only get exact matches and no partial matches?
We may have to add word boundary (\\b) to avoid partial matches
uti_words <- paste0("\\b", trimws(uti_words), "\\b")
Or another option is to change the grepl part of the code with fixed = TRUE
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus, fixed = TRUE)))
}
Related
I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.
Hi I am trying to join two large datasets >10000 entries each. To do this I have created a ‘unique ID’ - a combination of full name and date of birth which are present in both. However, the datasets have spelling mistakes/ different characters in the IDs so when using left join many won’t match. I don’t have access to fuzyjoin/ match so can’t use this to partially match them. Someone has suggested using adist(). How am I able to use this to match and merge the datasets or to flag ones which are close to matching? As simple as possible please I have only been using R for a few weeks!
Examples of code would be amazing
You could just rename them to names that are spelled correctly:
df$correct_spelling <- df$incorrect_spelling
This may a bit of a manual solution, but perhaps a base - R solution would be to look through unique values of the join fields and correct any that are misspelled using the grep() function and creating a crosswalk to merge into the dataframes with misspelled unique IDs. Here's a trivial example of what I mean:
Let's say we have a dataframe of scientists and their year of birth, and then we have a second dataframe with the scientists' names and their field of study, but the "names" column is riddled with spelling errors. Here is the code to make the example dataframes:
##Fake Data##
Names<-c("Jill", "Maria", "Carlos", "DeAndre") #Names
BirthYears<-c(1974, 1980, 1991, 1985) # Birthyears
Field<-c("Astronomy", "Geology", "Medicine", "Ecology") # Fields of science
Mispelled<-c("Deandre", "Marai", "Jil", "Clarlos")# Names misspelled
##Creating Dataframes##
DF<-data.frame(Names=Names, Birth=BirthYears) # Dataframe with correct spellings
DF2<-data.frame(Names=Mispelled, Field=Field) # Dataframe with incorrect spellings we want to fix and merge
What we can do is find all the unique values of the correctly spelled and the incorrectly spelled versions of the scientists' names using a regular expression replacement function gsub().
Mispelled2<-unique(DF2$Names) # Get unique values of names from misspelled dataframe
Correct<-unique(DF$Names) # Get unique values of names from correctly spelled dataframe
fix<-NULL #blank vector to save results from loop
for(i in 1:length(Mispelled2)){#Looping through all unique mispelled values
ptn<-paste("^",substring(Mispelled2[i],1,1), "+", sep="") #Creating a regular expression string to find patterns similar to the correct name
fix[i]<-grep(ptn, Correct, value=TRUE) #Finding and saving replacement name values
}#End loop
You'll have to come up with the regular expressions necessary for your situation, here is a link to a cheatsheet with how to build regular expressions
https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Now we can make a dataframe crosswalking the misspelled names with the correct spelling ie., Row 1 would have "Deandre" and "DeAndre" Row 2 would have "Jil" and "Jill."
CWX<-data.frame(Name_wrong=Mispelled2, Name_correct=fix)
Finally we merge the crosswalk to the dataframe with the incorrect spellings, and then merge the resultant dataframe to the dataframe with the correct spellings
Mispelled3<-merge(DF2, CWX, by.x="Names", by.y="Name_wrong")
Joined_DF<-merge(DF, Mispelled3[,-1], by.x="Names", by.y="Name_correct")
Here is what I was able to come up with for your question about matching in multiple ways. It's a bit clunky, but it works with this below example data. The trick is making the call to agrep() sensitive enough to not match names that partially match but are truly different, but flexible enough that it allows for partial matches and misspellings:
Example1<-"deborahoziajames04/14/2000"
Example2<-"Somepersonnotdeborah04/15/2002"
Example3<-"AnotherpersonnamedJames01/23/1995"
Misspelled1<-"oziajames04/14/2000"
Misspelled2<-"deborahozia04/14/2000"
Misspelled3<-"deborahoziajames10/14/1990"
Misspelled4<-"personnamedJames"
String<-c(Example1, Example2, Example3)
Misspelled<-c(Misspelled1, Misspelled2, Misspelled3, Misspelled4)
Spell_Correct<-function(String, Misspelled){
out<-NULL
for(i in 1:length(Misspelled)){
ptn_front<-paste('^', Misspelled[i], "\\B", sep="")
ptn_mid<-paste('\\B', Misspelled[i], "\\B", sep="")
ptn_end<-paste('\\B', Misspelled[i], "$", sep="")
ptn<-c(ptn_front, ptn_mid, ptn_end)
Nchar_M<-nchar(Misspelled[i])
Nchar_S<-nchar(String)
out_front<-agrep(pattern=ptn[1], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_mid<-agrep(pattern=ptn[2], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_end<-agrep(pattern=ptn[3], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
out_test<-list(out_front, out_mid, out_end)
for (j in 1:length(out_test)){
if(length(out_test[j])==1)
use_me<-out_test[j]
}
out[i]<-use_me
}
return(unlist(out))
}
Spell_Correct(String, Misspelled)
Basically this just repeating the previous answer multiple times by using the loop and tweaking the regular expression to try a beginning, middle, and end call to agrep(). Depending on how bad the misspellings are, you may need to play around with the max.distance and cost arguments. Good Luck.
Take Care,
-Sean
I am having issue with partial string matching. I have pairs of people, and I need to compare their names. To do this I have run a charmatch both directions on the two last names, to see if name1 is part of name2, and vice versa. I have a small dataset below to demonstrate the question. I use charmatch below; I have used pmatch as well and it returns the same result.
When charmatch says seeks matches for the seeks matches for the elements of its first argument among those of its second... I take that to mean it will treat each group of characters in element1 as a pattern n see if same group exists in element2. But that's obviously not what's happening, it looks like it's direction specific.
So...is it direction specific? And if so...what else can I use to do what I am describing? My EG names pun intended, what I actually run into are lots of last names where husband has his name and wife has hers + husband. I need to be able to see if husband last name exists within wife last name.
I know it can be done with regular expressions but I am not familiar with them, probably should be, but am not, so I'd prefer an answer that does not use regex.
eg_data <- data.frame(name1 = c('Jimmy Conway', 'Jimmy'),
name2 = c('Conway','Jimmy Conway'))
eg_data$share_name1 <- mapply(charmatch, eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(charmatch, eg_data$name2, eg_data$name1)
eg_data$share_name <- 0
eg_data$share_name [(eg_data$share_name1==1 | eg_data$share_name2==1)]
<- 1
Same two lines, only string detect, not charmatch.
eg_data$share_name1 <- mapply(str_detect,eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(str_detect,eg_data$name2, eg_data$name1)
OR even
eg_data$share_name1 <- ifelse(mapply(str_detect,eg_data$name1, eg_data$name2)==TRUE,1,0)
eg_data$share_name2 <- ifelse(mapply(str_detect,eg_data$name2, eg_data$name1)==TRUE,1,0)
Thanks for anyone who looked. I hope this helps others.
This could be useful
> with(eg_data, intersect(name1, name2))
[1] "Jimmy Conway"
Continuing with my exploration into text analysis, i have encountered yet another roadblock.I understand the logic but don't know how to do it in R.
Here's what i want to do:
I have 2 CSVs- 1. contains 10,000 comments 2. containing a list of words
I want to select all those comments that have any of the words in the 2nd CSV. How can i go about it?
example:
**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment
**CSV 2**
sample
set
comment
**Expected output:**
this is a sample set
the comments are not real
this is a random set of words
feel free to comment
Please note:
the different forms of words is also considered, eg, comment and comments are both considered.
We can use grep after pasteing the elements in the second dataset.
v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set" "the comments are not real"
#[3] "this is a random set of words" "feel free to comment"
First create two objects called lines and words.to.match from your files. You could do it like this:
lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]
Let's say they look like this:
lines <- c(
'this is a sample set',
'the comments are not real',
'this is a random set of words',
'hope this helps the problem case',
'thankyou for helping out',
'i have learned a lot here',
'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')
You can then compute the matches with two nested *apply-functions:
matches <- mapply(
function(words, line)
any(sapply(words, grepl, line, fixed=T)),
list(words.to.match),
lines
)
matched.lines <- lines[which(matches)]
What's going on here? I use mapply to compute a function over each line in lines, taking words.to.match as the other argument. Note that the cardinality of list(words.to.match) is 1. I just recycle this argument across each application. Then, inside the mapply function I call an sapply function to check whether any of the words match the line (I check for the match via grepl).
This is not necessarily the most efficient solution, but it's a bit more intelligible to me. Another way you could compute matches is:
matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)
I dislike this solution because you need to do a do.call("rbind",...), which is a bit hacky.
I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.