I'm using the stringrlibrary to count the number of occurrences of an array of strings in a column in excel.
string.arr = c(
"I can't handle this.",
"I shouldn't be this stressed out.",
... more possible strings ...
)
Sample data:
1 col_name
2 “I’m never going to succeed.”,“The professor will be disappointed in me.”,“Other students won’t want to work with me.”,“I shouldn't be this stressed out.",“Other people can handle this situation - what's wrong with me?"
3 “Everyone will think I am dumb.”,“People will make jokes about me if I get the wrong answer.”,“I shouldn't be this stressed out.",“Other people can handle this situation - what's wrong with me?"
4 ... more such rows ...
As you can see from the Sample data, there are two kinds of apostrophes used ' and ’. However, in R, I'm only able to use ' while creating the string.arr. Consequently, the code (below) is not counting the strings which have ’ in them.
for (string in string.arr) {
sum(str_count(deidentified_data_text_df$col_name, string), na.rm=TRUE)
}
It's not feasible to modify the data. Can I solve this in the code such that both ' and ’ in the data are detected by ' in the code.
I'm open to using any other package in R.
EDIT:
If string.arr contains what is essentially a list of key words (or sentences) that you want to match in larger text and the problem is that that larger text may contain two kinds of apostrophes, then you might simply replace all apostrophes in string.arr by a regex alternation group:
string.arr <- gsub("’|'", "(’|')", string.arr)
Result:
string.arr
[1] "I can(’|')t handle this."
[2] "They won(’|')t handle this"
[3] "I shouldn(’|')t be this stressed out."
[4] "no apostrophe"
Data:
string.arr = c(
"I can’t handle this.", # bent apostrophe
"They won't handle this", # straight apostrophe
"I shouldn't be this stressed out.", # straight apostrophe
"no apostrophe" # no apostrophe
)
Related
I've scraped chunks of text from XML files that are often missing whitespace between sentences. I've used str_split with great success to break the chunks into digestible sentences, like below:
list_of_strings <- str_split(chunk_of_text, pattern=boundary("sentence")
This works pretty well, but it can't deal with situations where the terminal period is not followed by a space. For example, "This sentence ends.This sentence continues." It returns this as 1 sentence, not two.
Using str_split with pattern=boundary("sentence") doesn't work.
If I search and replace periods with period-space, of course that screws up numbers like 1.5 pounds.
I've explored using wildcards to detect the situation, e.g.,
str_view_all(x, "[[:alpha:]]\\.[[:alpha:]]"))
but I can't figure out how to either 1) insert a space after the period so a subsequent call to str_split works correctly, or 2) split at the period.
Any advice on separating sentences when this occurs?
Newbie R programmer here, thanks for your help!
library(stringr)
x <- "This sentence ends.This sentence continues. It costs 1.5 pounds.They needed it A.S.A.P.Here's one more sentence."
str_split(x, "\\.\\s?(?=[A-Z][^\\.])")
[[1]]
[1] "This sentence ends" "This sentence continues"
[3] "It costs 1.5 pounds" "They needed it A.S.A.P"
[5] "Here's one more sentence."
Explanation:
"\\. # literal period
\\s? # optional whitespace
(?=[A-Z] # followed by a capital letter
[^\\.])" # which isn’t followed by another period
Also note this doesn’t account for every possibility. For instance, it’ll erroneously split after "Dr." for "Dr. Perez is on call.". You could handle that case by adding a negative lookbehind:
"(?<!Dr|Mr|Mrs|Ms|Mx)\\.\\s?(?=[A-Z][^\\.])"
But the specific contents, and other edge cases to handle, will depend on your data.
When having a paste() with a given delimiter like in the below example, is it possible to suppress the delimiter at single positions?
The example is part of a ggplot()+labs(subtitle=paste(...)) with whitespace separator:
paste("Minimum/maximum number of observations per point:", min(currdataset_S03$nobs), "* /", max(currdataset_S03$nobs), sep = " ")
And results in:
Now, I'd like to suppress/skip only the whitespace between 7 (=currdataset_S03$nobs) and the asterisk (see red arrow in the image). But don't know how.
I'm quite sure that I have seen a quite simple solution some time ago. But I can't remember it, - or my memory may not serves me right.
However, I could not find any helpful post so far. So, has anybody an idea for me please?
You could solve this using sprintf.
sprintf("Minimum/maximum number of observations per point: %s * / %s",
min(currdataset_S03$nobs), max(currdataset_S03$nobs))
# [1] "Minimum/maximum number of observations per point: 7 * / 14"
Data:
currdataset_S03 <- data.frame(nobs=7:14)
I am scraping comments from Reddit and trying to remove empty rows/comments.
A number of rows appear empty, though I cannot seem to remove them. When I use is_empty they do not appear empty.
> Reddit[25,]
[1] ""
> is_empty(Reddit$text[25])
[1] FALSE
> Reddit <- subset(Reddit, text != "")
> Reddit[25,]
[1] ""
Am I missing something? I've tried a couple of other methods to remove these rows and they haven't worked either.
Edit:
Included dput example in answer to comments:
RedditSample <- data.frame(text=
c("I liked coinbase, used it before. But the fees are simply too much. If they were to take 1% instead 2.5% I would understand. It's much simpler and long term it doesn't matter as much.",
"But Binance only charges 0.1% so making the switch is worth it fairly quickly. They also have many more coins. Approval process took me less than 10 minutes, but always depends on how many register at the same time.",
"", "Here's a 10%/10% referal code if you chose to register: KHELMJ94",
"What is a spot wallet?"))
Actually the data you shared doesn't contain an empty string, it contains a Unicode zero-width space character. You can see that with
charToRaw(RedditSample$text[3])
# [1] e2 80 8b
You could make sure there is a non-space character using a regular expression that matches a "word" character
subset(RedditSample, grepl("\\w", text))
You could use the string length functions. For example in tidyverse which includes the stringr package:
library(tidyverse)
Reddit %>%
filter(str_length(text) > 0)
Or base R:
Reddit[ nchar(Reddit$text) >0, ]
I have the below dataframe
head(df)
index song year artist genre lyrics
2 Till i am gone 2010 Eminem Rap Chorus:It's too much, it's too tough
i have done other data cleanups such as converting everything into lower case using gsub and removing words between brackets, however, not finding the syntax to just remove the word and the colon that is after it, for example in my row, i want to remove "chorus:"
After the syntax it should be
lyrics
It's too much, it's too tough
The following code will delete everything before the colon which i don't want as this colon can be anywhere in the cell
gsub(".*:","",foo)
You can specify to only remove the word immediately before the colon.
I expanded your test set to show that it works.
foo = c("Chorus:It's too much, it's too tough ",
"ABC Chorus:It's too much, it's too tough ")
gsub("\\w+:", "", foo)
[1] "It's too much, it's too tough " "ABC It's too much, it's too tough "
I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R.
EG:
"Acadia Parish" --> "Acadia" (removes Parish and space before Parish)
"Fifth District" --> "Fifth" (removes District and space before District)
SPSS syntax:
COMPUTE county=REPLACE(county,' Parish','').
There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces.
I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged.
Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well.
Thank you for looking.
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"
Legend:
^ Start of pattern.
() Group (or token).
\w* One or more occurrences of word character more than 1 times.
.* one or more occurrences of any character except new line \n.
$ end of pattern.
\1 Returns group from regexp
Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space.
string <- c("Arcadia Parish", "Fifth District")
bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")
trimws( sub(bad_regex, "", string) )
# [1] "Arcadia" "Fifth"
dataframename$varname <- gsub(" Parish","", dataframename$varname)