Extracting specified word from a vector using R - r

I have a text e.g
text<- "i am happy today :):)"
I want to extract :) from text vector and report its frequency

Here's one idea, which would be easy to generalize:
text<- c("i was happy yesterday :):)",
"i am happy today :)",
"will i be happy tomorrow?")
(nchar(text) - nchar(gsub(":)", "", text))) / 2
# [1] 2 1 0

I assume you only want the count, or do you also want to remove :) from the string?
For the count you can do:
length(gregexpr(":)",text)[[1]])
which gives 2. A more generalized solution for a vector of strings is:
sapply(gregexpr(":)",text),length)
Edit:
Josh O'Brien pointed out that this also returns 1 of there is no :) since gregexpr returns -1 in that case. To fix this you can use:
sapply(gregexpr(":)",text),function(x)sum(x>0))
Which does become slightly less pretty.

This does the trick but might not be the most direct way:
mytext<- "i am happy today :):)"
# The following line inserts semicolons to split on
myTextSub<-gsub(":)", ";:);", mytext)
# Then split and unlist
myTextSplit <- unlist(strsplit(myTextSub, ";"))
# Then see how many times the smiley turns up
length(grep(":)", myTextSplit))
EDIT
To handle vectors of text with length > 1, don't unlist:
mytext<- rep("i am happy today :):)",2)
myTextSub<-gsub(":\\)", ";:\\);", mytext)
myTextSplit <- strsplit(myTextSub, ";")
sapply(myTextSplit,function(x){
length(grep(":)", x))
})
But I like the other answers better.

Related

Unexpected outcome, not replacing, in R out of a gsub function

As the output of a certain operation, I have the following dataframe whith 729 observations.
> head(con)
Connections
1 r_con[C3-C3,Intercept]
2 r_con[C3-C4,Intercept]
3 r_con[C3-CP1,Intercept]
4 r_con[C3-CP2,Intercept]
5 r_con[C3-CP5,Intercept]
6 r_con[C3-CP6,Intercept]
As can be seen, the pattern to be removed is everything but the pair of Electrode information, for instance, in the first observation this would be C3-C3. Now, this is my take on the issue, which I'd expect to have the dataframe with everything removed. If I'm not wrong (which probably am) the regex syntax is ok and from my understanding I believe fixed=TRUE is also necessary. However, I do not understand the R output. When I would expect the pattern to be changed by nothing ""it returns this output, which doesn't make sense to me.
> gsub("r_con\\[\\,Intercept\\]\\","",con,fixed=TRUE)
[1] "3:731"
I believe this will probably be a silly question for an expert programmer, which I am far from being, and any insight would be much appreciated.
[UPDATE WITH SOLUTION]
Thanks to Tim and Ben I realised I was using a wrong regex syntax and a wrong source, this made it to me:
con2 <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", con$Connections)
I think your problem is that you're accessing "con" in your sub call. Also, as the user above me pointed out, you probably don't want to use sub.
I'm assuming, that your data is consistent, i.e., the strings in con$Connections follow more or less the same pattern. Then, this works:
I have set up this example:
con <- data.frame(Connections = c("r_con[C3-C3,Intercept]", "r_con[C3-CP1,Intercept]"))
library(stringr)
f <- function(x){
part <- str_split(x, ",")[[1]][1]
str_sub(part, 7, -1)
}
f(con$Connections[1])
sapply(con$Connections, f)
The sub function doesn't work this way. One viable approach would be to capture the quantity you want, then use this capture group as the replacement:
x <- "r_con[C3-C3,Intercept]"
term <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", x)
term
[1] "C3-C3"

How do I extract names with initials in R using sub?

I have several paragraphs that I am trying to extract initials with their correlative name.
For example, I might have a paragraph with lots of text that has the name "A. J. Balfour" in it, or "J. Balfour".
This is what I am writing right now and it doesn't work. I would love your feedback!
z = "This is a bunch of text. I would like to extract A J Balfour"
sub("^(([A]\\\S+\\\s){1}\\\S+).*", "\\1", z, perl = TRUE)
I am thinking the best option is using sub, but I am having issues getting my regular expression to work. I am having trouble finding good info on writing a regular expression that will extract characters.
Thank you.
The stringr library has the str_extract functions with an easier syntax than just using sub.
library(stringr)
str_extract(z, "[A]\\S{0,1}\\s(\\S\\S{0,1}\\s){0,1}.*")
#[1] "A J Balfour"
Edit:
Here is another attempt, but since you are asking for a more general solution, it is very difficult to get an exact match.
z<-c( "This is a bunch of text. I would like to extract A J Balfour",
"J Balfour",
'This is a bunch of text. G. Balfour'
)
str_extract_all(z, "([A-Z]+[\\. ]{1,2}){1,2}.*")
# ( - start of grouping
# [A-Z] - Any capital letter
# + - at least 1 times
# [\\. ] - a period or a space
# {1,2} - one or two times
# ){1,2} - 1 or 2 times for the grouping
# .* - any character zero or more times
In fact this attempt fails on the first test. Narrowing down to [A-J] would help.
Good luck.
Thank you! I ended up using str_extract_all to look like this:
z = "This is a bunch of text. I would like to extract A. J. Balfour and maybe some other words or another A. F. Balfour or even G. G. Balfour or maybe even A. G. Balfour"
str_extract_all(z, regex("[A-Z]. [A-Z]. Balfour", simplify = TRUE))
Thanks for all the thoughts!
Consider
using regmatches in base R.
z = "This is a bunch of text. I would like to extract A J Balfour"
regmatches(z,regexpr("[A]\\s{1}\\S+.*", z))
#[1] "A J Balfour"

How to remove last n characters from every element in the R vector

I am very new to R, and I could not find a simple example online of how to remove the last n characters from every element of a vector (array?)
I come from a Java background, so what I would like to do is to iterate over every element of a$data and remove the last 3 characters from every element.
How would you go about it?
Here is an example of what I would do. I hope it's what you're looking for.
char_array = c("foo_bar","bar_foo","apple","beer")
a = data.frame("data"=char_array,"data2"=1:4)
a$data = substr(a$data,1,nchar(a$data)-3)
a should now contain:
data data2
1 foo_ 1
2 bar_ 2
3 ap 3
4 b 4
Here's a way with gsub:
cs <- c("foo_bar","bar_foo","apple","beer")
gsub('.{3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b"
Although this is mostly the same with the answer by #nfmcclure, I prefer using stringr package as it provdies a set of functions whose names are most consistent and descriptive than those in base R (in fact I always google for "how to get the number of characters in R" as I can't remember the name nchar()).
library(stringr)
str_sub(iris$Species, end=-4)
#or
str_sub(iris$Species, 1, str_length(iris$Species)-3)
This removes the last 3 characters from each value at Species column.
The same may be achieved with the stringi package:
library('stringi')
char_array <- c("foo_bar","bar_foo","apple","beer")
a <- data.frame("data"=char_array, "data2"=1:4)
(a$data <- stri_sub(a$data, 1, -4)) # from the first to the (last-4)-th character
## [1] "foo_" "bar_" "ap" "b"
Similar to #Matthew_Plourde using gsub
However, using a pattern that will trim to zero characters i.e. return "" if the original string is shorter than the number of characters to cut:
cs <- c("foo_bar","bar_foo","apple","beer","so","a")
gsub('.{0,3}$', '', cs)
# [1] "foo_" "bar_" "ap" "b" "" ""
Difference is, {0,3} quantifier indicates 0 to 3 matches, whereas {3} requires exactly 3 matches otherwise no match is found in which case gsub returns the original, unmodified string.
N.B. using {,3} would be equivalent to {0,3}, I simply prefer the latter notation.
See here for more information on regex quantifiers:
https://www.regular-expressions.info/refrepeat.html
friendly hint when working with n characters of a string to cut off/replace:
--> be aware of whitespaces in your strings!
use base::gsub(' ', '', x, fixed = TRUE) to get rid of unwanted whitespaces in your strings. i spent quite some time to find out why the great solutions provided above did not work for me. thought it might be useful for others as well ;)

Removing duplicate words in a string in R

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:
str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?
If you are still interested in alternate solutions you can use unique which slightly simplifies your code.
paste(unique(d), collapse = ' ')
As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.
d <- gsub("[[:punct:]]", "", d)
There are no need additional package
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
Atomic function:
rem_dup.one <- function(x){
paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")
Vectorize
rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)
REsult
"how do i best try and find a way to improve this code" "and here's a second one not third"
To remove duplicate words except for any special characters. use this function
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse =
" ")
}
Input data:
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg
(Silver)"
rem_dup_word(duptest)
output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)
It will treat "Samsung" and "SAMSUNG" as duplicate
I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
library(qdap)
library(dplyr) # so that pipe function (%>% can work)
str %>%
tolower() %>%
word_split() %>%
sapply(., function(x) unbag(unique(x))) %>%
rm_white_endmark() %>%
rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
unname()
## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

R gsub add leading line break

I need to add a leading lines break "\n" to a list of axis label names in R. I cannot work out how to do this with gsub. For example, I need "Q1\n/\n15" to read "\nQ1\n/\n15". Neither google nor the help commands are leading me to the answer. Any advice?
Thanks in advance.
So there are about 4 answers in the comments (as of this writing), so I'll just summarize them in a proper answer.
examp <- "Q1\n/\n15"
paste("\n", examp, sep="")
gsub("^(.)","\n\\1",examp)
sprintf("\n%s", examp)
gsub("^", "\n", examp)
all of which give
[1] "\nQ1\n/\n15"
And all of which are properly vectorized (that is, if examp <- c("Q1\n/\n15", "Q1\n/\n16"), all return [1] "\nQ1\n/\n15" "\nQ1\n/\n16".

Resources