I've got the following code, which I expect to give me a list of 3, since there are 3 elements in texts:
library(stringr)
texts <- c("I doubt it! :)", ";) disagree, but ok.", "No emoticons here!!!")
smileys <- c(":)","(:",";)",":D")
str_extract_all(texts, fixed(smileys))
Instead, I get a list of four (the length of my "pattern" parameter, here the smileys. Additionally, I get the following warning message:
Warning message: In stri_extract_all_fixed(string, pattern, simplify =
simplify, : longer object length is not a multiple of shorter object
length```
Well, I don't imagine length will match, as I'm looking for any hits on any of the smileys in each text. It's not like I want to match string 1 with pattern 1, string 2 with pattern 2, etc.
Aware that I am messing up stringi's understanding of vectorizing, I have tried this instead:
texts %>% map(~ str_extract_all(.x, fixed(smileys)))
This is much better, as it gives me a list of 3, but each element is in turn a list of four.
What I'm trying to get to is a list of 3 that is as little nested as possible. Someone, somewhere, has solved this, but I can't for the life of me figure it out or get how to google it. I could do a for loop over this, but I consider myself a citizen of the tidyverse...
Grateful for any assistance.
You can use paste to wrap each element of smiley with \\Q and \\E and collapse on the regex "or" metacharacter (|) to form a single pattern. As mentioned in the link Henrik shared and documented on ?regex and in the stringi manual, characters between \\Q and \\E are interpreted literally.
pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|")
# [1] "\\Q:)\\E|\\Q(:\\E|\\Q;)\\E|\\Q:D\\E"
library(stringi)
stri_extract_all_regex(texts, pattern)
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#[1] NA
Base R:
regmatches(texts, gregexpr(pattern, texts))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# If you want an NA, instead of a zero-length vector,
# then you could do something like:
# lapply(
# regmatches(texts, gregexpr(pattern, texts)),
# function(ii) ifelse(is.character(ii) & length(ii) == 0L, NA, ii))
And if you do want to use purrr and avoid regular expressions, one idea would be something like this:
library(purrr)
library(stringr)
texts %>%
map(~ unlist(str_extract_all(.x, fixed(smileys))))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# if you want NA, not a zero-length vector, you could add:
# %>% map(~ ifelse(is.character(.x) & length(.x) == 0L, NA, .x))
Related
I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.
If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA
For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"
I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"
I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"
dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"
Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate
stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"
I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"
Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.
Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.
I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.
Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.
I have this variable
x= "379_exp_mirror1.csv"
I need to extract the number ("379") at the beggining (which doesn't always have 3 characters), i.e. everything before the first "". And then I need to extract everything between the second "" and the ".", in this case "mirror1".
I have tried several combinations with sub and gsub with no success, can anyone give me some indications please?
Thank you
You can use regular expression. For your problem ^(?<Number>[0-9]*)_.* do the job
1/ Test your regular expression with this website : http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
Or you can split string with underscore and then try parse (int.TryParse). I think the second is better but if you want to be a regular expression master try the first method
You can use sub to extract the substrings:
x <- "379_exp_mirror1.csv"
sub("_.*", "", x)
# [1] "379"
sub("^(?:.*_){2}(.*?)\\..*", "\\1", x)
# [1] "mirror1"
Another approach with gregexpr:
regmatches(x, gregexpr("^.*?(?=_)|(?<=_)[^_]*?(?=\\.)", x, perl = TRUE))[[1]]
# [1] "379" "mirror1"
May be you can try:
library(stringr)
x <- "379_exp_mirror1.csv"
str_extract_all(x, perl('^[0-9]+(?=_)|[[:alnum:]]+(?=\\.)'))[[1]]
#[1] "379" "mirror1"
Or
strsplit(x, "[._]")[[1]][c(T,F)]
#[1] "379" "mirror1"
Or
scan(text=gsub("[.]","_", x),what="",sep="_")[c(T,F)]
#Read 4 items
#[1] "379" "mirror1"
I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with a single command?
I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.
Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
grep(pattern, x)
[1] 1 2 3
This returns an index of which of your vectors contained the matching pattern (in this case the first three.)
Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:
grepl(pattern, x)
[1] TRUE TRUE TRUE FALSE
See ?regex for help about regular expressions in R.
Edit:
To avoid creating pattern manually we can use paste:
myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")
Here is one solution using stringr package
require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")
Use the -e argument to add additional patterns:
echo '1100' | grep -e '001' -e '110' -e '101'
If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:
stri_detect_regex(x, pattern)
## [1] TRUE TRUE TRUE FALSE
And some benchmarks:
require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
expr min lq mean median uq max neval
stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658 100
grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421 100
Sorry for making this an additonal answer, but it is too many lines for a comment.
I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.
For a really large number of items, a loop would be required to check each item of the pattern.
set.seed(0)
samplefun <- function(n, x, collapse){
paste(sample(x, n, replace=TRUE), collapse=collapse)
}
words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')
#since execution takes a while, I have commented out the following lines
#result <- grepl(paste(words, collapse = "|"), text)
# Error in grepl(pattern, text) :
# invalid regular expression
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...
#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))
# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) :
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)
You can also use the %like% operator from data.table library.
library(data.table)
# input
x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"
# check for pattern
x %like% pattern
> [1] TRUE TRUE TRUE FALSE