Using Regex to edit a column in R [duplicate] - r

I've got a column people$food that has entries like chocolate or apple-orange-strawberry.
I want to split people$food by - and get the first entry from the split.
In python, the solution would be food.split('-')[0], but I can't find an equivalent for R.

If you need to extract the first (or nth) entry from each split, use:
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word,"-"), `[`, 1)
#[1] "apple" "chocolate"
Or faster and more explictly:
vapply(strsplit(word,"-"), `[`, 1, FUN.VALUE=character(1))
#[1] "apple" "chocolate"
Both bits of code will cope well with selecting whichever value in the split list, and will deal with cases that are outside the range:
vapply(strsplit(word,"-"), `[`, 2, FUN.VALUE=character(1))
#[1] "orange" NA

For example
word <- 'apple-orange-strawberry'
strsplit(word, "-")[[1]][1]
[1] "apple"
or, equivalently
unlist(strsplit(word, "-"))[1].
Essentially the idea is that split gives a list as a result, whose elements have to be accessed either by slicing (the former case) or by unlisting (the latter).
If you want to apply the method to an entire column:
first.word <- function(my.string){
unlist(strsplit(my.string, "-"))[1]
}
words <- c('apple-orange-strawberry', 'orange-juice')
R: sapply(words, first.word)
apple-orange-strawberry orange-juice
"apple" "orange"

I would use sub() instead. Since you want the first "word" before the split, we can simply remove everything after the first - and that's what we're left with.
sub("-.*", "", people$food)
Here's an example -
x <- c("apple", "banana-raspberry-cherry", "orange-berry", "tomato-apple")
sub("-.*", "", x)
# [1] "apple" "banana" "orange" "tomato"
Otherwise, if you want to use strsplit() you can round up the first elements with vapply()
vapply(strsplit(x, "-", fixed = TRUE), "[", "", 1)
# [1] "apple" "banana" "orange" "tomato"

I would suggest using head rather than [ in R.
word <- c('apple-orange-strawberry','chocolate')
sapply(strsplit(word, "-"), head, 1)
# [1] "apple" "chocolate"

dplyr/magrittr approach:
library(magrittr)
library(dplyr)
word = c('apple-orange-strawberry', 'chocolate')
strsplit(word, "-") %>% sapply(extract2, 1)
# [1] "apple" "chocolate"

Using str_remove() to delete everything after the pattern:
df <- data.frame(words = c('apple-orange-strawberry', 'chocolate'))
mutate(df, short = stringr::str_remove(words, "-.*")) # mutate method
stringr::str_remove(df$words, "-.*") # str_remove example
stringr::str_replace(df$words, "-.*", "") # str_replace method
stringr::str_split_fixed(df$words, "-", n=2)[,1] # str_split method similar to original question's Python code
tidyr::separate(df, words, into = c("short", NA)) # using the separate function
words short
1 apple-orange-strawberry apple
2 chocolate chocolate

stringr 1.5.0 introduced str_split_i to do this easily:
library(stringr)
str_split_i(c('apple-orange-strawberry','chocolate'), "-", 1)
[1] "apple" "chocolate"
The third argument represents the index you want to extract. Also new is that you can use negative values to index from the right:
str_split_i(c('apple-orange-strawberry','chocolate'), "-", -1)
[1] "strawberry" "chocolate"

Related

How to add single quotes around multiple strings

strings <- c("apple", "banana", "029")
> strings
[1] "apple" "banana" "029"
I would like to add single quotes to each element in strings and separate the strings with ,. My desired output is this:
desired_strings <- "'apple','banana','029'"
> desired_strings
[1] "'apple','banana','029'"
My attempt:
a <- "'"
paste0(mapply(paste0, a, strings, a), ",")
[1] "'apple'," "'banana'," "'029',"
However, this is not quite right.
You can use sQuote() and then collapse to a single string with paste():
paste(sQuote(strings, q = FALSE), collapse = ",")
[1] "'apple','banana','029'"
Using sprintf.
toString(sprintf("'%s'", strings))
# [1] "'apple', 'banana', '029'"
or
paste(sprintf("'%s'", strings), collapse=",")
# [1] "'apple','banana','029'"

Removing text between parentheses with unmatched pairs

I am trying to remove characters/numbers between parentheses. Firstly, the numbered parentheses - i.e. ("(3)") - at the start, and then anything in the second pair of parentheses. Sometimes this second pair of parentheses has an unmatched bracket which complicates things. An example:
library(qdapRegex)
n <- c("(1) Apple (Pe(ar)", "(2) Apple (Or(ang)e)", "(3) Banana (Hot(dog)")
c <- rm_between(n,"(",")", extract = TRUE)
To ideally get:
c
> "Apple" "Apple" "Banana"
It seems that you always need the second word. If that is the case then here are a couple of (straightforward) ways of doing it,
#Base R
sapply(strsplit(n, ' '), `[`, 2)
[1] "Apple" "Apple" "Banana"
#The always fun, word() from stringr package
stringr::word(n, 2)
[1] "Apple" "Apple" "Banana"
If you want to use regex, then you could use a replace regex with empty string like this:
[^A-Za-z ]
Or with insensitive flag
(?i)[^a-z ]
Regex demo

Vectorized stringr with fixed (literal) characters

I've got the following code, which I expect to give me a list of 3, since there are 3 elements in texts:
library(stringr)
texts <- c("I doubt it! :)", ";) disagree, but ok.", "No emoticons here!!!")
smileys <- c(":)","(:",";)",":D")
str_extract_all(texts, fixed(smileys))
Instead, I get a list of four (the length of my "pattern" parameter, here the smileys. Additionally, I get the following warning message:
Warning message: In stri_extract_all_fixed(string, pattern, simplify =
simplify, : longer object length is not a multiple of shorter object
length```
Well, I don't imagine length will match, as I'm looking for any hits on any of the smileys in each text. It's not like I want to match string 1 with pattern 1, string 2 with pattern 2, etc.
Aware that I am messing up stringi's understanding of vectorizing, I have tried this instead:
texts %>% map(~ str_extract_all(.x, fixed(smileys)))
This is much better, as it gives me a list of 3, but each element is in turn a list of four.
What I'm trying to get to is a list of 3 that is as little nested as possible. Someone, somewhere, has solved this, but I can't for the life of me figure it out or get how to google it. I could do a for loop over this, but I consider myself a citizen of the tidyverse...
Grateful for any assistance.
You can use paste to wrap each element of smiley with \\Q and \\E and collapse on the regex "or" metacharacter (|) to form a single pattern. As mentioned in the link Henrik shared and documented on ?regex and in the stringi manual, characters between \\Q and \\E are interpreted literally.
pattern <- paste("\\Q", smileys, "\\E", sep = "", collapse = "|")
# [1] "\\Q:)\\E|\\Q(:\\E|\\Q;)\\E|\\Q:D\\E"
library(stringi)
stri_extract_all_regex(texts, pattern)
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#[1] NA
Base R:
regmatches(texts, gregexpr(pattern, texts))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# If you want an NA, instead of a zero-length vector,
# then you could do something like:
# lapply(
# regmatches(texts, gregexpr(pattern, texts)),
# function(ii) ifelse(is.character(ii) & length(ii) == 0L, NA, ii))
And if you do want to use purrr and avoid regular expressions, one idea would be something like this:
library(purrr)
library(stringr)
texts %>%
map(~ unlist(str_extract_all(.x, fixed(smileys))))
#[[1]]
#[1] ":)"
#
#[[2]]
#[1] ";)"
#
#[[3]]
#character(0)
# if you want NA, not a zero-length vector, you could add:
# %>% map(~ ifelse(is.character(.x) & length(.x) == 0L, NA, .x))

R programming : select element from split string based on value in another column

I have a data frame having one column of words, with syllables separated by hyphens. I want to extract the nth syllable, where n is given in another column. Like this:
word <- c("to-ma-to", "cheese", "ta-co")
whichSyl <- c(2, 1, 1)
mydf <- data.frame(word, whichSyl)
mydf$word <- as.character(mydf$word)
desired: a vector containing
ma
cheese
ta
If this were, say, awk, I would just do
'{split($1,a,"-"); print a[$2]}'
The words don't always have the same number of syllables.
It seems likely that there is a straightforward way to do this, but I'm not seeing it. Thanks
You can use mapply and strsplit to get,
mapply('[', strsplit(mydf$word, '-'), whichSyl)
#[1] "ma" "cheese" "ta"
Here I wrote a function that does one row at a time, and then uses lapply() to iterate over all rows and do.call(rbind()) to bind all of those responses together.
getSyl <- function(i){
strsplit(mydf$word[i], '-')[[1]][mydf$whichSyl[i]]
}
do.call(rbind, lapply(1:nrow(mydf), getSyl))
[,1]
[1,] "ma"
[2,] "cheese"
[3,] "ta"
We can use read.table and row/column indexing
read.table(text=mydf$word, sep="-", header=FALSE,
fill=TRUE)[cbind(1:nrow(mydf), mydf$whichSyl)]
#[1] "ma" "cheese" "ta"

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Resources