R, split string to pairs of character - r

How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?

x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above

Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"

Related

R transform character vectors in a list, conditional on the content of the vector

The problem is I have a list of character vectors.
example:
mylist <- list( c("once","upon","a","time"),
c("once", "in", "olden", "times"),
c("Let","all","good","men"),
c("Let","This"),
c("once", "is","never","enough"),
c("in","the"),
c("Come","dance","all","around"))
and I want to prepend c("one", "two") to those vectors starting "once" to end up with the list
mylist <- list( c("one", "two", "once","upon","a","time"),
c("one", "two", "once", "in", "olden", "times"),
c("Let","all","good","men"),
c("Let","This"),
c("one", "two", "once", "is","never","enough"),
c("in","the"),
c("Come","dance","all","around"))
so far
I can select the relevant vectors
mylist[grep("once",mylist)]
and I can prepend "one" and "two" to create a results list
resultlist <- lapply(mylist[grep("once",mylist)],FUN = function(listrow) prepend(listrow,c("One","Two")))
But putting the results in the correct place in mylist?
Nope, that escapes me!
Hints, tips and solutions most welcome :-)
We can use
lapply(mylist , \(x) if(grepl("once" , x[1]))
append(x, c("one", "two") , 0) else x)
Output
[[1]]
[1] "one" "two" "once" "upon" "a" "time"
[[2]]
[1] "one" "two" "once" "in" "olden" "times"
[[3]]
[1] "Let" "all" "good" "men"
[[4]]
[1] "Let" "This"
[[5]]
[1] "one" "two" "once" "is" "never" "enough"
[[6]]
[1] "in" "the"
[[7]]
[1] "Come" "dance" "all" "around"
I don't think you need grep at all. Loop over the list, checking the first value for "once" and appending via c() the extra values:
lapply(mylist, \(x) if(x[1] == "once") c("one", "two", x) else x)
##[[1]]
##[1] "one" "two" "once" "upon" "a" "time"
##
##[[2]]
##[1] "one" "two" "once" "in" "olden" "times"
##
##[[3]]
##[1] "Let" "all" "good" "men"
##
##[[4]]
##[1] "Let" "This"
##
##[[5]]
##[1] "one" "two" "once" "is" "never" "enough"
##
##[[6]]
##[1] "in" "the"
##
##[[7]]
##[1] "Come" "dance" "all" "around"
Another option with map_if
library(purrr)
map_if(mylist, .p = ~ first(.x) == "once", .f = ~ c("one", "two", .x))
-output
[[1]]
[1] "one" "two" "once" "upon" "a" "time"
[[2]]
[1] "one" "two" "once" "in" "olden" "times"
[[3]]
[1] "Let" "all" "good" "men"
[[4]]
[1] "Let" "This"
[[5]]
[1] "one" "two" "once" "is" "never" "enough"
[[6]]
[1] "in" "the"
[[7]]
[1] "Come" "dance" "all" "around"

Is there a way to "undo" a paste

I am looking for a way to split up a string, but instead of splitting by an underscore or specific word, I would want to split from a series of words - and also not have that word deleted. For example,
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")
x <- paste(a, b, sep = "_")
I then would like a way to separate x into a and b.
it is a bit difficult as the 4th and 5th value include what you are using to paste the strings. The strsplit() function can be used in general for splitting string by specific separators, but you run into some troubles and to solve them you have to know what b looks like at least to make sure you are not separating incorrectly (or use a unique separator):
strsplit(x, split = "_")
[[1]]
[1] "Hello" "sum"
[[2]]
[1] "Joe" "sum" "one"
[[3]]
[1] "Simpsons" "sum"
[[4]]
[1] "Oh" "No" "sum" "one"
[[5]]
[1] "Hiya" "Hi" "sum"
[[6]]
[1] "oh" "sum" "one"
The result is a list where each string is a list item in form of a string vector of diferent lengths.
An option can be to use the value of b as splitter:
rd <- strsplit(x, split = paste0(paste0("_",b), collapse = "|"))
rd
[[1]]
[1] "Hello"
[[2]]
[1] "Joe"
[[3]]
[1] "Simpsons"
[[4]]
[1] "Oh_No"
[[5]]
[1] "Hiya_Hi"
[[6]]
[1] "oh"
# convert this to a vector:
a <- unlist(rd)
a
[1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Now you use this info the other way arroung:
b <- unique(gsub(paste0(paste0(a, "_"), collapse = "|"),"", x))
b
[1] "sum" "sum_one"
As #Gregor Thomas already said in comments, your information is lost. However, depending on the context, there is a way of storing the information in an attribute using a home-grown my_paste function for which we also write a print method and a my_unpaste function.
Here a sketch of the idea:
my_paste <- \(..., sep=" ", collapse=NULL, recycle0=FALSE) { ## new paste fun
o <- `attr<-`(paste(..., sep=sep), 'unpaste', list(...))
return(structure(o, class=c('character', 'my_paste')))
}
print.my_paste <- function(x) { ## print method for class `my_paste'
print(as.character(x))
}
my_unpaste <- \(x, warn=TRUE) { ## the un-paste function
if (!inherits(x, 'my_paste')) {
if (warn) warning('Nothing to unpaste.')
return(x)
} else {
return(attr(x, 'unpaste'))
}
}
Usage
x <- my_paste(a, b, sep='_')
Looks like this,
str(x)
# 'my_paste' chr [1:6] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
# - attr(*, "unpaste")=List of 2
# ..$ : chr [1:6] "Hello" "Joe" "Simpsons" "Oh_No" ...
# ..$ : chr [1:2] "sum" "sum_one"
but prints normal:
x ## or more verbose `print(x)`
# [1] "Hello_sum" "Joe_sum_one" "Simpsons_sum" "Oh_No_sum_one" "Hiya_Hi_sum" "oh_sum_one"
Now un-paste!
my_unpaste(x)
# [[1]]
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
#
# [[2]]
# [1] "sum" "sum_one"
Has a warning:
my_unpaste(a)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
# Warning message:
# In my_unpaste(a) : Nothing to unpaste.
my_unpaste(a, warn=FALSE)
# [1] "Hello" "Joe" "Simpsons" "Oh_No" "Hiya_Hi" "oh"
Note: R >= 4.1 used.
Data:
a <- c("Hello", "Joe", "Simpsons", "Oh_No", "Hiya_Hi", "oh")
b <- c("sum", "sum_one")

How to expand loop n times in R programming?

I have a vector of characters 'A', 'B', 'C', 'D' and would like to loop n times to get all possible combinations (4^n) of the characters. How do I write a function that will perform this given input n?
For example, if n=2, my loop will look something like this:
string <- c('A','B','C','D')
combination = c()
count = 1
for (j in string) {
for (k in string) {
combination[count] <- paste0(j,k)
count = count + 1
}
}
which will yield:
> combination
[1] "AA" "AB" "AC" "AD" "BA" "BB" "BC" "BD" "CA" "CB" "CC" "CD" "DA" "DB" "DC" "DD"
and if n=3, the code will be like this:
combination = c()
count = 1
for (j in string) {
for (k in string) {
for (l in string) {
combination[count] <- paste0(j,k,l)
count = count + 1
}
}
}
which yields
> combination
[1] "AAA" "AAB" "AAC" "AAD" "ABA" "ABB" "ABC" "ABD" "ACA" "ACB" "ACC" "ACD" "ADA" "ADB" "ADC" "ADD" "BAA" "BAB" "BAC"
[20] "BAD" "BBA" "BBB" "BBC" "BBD" "BCA" "BCB" "BCC" "BCD" "BDA" "BDB" "BDC" "BDD" "CAA" "CAB" "CAC" "CAD" "CBA" "CBB"
[39] "CBC" "CBD" "CCA" "CCB" "CCC" "CCD" "CDA" "CDB" "CDC" "CDD" "DAA" "DAB" "DAC" "DAD" "DBA" "DBB" "DBC" "DBD" "DCA"
[58] "DCB" "DCC" "DCD" "DDA" "DDB" "DDC" "DDD"
In a word, recursion:
#' n: length of each combination string
#' basis: the starting vector of characters to combine
#' extras: the vector of characters to be combined with basis. Defaults to basis
#' i: The current depth of recursion. Users should generally not need to access
#' this parameter.
combine <- function(n, basis=c('A','B','C','D'), extras=basis, i=1) {
x <- expand.grid(basis, extras)
y <- paste0(x$Var1, x$Var2)
if (i == n-1) {
return(y)
} else {
return(combine(n, y, extras, i+1))
}
}
Giving, for example,
> combine(2)
[1] "AA" "BA" "CA" "DA" "AB" "BB" "CB" "DB" "AC" "BC" "CC" "DC" "AD" "BD" "CD" "DD"
and
> combine(3)
[1] "AAA" "BAA" "CAA" "DAA" "ABA" "BBA" "CBA" "DBA" "ACA" "BCA" "CCA" "DCA" "ADA" "BDA" "CDA" "DDA" "AAB" "BAB" "CAB" "DAB" "ABB" "BBB" "CBB" "DBB" "ACB" "BCB" "CCB"
[28] "DCB" "ADB" "BDB" "CDB" "DDB" "AAC" "BAC" "CAC" "DAC" "ABC" "BBC" "CBC" "DBC" "ACC" "BCC" "CCC" "DCC" "ADC" "BDC" "CDC" "DDC" "AAD" "BAD" "CAD" "DAD" "ABD" "BBD"
[55] "CBD" "DBD" "ACD" "BCD" "CCD" "DCD" "ADD" "BDD" "CDD" "DDD"
etc.
Feel free to sort the output if another order is more desirable.

Error using lapply to pass dataframe variable through custom function

I have a function that was suggested by a user as an aswer to my previous question:
word_string <- function(x) {
inds <- seq_len(nchar(x))
start = inds[-length(inds)]
stop = inds[-1]
substring(x, start, stop)
}
The function works as expected and breaks down a given word into component parts as per my sepcifications:
word_string('microwave')
[1] "mi" "ic" "cr" "ro" "ow" "wa" "av" "ve"
What I now want to be able to do is have the function applied to all rows of a specified columnin a dataframe.
Here's dataframe for purposes of illustration:
word <- c("House", "Motorcar", "Boat", "Dog", "Tree", "Drink")
some_value <- c("2","100","16","999", "65","1000000")
my_df <- data.frame(word, some_value, stringsAsFactors = FALSE )
my_df
word some_value
1 House 2
2 Motorcar 100
3 Boat 16
4 Dog 999
5 Tree 65
6 Drink 1000000
Now, if I use lapply to work the function on my dataframe, not only do I get incorrect results but also an error message.
lapply(my_df['word'], word_string)
$word
[1] "Ho" "ot" "at" "" "Tr" "ri"
Warning message:
In seq_len(nchar(x)) : first element used of 'length.out' argument
So you can see that the function is being applied, but it's being applied such that it's evaluating each row partially.
The desired output would be something like:
[1] "ho" "ou" "us" "se
[2] "mo" "ot" "to" "or" "rc" "ca" "ar"
[3] "bo" "oa" "at"
[4] "do" "og"
[5] "tr" "re" "ee"
[6] "dr" "ri" "in" "nk"
Any guidance greatly appreciated.
The reason is that [ is still a data.frame with one column (if we don't use ,) and so here the unit is a single column.
str(my_df['word'])
'data.frame': 6 obs. of 1 variable:
# $ word: chr "House" "Motorcar" "Boat" "Dog" ...
The lapply loops over that single column instead of each of the elements in that column.
W need either $ or [[
lapply(my_df[['word']], word_string)
#[[1]]
#[1] "Ho" "ou" "us" "se"
#[[2]]
#[1] "Mo" "ot" "to" "or" "rc" "ca" "ar"
#[[3]]
#[1] "Bo" "oa" "at"
#[[4]]
#[1] "Do" "og"
#[[5]]
#[1] "Tr" "re" "ee"
#[[6]]
#[1] "Dr" "ri" "in" "nk"

Extract elements between a character and space

I'm having a hard time extracting elements between a / and a black space. I can do this when I have two characters like < and > for instance but the space is throwing me. I'd like the most efficient way to do this in base R as This will be lapplied to thousands of vectors.
I'd like to turn this:
x <- "This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
This:
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
EDIT:
Thank you all for the answers. I'm going for speed so Andres code wins out. Dwin's code wins for the shotest amount of code. Dirk yours was the second fastest. The stringr solution was the slowest (I figured it would be) and wasn't in base but is pretty understandable (which really is the intent of the stringr package I think as this seems to be Hadley's philosophy with most things.
I appreciate your assistance. Thanks again.
I thought I'd include the benchmarking since this will be lapplied over several thousand vectors:
test replications elapsed relative user.self sys.self
1 ANDRES 10000 1.06 1.000000 1.05 0
3 DIRK 10000 1.29 1.216981 1.20 0
2 DWIN 10000 1.56 1.471698 1.43 0
4 FLODEL 10000 8.46 7.981132 7.70 0
Similar but a bit more succinct:
#1- Separate the elements by the blank space
y=unlist(strsplit(x,' '))
#2- extract just what you want from each element:
sub('^.*/([^ ]+).*$','\\1',y)
Where beginning and end anchor characters
are ^ and $ respectively, .* matches any character.
[^ ]+ takes the nonblank characters.
\\1 is the first tagged character
Use regex pattern that is fwd-slash or space:
strsplit(x, "/|\\s" )
[[1]]
[1] "This" "DT" "is" "VBZ" "a" "DT" "short"
[8] "JJ" "sentence" "NN" "consisting" "VBG" "of" "IN"
[15] "some" "DT" "nouns," "JJ" "verbs," "NNS" "and"
[22] "CC" "adjectives." "VBG"
Didn't read the Q closely enough. One could use that result to extract the even numbered elements:
strsplit(x, "/|\\s")[[1]][seq(2, 24, by=2)]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
Here is a one-liner:
R> x <- paste("This/DT is/VBZ a/DT short/JJ sentence/NN consisting/VBG"
"of/IN some/DT nouns,/JJ verbs,/NNS and/CC adjectives./VBG"
R> matrix(do.call(c, strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")),
+ ncol=2, byrow=TRUE)[,2]
[1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"
R>
The key is to get rid of 'text before slash':
R> gsub("[a-zA-Z.,]*/", " ", x)
[1] " DT VBZ DT JJ NN VBG IN DT JJ NNS CC VBG"
R>
after which it is just a matter of splitting the string
R> strsplit(gsub("[a-zA-Z.,]*/", " ", x), " ")
[[1]]
[1] "" "DT" "" "VBZ" "" "DT" "" "JJ" "" "NN"
[11] "" "VBG" "" "IN" "" "DT" "" "JJ" "" "NNS"
[21] "" "CC" "" "VBG"
and filtering the "". There may well be more compact ways for the last bit.
R>
The stringr package has nice functions for working with strings, with very intuitive names. Here you can use str_extract_all to get all matches (including the leading slash), then str_sub to remove the slashes:
str_extract_all(x, "/\\w*")
# [[1]]
# [1] "/DT" "/VBZ" "/DT" "/JJ" "/NN" "/VBG" "/IN" "/DT" "/JJ" "/NNS"
# [11] "/CC" "/VBG"
str_sub(str_extract_all(x, "/\\w*")[[1]], start = 2)
# [1] "DT" "VBZ" "DT" "JJ" "NN" "VBG" "IN" "DT" "JJ" "NNS" "CC" "VBG"

Resources