Dictionary of words separated by their length - r

I have a dataframe of words like so :chr "ABC" "ABM" "AG" "AGB" "AGP" "AD".
I would like to convert it into a list (dictionary) of lists (of words), divided by length:
:chr NULL
:chr [1:2] "AD" "AG"
:chr [1:4] "ABC" "ABM" "AGB" "AGP"

You can use split:
split(words, nchar(words)) # split the words vector by the number of characters
# $`2`
# [1] "AG" "AD"
# $`3`
# [1] "ABC" "ABM" "AGB" "AGP"
Data:
words <- c("ABC", "ABM", "AG", "AGB", "AGP", "AD")

Related

How to generate all permutations of lists of string?

I have character data like this
[[1]]
[1] "F" "S"
[[2]]
[1] "Y" "Q" "Q"
[[3]]
[1] "C" "T"
[[4]]
[1] "G" "M"
[[5]]
[1] "A" "M"
And I want to generate all permutations for each individual list (not mixed between lists) and combine them together into one big list.
For example, for the first and second lists, which are "F" "S" and "Y" "Q" "Q", I want to get the permutation lists as c("FS", "SF"), and c("YQQ", "QYQ", "QQY"), and then combine them into one.
Here's an approach with combinat::permn:
library(combinat)
lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = "")))
#[[1]]
#[1] "FS" "SF"
#
#[[2]]
#[1] "YQQ" "QYQ" "QQY"
#
#[[3]]
#[1] "CT" "TC"
#
#[[4]]
#[1] "GM" "MG"
#
#[[5]]
#[1] "AM" "MA"
Or together with unlist:
unlist(lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = ""))))
# [1] "FS" "SF" "YQQ" "QYQ" "QQY" "CT" "TC" "GM" "MG" "AM" "MA"
Data:
data <- list(c("F", "S"), c("Y", "Q", "Q"), c("C", "T"), c("G", "M"),
c("A", "M"))
It looks like your desired output is not exactly the same as this related post (Generating all distinct permutations of a list in R). But we can build on the answer there.
library(combinat)
# example data, based on your description
X <- list(c("F","S"), c("Y", "Q", "Q"))
result <- lapply(X, function(x1) {
unique(sapply(permn(x1), function(x2) paste(x2, collapse = "")))
})
print(result)
Output
[[1]]
[1] "FS" "SF"
[[2]]
[1] "YQQ" "QYQ" "QQY"
The first (outer) lapply iterates over each element of the list, which contains the individual letters (in a vector). With each iteration the permn takes the individual letters (eg "F" and "S"), and returns a list object with all possible permutations (e.g "F" "S" and "S" F"). To format the output as you described, the inner sapply takes each those permutations and collapses them into a single character value, filtered for unique values.
library(combinat)
final <- unlist(lapply(X , function(test_X) lapply(permn(test_X), function(x) paste(x,collapse='')) ))

Error using lapply to pass dataframe variable through custom function

I have a function that was suggested by a user as an aswer to my previous question:
word_string <- function(x) {
inds <- seq_len(nchar(x))
start = inds[-length(inds)]
stop = inds[-1]
substring(x, start, stop)
}
The function works as expected and breaks down a given word into component parts as per my sepcifications:
word_string('microwave')
[1] "mi" "ic" "cr" "ro" "ow" "wa" "av" "ve"
What I now want to be able to do is have the function applied to all rows of a specified columnin a dataframe.
Here's dataframe for purposes of illustration:
word <- c("House", "Motorcar", "Boat", "Dog", "Tree", "Drink")
some_value <- c("2","100","16","999", "65","1000000")
my_df <- data.frame(word, some_value, stringsAsFactors = FALSE )
my_df
word some_value
1 House 2
2 Motorcar 100
3 Boat 16
4 Dog 999
5 Tree 65
6 Drink 1000000
Now, if I use lapply to work the function on my dataframe, not only do I get incorrect results but also an error message.
lapply(my_df['word'], word_string)
$word
[1] "Ho" "ot" "at" "" "Tr" "ri"
Warning message:
In seq_len(nchar(x)) : first element used of 'length.out' argument
So you can see that the function is being applied, but it's being applied such that it's evaluating each row partially.
The desired output would be something like:
[1] "ho" "ou" "us" "se
[2] "mo" "ot" "to" "or" "rc" "ca" "ar"
[3] "bo" "oa" "at"
[4] "do" "og"
[5] "tr" "re" "ee"
[6] "dr" "ri" "in" "nk"
Any guidance greatly appreciated.
The reason is that [ is still a data.frame with one column (if we don't use ,) and so here the unit is a single column.
str(my_df['word'])
'data.frame': 6 obs. of 1 variable:
# $ word: chr "House" "Motorcar" "Boat" "Dog" ...
The lapply loops over that single column instead of each of the elements in that column.
W need either $ or [[
lapply(my_df[['word']], word_string)
#[[1]]
#[1] "Ho" "ou" "us" "se"
#[[2]]
#[1] "Mo" "ot" "to" "or" "rc" "ca" "ar"
#[[3]]
#[1] "Bo" "oa" "at"
#[[4]]
#[1] "Do" "og"
#[[5]]
#[1] "Tr" "re" "ee"
#[[6]]
#[1] "Dr" "ri" "in" "nk"

R, split string to pairs of character

How to split string in R in following way ? Look at example, please
example:
c("ex", "xa", "am", "mp", "pl", "le") ?
x = "example"
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
# [1] "ex" "xa" "am" "mp" "pl" "le"
You could, of course, wrap it into a function, maybe omit non-letters (I don't know if the colon was supposed to be part of your string or not), etc.
To do this to a vector of strings, you can use it as an anonymous function with lapply:
lapply(month.name, function(x) substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x)))
# [[1]]
# [1] "Ja" "an" "nu" "ua" "ar" "ry"
#
# [[2]]
# [1] "Fe" "eb" "br" "ru" "ua" "ar" "ry"
#
# [[3]]
# [1] "Ma" "ar" "rc" "ch"
# ...
Or make it into a named function and use it by name. This would make sense if you'll use it somewhat frequently.
str_split_pairs = function(x) {
substring(x, first = 1:(nchar(x) - 1), last = 2:nchar(x))
}
lapply(month.name, str_split_pairs)
## same result as above
Here's another option (though it's slower than #Gregor's answer):
x=c("example", "stackoverflow", "programming")
lapply(x, function(i) {
i = unlist(strsplit(i,""))
paste0(i, lead(i))[-length(i)]
})
[[1]]
[1] "ex" "xa" "am" "mp" "pl" "le"
[[2]]
[1] "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow"
[[3]]
[1] "pr" "ro" "og" "gr" "ra" "am" "mm" "mi" "in" "ng"

Sort character list in ascending order based on order of numerical list in r

I want to arrange a list with characters based on order/arrange results on another list.
For example, given a list char, and list of values (numbers) mini, I can get sorted char list:
sorted<-mapply(function(x, y) y[x], lapply(mini, order), char)
I want to use arrange/order that will sort char list based on ascending min list
(I want to have ascendant alphabetical char when values in mini are same).
Suggestions?
EDIT: dummy example
char <- list(A=c("dd", "aa", "cc", "ff"), B=c("rr", "ee", "tt", "aa"))
mini <- list(A=c(4,2,4,4), B=c(5,5,7,1))
char
$A
"dd" "aa" "cc" "ff" ...
$B
"rr" "ee" "tt" "aa" ...
mini
$A
4 2 4 4 ...
$B
5 5 7 1 ...
expected result:
sorted
$A
"aa" "cc" "dd" "ff"
$B
"aa" "ee" "rr" "tt"
Try this:
Map(function(x, y) y[order(x, y)], mini, char)
lapply( names(char), function(nm) char[[nm]][order(mini[[nm]], char[[nm]])])
#------
[[1]]
[1] "aa" "cc" "dd" "ff"
[[2]]
[1] "aa" "ee" "rr" "tt"

splitting a character vector at specified intervals in R

I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this
"abxyzpqrst34245"
"mndeflmnop6346781"
I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:
[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")
NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.
I tried to use cut, but it only works with integer vectors.
I looked at split, but I'm not sure if it works without factors.
So, I finally went with substr to divide each of the sentences separately like this:
substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"
I'm using this long process to split these strings. Is there any easier way to achieve this splitting?
You're looking for (the often overlooked) substring:
x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab" "xyz" "pqrst" "34245"
which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:
x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab" "xyz" "pqrst" "34245"
[[2]]
[1] "mn" "def" "lmnop" "6346781"
If you have a vector of strings to be split, you might also find read.fwf() handy. Use it like so:
x <- c("abxyzpqrst34245", "mndeflmnop6346781")
df <- read.fwf(file = textConnection(x),
widths = c(2,3,5,10000),
colClasses = "character")
df
# V1 V2 V3 V4
# 1 ab xyz pqrst 34245
# 2 mn def lmnop 6346781
str(df)
# 'data.frame': 2 obs. of 4 variables:
# $ V1: chr "ab" "mn"
# $ V2: chr "xyz" "def"
# $ V3: chr "pqrst" "lmnop"
# $ V4: chr "34245" "6346781"

Resources