For example, I have an element "computer" in a vector. I need to get a vector consisting of "c", "o", "m", "p", "u", "t", "e", "r".
And the second part of my question is optional. How can I create a vector containing letter combinations of the elements of the above mentioned vector and letters in the resulting combinations will be only in such order as in the original word? For instance, I want to get something like "puter" or "mpu" in this vector instead of "tumpo".
You can use
strsplit("computer", "\\b")
and
library("RWeka")
gsub(" ", "",
NGramTokenizer(paste(strsplit("computer", "\\b")[[1]], collapse=" "),
Weka_control(min=2,
max=5)),
fixed=TRUE)
# [1] "compu" "omput" "mpute" "puter" "comp"
# [6] "ompu" "mput" "pute" "uter" "com"
# [11] "omp" "mpu" "put" "ute" "ter"
# [16] "co" "om" "mp" "pu" "ut"
# [21] "te" "er"
to create n-grams with 2 <= n <=5.
For the first part of the question is really easy to get:
splits <- unlist(strsplit("computer",split=""))
> splits
[1] "c" "o" "m" "p" "u" "t" "e" "r"
For the second part you can use the following code:
subseqs <-
unlist(
lapply(1:length(splits),FUN=function(x){
lapply(1:(length(splits)+1-x),FUN=function(y){
paste(splits[y:(y+x-1)],collapse="") })
})
)
> subseqs
[1] "c" "o" "m" "p" "u" "t" "e"
[8] "r" "co" "om" "mp" "pu" "ut" "te"
[15] "er" "com" "omp" "mpu" "put" "ute" "ter"
[22] "comp" "ompu" "mput" "pute" "uter" "compu" "omput"
[29] "mpute" "puter" "comput" "ompute" "mputer" "compute" "omputer"
[36] "computer"
For three consecutive letter combinations:
x <- strsplit("computer", "\\b")
y <- combn(seq(x),3); m <- match(1:6,y[1,])
combn (x,3)[,m]
Related
I have character data like this
[[1]]
[1] "F" "S"
[[2]]
[1] "Y" "Q" "Q"
[[3]]
[1] "C" "T"
[[4]]
[1] "G" "M"
[[5]]
[1] "A" "M"
And I want to generate all permutations for each individual list (not mixed between lists) and combine them together into one big list.
For example, for the first and second lists, which are "F" "S" and "Y" "Q" "Q", I want to get the permutation lists as c("FS", "SF"), and c("YQQ", "QYQ", "QQY"), and then combine them into one.
Here's an approach with combinat::permn:
library(combinat)
lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = "")))
#[[1]]
#[1] "FS" "SF"
#
#[[2]]
#[1] "YQQ" "QYQ" "QQY"
#
#[[3]]
#[1] "CT" "TC"
#
#[[4]]
#[1] "GM" "MG"
#
#[[5]]
#[1] "AM" "MA"
Or together with unlist:
unlist(lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = ""))))
# [1] "FS" "SF" "YQQ" "QYQ" "QQY" "CT" "TC" "GM" "MG" "AM" "MA"
Data:
data <- list(c("F", "S"), c("Y", "Q", "Q"), c("C", "T"), c("G", "M"),
c("A", "M"))
It looks like your desired output is not exactly the same as this related post (Generating all distinct permutations of a list in R). But we can build on the answer there.
library(combinat)
# example data, based on your description
X <- list(c("F","S"), c("Y", "Q", "Q"))
result <- lapply(X, function(x1) {
unique(sapply(permn(x1), function(x2) paste(x2, collapse = "")))
})
print(result)
Output
[[1]]
[1] "FS" "SF"
[[2]]
[1] "YQQ" "QYQ" "QQY"
The first (outer) lapply iterates over each element of the list, which contains the individual letters (in a vector). With each iteration the permn takes the individual letters (eg "F" and "S"), and returns a list object with all possible permutations (e.g "F" "S" and "S" F"). To format the output as you described, the inner sapply takes each those permutations and collapses them into a single character value, filtered for unique values.
library(combinat)
final <- unlist(lapply(X , function(test_X) lapply(permn(test_X), function(x) paste(x,collapse='')) ))
This question already has answers here:
Numeric to Alphabetic Lettering Function in R [duplicate]
(4 answers)
is there a way to extend LETTERS past 26 characters e.g., AA, AB, AC...?
(9 answers)
Closed 2 years ago.
How to create a list of sequential letters like excel's column headers in R? I have 559 columns in my excel file, so I would like to create a vector of sequential letters like "A, B,... Z, AA, AB,... AZ,... BA, BB,..." etc. This is so that I can create my own data dictionary, so would like to "map" to excel's column headers.
In case if you are interested in base R solution, you can try this:
all <- expand.grid(LETTERS, LETTERS)
all <- all[order(all$Var1,all$Var2),]
out <- c(LETTERS, do.call('paste0',all))
The out will return 702 values as vectors, I believe you want to subset them until 559, so you can write: out[1:559].
To rename your columns you can use, where data_frame is your data frame name
names(data_frame) <- out[1:559]
One important note though, I am assuming here that you only wanted column with two characters not more than that.
A generic approach using gtools
comb <- lapply(1:3, function(x)gtools::permutations(26,x, LETTERS, repeats.allowed = TRUE))
## Using 3 for excel 3 combinations of alphabets
unlist(lapply(comb, function(x)do.call('paste0', data.frame(x,stringsAsFactors = FALSE))))
Some observations:
> out[1:50]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
[12] "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V"
[23] "W" "X" "Y" "Z" "AA" "AB" "AC" "AD" "AE" "AF" "AG"
[34] "AH" "AI" "AJ" "AK" "AL" "AM" "AN" "AO" "AP" "AQ" "AR"
[45] "AS" "AT" "AU" "AV" "AW" "AX"
num2let takes a number or vector of numbers and converts that to letter notation. Pass the sequence 1:n to get sequential output.
We also provide the inverse function lets2num which takes codes and returns the numbers.
The main advantages of this approach are that (1) it also works on input vectors which are not sequences, e.g. num2let(599) finds the letter code for columnn number 599, (2) if used with a sequence the sequence does not have to be a multiple or a power of 26, (3) it can be used with other codes and bases and not just 26 LETTERS, e.g. num2let(1:40, head(letters)), (4) no upper bound restriction to n, e.g. num2let(1e10) works and (5) both directions are provided and (6) only base R is used.
num2let <- function(n, lets = LETTERS) {
base <- length(lets)
if (length(n) > 1) return(sapply(n, num2let, lets = lets))
stopifnot(n > 0)
out <- ""
repeat {
if (n > base) {
rem <- (n-1) %% base
n <- (n-1) %/% base
out <- paste0(lets[rem+1], out)
} else return( paste0(lets[n], out) )
}
}
let2num <- function(x, lets = LETTERS) {
base <- length(lets)
s <- strsplit(x, "")
sapply(s, function(x) sum((match(x, lets)) * base ^ seq(length(x) - 1, 0)))
}
Test
lets <- num2let(1:40)
lets
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
## [16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "AA" "AB" "AC" "AD"
## [31] "AE" "AF" "AG" "AH" "AI" "AJ" "AK" "AL" "AM" "AN"
let2num(lets)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
num2let(599)
## [1] "WA"
let2num("WA")
## [1] 599
Other
I previously answered this question here Numeric to Alphabetic Lettering Function in R and my answers there take different approaches than the answer here so you can look at those too.
Base R one liner:
c(LETTERS, unlist(sapply(seq_along(LETTERS), function(i){paste0(LETTERS[i], LETTERS)})))
I would like to apply grep() in R, but I am not really good in lapply(). I understand that lapply is able to take a list, apply function to each members and output a list. For instance, let x be a list consists of 2 members.
> x<-strsplit(docs$Text," ")
>
> x
[[1]]
[1] "I" "lovehttp" "my" "mum." "I" "love"
[7] "my" "dad." "I" "love" "my" "brothers."
[[2]]
[1] "I" "live" "in" "Eastcoast" "now." "Job.I"
[7] "used" "to" "live" "in" "WestCoast."
I would like to apply grep() function to remove words consisting of http. So, I would apply:
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE))
But it does not work and it says
Error in grep(pattern = "http", invert = TRUE, value = TRUE) :
argument "x" is missing, with no default
So, I tried
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE,x))
But it says
Error in match.fun(FUN) :
'grep(pattern = "http", invert = TRUE, value = TRUE, x)' is not a
function, character or symbol
A help please, and thanks!
This can be done in one line:
lst <- lapply(lst, grep, pattern="http", value=TRUE, invert=TRUE)
#lst
#[[1]]
# [1] "I" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."
If you don't want to remove the entire word that contains the pattern and remove only the pattern itself while retaining the rest of the word (as discussed in the comments), you can use gsub instead of grep:
lapply(lst, gsub, pattern="http", replacement="")
#[[1]]
# [1] "I" "love" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."
The following line of code will remove all entries from vectors in your list which contain the substring http:
repx <- function(x) {
y <- grep("http", x)
vec <- rep(TRUE, length(x))
vec[y] <- FALSE
x <- x[vec]
return(x)
}
lapply(lst, function(x) { repx(x) })
Data:
x1 <- c("I", "lovehttp", "my", "mum.", "I", "love", "my", "dad.", "I", "love", "my", "brothers.")
x2 <- c("I", "live", "in", "Eastcoast", "now.", "Job.I", "used", "to", "live", "in", "WestCoast.")
lst <- list(x1, x2)
I want to use the blank row represented by "" that exists in my list so I can group all the rows in between into sublists.
For example I have a long list that looks like this:
> data
[1] "data science"
[2] "big data"
[3] "machine learning"
[4] "BI"
[5] "analytics"
[6] ""
[7] "SAS"
[8] "R"
[9] "Python"
[10] "Spark"
[11] ""
[12] "Hive"
[13] "PIG"
[14] "IMPALA"
....
And I want something like this:
> output
[[1]] [1] "data science" "big data" "machine learning" "BI" "analytics"
[[2]] [1] "SAS" "R" "Python" "Spark"
[[3]] [1] "Hive" "PIG" "IMPALA"
The indexation in my output is maybe wrong but overall it's what I want.
Maybe something with splitwould do it.
You are correct that split can help you. If you cumsum a logical vector it will break apart your original vector into groups. You then have to drop the first element because it is "". That's what tail does in the lapply:
set.seed(201)
x <- sample(letters, 20, replace = T)
x[c(6,12)] <- ""
> lapply(split(x, cumsum(x == "")), tail, -1)
$`0`
[1] "p" "p" "q" "r"
$`1`
[1] "v" "n" "g" "l" "t"
$`2`
[1] "p" "p" "n" "e" "t" "c" "j" "m"
I would like to generate all combinations of two vectors, given two constraints: there can never be more than 3 characters from the first vector, and there must always be at least one characters from the second vector. I would also like to vary the final number of characters in the combination.
For instance, here are two vectors:
vec1=c("A","B","C","D")
vec2=c("W","X","Y","Z")
Say I wanted 3 characters in the combination. Possible acceptable permutations would be: "A" "B" "X"or "A" "Y" "Z". An unacceptable permutation would be: "A" "B" "C" since there is not at least one character from vec2.
Now say I wanted 5 characters in the combination. Possible acceptable permutations would be: "A" "C" "Z" "Y" or "A" "Y" "Z" "X". An unacceptable permutation would be: "A" "C" "D" "B" "X" since there are >3 characters from vec2.
I suppose I could use expand.grid to generate all combinations and then somehow subset, but there must be an easier way. Thanks in advance!
I'm not sure wheter this is easier, but you can leave away permutations that do not satisfy your conditions whith this strategy:
generate all combinations from vec1 that are acceptable.
generate all combinations from vec2 that are acceptable.
generate all combinations taking one solution from 1. + one solution from 2. Here I'd do the filtering with condition 3 afterwards.
(if you're looking for combinations, you're done, otherwise:) produce all permutations of letters within each result.
Now, let's have
vec1 <- LETTERS [1:4]
vec2 <- LETTERS [23:26]
## lists can eat up lots of memory, so use character vectors instead.
combine <- function (x, y)
combn (y, x, paste, collapse = "")
res1 <- unlist (lapply (0:3, combine, vec1))
res2 <- unlist (lapply (1:length (vec2), combine, vec2))
now we have:
> res1
[1] "" "A" "B" "C" "D" "AB" "AC" "AD" "BC" "BD" "CD" "ABC"
[13] "ABD" "ACD" "BCD"
> res2
[1] "W" "X" "Y" "Z" "WX" "WY" "WZ" "XY" "XZ" "YZ"
[11] "WXY" "WXZ" "WYZ" "XYZ" "WXYZ"
res3 <- outer (res1, res2, paste0)
res3 <- res3 [nchar (res3) == 5]
So here you are:
> res3
[1] "ABCWX" "ABDWX" "ACDWX" "BCDWX" "ABCWY" "ABDWY" "ACDWY" "BCDWY" "ABCWZ"
[10] "ABDWZ" "ACDWZ" "BCDWZ" "ABCXY" "ABDXY" "ACDXY" "BCDXY" "ABCXZ" "ABDXZ"
[19] "ACDXZ" "BCDXZ" "ABCYZ" "ABDYZ" "ACDYZ" "BCDYZ" "ABWXY" "ACWXY" "ADWXY"
[28] "BCWXY" "BDWXY" "CDWXY" "ABWXZ" "ACWXZ" "ADWXZ" "BCWXZ" "BDWXZ" "CDWXZ"
[37] "ABWYZ" "ACWYZ" "ADWYZ" "BCWYZ" "BDWYZ" "CDWYZ" "ABXYZ" "ACXYZ" "ADXYZ"
[46] "BCXYZ" "BDXYZ" "CDXYZ" "AWXYZ" "BWXYZ" "CWXYZ" "DWXYZ"
If you prefer the results split into single letters:
res <- matrix (unlist (strsplit (res3, "")), nrow = length (res3), byrow = TRUE)
> res
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "B" "C" "W" "X"
[2,] "A" "B" "D" "W" "X"
[3,] "A" "C" "D" "W" "X"
[4,] "B" "C" "D" "W" "X"
(snip)
[51,] "C" "W" "X" "Y" "Z"
[52,] "D" "W" "X" "Y" "Z"
Which are your combinations.