for simulation purposes, I'm trying to figure out how to generate a long finite string of text of English letters with spaces and periods. I've been playing with concatenate and reg. exp, and looking at help files, but being a newbie to R, not making much progress. All I got so far is x <- sample(c("a", "b", "c", " ", ".", 100, replace=TRUE). Well, that's only three letters. i've been trying [a:z] and things like that, but c() doesn't seem to like that and gives errors.
Then, even if I enumerate every single letter in c(), the sample function returns a character vector with each letter being an element:
str(x)
chr [1:100] "c" "b" "a" "b" ...
but I need the whole string to be just one element in a character vector. An example of a function I'm looking for would generate a text string like "asdf twdjk.fd alw" of any length I want, in this case 17. So if I do a str() on the result, it should give me:
str(x)
chr "asdf twdjk.fd alw"
Thank you in advance for any tips.
R provides all the letters in the alphabet as a built-in constant, so to get the characters to generate from, you can just do:
chars = c(letters, " ", ".")
Then use paste0 to combine the results of your sampling into a single string:
paste0(sample(chars, 100, replace=TRUE), collapse="")
Hopefully someone can do better:
Alphabet <- c(LETTERS, " ", ".")
set.seed(123)
x <- sample(Alphabet, 17) # maybe replace=TRUE
x <- as.list(x)
pasteNoSpaces <- function(...) paste(..., sep="") # paste0 is better
do.call("pasteNoSpaces", x)
Related
I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.
How to split a string into elements of fixed length in R is a commonly asked question to which typical answers either rely on substring(x) or strsplit(x, sep="") followed by paste(y, collapse = "").
For instance, one would slit the string "azertyuiop" into "aze", "rty","uio", "p" by specifying a fixed length of 3 characters.
I'm looking for the fastest way possible.
After some testing with long strings (> 1000 chars), I have found that substring() is way too slow. The strategy is hence to split the string into individual characters, and them paste them back into groups of the desired length, by applying some cleverness.
Here is the fastest function I could come up with. The idea is to split the string into individual chars, then have a separator interspersed in the character vector at the right positions, collapse the characters (and separators) back into a string, then split the string again, but this time specifying the separator.
splitInParts <- function(string, size) { #can process a vector of strings. "size" is the length of desired substrings
chars <- strsplit(string,"",T)
lengths <- nchar(string)
nFullGroups <- floor(lengths/size) #the number of complete substrings of the desired size
#here we prepare a vector of separators (comas), which we will replace by the characters, except at the positions that will have to separate substring groups of length "size". Assumes that the string doesn't have any comas.
seps <- Map(rep, ",", lengths + nFullGroups) #so the seps vector is longer than the chars vector, because there are separators (as may as they are groups)
indices <- Map(seq, 1, lengths + nFullGroups) #the positions at which separators will be replaced by the characters
indices <- lapply(indices, function(x) which(x %% (size+1) != 0)) #those exclude the positions at which we want to retain the separators (I haven't found a better way to generate such vector of indices)
temp <- function(x,y,z) { #a fonction describing the replacement, because we call it in the Map() call below
x[y] <- z
x
}
res <- Map(temp, seps, indices, chars) #so now we have a vector of chars with separators interspersed
res <- sapply(res, paste, collapse="", USE.NAMES=F) #collapses the characters and separators
res <- strsplit(res, ",", T) #and at last, we can split the strings into elements of the desired length
}
This looks quite tedious, but I have tried to simply put the chars vector into a matrix with the adequate number of rows, then collapse the matrix columns with apply(mat, 2, paste, collapse=""). This is MUCH slower. And splitting the character vector with split() into a list of vectors of the right length, so as to collapse elements, is even slower.
So if you can find something faster, let me know. If not, well my function may be of some use. :)
Was fun reading the updates, so I benchmarked:
> nchar(mystring)
[1] 260000
My idea was near the same as #akrun's one as str_extract_all use the same function under the hood IIRC)
library(stringr)
tensiSplit <- function(string,size) {
str_extract_all(string, paste0('.{1,',size,'}'))
}
And the results on my machine:
> microbenchmark(splitInParts(mystring,3),akrunSplit(mystring,3),splitInParts2(mystring,3),tensiSplit(mystring,3),gsubSplit(mystring,3),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
splitInParts(mystring, 3) 64.80683 64.83033 64.92800 64.85384 64.98858 65.12332 3
akrunSplit(mystring, 3) 4309.19807 4315.29134 4330.40417 4321.38461 4341.00722 4360.62983 3
splitInParts2(mystring, 3) 21.73150 21.73829 21.90200 21.74507 21.98725 22.22942 3
tensiSplit(mystring, 3) 21.80367 21.85201 21.93754 21.90035 22.00447 22.10859 3
gsubSplit(mystring, 3) 53.90416 54.28191 54.55416 54.65966 54.87915 55.09865 3
We can split by specifying a regex lookbehind to match the position preceded by 'n' characters, For example, if we are splitting by 3 characters, we match the position/boundary preceded by 3 characters ((?<=.{3})).
splitInParts <- function(string, size){
pat <- paste0('(?<=.{',size,'})')
strsplit(string, pat, perl=TRUE)
}
splitInParts(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
splitInParts(str1, 4)
#[[1]]
#[1] "azer" "tyui" "op"
splitInParts(str1, 5)
#[[1]]
#[1] "azert" "yuiop"
Or another approach is using stri_extract_all from library(stringi).
library(stringi)
splitInParts2 <- function(string, size){
pat <- paste0('.{1,', size, '}')
stri_extract_all_regex(string, pat)
}
splitInParts2(str1, 3)
#[[1]]
#[1] "aze" "rty" "uio" "p"
stri_extract_all_regex(str1, '.{1,3}')
data
str1 <- "azertyuiop"
Alright, there was a faster solution published here (d'oh!)
Simply
strsplit(gsub("([[:alnum:]]{size})", "\\1 ", string)," ",T)
Here using a space as separator.
(didn't think about [[:allnum::]]{}).
How can I mark my own question as a duplicate? :(
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.
Suppose a vector:
xx.1 <- c("zz_ZZ_uu_d", "II_OO_d")
I want to get a new vector splitted from right most and only split once. The expected results would be:
c("zz_ZZ_uu", "d", "II_OO", "d").
It would be like python's rsplit() function. My current idea is to reverse the string, and split the with str_split() in stringr.
Any better solutions?
update
Here is my solution returning n splits, depending on stringr and stringi. It would be nice that someone provides a version with base functions.
rsplit <- function (x, s, n) {
cc1 <- unlist(stringr::str_split(stringi::stri_reverse(x), s, n))
cc2 <- rev(purrr::map_chr(cc1, stringi::stri_reverse))
return(cc2)
}
Negative lookahead:
unlist(strsplit(xx.1, "_(?!.*_)", perl = TRUE))
# [1] "zz_ZZ_uu" "d" "II_OO" "d"
Where a(?!b) says to find such an a which is not followed by a b. In this case .*_ means that no matter how far (.*) there should not be any more _'s.
However, it seems to be not that easy to generalise this idea. First, note that it can be rewritten as positive lookahead with _(?=[^_]*$) (find _ followed by anything but _, here $ signifies the end of a string). Then a not very elegant generalisation would be
rsplit <- function(x, s, n) {
p <- paste0("[^", s, "]*")
rx <- paste0(s, "(?=", paste(rep(paste0(p, s), n - 1), collapse = ""), p, "$)")
unlist(strsplit(x, rx, perl = TRUE))
}
rsplit(vec, "_", 1)
# [1] "a_b_c_d_e_f" "g" "a" "b"
rsplit(vec, "_", 3)
# [1] "a_b_c_d" "e_f_g" "a_b"
where e.g. in case n=3 this function uses _(?=[^_]*_[^_]*_[^_]*$).
Another two. In both I use "(.*)_(.*)" as the pattern to capture both parts of the string. Remember that * is greedy so the first (.*) will match as many characters as it can.
Here I use regexec to capture where your susbtrings start and end, and regmatches to reconstruct them:
unlist(lapply(regmatches(xx.1, regexec("(.*)_(.*)", xx.1)),
tail, -1))
And this one is a little less academic but easy to understand:
unlist(strsplit(sub("(.*)_(.*)", "\\1###\\2", xx.1), "###"))
What about just pasting it back together after it's split?
rsplit <- function( x, s ) {
spl <- strsplit( "zz_ZZ_uu_d", s, fixed=TRUE )[[1]]
res <- paste( spl[-length(spl)], collapse=s, sep="" )
c( res, spl[length(spl)] )
}
> rsplit("zz_ZZ_uu_d", "_")
[1] "zz_ZZ_uu" "d"
I also thought about a very similar approach to that of Ari
> res <- lapply(strsplit(xx.1, "_"), function(x){
c(paste0(x[-length(x)], collapse="_" ), x[length(x)])
})
> unlist(res)
[1] "zz_ZZ_uu" "d" "II_OO" "d"
This gives exactly what you want and is the simplest approach:
require(stringr)
as.vector(t(str_match(xx.1, '(.*)_(.*)') [,-1]))
[1] "zz_ZZ_uu" "d" "II_OO" "d"
Explanation:
str_split() is not the droid you're looking for, because it only does left-to-right split, and splitting then repasting all the (n-1) leftmost matches is a total waste of time. So use str_split() with a regex with two capture groups. Note the first (.*)_ will greedy match everything up to the last occurrence of _, which is what you want. (This will fail if there isn't at least one _, and return NAs)
str_match() returns a matrix where the first column is the entire string, and subsequent columns are individual capture groups. We don't want the first column, so drop it with [,-1]
as.vector() will unroll that matrix column-wise, which is not what you want, so we use t() to transpose it to unroll row-wise
str_match(string, pattern) is vectorized over both string and pattern, which is neat
I have fullname data that I have used strsplit() to get each element of the name.
# Dataframe with a `names` column (complete names)
df <- data.frame(
names =
c("Adam, R, Goldberg, MALS, MBA",
"Adam, R, Goldberg, MEd",
"Adam, S, Metsch, MBA",
"Alan, Haas, MSW",
"Alexandra, Dumas, Rhodes, MA",
"Alexandra, Ruttenberg, PhD, MBA"),
stringsAsFactors=FALSE)
# Add a column with the split names (it is actually a list)
df$splitnames <- strsplit(df$names, ', ')
I also have a list of degrees below
degrees<-c("EdS","DEd","MEd","JD","MS","MA","PhD","MSPH","MSW","MSSA","MBA",
"MALS","Esq","MSEd","MFA","MPA","EdM","BSEd")
I would like to get the intersection for each name and respective degrees.
I'm not sure how to flatten the name list so I can compare the two vectors using intersect. When I tried unlist(df$splitname,recursive=F) it returned each element separately. Any help is appreciated.
Try
df$intersect <- lapply(X=df$splitname, FUN=intersect, y=degrees)
That will give you a list of the intersection of each element in df$splitname (e.g. intersect(df$splitname[[1]], degrees)). If you want it as a vector:
sapply(X=df$intersect, FUN=paste, collapse=', ')
I assume you need it as a vector, since possibly the complete names came from one (for instance, from a dataframe), but strsplit outputs a list.
Does that work? If not, please try to clarify your intention.
Good luck!
For continuity, you can use unlist :
hh <- unlist(df$splitname)
intersect(hh,degrees)
For example :
ll <- list(c("Adam" , "R" , "Goldberg" ,"MALS" , "MBA "),
c("Adam" , "R" , "Goldberg", "MEd" ))
intersect(hh,degrees)
[1] "MEd"
or equivalent to :
hh[hh %in% degrees]
[1] "MEd"
To get differences you can use
setdiff(hh,degrees)
[1] "Adam" "R" "Goldberg" "MALS" "MBA "
...