How to randomly reshuffle letters in words - r

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here

Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))

You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"

Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")

You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

Related

Extracting Headers from a list [duplicate]

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"
Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.
Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.
I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.
Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Consecutive string matching in a sentence using R

I have names of some 7 countries which is stored somewhere like:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Now, I have to find out using r if a given sentence has these words.
Sometimes the name of a country is hiding in the consecutive letters within a sentence.
for ex:
You all must pay it bac**k, or ea**ch of you will be in trouble.
If this sentence is passed it should return "korea"
I have tried:
grep('You|all|must|pay|it|back|or|each|of|you|will|be|in|trouble',Random, value = TRUE,ignore.case=TRUE,
fixed = FALSE)
it should return korea
but it's not working. Perhaps I should not use Partial Matching, but i dont have much knowledge regarding it.
Any help is appreciated.
You can use the handy stringr library for this. First, remove all the punctuation and spaces from your sentence that we want to match.
> library(stringr)
> txt <- "You all must pay it back, or each of you will be in trouble."
> g <- gsub("[^a-z]", "", tolower(txt))
# [1] "Youallmustpayitbackoreachofyouwillbeintrouble"
Then we can use str_detect to find the matches.
> Random[str_detect(g, Random)]
# [1] "korea"
Basically you're just looking for a sub-string within a sentence, so collapsing the sentence first seems like a good way to go. Alternatively, you could use str_locate with str_sub to find the relevant sub-strings.
> no <- na.omit(str_locate(g, Random))
> str_sub(g, no[,1], no[,2])
# [1] "korea"
Edit Here's one more I came up with
> Random[Vectorize(grepl)(Random, g)]
# [1] "korea"
Using base functions only:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
Random2=paste(Random,collapse="|") #creating pattern for match
text="bac**k, or ea**ch of you will be in trouble."
text2=gsub("[[:punct:][:space:]]","",text,perl=T) #removing punctuations and space characters
regmatches(text2,gregexpr(Random2,text2))
[[1]]
[1] "korea"
You could use stringi which is faster for these operations
library(stringi)
Random[stri_detect_regex(gsub("[^A-Za-z]", "", txt), Random)]
#[1] "korea"
#data
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
Try:
Random <- c('norway', 'india', 'china', 'korea', 'france','japan','iran')
txt <- "You all must pay it back, or each of you will be in trouble."
tt <- gsub("[[:punct:]]|\\s+", "", txt)
unlist(sapply(Random, function(r) grep(r, tt)))
korea
1

generating words in R

for simulation purposes, I'm trying to figure out how to generate a long finite string of text of English letters with spaces and periods. I've been playing with concatenate and reg. exp, and looking at help files, but being a newbie to R, not making much progress. All I got so far is x <- sample(c("a", "b", "c", " ", ".", 100, replace=TRUE). Well, that's only three letters. i've been trying [a:z] and things like that, but c() doesn't seem to like that and gives errors.
Then, even if I enumerate every single letter in c(), the sample function returns a character vector with each letter being an element:
str(x)
chr [1:100] "c" "b" "a" "b" ...
but I need the whole string to be just one element in a character vector. An example of a function I'm looking for would generate a text string like "asdf twdjk.fd alw" of any length I want, in this case 17. So if I do a str() on the result, it should give me:
str(x)
chr "asdf twdjk.fd alw"
Thank you in advance for any tips.
R provides all the letters in the alphabet as a built-in constant, so to get the characters to generate from, you can just do:
chars = c(letters, " ", ".")
Then use paste0 to combine the results of your sampling into a single string:
paste0(sample(chars, 100, replace=TRUE), collapse="")
Hopefully someone can do better:
Alphabet <- c(LETTERS, " ", ".")
set.seed(123)
x <- sample(Alphabet, 17) # maybe replace=TRUE
x <- as.list(x)
pasteNoSpaces <- function(...) paste(..., sep="") # paste0 is better
do.call("pasteNoSpaces", x)

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources