Splitting merged words (with mini-dictionary) - r

I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.
Here's an example:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
My general procedure would be something like this:
search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB
Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:
searching for words inside words. I can match words from lists no problem but I'm not sure how I use grep or equivalent to go further than this
declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked
Can anyone please send me in the right direction?

I think the first step would be to build all the combined pairs from ListB:
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
# [1] "dodo" "minedo" "anddo" "thedo" "lowerdo" "owedo" "swimdo"
# [8] "domine" "minemine" "andmine" "themine" "lowermine" "owemine" "swimmine"
# [15] "doand" "mineand" "andand" "theand" "lowerand" "oweand" "swimand"
# [22] "dothe" "minethe" "andthe" "thethe" "lowerthe" "owethe" "swimthe"
# [29] "dolower" "minelower" "andlower" "thelower" "lowerlower" "owelower" "swimlower"
# [36] "doowe" "mineowe" "andowe" "theowe" "lowerowe" "oweowe" "swimowe"
# [43] "doswim" "mineswim" "andswim" "theswim" "lowerswim" "oweswim" "swimswim"
You can use str_extract from the stringr package to extract the element of combos that is contained within each element of ListA, if such an element exists:
library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA "andthe" "lowerswim" NA NA
Finally, you want to split the words in ListA that matched a pair of elements from ListB, unless this word is already in ListB. I suppose there are lots of ways to do this, but I'll use lapply and unlist:
newA <- unlist(lapply(seq_along(ListA), function(idx) {
if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
return(ListA[idx])
} else {
return(as.vector(as.matrix(pairings[combos == matches[idx],])))
}
}))
newA
# [1] "dopamine" "and" "the" "lower" "swim" "other" "different"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

See which vector in a list is contained within a vector from another list (finding people's name matches)

I have one list of vectors of people's names, where each vector just has the first and last name and I have another list of vectors, where each vector has the first, middle, last names. I need to match the two lists to find people who are included in both lists. Because the names are not in order (some vectors have the first name as the first value, while others have the last name as the first value), I would like to match the two vectors by finding which vector in the second list (full name) contains all the values of a vector in the first list (first and last names only).
What I have done so far:
#reproducible example
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"))
First, I tried to make a function that checks whether one vector is contained in another vector (heavily based on the code from here).
my_contain <- function(values,x){
tx <- table(x)
tv <- table(values)
z <- tv[names(tx)] - tx
if(all(z >= 0 & !is.na(z))){
paste(x, collapse = " ")
}
}
#value would be the longer vector (from full_name_list)
#and x would be the shorter vector(from first_last_name_list)
Then, I tried to put this function within sapply() so that I can work with lists and that's where I got stuck. I can get it to see whether one vector is contained within a list of vectors, but I'm not sure how to check all the vectors in one list and see if it is contained within any of the vectors from a second list.
#testing with the first vector from first_last_names_list.
#Need to make it run through all the vectors from first_last_names_list.
sapply(1:length(full_names_list),
function(i) any(my_contain(full_names_list[[i]],
first_last_names_list[[1]]) ==
paste(first_last_names_list[[1]], collapse = " ")))
#[1] TRUE FALSE FALSE FALSE
Lastly- although it might be too much to ask in one question- if anyone could give me any pointers on how to incorporate agrep() for fuzzy matching to account for typos in the names, that would be great! If not, that's okay too, since I want to get at least the matching part right first.
Since you are dealing with lists it would be better to collapse them into vectors to be easy to deal with regular expressions. But you just arrange them in ascending order. In that case you can easily match them:
lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
lst1=gsub("\\s|$",".*",lst)
lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
(lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
boy.*boy.* bob.*orengo.* kalonzo.*musyoka.* anami.*lisamula.*
"boy boy juma" "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"
Now if you want to link first_name_last_name_list and full_name_list then:
setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy" "juma" "boy"
$`bob orengo`
[1] "james" "bob" "orengo"
$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"
$`anami lisamula`
[1] "lisamula" "silverse" "anami"
where the names are from first_last_list and the elements are full_name_list. It would be great for you to deal with character vectors rather than lists:
Edit I've modified the solution to satisfy the constraint that a repeated name such as 'John John' should not match against 'John Smith'.
apply(sapply(first_last_names_list, unlist), 2, function(x){
any(sapply(full_names_list, function(y) sum(unlist(y) %in% x) >= length(x)))
})
This solution still uses %in% and the apply functions, but it now does a kind of reverse search - for every element in the first_last names it looks at
how many words in each name within the full_names list are matched. If this number is greater than or equal to the number of words in the first_list names item under consideration (always 2 words in your examples, but the code will work for any number), it returns TRUE. This logical array is then aggregated with ANY to pass back single vector showing if each first_last is matched to any full_name.
So for example, 'John John' would not be matched to 'John Smith Random', as only 1 of the 3 words in 'John Smith Random' are matched. However, it would be matched to 'John Adam John', as 2 of the 3 words in 'John Adam John' are matched, and 2 is equal to the length of 'John John'. It would also match to 'John John John John John' as 5 of the 5 words match, which is greater than 2.
Instead of my_contain, try
x %in% values
Maybe also unlist and work with data frames? Not sure if you considered it--might make things easier:
# unlist to vectors
fl <- unlist(first_last_names_list)
fn <- unlist(full_names_list)
# grab individual names and convert to dfs;
# assumptions: first_last_names_list only contains 2-element vectors
# full_names_list only contains 3-element vectors
first_last_df <- data.frame(first_fl=fl[c(T, F)],last_fl=fl[c(F, T)])
full_name_df <- data.frame(first_fn=fn[c(T,F,F)],mid_fn=fn[c(F,T,F)],last_fn=fn[c(F,F,T)])
Or you could do this:
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"),
c("musyoka", "jeremy", "kalonzo")) # added just to test
# create copies of full_names_list without middle name;
# one list with matching name order, one with inverted order
full_names_short <- lapply(full_names_list,function(x){x[c(1,3)]})
full_names_inv <- lapply(full_names_list,function(x){x[c(3,1)]})
# check if names in full_names_list match either
full_names_list[full_names_short %in% first_last_names_list | full_names_inv %in% first_last_names_list]
In this case %in% does exactly what you want it to do, it checks if the complete name vector matches.

R programming : select element from split string based on value in another column

I have a data frame having one column of words, with syllables separated by hyphens. I want to extract the nth syllable, where n is given in another column. Like this:
word <- c("to-ma-to", "cheese", "ta-co")
whichSyl <- c(2, 1, 1)
mydf <- data.frame(word, whichSyl)
mydf$word <- as.character(mydf$word)
desired: a vector containing
ma
cheese
ta
If this were, say, awk, I would just do
'{split($1,a,"-"); print a[$2]}'
The words don't always have the same number of syllables.
It seems likely that there is a straightforward way to do this, but I'm not seeing it. Thanks
You can use mapply and strsplit to get,
mapply('[', strsplit(mydf$word, '-'), whichSyl)
#[1] "ma" "cheese" "ta"
Here I wrote a function that does one row at a time, and then uses lapply() to iterate over all rows and do.call(rbind()) to bind all of those responses together.
getSyl <- function(i){
strsplit(mydf$word[i], '-')[[1]][mydf$whichSyl[i]]
}
do.call(rbind, lapply(1:nrow(mydf), getSyl))
[,1]
[1,] "ma"
[2,] "cheese"
[3,] "ta"
We can use read.table and row/column indexing
read.table(text=mydf$word, sep="-", header=FALSE,
fill=TRUE)[cbind(1:nrow(mydf), mydf$whichSyl)]
#[1] "ma" "cheese" "ta"

need to flatten list to use intersect in R

I have fullname data that I have used strsplit() to get each element of the name.
# Dataframe with a `names` column (complete names)
df <- data.frame(
names =
c("Adam, R, Goldberg, MALS, MBA",
"Adam, R, Goldberg, MEd",
"Adam, S, Metsch, MBA",
"Alan, Haas, MSW",
"Alexandra, Dumas, Rhodes, MA",
"Alexandra, Ruttenberg, PhD, MBA"),
stringsAsFactors=FALSE)
# Add a column with the split names (it is actually a list)
df$splitnames <- strsplit(df$names, ', ')
I also have a list of degrees below
degrees<-c("EdS","DEd","MEd","JD","MS","MA","PhD","MSPH","MSW","MSSA","MBA",
"MALS","Esq","MSEd","MFA","MPA","EdM","BSEd")
I would like to get the intersection for each name and respective degrees.
I'm not sure how to flatten the name list so I can compare the two vectors using intersect. When I tried unlist(df$splitname,recursive=F) it returned each element separately. Any help is appreciated.
Try
df$intersect <- lapply(X=df$splitname, FUN=intersect, y=degrees)
That will give you a list of the intersection of each element in df$splitname (e.g. intersect(df$splitname[[1]], degrees)). If you want it as a vector:
sapply(X=df$intersect, FUN=paste, collapse=', ')
I assume you need it as a vector, since possibly the complete names came from one (for instance, from a dataframe), but strsplit outputs a list.
Does that work? If not, please try to clarify your intention.
Good luck!
For continuity, you can use unlist :
hh <- unlist(df$splitname)
intersect(hh,degrees)
For example :
ll <- list(c("Adam" , "R" , "Goldberg" ,"MALS" , "MBA "),
c("Adam" , "R" , "Goldberg", "MEd" ))
intersect(hh,degrees)
[1] "MEd"
or equivalent to :
hh[hh %in% degrees]
[1] "MEd"
To get differences you can use
setdiff(hh,degrees)
[1] "Adam" "R" "Goldberg" "MALS" "MBA "
...

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources