How to string count unique values in data strings - r

I am trying to find common words having 5 unique vowels (i.e: "aeiuo" without in a single word and without repetition)
I tried this:
library(tidyverse)
x<-c("appropriate","associate","available","colleague","experience","encourage","encouragi","associetu")
x[str_count(x,"[aeiuo]")>4]
Note that words ("encouragi" and "associetu") were used for the purpose of verifying my intended answer
the results I am generating are the following:
[3] "available" "colleague"
[5] "experience" "encourage"
[7] "encouragi" "associetu"
While I wanted to get only:
"encouragi" "associetu" which fulfill the criteria of having 5 distinct vowels (i.e: "aeiuo").
Is there any function to serve as string_count_unique?? if yes, which one? if not, what other function might you recommend me to use so that I meet the set criteria?
thank you in advance for your help!

One option could be:
x[lengths(lapply(str_extract_all(x, "a|e|i|u|o"), unique)) == 5]
[1] "encouragi" "associetu"

Maybe strsplit could help you
> x[sapply(strsplit(x,""),function(v) sum(unique(v)%in%c("a","e","i","o","u"))>4) ]
[1] "encouragi" "associetu"

Here's a way to do it using strsplit and setdiff. We loop over each string using sapply, we split each string into its letters, then we check if all vowels are present in the vector resulting from strsplit. If the length of the setdiff is greater than 0, one or more vowels are not present in the string.
keep <- sapply(x, FUN = function(x){
length(setdiff(c("a", "e", "i", "o", "u"), el(strsplit(x, "")))) == 0
})
x[keep]
# [1] "encouragi" "associetu"

The problem with your code is that you are counting if the sum of ANY of aeiou is >4. What you want is to check that the count of a is >0 AND that the count of e is >0 and so on. So you could check the following:
x[str_count(x,"[a]")>0 & str_count(x,"[e]")>0 & str_count(x,"[i]")>0 & str_count(x,"[o]")>0 & str_count(x,"[u]")>0]

Related

How to check if a character vector contains a string

I'm very new to R, just got RSTudio last week, so this might be a dumb question but anyway, I think I'm getting contradictory statements about whether or not my string "rs2418691" is in my vector rsIDcolumn. When I use the %in% command it says no, but using the which command does give me a coordinate for it in the vector:
> "rs2418691" %in% rsIDcolumn
[1] FALSE
> which(rsIDcolumn == "rs2418691")
[1] 137853
Does anyone know what's going on please? Thank you!
I think you are refering to a dataframe column. If you have a dataframe called df, which has a column named rsIDcolumn you can check if a string is inside of it by doing:
"rs2418691" %in% df$rsIDcolumn
Just summing up, what is in the comment from #Adamm:
x <- data.frame(a=c("b", "c"))
"c" %in% x
#[1] FALSE
which(x == "c")
#[1] 2

Extracting a certain substring (email address)

I'm attempting to pull some a certain from a variable that looks like this:
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
(this variable has hundreds of observations)
I want to eventually make a second variable that pulls their email to give this output:
v2 <- c("personsemail#email.com", "person2#email.com")
How would I do this? Is there a certain package I can use? Or do I need to make a function incorporating grep and substr?
Those look like what R might call a "person". There is an as.person() function that can split out the email address. For example
v1 <- c("Persons Name <personsemail#email.com>","person 2 <person2#email.com>")
unlist(as.person(v1)$email)
# [1] "personsemail#email.com" "person2#email.com"
For more information, see the ?person help page.
One option with str_extract from stringr
library(stringr)
str_extract(v1, "(?<=\\<)[^>]+")
#[1] "personsemail#email.com" "person2#email.com"
You can look for the pattern "anything**, then <, then (anything), then >, then anything" and replace that pattern with the part between the parentheses, indicated by \1 (and an extra \ to escape).
sub('.*<(.*)>.*', '\\1', v1)
# [1] "personsemail#email.com" "person2#email.com"
** "anything" actually means anything but line breaks
You can look for a pattern that looks like email using regexpr. If a match is found, extract the relevant part using substring. The starting position and match length is provided by the regexpr
inds = regexpr(pattern = "<(.*#.*\\..*)>", v1)
ifelse(inds > 1,
substring(v1, inds + 1, inds + attr(inds, "match.length") - 2),
NA)
#[1] "personsemail#email.com" "person2#email.com"

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Splitting merged words (with mini-dictionary)

I have a set of words: some of which are merged terms, and others that are just simple words. I also have a separate list of words that I am going to use to compare with my first list (as a dictionary) in order to 'un-merge' certain words.
Here's an example:
ListA <- c("dopamine", "andthe", "lowerswim", "other", "different")
ListB <- c("do", "mine", "and", "the", "lower", "owe", "swim")
My general procedure would be something like this:
search for pattern from ListB that occurs twice in a word in ListA where the merged terms are consecutive (no spare letters in the word). So for example, from ListA 'lowerswim' would match with 'lower' and 'swim' not 'owe' and 'swim'.
for each selected word, check if that word exists in ListB. If yes, then keep it in ListA. Otherwise, split the word into the two words matched with words from ListB
Does this sound sensible? And if so, how do I implement it in R? Maybe it sounds quite routine but at the moment I'm having trouble with:
searching for words inside words. I can match words from lists no problem but I'm not sure how I use grep or equivalent to go further than this
declaring that the words must be consecutive. I've been thinking about this for a while but I can't get to seem to try anything that has worked
Can anyone please send me in the right direction?
I think the first step would be to build all the combined pairs from ListB:
pairings <- expand.grid(ListB, ListB)
combos <- apply(pairings, 1, function(x) paste0(x[1], x[2]))
combos
# [1] "dodo" "minedo" "anddo" "thedo" "lowerdo" "owedo" "swimdo"
# [8] "domine" "minemine" "andmine" "themine" "lowermine" "owemine" "swimmine"
# [15] "doand" "mineand" "andand" "theand" "lowerand" "oweand" "swimand"
# [22] "dothe" "minethe" "andthe" "thethe" "lowerthe" "owethe" "swimthe"
# [29] "dolower" "minelower" "andlower" "thelower" "lowerlower" "owelower" "swimlower"
# [36] "doowe" "mineowe" "andowe" "theowe" "lowerowe" "oweowe" "swimowe"
# [43] "doswim" "mineswim" "andswim" "theswim" "lowerswim" "oweswim" "swimswim"
You can use str_extract from the stringr package to extract the element of combos that is contained within each element of ListA, if such an element exists:
library(stringr)
matches <- str_extract(ListA, paste(combos, collapse="|"))
matches
# [1] NA "andthe" "lowerswim" NA NA
Finally, you want to split the words in ListA that matched a pair of elements from ListB, unless this word is already in ListB. I suppose there are lots of ways to do this, but I'll use lapply and unlist:
newA <- unlist(lapply(seq_along(ListA), function(idx) {
if (is.na(matches[idx]) | ListA[idx] %in% ListB) {
return(ListA[idx])
} else {
return(as.vector(as.matrix(pairings[combos == matches[idx],])))
}
}))
newA
# [1] "dopamine" "and" "the" "lower" "swim" "other" "different"

R: Replacing rownames of data frame by a substring[2]

I have a question about the use of gsub. The rownames of my data, have the same partial names. See below:
> rownames(test)
[1] "U2OS.EV.2.7.9" "U2OS.PIM.2.7.9" "U2OS.WDR.2.7.9" "U2OS.MYC.2.7.9"
[5] "U2OS.OBX.2.7.9" "U2OS.EV.18.6.9" "U2O2.PIM.18.6.9" "U2OS.WDR.18.6.9"
[9] "U2OS.MYC.18.6.9" "U2OS.OBX.18.6.9" "X1.U2OS...OBX" "X2.U2OS...MYC"
[13] "X3.U2OS...WDR82" "X4.U2OS...PIM" "X5.U2OS...EV" "exp1.U2OS.EV"
[17] "exp1.U2OS.MYC" "EXP1.U20S..PIM1" "EXP1.U2OS.WDR82" "EXP1.U20S.OBX"
[21] "EXP2.U2OS.EV" "EXP2.U2OS.MYC" "EXP2.U2OS.PIM1" "EXP2.U2OS.WDR82"
[25] "EXP2.U2OS.OBX"
In my previous question, I asked if there is a way to get the same names for the same partial names. See this question: Replacing rownames of data frame by a sub-string
The answer is a very nice solution. The function gsub is used in this way:
transfecties = gsub(".*(MYC|EV|PIM|WDR|OBX).*", "\\1", rownames(test)
Now, I have another problem, the program I run with R (Galaxy) doesn't recognize the | characters. My question is, is there another way to get to the same solution without using this |?
Thanks!
If you don't want to use the "|" character, you can try something like :
Rnames <-
c( "U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9" ,
"U2OS.OBX.2.7.9" , "U2OS.EV.18.6.9" ,"U2O2.PIM.18.6.9" ,"U2OS.WDR.18.6.9" )
Rlevels <- c("MYC","EV","PIM","WDR","OBX")
tmp <- sapply(Rlevels,grepl,Rnames)
apply(tmp,1,function(i)colnames(tmp)[i])
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR"
But I would seriously consider mentioning this to the team of galaxy, as it seems to be rather awkward not to be able to use the symbol for OR...
I wouldn't recommend doing this in general in R as it is far less efficient than the solution #csgillespie provided, but an alternative is to loop over the various strings you want to match and do the replacements on each string separately, i.e. search for "MYN" and replace only in those rownames that match "MYN".
Here is an example using the x data from #csgillespie's Answer:
x <- c("U2OS.EV.2.7.9", "U2OS.PIM.2.7.9", "U2OS.WDR.2.7.9", "U2OS.MYC.2.7.9",
"U2OS.OBX.2.7.9", "U2OS.EV.18.6.9", "U2O2.PIM.18.6.9","U2OS.WDR.18.6.9",
"U2OS.MYC.18.6.9","U2OS.OBX.18.6.9", "X1.U2OS...OBX","X2.U2OS...MYC")
Copy the data so we have something to compare with later (this just for the example):
x2 <- x
Then create a list of strings you want to match on:
matches <- c("MYC","EV","PIM","WDR","OBX")
Then we loop over the values in matches and do three things (numbered ##X in the code):
Create the regular expression by pasting together the current match string i with the other bits of the regular expression we want to use,
Using grepl() we return a logical indicator for those elements of x2 that contain the string i
We then use the same style gsub() call as you were already shown, but use only the elements of x2 that matched the string, and replace only those elements.
The loop is:
for(i in matches) {
rgexp <- paste(".*(", i, ").*", sep = "") ## 1
ind <- grepl(rgexp, x) ## 2
x2[ind] <- gsub(rgexp, "\\1", x2[ind]) ## 3
}
x2
Which gives:
> x2
[1] "EV" "PIM" "WDR" "MYC" "OBX" "EV" "PIM" "WDR" "MYC" "OBX" "OBX" "MYC"

Resources