R: How do you apply grep() in lapply() - r

I would like to apply grep() in R, but I am not really good in lapply(). I understand that lapply is able to take a list, apply function to each members and output a list. For instance, let x be a list consists of 2 members.
> x<-strsplit(docs$Text," ")
>
> x
[[1]]
[1] "I" "lovehttp" "my" "mum." "I" "love"
[7] "my" "dad." "I" "love" "my" "brothers."
[[2]]
[1] "I" "live" "in" "Eastcoast" "now." "Job.I"
[7] "used" "to" "live" "in" "WestCoast."
I would like to apply grep() function to remove words consisting of http. So, I would apply:
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE))
But it does not work and it says
Error in grep(pattern = "http", invert = TRUE, value = TRUE) :
argument "x" is missing, with no default
So, I tried
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE,x))
But it says
Error in match.fun(FUN) :
'grep(pattern = "http", invert = TRUE, value = TRUE, x)' is not a
function, character or symbol
A help please, and thanks!

This can be done in one line:
lst <- lapply(lst, grep, pattern="http", value=TRUE, invert=TRUE)
#lst
#[[1]]
# [1] "I" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."
If you don't want to remove the entire word that contains the pattern and remove only the pattern itself while retaining the rest of the word (as discussed in the comments), you can use gsub instead of grep:
lapply(lst, gsub, pattern="http", replacement="")
#[[1]]
# [1] "I" "love" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."

The following line of code will remove all entries from vectors in your list which contain the substring http:
repx <- function(x) {
y <- grep("http", x)
vec <- rep(TRUE, length(x))
vec[y] <- FALSE
x <- x[vec]
return(x)
}
lapply(lst, function(x) { repx(x) })
Data:
x1 <- c("I", "lovehttp", "my", "mum.", "I", "love", "my", "dad.", "I", "love", "my", "brothers.")
x2 <- c("I", "live", "in", "Eastcoast", "now.", "Job.I", "used", "to", "live", "in", "WestCoast.")
lst <- list(x1, x2)

Related

How to generate all permutations of lists of string?

I have character data like this
[[1]]
[1] "F" "S"
[[2]]
[1] "Y" "Q" "Q"
[[3]]
[1] "C" "T"
[[4]]
[1] "G" "M"
[[5]]
[1] "A" "M"
And I want to generate all permutations for each individual list (not mixed between lists) and combine them together into one big list.
For example, for the first and second lists, which are "F" "S" and "Y" "Q" "Q", I want to get the permutation lists as c("FS", "SF"), and c("YQQ", "QYQ", "QQY"), and then combine them into one.
Here's an approach with combinat::permn:
library(combinat)
lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = "")))
#[[1]]
#[1] "FS" "SF"
#
#[[2]]
#[1] "YQQ" "QYQ" "QQY"
#
#[[3]]
#[1] "CT" "TC"
#
#[[4]]
#[1] "GM" "MG"
#
#[[5]]
#[1] "AM" "MA"
Or together with unlist:
unlist(lapply(data,function(x)unique(sapply(combinat::permn(x),paste,collapse = ""))))
# [1] "FS" "SF" "YQQ" "QYQ" "QQY" "CT" "TC" "GM" "MG" "AM" "MA"
Data:
data <- list(c("F", "S"), c("Y", "Q", "Q"), c("C", "T"), c("G", "M"),
c("A", "M"))
It looks like your desired output is not exactly the same as this related post (Generating all distinct permutations of a list in R). But we can build on the answer there.
library(combinat)
# example data, based on your description
X <- list(c("F","S"), c("Y", "Q", "Q"))
result <- lapply(X, function(x1) {
unique(sapply(permn(x1), function(x2) paste(x2, collapse = "")))
})
print(result)
Output
[[1]]
[1] "FS" "SF"
[[2]]
[1] "YQQ" "QYQ" "QQY"
The first (outer) lapply iterates over each element of the list, which contains the individual letters (in a vector). With each iteration the permn takes the individual letters (eg "F" and "S"), and returns a list object with all possible permutations (e.g "F" "S" and "S" F"). To format the output as you described, the inner sapply takes each those permutations and collapses them into a single character value, filtered for unique values.
library(combinat)
final <- unlist(lapply(X , function(test_X) lapply(permn(test_X), function(x) paste(x,collapse='')) ))

Remove duplicates within consecutive runs of characters

I have strings containing lots of duplicates, like this:
tst <- c("C>C>C>B>B>B>B>C>C>*>*>*>*>*>C", "A>A>A", "*>B>B",
"A>A>A>A>A>*>A>A>A>*>*>*>*>A>A", "*>C>C", "A")
I'd like to remove all consecutive duplicated upper-case and "*" characters, so the expected result is this:
[1] "CBC*C" "A" "*B" "A*A*A" "*C" "A"
I've successfully extracted the duplicated capitals:
library(stringr)
unlist(str_extract_all(gsub(">", "", tst), "(.)(?=\\1)"))
[1] "C" "C" "B" "B" "B" "C" "*" "*" "*" "*"
but am somewhat stuck here. My hunch is that the function which, which returns indices, might be of help but don't know how to implement it in this case.
Any ideas?
EDIT:
I wasn't that far from the solution myself - just using a negative lookahead (instead of the positive lookahead) does the trick:
str_extract_all(gsub(">", "", tst), "(.)(?!\\1)")
[[1]]
[1] "C" "B" "C" "*" "C"
[[2]]
[1] "A"
[[3]]
[1] "*" "B"
[[4]]
[1] "A" "*" "A" "*" "A"
[[5]]
[1] "*" "C"
[[6]]
[1] "A"
We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
Another way to get CBC*C could be using 2 groups and using group 2 in the replacement.
((.)>)\1+
Regex demo
Example
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
gsub("((.)>)\\1+", "\\2", tst)
Output
[1] "CBC*C"
For us allergic to regex:
paste(rle(strsplit(tst, ">")[[1]])$values, collapse = ">") # or collapse = ""
[1] "C>B>C>*>C"
...which of course fails for strings with runs of lowercase letters, like "A>A>a>a>A>A"
A somewhat universal base R approach without regexps.
The idea here is to melt down the string to groups and then remove the repeating patterns successively (which makes it distinct from unique):
tst <- "C>C>C>B>B>B>B>C>C>*>*>*>*>*>C"
st <- paste(unlist(strsplit(tst,">")),collapse="")
#[1] "CCCBBBBCC*****C"
paste( unlist( sapply( 1:nchar(st), function(x){
if( substr(st,x,x) != substr(st,(x+1),(x+1)) ){ substr(st,x,x) } } ) ), collapse="" )
#[1] "CBC*C"
Oh, and if you want lowercase functionality (excluding lowercase letters from removal), use this instead:
paste( unlist( sapply( 1:nchar(st), function(x){
a=substr(st,x,x); b=substr(st,(x+1),(x+1));
if( a != b & toupper(a) == a ){ a } else if( toupper(a) != a ){ a } } ) ), collapse="" )

Change the null values ​of multiple lists that only differ by a number

I want to change the null values ​​of multiple lists that only differ by a number. In this example I have 3 lists: "a1", "a2" and "a3", and I want to change their null values for "THERE'S NO VALUE". I've tried with a for loop using "paste" function, but it doesn't run. This is a simplied version of my code:
a1<-list(NULL, "a","b")
a2<-list("d", NULL,"m")
a3<-list("k", NULL,"l")
for (i in 1:3){
var<-paste("a", i, sep = "")
var[var=='NULL']<-"THERE'S NO VALUE"
}
Also I've tried with assign function, but It changes all variables, and I only want to change the null element of each one (I suspect why, but I don't know how to change the function to work):
for (i in 1:3){
var<-paste("a", i, sep = "")
assign(var,var[var=='NULL']<-"THERE'S NO VALUE")
}
Thanks in advance.
We use mget to get the objects in a list, then loop over the list with lapply, replace the elements that are NULL with the new value and then if needed, use list2env to change the object values in the global env
list2env(lapply(mget(paste0("a", 1:3)), function(x) {
x[sapply(x, is.null)] <- "THERE'S NO VALUE"
x}),
.GlobalEnv)
-Now check the objects
a1
[[1]]
[1] "THERE'S NO VALUE"
[[2]]
[1] "a"
[[3]]
[1] "b"
a2
[[1]]
[1] "d"
[[2]]
[1] "THERE'S NO VALUE"
[[3]]
[1] "m"
a3
[[1]]
[1] "k"
[[2]]
[1] "THERE'S NO VALUE"
[[3]]
[1] "l"

Removing empty words in a list in R

I have a long list of words, some of which are empty strings. This is part of the list.
17`[[95]]
[1] "while" "" "however" "" "the" "right" "is" "unsettled"
[9] "" "we" "have" "avoided" "changing" "the" "state"
17`[[96]]
[1] "of" "things" "by" "taking" "new" "posts"
[7] "or" "strengthening" "ourselves" "in" "the" "disputed"
I'm trying to get rid of the empty strings in each element of the list. I don't know how to do this using regular expressions, and can't figure why the following lapply doesn't work either:
new_list = lapply(list, function(x) x = x[x != ""])
Can you help correct the code? Also, do you know how to use regexp for that? Thanks.
We can use grep
lapply(list, function(x) lapply(x, grep, pattern = "^$", value = TRUE, invert = TRUE))
Or as #thelatemail mentioned the recursive apply (rapply) can be used
rapply(list, grep, pattern = "^$", value = TRUE, invert= TRUE, how = "list")

How to split a certain element in a vector by letters?

For example, I have an element "computer" in a vector. I need to get a vector consisting of "c", "o", "m", "p", "u", "t", "e", "r".
And the second part of my question is optional. How can I create a vector containing letter combinations of the elements of the above mentioned vector and letters in the resulting combinations will be only in such order as in the original word? For instance, I want to get something like "puter" or "mpu" in this vector instead of "tumpo".
You can use
strsplit("computer", "\\b")
and
library("RWeka")
gsub(" ", "",
NGramTokenizer(paste(strsplit("computer", "\\b")[[1]], collapse=" "),
Weka_control(min=2,
max=5)),
fixed=TRUE)
# [1] "compu" "omput" "mpute" "puter" "comp"
# [6] "ompu" "mput" "pute" "uter" "com"
# [11] "omp" "mpu" "put" "ute" "ter"
# [16] "co" "om" "mp" "pu" "ut"
# [21] "te" "er"
to create n-grams with 2 <= n <=5.
For the first part of the question is really easy to get:
splits <- unlist(strsplit("computer",split=""))
> splits
[1] "c" "o" "m" "p" "u" "t" "e" "r"
For the second part you can use the following code:
subseqs <-
unlist(
lapply(1:length(splits),FUN=function(x){
lapply(1:(length(splits)+1-x),FUN=function(y){
paste(splits[y:(y+x-1)],collapse="") })
})
)
> subseqs
[1] "c" "o" "m" "p" "u" "t" "e"
[8] "r" "co" "om" "mp" "pu" "ut" "te"
[15] "er" "com" "omp" "mpu" "put" "ute" "ter"
[22] "comp" "ompu" "mput" "pute" "uter" "compu" "omput"
[29] "mpute" "puter" "comput" "ompute" "mputer" "compute" "omputer"
[36] "computer"
For three consecutive letter combinations:
x <- strsplit("computer", "\\b")
y <- combn(seq(x),3); m <- match(1:6,y[1,])
combn (x,3)[,m]

Resources