Removing empty words in a list in R - r

I have a long list of words, some of which are empty strings. This is part of the list.
17`[[95]]
[1] "while" "" "however" "" "the" "right" "is" "unsettled"
[9] "" "we" "have" "avoided" "changing" "the" "state"
17`[[96]]
[1] "of" "things" "by" "taking" "new" "posts"
[7] "or" "strengthening" "ourselves" "in" "the" "disputed"
I'm trying to get rid of the empty strings in each element of the list. I don't know how to do this using regular expressions, and can't figure why the following lapply doesn't work either:
new_list = lapply(list, function(x) x = x[x != ""])
Can you help correct the code? Also, do you know how to use regexp for that? Thanks.

We can use grep
lapply(list, function(x) lapply(x, grep, pattern = "^$", value = TRUE, invert = TRUE))
Or as #thelatemail mentioned the recursive apply (rapply) can be used
rapply(list, grep, pattern = "^$", value = TRUE, invert= TRUE, how = "list")

Related

R: Access the last subelement of each element in a list

Say I have a character vector like:
x <- c('A__B__Mike','A__Paul','Daniel','A__B__C__Martha','A__John','A__B__C__D__Laura')
I want a vector of only the names in the last position; I guess I could do it removing the first chunk using regular expressions, but say I want to use strsplit() to split by '__':
x.list <- strsplit(x, '__')
How would I access the last subelement (the names) of each element in this list? I only know how to do it if I know the position:
sapply(x.list, "[[", 1)
But how to access the last when the position is variable? Thanks!
In any case, what would be the fastest way to extract the names out of x in the first place? Anything faster than the strsplit approach?
We can do this with base R. Either using sub
sub(".*__", "", x)
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
or with strsplit, we get the last element with tail
sapply(strsplit(x, '__'), tail, 1)
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Or to find the position, we can use gregexpr and then extract using substring
substring(x, sapply(gregexpr("[^__]+", x), tail, 1))
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Or with stri_extract_last
library(stringi)
stri_extract_last(x, regex="[^__]+")
#[1] "Mike" "Paul" "Daniel" "Martha" "John" "Laura"
Use word function of stringr package
library(stringr)
word(x,start = -1,sep = "\\_+")

R: How do you apply grep() in lapply()

I would like to apply grep() in R, but I am not really good in lapply(). I understand that lapply is able to take a list, apply function to each members and output a list. For instance, let x be a list consists of 2 members.
> x<-strsplit(docs$Text," ")
>
> x
[[1]]
[1] "I" "lovehttp" "my" "mum." "I" "love"
[7] "my" "dad." "I" "love" "my" "brothers."
[[2]]
[1] "I" "live" "in" "Eastcoast" "now." "Job.I"
[7] "used" "to" "live" "in" "WestCoast."
I would like to apply grep() function to remove words consisting of http. So, I would apply:
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE))
But it does not work and it says
Error in grep(pattern = "http", invert = TRUE, value = TRUE) :
argument "x" is missing, with no default
So, I tried
> lapply(x,grep(pattern="http",invert=TRUE, value=TRUE,x))
But it says
Error in match.fun(FUN) :
'grep(pattern = "http", invert = TRUE, value = TRUE, x)' is not a
function, character or symbol
A help please, and thanks!
This can be done in one line:
lst <- lapply(lst, grep, pattern="http", value=TRUE, invert=TRUE)
#lst
#[[1]]
# [1] "I" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."
If you don't want to remove the entire word that contains the pattern and remove only the pattern itself while retaining the rest of the word (as discussed in the comments), you can use gsub instead of grep:
lapply(lst, gsub, pattern="http", replacement="")
#[[1]]
# [1] "I" "love" "my" "mum." "I" "love" "my" "dad." "I" "love" "my" "brothers."
#
#[[2]]
# [1] "I" "live" "in" "Eastcoast" "now." "Job.I" "used" "to" "live" "in" "WestCoast."
The following line of code will remove all entries from vectors in your list which contain the substring http:
repx <- function(x) {
y <- grep("http", x)
vec <- rep(TRUE, length(x))
vec[y] <- FALSE
x <- x[vec]
return(x)
}
lapply(lst, function(x) { repx(x) })
Data:
x1 <- c("I", "lovehttp", "my", "mum.", "I", "love", "my", "dad.", "I", "love", "my", "brothers.")
x2 <- c("I", "live", "in", "Eastcoast", "now.", "Job.I", "used", "to", "live", "in", "WestCoast.")
lst <- list(x1, x2)

using paste with a list

I'm trying to understand the behavior of strsplit and paste, which are inverse functions. However, when I strsplit a vector, a list is returned, like so:
> strsplit(c("on,e","tw,o","thre,e","fou,r"),",")
[[1]]
[1] "on" "e"
[[2]]
[1] "tw" "o"
[[3]]
[1] "thre" "e"
[[4]]
[1] "fou" "r"
I tried using lapply to cat the elements of the list back together, but it doesn't work:
> lapply(strsplit(c("on,e","tw,o","thre,e","fou,r"),","),cat)
on etw othre efou r[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
The same formula with paste instead of cat actually does nothing at all! Why am I getting these results? and how can I get the result I want, which is the original vector back again?
(Obviously, in my actual code I'm trying to do more with the strsplit and cat than just return the original vector, but I think a solution to this problem will work for mine. Thanks!)
While yes, cat will concatenate and print to the console, it does not actually function in the same way paste does. It's result best explained in help("cat")
The collapse argument in paste is effectively the opposite of the split argument in strsplit. And you can use sapply to return the simplified pasted vector.
x <- c("on,e","tw,o","thre,e","fou,r")
( y <- sapply(strsplit(x, ","), paste, collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
( z <- vapply(strsplit(x, ","), paste, character(1L), collapse = ",") )
# [1] "on,e" "tw,o" "thre,e" "fou,r"
identical(x, y)
# [1] TRUE
identical(x, z)
# [1] TRUE
Note that for cases like this, vapply will be more efficient than sapply. And adding fixed = TRUE in strsplit should increase efficiency as well.

How to use the strsplit function with a period

I would like to split the following string by its periods. I tried strsplit() with "." in the split argument, but did not get the result I want.
s <- "I.want.to.split"
strsplit(s, ".")
[[1]]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
The output I want is to split s into 4 elements in a list, as follows.
[[1]]
[1] "I" "want" "to" "split"
What should I do?
When using a regular expression in the split argument of strsplit(), you've got to escape the . with \\., or use a charclass [.]. Otherwise you use . as its special character meaning, "any single character".
s <- "I.want.to.split"
strsplit(s, "[.]")
# [[1]]
# [1] "I" "want" "to" "split"
But the more efficient method here is to use the fixed argument in strsplit(). Using this argument will bypass the regex engine and search for an exact match of ".".
strsplit(s, ".", fixed = TRUE)
# [[1]]
# [1] "I" "want" "to" "split"
And of course, you can see help(strsplit) for more.
You need to either place the dot . inside of a character class or precede it with two backslashes to escape it since the dot is a character of special meaning in regex meaning "match any single character (except newline)"
s <- 'I.want.to.split'
strsplit(s, '\\.')
# [[1]]
# [1] "I" "want" "to" "split"
Besides strsplit(), you can also use scan(). Try:
scan(what = "", text = s, sep = ".")
# Read 4 items
# [1] "I" "want" "to" "split"

Character "|" in strsplit function (vertical bar / pipe)

I was curious about:
> strsplit("ty,rr", split = ",")
[[1]]
[1] "ty" "rr"
> strsplit("ty|rr", split = "|")
[[1]]
[1] "t" "y" "|" "r" "r"
Why don't I get c("ty","rr") from strsplit("ty|rr", split="|")?
It's because the split argument is interpreted as a regular expression, and | is a special character in a regex.
To get round this, you have two options:
Option 1: Escape the |, i.e. split = "\\|"
strsplit("ty|rr", split = "\\|")
[[1]]
[1] "ty" "rr"
Option 2: Specify fixed = TRUE:
strsplit("ty|rr", split = "|", fixed = TRUE)
[[1]]
[1] "ty" "rr"
Please also note the See Also section of ?strsplit, which tells you to read ?"regular expression" for details of the pattern specification.

Resources