Removing "" elements from a list in R - r

I have a list where I would like to remove empty characters: "".
I seem to be subsetting the elements incorrectly:
> sample2[which(sample2 == "")]
list()
> sample2[which(sample2 != "")]
[[1]]
[1] "" "03JAN1990" "" "" ""
[6] "" "23.4" "0.4" "" ""
[11] "" "" "25.1" "0.3" ""
[16] "" "" "" "26.6" "0.0"
[21] "" "" "" "" "28.6"
[26] "0.3"
What should I do to subset and remove the empty characters?

From your output, it looks like sample2 is not a character vector, but it is a list containing a character vector. You should be using
sample2[[1]][which(sample2[[1]] != "")]
(It would help to include dput(sample2) just to confirm)
Or even better, take the character vector out of the list first
sample3 <- sample2[[1]]
# or maybe sample3 <- unlist(sample2)
sample3[which(sample3 != "")]

A very basic solution:
> lst = list(1,2,"dog","","boss","")
> x = unlist(lst)
> list(x[x!=""])
[[1]]
[1] "1" "2" "dog" "boss"

Related

Extract words that are repeated from one sentence to the next

I have sentences from spoken conversation and would like to identify the words that are repeated fom sentence to sentence; here's some illustartive data (in reproducible format below)
df
# A tibble: 10 x 1
Orthographic
<chr>
1 "like I don't understand sorry like how old's your mom"
2 "eh sixty-one"
3 "yeah (...) yeah yeah like I mean she's not like in the risk age group but still"
4 "yeah"
5 "HH"
6 "I don't know"
7 "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks…
8 "yeah"
9 "she said you should come home probably "
10 "no and like why would you go to the airport where people have corona sit in the plane where peop…
I'm not unsuccessful at extracting the repeated words using a forloop but do also get some strange results: Here's what I've been doing so far:
# initialize pattern and new column `rept` in `df`:
pattern1 <- c()
df$rept <- NA
# for loop:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("\\b(", paste0(unlist(str_split(df$Orthographic[i-1], " ")), collapse = "|"), ")\\b")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
The results are these; result # 10 is strange/incorrect - it should be character(0). How can the code be improved so that no such strange results are obtained?
df$rept
[[1]]
[1] NA
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "" "" "" "" "" "" "" "" "" "" "you" "" "" "" "" ""
[17] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[33] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[49] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[65] "" "" "" "" "" "" "" "" "" "" "" ""
Reproducible data:
structure(list(Orthographic = c("like I don't understand sorry like how old's your mom",
"eh sixty-one", "yeah (...) yeah yeah like I mean she's not like in the risk age group but still",
"yeah", "HH", "I don't know", "yeah I talked to my grandparents last night and last time I talked to them it was like two weeks ago and they at that time they were already like maybe you should just get on a plane and come home and like you can't just be here and and then last night they were like are you sure you don't wanna come home and I was I don't think I can and my mom said the same thing",
"yeah", "she said you should come home probably ", "no and like why would you go to the airport where people have corona sit in the plane where people have corona to get there where people have corona and then go and take it to your family"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
When you debug such regex issues concerning dynamic patterns with word boundaries, there are a lot of things to keep in mind (so as to understand how to best approach the whole issue).
First, check the patterns you get,
for(i in 2:nrow(df)) {
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+"))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
Here is the list of regexps:
[1] "\\b(like|I|don't|understand|sorry|like|how|old's|your|mom)\\b"
[1] "\\b(eh|sixty-one)\\b"
[1] "\\b(yeah|(...)|yeah|yeah|like|I|mean|she's|not|like|in|the|risk|age|group|but|still)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(HH)\\b"
[1] "\\b(I|don't|know)\\b"
[1] "\\b(yeah|I|talked|to|my|grandparents|last|night|and|last|time|I|talked|to|them|it|was|like|two|weeks|ago|and|they|at|that|time|they|were|already|like|maybe|you|should|just|get|on|a|plane|and|come|home|and|like|you|can't|just|be|here|and|and|then|last|night|they|were|like|are|you|sure|you|don't|wanna|come|home|and|I|was|I|don't|think|I|can|and|my|mom|said|the|same|thing)\\b"
[1] "\\b(yeah)\\b"
[1] "\\b(she|said|you|should|come|home|probably|)\\b"
Look at the second pattern: \b(eh|sixty-one)\b. What if the first word was sixty? The \b(sixty|sixty-one)\b regex will never match sixty-one because sixty would have matched first and the other alternative would not even have been considered. You need to always sort the alternatives by length in the descending order to assure you always match the longest alternative first when you use word boundaries and you know there can be alternatives with more than one word in them. Here, you do not need to sort the alternatives because you only have single word alternatives.
See the next pattern containing |(...)| alternative. It matches any three chars other than line break chars and captures them into a group. However, the string contained a (...) substring where the parentheses and dots are literal chars. To match them with a regex, you need to escape all special chars.
Next, you consider "words" to be non-whitespace chunks of chars because you use str_split(df$Orthographic[i-1], " "). This invalidates the approach with \b altogether, you need to use whitespace boundaries, (?<!\S) at the start and (?!\S) at the end instead of \bs. More, since you only split with a single space, you may get empty alternatives if there are two or more consecutive spaces in the input string. You need to use \s+ pattern here to split by one or more whitespaces.
Next, there is a trailing space in the last but one string, and it creates an empty alternative. You need to trimws your input before splitting into tokens/words.
This is what you need to do with the regex solution: add the escape.for.regex function:
## Escape for regex
escape.for.regex <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
and then use it to escape the tokens that you obtain by splitting the trimmed df$Orthographic[i-1] with \s+ regex, appy unique to remove duplicates to make the pattern more efficient and shorter, and add the whitespace boundaries:
for(i in 2:nrow(df)){
pattern1[i-1] <- paste0("(?<!\\S)(?:", paste0(escape.for.regex(unique(unlist(str_split(trimws(df$Orthographic[i-1]), "\\s+")))), collapse = "|"), ")(?!\\S)")
df$rept[i] <- str_extract_all(df$Orthographic[i], pattern1[i-1])
}
See the list of regexps:
[1] "(?<!\\S)(?:like|I|don't|understand|sorry|how|old's|your|mom)(?!\\S)"
[1] "(?<!\\S)(?:eh|sixty-one)(?!\\S)"
[1] "(?<!\\S)(?:yeah|\\(\\.\\.\\.\\)|like|I|mean|she's|not|in|the|risk|age|group|but|still)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:HH)(?!\\S)"
[1] "(?<!\\S)(?:I|don't|know)(?!\\S)"
[1] "(?<!\\S)(?:yeah|I|talked|to|my|grandparents|last|night|and|time|them|it|was|like|two|weeks|ago|they|at|that|were|already|maybe|you|should|just|get|on|a|plane|come|home|can't|be|here|then|are|sure|don't|wanna|think|can|mom|said|the|same|thing)(?!\\S)"
[1] "(?<!\\S)(?:yeah)(?!\\S)"
[1] "(?<!\\S)(?:she|said|you|should|come|home|probably)(?!\\S)"
Output:
> df$rept
[[1]]
NULL
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "yeah"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
[1] "I" "I" "don't" "I" "I" "don't" "I"
[[8]]
[1] "yeah"
[[9]]
character(0)
[[10]]
[1] "you"
Depending on whether it is sufficient to identify repeated words, or also their repeat frequencies, you might want to modify the function, but here is one approach using the dplyr::lead function:
library(stringr)
library(dplyr)
# general function that identifies intersecting words from multiple strings
getRpt <- function(...){
l <- lapply(list(...), function(x) unlist(unique(
str_split(as.character(x), pattern=boundary(type="word")))))
Reduce(intersect, l)
}
df$rept <- mapply(getRpt, df$Orthographic, lead(df$Orthographic), USE.NAMES=FALSE)

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

How to delete "" from a list of character vectors

I have a list of character vectors, where some elements are actual strings, such as "FA" and "EX". However, some others are just "". I want to delete these.
list1 <- c("FA", "EX", "")
list2 <- c("FA")
list3 <- c("")
list <- list(list1, list2, list3)
> list
[[1]]
[1] "FA" "EX" ""
[[2]]
[1] "FA"
[[3]]
[1] ""
Should then be
[[1]]
[1] "FA" "EX"
[[2]]
[1] "FA"
How can I accomplish this?
Try
lapply(list[list!=''], function(x) x[x!=''])

R: Removing blanks from the list

I'm wondering if there is any way to remove blanks from the list.
As far as I've searched, I found out that there are many Q&As for removing
the whole element from the list, but couldn't find the one regarding
a specific component of the element.
To be specific, the list now I'm working with looks like this:
[[1]]
[1] "1" "" "" "2" "" "" "3"
[[2]]
[1] "weak"
[[3]]
[1] "22" "33"
[[4]]
[1] "44" "34p" "45"
From above, you can find " ", which should be removed.
I've tried different commands like
text.words.bl <- text.words.ll[-which(text.words.ll==" ")]
text.words.bl <- text.words.ll[!sapply(text.words.ll, is.null)]
etc, but seems like " "s in [[1]] of the list still remains.
Is it impossible to apply commands to small pieces in each element of the list?
(e.g. 1, 2, weak, 22, 33... respectively)
I've used "lapply" function to run specific commands to each elements,
and it seemed like those lapply commands all worked....
JY
Use %in%, but negate it with !:
## Sample data:
L <- list(c(1, 2, "", "", 4), c(1, "", "", 2), c("", "", 3))
L
# [[1]]
# [1] "1" "2" "" "" "4"
#
# [[2]]
# [1] "1" "" "" "2"
#
# [[3]]
# [1] "" "" "3"
The replacement:
lapply(L, function(x) x[!x %in% ""])
# [[1]]
# [1] "1" "2" "4"
#
# [[2]]
# [1] "1" "2"
#
# [[3]]
# [1] "3"
Obviously, assign the output to "L" if you want to overwrite the original dataset:
L[] <- lapply(L, function(x) x[!x %in% ""])
Another way would be to use nchar(). I borrowed L from #Ananda Mahto.
lapply(L, function(x) x[nchar(x) >= 1])
#[[1]]
#[1] "1" "2" "4"
#
#[[2]]
#[1] "1" "2"
#
#[[3]]
#[1] "3"

Using lapply to subset a list of character lists in R

I have the following line that subsets a character list correctly:
> cpc_data2[[1]]
[1] "" "Week" "" "" "" "" "" "" "" ""
[11] "" "SST" "SSTA" "" "" "" "" "SST" "SSTA" ""
[21] "" "" "" "SST" "SSTA" "" "" "" "" "SST"
[31] "SSTA"
> cpc_data2[[1]][which(cpc_data2[1][[1]] != "")]
[1] "Week" "SST" "SSTA" "SST" "SSTA" "SST" "SSTA" "SST" "SSTA"
I would like to subset every list in cpc_data2. How should I do this? I have tried the following, clearly my syntax is incorrect:
> cpc_data3 = lapply(cpc_data2, function(x) x[which(x[[1]] != "")])
> head(cpc_data3)
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
You could try
lapply(cpc_data2, function(x) x[x!=''])

Resources