How to flag missing left-hand collocates with NA - r

I want to compute collocates of the lemma GO, including all its forms such as go, goes, gone, etc.:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
The lemma forms are stored in this vector:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
and this vector turns them into an alternation pattern:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
However, when using the pattern with str_extract_all to extract the immediately left-hand collocate of GO, the extraction misses out on those strings where GO is the first word in the string and reoccurs later in the string:
library(stringr)
str_extract_all(go, paste0("'?\\b[a-z']+\\b(?=\\s?", pattern_GO, ")"))
[[1]]
character(0)
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
The expected result is this:
[[1]]
[1] NA
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] NA "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
How can the extraction be mended to also return NA in the absence of a left-hand collocate?

You can add an alternative to match at the start of a string, or your consuming pattern:
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
See the regex demo.
See the R demo:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
library(stringr)
str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
Output:
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
Sukces #stdin #stdout 0.26s 42528KB
[1] "\\b(go|goes|going|gone|went|gon na)\\b"
[[1]]
[1] ""
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "" "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
If you want, you can turn all empty items into NA using
res <- str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
res <- lapply(res, function(x) ifelse(x=="", NA, x))

Related

Iteratively extract repeated word forms across speaking turns

I'm working on speaking turns in conversation. My interest is in the words that get repeated from a prior turn to a next turn:
turnsX <- data.frame(
speaker = c("A","B","A","B"),
speech = c("let's have a look",
"yeah let's take a look",
"yeah okay so where to start",
"let's start here"), stringsAsFactors = F
)
I want to extract the repeated word forms. To this end I've run a for loop, iteratively defining each speech turn as a regex pattern for the next speech turn and str_extracting the words that get repeated from turn to turn:
library(stringr)
pattern <- c()
extracted <- c()
for(i in 1:nrow(turnsX)){
pattern[i] <- paste0(unlist(str_split(turnsX$speech[i], " ")), collapse = "|")
extracted[i+1] <- str_extract_all(turnsX$speech[i+1], pattern[i])
}
The result however is partly incorrect:
extracted
[[1]]
NULL
[[2]]
[1] "a" "let's" "a" "a" "look"
[[3]]
[1] "yeah" "a" "a"
[[4]]
[1] "start"
[[5]]
[1] NA
The correct result should be:
extracted
[[1]]
NULL
[[2]]
[1] "let's" "a" "look"
[[3]]
[1] "yeah"
[[4]]
[1] "start"
Where's the mistake? How can the code be mended, or what other approach is there, to get the correct result?
Maybe you can use Map and %in%.
x <- strsplit(turnsX$speech, " ")
Map(function(y,z) y[y %in% z], x[-length(x)], x[-1])
#[[1]]
#[1] "let's" "a" "look"
#
#[[2]]
#[1] "yeah"
#
#[[3]]
#[1] "start"
Here's a base R approach using Map :
tmp <- strsplit(turnsX$speech, ' ')
c(NA, Map(intersect, tmp[-1], tmp[-length(tmp)]))
#[[1]]
#[1] NA
#[[2]]
#[1] "let's" "a" "look"
#[[3]]
#[1] "yeah"
#[[4]]
#[1] "start"
You want the word boundaries "\\b"
library(stringr)
pattern <- c()
extracted <- c()
for(i in 2:nrow(turnsX)){
pattern[i - 1] <- paste0(unlist(str_split(turnsX$speech[i - 1], " ")), collapse = "|\\b")
extracted[i] <- str_extract_all(turnsX$speech[i], pattern[i - 1])
}
# [[1]]
# NULL
#
# [[2]]
# [1] "let's" "a" "look"
#
# [[3]]
# [1] "yeah"
#
# [[4]]
# [1] "start"

ft_tokenizer tokenizes words to lower, I want it to be as they are

I am using ft_tokenizer for spark dataframe in R.
and it tokenizes each word and changes it to all lower, I want the words to be in the format they originally are.
text_data <- data_frame(
x = c("This IS a sentence", "So is this")
)
tokenized <- text_data_tbl %>%
ft_tokenizer("x", "word")
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "this"
##
## [[1]][[2]]
## [1] "is"
##
## [[1]][[3]]
## [1] "a"
I want:
tokenized$word
## [[1]]
## [[1]][[1]]
## [1] "This"
##
## [[1]][[2]]
## [1] "IS"
##
## [[1]][[3]]
## [1] "a"
I guess it is not possible with ft_tokenizer. From ?ft_tokenizer
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
So it's basic feature is to convert the string to lowercase and split on white-space which I guess cannot be changed. Consider doing
text_data$new_x <- lapply(strsplit(text_data$x, "\\s+"), as.list)
which will give the same output as expected and you can continue your process as it is from here.
text_data$new_x
#[[1]]
#[[1]][[1]]
#[1] "This"
#[[1]][[2]]
#[1] "IS"
#[[1]][[3]]
#[1] "a"
#[[1]][[4]]
#[1] "sentence"
#[[2]]
#[[2]][[1]]
#[1] "So"
#[[2]][[2]]
#[1] "is"
#[[2]][[3]]
#[1] "this"

How to split string in R with regular expression when parts of the regular expression are to be kept in the subsequent splitted strings?

I have a vector of character strings like this x = c("ABC", "ABC, EF", "ABC, DEF, 2 stems", "DE, other comments, and stuff").
I'd like to split each of these into two components: 1) the set of capital letters (2 or 3 letters, separated by commas), and 2) everything after the last "[A-Z][A-Z], ".
The results should be
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, DEF" "2 stems"
[[4]]
[1] "DE" "other comments, and stuff"
I tried strsplit(x, "[A-Z][A-Z], [a-z0-9]") and strsplit(x, "(?:[A-Z][A-Z], )[a-z0-9]"), both of which returned
[[1]]
[1] "ABC"
[[2]]
[1] "ABC, EF"
[[3]]
[1] "ABC, D" " stems"
[[4]]
[1] "" "ther comments, and stuff"
The identification of where to split depends on a combination of the end of the first substring and the beginning of the second substring, and so those parts get excluded from the final result.
Any help appreciated in splitting as indicated above while including the relevant parts of the split regex in each substring!
One option would be str_split
library(stringr)
str_split(x, ", (?=[a-z0-9])", n = 2)
#[[1]]
#[1] "ABC"
#[[2]]
#[1] "ABC, EF"
#[[3]]
#[1] "ABC, DEF" "2 stems"
#[[4]]
#[1] "DE" "other comments, and stuff"

Filter list in R which has nchar > 1

I have a list of names
> x <- c("Test t", "Cuma Nama K", "O", "Test satu dua t")
> name <- strsplit(x, " ")
> name
[[1]]
[1] "Test" "t"
[[2]]
[1] "Cuma" "Nama" "K"
[[3]]
[1] "O"
[[4]]
[1] "Test" "satu" "dua" "t"
How can I filter a list so that it can become like this?
I am trying to find out how to filter the list which has nchar > 1
> name
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[4]]
[1] "Test" "satu" "dua"
lapply(name, function(x) x[nchar(x)>1])
Results in:
[[1]]
[1] "Test"
[[2]]
[1] "Cuma" "Nama"
[[3]]
character(0)
[[4]]
[1] "Test" "satu" "dua"
We can loop over the list elements, subset the elements that have nchar greater than 1 and use Filter to remove the elements that 0 elements
Filter(length,lapply(name, function(x) x[nchar(x) >1 ]))
#[[1]]
#[1] "Test"
#[[2]]
#[1] "Cuma" "Nama"
#[[3]]
#[1] "Test" "satu" "dua"
If we want to remove the words with one character from the string, we can also do this without splitting
setdiff(gsub("(^| ).( |$)", "", x), "")
#[1] "Test" "Cuma Nama" "Test satu dua"

Extract a fixed-length character in R

I have an attribute consisting DNA sequences and would like to translate it to its amino name.
So I need to split the sequence in a fixed-length character that is 3.
Here is the sample of the data
data=c("AATAGACGT","TGACCC","AAATCACTCTTT")
How can I extract it into:
[1] "AAT" "AGA" "CGT"
[2] "TGA" "CCC"
[3] "AAA" "TCA" "CTC" "TTT"
So far I can only find how to split a string given a certain regex as the separator
Try
strsplit(data, '(?<=.{3})', perl=TRUE)
Or
library(stringi)
stri_extract_all_regex(data, '.{1,3}')
Another solution, still one liner, but less elegant than the other ones (using lapply):
lapply(data, function(u) substring(u, seq(1, nchar(u), 3), seq(3, nchar(u),3)))
#[[1]]
#[1] "AAT" "AGA" "CGT"
#[[2]]
#[1] "TGA" "CCC"
#[[3]]
#[1] "AAA" "TCA" "CTC" "TTT"
as.list(gsub("(.{3})", "\\1 ", data))
[[1]]
[1] "AAT AGA CGT "
[[2]]
[1] "TGA CCC "
[[3]]
[1] "AAA TCA CTC TTT "
or
regmatches(data, gregexpr(".{3}", data))
[[1]]
[1] "AAT" "AGA" "CGT"
[[2]]
[1] "TGA" "CCC"
[[3]]
[1] "AAA" "TCA" "CTC" "TTT"
Another:
library(gsubfn)
strapply(data, "...")

Resources