I'm writing a function (NextWordPrediction) in R to predict the next word given some words. The basic structure is as follows:
If input exists in dat such that nrow(dat) != 0 return input and answer
If input doesn't exist such that nrow(dat) == 0 call to recursion and atempt input-1 (eg. if input is "hello great world" try "great world" so on and so forth until nrow nrow(dat) != 0
If after step 2 nrow(dat) == 0 return string "Word not in dictionary. We added this to our database!" and add original input to dataset
Here is the full code:
NextWordPrediction <- function(input) {
dat <- training %>%
filter(., N_gram == str_count(input, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
assign("training",
training %>%
mutate(Frequency = ifelse(Word == input &
N_gram == str_count(input, "\\S+"),
Frequency + 1,
Frequency)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else if (nrow(dat) == 0 & word(input, 1) != "NA") {
input_1 <- Reduce(paste, word(input, 2:str_count(input,"\\S+")))
return(NextWordPrediction(input_1))
} else if (nrow(dat) == 0 & word(input, 1) == "NA") {
assign("training",
training %>%
add_row(., Word = tolower(input), Frequency = 1, N_gram = str_count(input, "\\S+")),
envir = .GlobalEnv)
ans <- paste("Word not in dictionary. We added this to our database!")
return(ans)
}
}
The issue I'm having happens somewhere between step 2 and 3. If input is not found after the recursion call, the added input to the database is input-1 ("great world") where I'd like the original input ("hello great world"). This is my first attempt to implement recursion and would like to understand the mistake in my code.
Thanks :)
Update to be Reproducible:
library(dplyr); library(stringr)
training <- data.frame(Word = c("hello", "she was great", "this is", "long time ago in"), Frequency = c(4, 3, 10, 1),
N_gram = c(1, 3, 2, 4), Prop = c(4/18, 3/18, 10/18, 1/18), Word_to_Predict = c(NA, "great", "is", "in"))
NextWordPrediction("she was") ## returns "she was" & "great"
NextWordPrediction("hours ago") ## returns "hours ago" & "in"
NextWordPrediction("words not in data") ## returns "Word not in dictionary. We added this to our database!" after trying "not in data", "in data" and adds "words not in data" to dataset
Here is an imperfect and overly-complicated demonstration of a recursive function operating on strings. Ideally there are some more safeguards that could be put into place, and there are of course much faster, more efficient, smarter ways of doing this one task, but ... perhaps you'll get the point.
I'm going to change all es to as, one word at a time.
e_to_a <- function(strings) {
# unnecessarily complex
message("# Called : ", sQuote(strings))
if (!nzchar(strings)) return(strings)
word1 <- sub("^([^[:space:]]*)[[:space:]]?.*", "\\1", strings)
others <- sub("^[^[:space:]]*[[:space:]]?", "", strings)
message("# - word1 : ", sQuote(word1))
message("# - others: ", sQuote(others))
# operate on the first word
word1 <- gsub("e", "a", word1)
if (nzchar(others)) {
others <- e_to_a(others)
return(paste(word1, others))
} else {
return(word1)
}
}
In action:
e_to_a("hello great world")
# # Called : 'hello great world'
# # - word1 : 'hello'
# # - others: 'great world'
# # Called : 'great world'
# # - word1 : 'great'
# # - others: 'world'
# # Called : 'world'
# # - word1 : 'world'
# # - others: ''
# [1] "hallo graat world"
The key is that when you make the recursive call, what you're currently doing
return(NextWordPrediction(input_1))
is going to return just the recursive part, dismissing the first word. That would be analogous to me doing
if (nzchar(others)) {
others <- e_to_a(others)
# return(paste(word1, others))
return(others)
} else {
return(word1)
}
I hope you can apply this to your function.
Bottom line, since your question is not reproducible, I'll guess that your fix is something like:
} else if (nrow(dat) == 0 & word(input, 1) != "NA") {
input_vec <- str_split(input, "\\s+")
input_firstword <- input_vec[1]
input_otherwords <- paste(input_vec[-1], collapse = " ")
return(paste(input_firstword, NextWordPrediction(input_otherwords)))
} else if (nrow(dat) == 0 & word(input, 1) == "NA") {
Stream-of-consciousness answer. It doesn't solve anything, but it highlights some areas where code can or must be changed. Up front: == NA fails; you're always discarding the first word in recursion; NA (the object meaning "could be anything") is being coerced into "NA", the literal string.
Starting with a fresh training, I'll debug(NextWordPrediction) and trace line-by-line. It gets to input_1 <- ..., the first thing I notice is:
first time, input_1 is "great world";
next time, it is "world";
next time, it is "na world", fail.
This is a classic fail on two counts:
the code assumes that there are multiple words, even though str_count(input,"\\S+") returns 1 here; and
it is a common mistake to assume that 2:... is always increasing and will not go over a certain count, but unfortunately 2:1 returns c(2L, 1L) ... perhaps you should check the length of your vectors before arbitrarily counting past them.
I think you're trying to guard against this with your previous test of word(input,1) != NA (which is also a mistake), but the only time that's going to happen is when input is 0-length vector (character(0)), not empty-string "". You won't get that with the current code, and I think your intent is for it to reduce to "".
I'm going to change your word(input, 2:str_count(...)) to
input_1 <- sub("^\\S*\\s?", "", input)
You have word(input, 1) != "NA" (and ==), that is either mistaking R's native object for a string, or you think you should be checking for a literal string "NA"; granted, english doesn't use that much as a real word, some languages do. I'm not certain if you intend that to be the NA literal or if for some reason your function will convert NA to "NA" and you want to guard against that.
That last assumption is fixing a symptom, not a problem. Never allow your function to return "NA" (this happens here in a couple of places), you need to guard against it. To me, it is perfectly reasonable to see a word "NA" and differentiate it from the R native NA. Data missingness is important to differentiate.
Assuming you meant != NA instead ... word(input, 1) != NA will never work. Let's run through some examples:
word("hello", 1)
# [1] "hello"
word("", 1)
# [1] ""
word(c(), 1)
# Warning in rep(string, length.out = n) :
# 'x' is NULL so the result will be NULL
# Error in mapply(function(word, loc) word[loc, "start"], words, start) :
# zero-length inputs cannot be mixed with those of non-zero length
word(character(0), 1)
# [1] NA
Okay, so it can return an NA, when the input vector is a 0-length character vector, but ...
word(character(0), 1) == NA
# [1] NA
word(character(0), 1) == NA_character_
# [1] NA
That's right, you cannot check for NA-ness that way. (Did you know that there are over six kinds of NA? They are not the same, identical(NA, NA_real_).)
Use is.na(.):
is.na(word(character(0), 1))
# [1] TRUE
(That's assuming we can see it in normal operation.)
I'm going to change that if condition to:
} else if (nrow(dat) == 0 && nzchar(input) && !is.na(word(input, 1))) {
We're getting closer. Now I can get into the third call of the function, where input is finally "" and we go into the first conditional block, assigning the new content to training. Unfortunately, dat$Word_to_Predict[1] is NA, so your ans is " NA", which just doesn't seem logical. Granted, your default training dataset has this explicitly, and while I don't know what you mean to happen here, I suggest stringifying an R object of NA into " NA" seems wrong.
I don't have a fundamental fix to this flow, though: you want to concatenate the val found with the previous input string, but ... if Word_to_Predict is NA (not a normal string), then ... what do you do? For the sake of moving forward, I'll dismiss concatenating "NA" onto a string ... though it's producing results that are "wrong" from a linguistic standpoint, I believe. (I'll just interpret "NA" as "(I don't have a great value for this spot)" or similar :-)
You are always pasteing a squished input with val, but ... if input is "", then paste still adds a space between them, which seems unnecessary. You can always "patch" this later by repeatedly squishing the strings, but ... symptom/problem again. I suggest instead using
ans <- str_squish(paste(input, val))
And my original point ...
When you start with "she was", it will find something on the first invocation, and we paste the input with the val to get the answer. However, when you have to go into recursion, you call the function again with the rest of the sentence and perfect discard the first word. For instance:
NextWordPrediction("hello great world")
#1> `input` is "hello great world", second `if` block, `input_1` is "great world"
#2> `input` is "great world", second `if` block, `input_1` is "world"
#3> `input` is "world", second `if` block, `input_1` is `""`
#4> `input` is "", first `if` block, `val` is `NA`, and `ans` is "NA"
#3> blindly returns list("NA", head(dat)) (discarding "world")
#2> blindly returns list("NA", head(dat)) (discarding "great")
#1> blindly returns list("NA", head(dat)) (discarding "hello")
Do you see the problem now? Instead of return(NextWordPrediction(input_rest)), you need to capture the result, prepend the word you stripped from input, and continue passing the updated return value up the chain. I suggest
input_1 <- gsub("\\s\\S*", "", input)
input_rest <- sub("^\\S*\\s?", "", input)
out <- NextWordPrediction(input_rest)
out[[1]] <- str_squish(paste(input_1, out[[1]]))
return(out)
After all of that, I now see
NextWordPrediction("hello great world")
# [[1]]
# [1] "hello great world NA"
# [[2]]
# Word Frequency N_gram Prop Word_to_Predict
# 1 hello 4 1 1 <NA>
which, according to your initial training, is correct.
Unfortunately, this breaks something else.
"words not in data" always eventually matches something (as will anything not in training), since it reduces to an empty string "", and your first logic of grepl(paste("^", tolower(str_squish(input)), sep = ""), Word) will always match something with input of "".
We can fix this with a simple additional condition in your first filtering:
filter(nzchar(input) & grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
And finally, when you get to the final if block when you need to add data to training, if this is the first/outer call of the function, then input truly reflects the entire sentence, which is what you want. However, if you've done one or more calls of recursion, then input is merely one word in the chain, not the entire thing. And due to some of the assumptions above, at this stage input is "", so ... any addition would be useless.
There are two strategies for dealing with this:
Keep track of whether this is the outer (first) call or some inner call. When you recursively call, check the return value ... if empty and this is an inner call, return empty; if empty and this is the first/outer call, then append to training; or
Always pass the entire string along with the current input. This would reverse my recommendation in bullet 6 above, so your second if block would just call NextWordPrediction(input_rest, input_1) (using my variables) and not str_squish after it. The squishing/pasting would be handled in the first if block, where you would need to prepend the value (if any) of preceding).
NextWordPrediction <- function(input, preceding = "") {
Side notes, not wrong per se but still not good.
& (single) in an if condition works but is bad practice: & does vector logic, which means it can return vectors of length other than 1; if conditions must be length exactly 1, not 0 or 2 or more. Use && here.
Reduce(paste, ...) is just unnecessary. Use paste(...).
After understanding the implications of recursion in my function thanks to #r2evans I realized that a solution by means of recursion would be too complicated and as a result the following code meets all my conditions and works as expected:
NextWordPrediction <- function(input) {
dat <- training %>%
filter(., N_gram == str_count(input, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
assign("training",
training %>%
mutate(Frequency = ifelse(Word == input &
N_gram == str_count(input, "\\S+"),
Frequency + 1,
Frequency)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else {
for (i in 2:str_count(input, "\\S+")) {
input_1 <- word(input, start = i, end = str_count(input,"\\S+"))
dat <- training %>%
filter(., N_gram == str_count(input_1, "\\S+") + 1) %>%
filter(grepl(paste("^", tolower(str_squish(input_1)), sep = ""), Word)) %>%
arrange(., desc(Prop))
if (nrow(dat) != 0) {
val <- dat$Word_to_Predict[1]
ans <- paste(str_squish(input), val)
return(list(ans, head(dat,5)))
} else if (nrow(dat) == 0 & i == str_count(input, "\\S+")) {
assign("training",
training %>%
add_row(., Word = tolower(input), Frequency = + 1, N_gram = str_count(input, "\\S+"),
Word_to_Predict = word(input, -1)) %>%
group_by(., N_gram) %>%
mutate(., Prop = Frequency/ sum(Frequency)) %>%
data.frame(.),
envir = .GlobalEnv)
ans <- paste("Word not in dictionary. We added this to our database!")
return(ans)
}
}
}
}
It loops through input-1 until a value is found in the dataframe and when this happens an answer is returned, otherwise we add the original input to the dataframe.
Related
I have the below R code .
OBJECTIVE : I am trying to check strings present in kind object is composite of word object by iterating & comparing the character positioning of the two objects. If it is composite of the other ,it returns POSITIVE else NEGATIVE.
PROBLEM STATEMENT :
If kind object value has minimal characters in each string c('abcde','crnas','onarous','ravus') it gives me better response. If the strings present in the kind object has more string length ( 10 ^ 5) c('cdcdc.....{1LCharacters}','fffw....{1LCharacters}','efefefef..{1LCharacters}'). It takes more time to process. Is there a better way to put this in , so that compilation time can be relatively small.
Suggestions / Corrections are highly appreciated.
word <- "coronavirus"
total <- "3"
kind <- c('abcde','crnas','onarous','ravus')
invisible(lapply(kind,function(x) {
if (length(x) > length(word)) {
cat("NEGATIVE",sep='\n')
}
index=1;
for (i in seq(from=1,to=nchar(word)-1,by=1)) {
if(substr(word,i,i) == substr(x,index,index))
{
index<-index+1;
}
}
if (index == nchar(x))
{
cat("POSITIVE",sep='\n')
}
else
{
cat("NEGATIVE",sep='\n')
}
}))
Output :
NEGATIVE
POSITIVE
NEGATIVE
POSITIVE
You could also do:
vals <- attr(adist(kind, word,counts = TRUE), 'counts')[,,3]
ifelse(vals>0, 'NEGATIVE', 'POSITIVE')
[1] "NEGATIVE" "POSITIVE" "NEGATIVE"
Update
If you want to print the result vertically, you can try cat like below
cat(
paste0(c("NEGATIVE", "POSITIVE")[
1 +
sapply(
gsub("(?<=.)(?=.)", ".*", kind, perl = TRUE),
grepl,
x = word
)
], collapse = "\n"),
"\n"
)
which gives
NEGATIVE
POSITIVE
NEGATIVE
I guess you can try gsub + grepl like below
c("NEGATIVE", "POSITIVE")[
1 +
sapply(
gsub("(?<=.)(?=.)", ".*", kind, perl = TRUE),
grepl,
x = word
)
]
which gives
[1] "NEGATIVE" "POSITIVE" "NEGATIVE"
Could you please help me?
I'm trying to modify an R function written by a colleague. This function receives a character vector with scientific names (Latin binomes), just like this one:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Then, it should abbreviate the scientific names, using only the first three letters of the genus (first term) and the epithet (second term) to make a short code. For instance, Cerradomys scotti should become Cersco.
This is the original function:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
With a simple list like that one, the function works perfectly. However, when the list has some missing or extra elements, it does not work. The loop stops when it meets the first row that does not match the pattern. Let's take this more complex list as an example:
Name
Cerradomys scotti
Oligoryzomys sp
Philander frenatus
Byrsonima sp
Campomanesia adamantium
Cecropia pachystachya
Cecropia sp
Erythroxylum sp
Ficus sp
Leandra aurea
Morfosp1
Vismia cf brasiliensis
See that Morfosp1 has only 1 term. And Vismia cf brasiliensis has an additional term (cf) in the middle.
I've tried adapting the function, for instance, this way:
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)) {
vector[i] <- if(splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2]))) {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
vector
}
Nevertheless, it does not work. I get this error message:
Error in if (splitnames[[i]][2] == "^sp" & is.na(splitnames[[i]][2])) { :
valor ausente onde TRUE/FALSE necessário
How could I make the function:
Deal also with names that have only 1 term?
Expected outcome: Morfosp1 -> Morfosp1 (stays the same)
Deal also with names that have an additional term in the middle?
Expected outcome: Vismia cf brasiliensis -> Visbra (term in the middle is ignored)
Thank you very much!
Something like this is pretty concise:
test <- c("Cerradomys scotti", "Oligoryzomys sp", "Latingstuff", "Latin staff more")
# function to truncate a given name
trunc_str <- function(latin_name) {
# split it on a space
name_split <- unlist(strsplit(latin_name, " ", fixed = TRUE))
# if one name, just return it
if (length(name_split) == 1) return(name_split)
# truncate to first 3 letters
name_trunc <- substr(name_split, 1, 3)
# paste the first and last term together (skipping any middle ones)
paste0(head(name_trunc, 1), tail(name_trunc, 1))
}
# iterate over all
vapply(test, trunc_str, "")
# Cerradomys scotti Oligoryzomys sp Latingstuff Latin staff more
# "Cersco" "Olisp" "Latingstuff" "Latmor"
If you don't want a named vector output, you can use USE.NAMES = FALSE in vapply(). Or feel free to use a loop here.
AbbreviatedNames <- function(vector) {
abbreviations <- character(length = length(vector))
splitnames <- strsplit(vector, " ")
for (i in 1:length(vector)){
# One name
if(length(splitnames[[i]])==1){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
# Two names
else if(length(splitnames[[i]])==2){
vector[i] <- if(splitnames[[i]][2] == "^sp") {
paste(substr(splitnames[[i]][1],1,3),
splitnames[[i]][2], sep = "")
}
else {
paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][2],1,3), sep = "")
}
}
# Three names
else if(length(splitnames[[i]])==3){
vector[i] <- paste(substr(splitnames[[i]][1],1,3),
substr(splitnames[[i]][3],1,3), sep = "")
# Assuming that the unwanted word is always in the middle
}
}
return(vector)
}
I tested on the list you gave and it seems to work, tell me if you need a more general code
Thank you very much for the help, Ricardo and Adam! I've made the code available on GitHub to other people who work with interaction networks, and need to abbreviate scientific names to be used in graphs.
I'm searching R scripts and not sure why as.character() drops the `[`. Is there some way to get the code back correctly as a vector of strings?
Notice the `[`(. < 5) turns into (. < 5)[]
Note: I'm not looking for better ways to make this call as this isn't my code.
code <-
"1:10 %>% `[`(. < 5) %>% mean()
a <- 1:3"
# fine
parse(text = code)
#> expression(1:10 %>% `[`(. < 5) %>% mean(), a <- 1:3)
# not fine
as.character(parse(text = code))
#> [1] "1:10 %>% (. < 5)[] %>% mean()"
#> [2] "a <- 1:3"
Created on 2020-07-01 by the reprex package (v0.3.0)
The issue is that I need to substitute out parts of the code so that my function works. The function finds reactive commands and changes them to functions for the user to access in their environment. It's part of my shinyobjects package (shameless plug).
a <- reactive({
input$n * 100
})
and converts it to
a <- function() {
input$n *100
}
The methods I have been using have been fine until this edge case with the `[`.
The solution should be able to return each expression as something I can manipulate. This is a more complex example and should return a string vector of length 5. I'm also happy to take this discussion offline as I'm open to a better method overall for this functionality. You can find my contact here
code <-
'library(tidyverse)
library(shiny)
1:10 %>% `[`(. < 5) %>% mean()
df <- reactive({
mpg %>%
filter(cty > input$cty)
})
renderPlot(
ggplot(df(), aes(class)) +
geom_bar()
)'
(I've edited this a bit to explain the behaviour some more):
The problem is that magrittr's pipe operator uses non-standard evaluation inconsistently.
The expression
`[`(. < 5)
is legal R code that is equivalent to what was deparsed:
(. < 5)[]
However, it's a weird enough expression that magrittr gets confused by it, and doesn't transform
1:10 %>% (. < 5)[]
the same way it would transform
1:10 %>% `[`(. < 5)
I wouldn't call this a bug in magrittr (it's documented behaviour, if you look closely enough), but it's certainly an inconvenience caused by the inconsistent handling of dots. Normally if you put a dot in a term in a magrittr chain, that's the only place that the previous result is put in. For example, this doesn't print "foobar" twice:
"foobar" %>% cat("arg1", ., "arg3")
However, if the dot is in a function call in the chain, it is also inserted at the start:
"foobar" %>% cat("arg1", identity(.), "arg3")
does print it twice.
magrittr is evaluating 1:10 %>% [(. < 5) as
`[`(1:10, 1:10 < 5)
i.e.
(1:10)[1:10 < 5]
Really for consistency it would require you to type
1:10 %>% `[`(., . < 5)
but it is trying to be helpful, which is what is so unhelpful for what you want to do.
I suppose you could write a function to detect these cases yourself, and insert the extra dot explicitly.
Edited to add: Here's such a function:
explicitDots <- function(expr) {
nestedDot <- function(lang) {
if (is.call(lang)) {
for (i in seq_along(lang)) {
if (nestedDot(lang[[i]]))
return(TRUE)
}
return(FALSE)
} else
identical(lang, quote(.))
}
fixLang <- function(lang) {
if (is.call(lang)) {
fn <- lang[[1]]
if (as.character(fn) == "%>%") {
lang[[2]] <- fixLang(lang[[2]])
lang[[3]] <- fixLang(lang[[3]])
} else {
hasTopLevelDot <- FALSE
hasNestedDot <- FALSE
for (i in seq_along(lang)[-1]) {
if (identical(lang[[i]], quote(.))) {
hasTopLevelDot <- TRUE
break
}
hasNestedDot <- hasNestedDot || nestedDot(lang[[i]])
}
if (hasNestedDot && !hasTopLevelDot) {
# Insert a dot in position 2
lang <- lang[c(1,seq_along(lang))]
lang[[2]] <- quote(.)
}
}
}
lang
}
expr <- removeSource(expr)
for (i in seq_along(expr)) {
expr[[i]] <- fixLang(expr[[i]])
}
expr
}
And here's an example using it:
code <-
"1:10 %>% `[`(. < 5) %>% mean()
a <- 1:3"
p <- parse(text = code)
explicitDots(p)
which produced this output:
expression(1:10 %>% .[. < 5] %>% mean(), a <- 1:3)
If we need to get a vector of strings, one option is strsplit on the nextline character followed by zero or more spaces
out <- strsplit(code, "\n\\s*")[[1]]
out
#[1] "1:10 %>% `[`(. < 5) %>% mean()"
#[2] "a <- 1:3"
sapply(out, function(x) eval(parse(text = x)))
#$`1:10 %>% `[`(. < 5) %>% mean()`
#[1] 2.5
#$`a <- 1:3`
#[1] 1 2 3
First of all, you should be aware of the "never use parse " rule -- there are always better ways.
Next, what you get back is an expression , not an object which tells you about an expression.
Similarly, you're explicitly barred from trying to pull this trick with a closure:
bar <- as.character(function(x) x+3)
Error in as.character(function(x) x + 3) :
cannot coerce type 'closure' to vector of type 'character'
Now,
foo <- parse(text = code)
as.character(deparse(foo))
[1] "structure(expression(1:10 %>% (. < 5)[] %>% mean(), a <- 1:3), srcfile = <environment>, wholeSrcref = structure(c(1L, "
[2] "0L, 3L, 0L, 0L, 0L, 1L, 3L), srcfile = <environment>, class = \"srcref\"))"
Should give you a feel for what you're dealing with.
But it's not clear why you want/need to take a string, parse it, and then try to get it back again. Either follow akrun's approach or take another route to turn poorly structured text into executable commands.
I need some pointers on this. Actually, I don't necessarily need a fully-fledged solution here - some pointers to functions and/or packages would be great.
The problem: I want to find specific sequences in a character vector. The sequences can be somewhat "underspecified". That means that some of the elements should be fixed, but for some elements it does not matter how long they are or what they are exactly.
An example: Suppose I want to find the following pattern in a character vector:
The sequence should begin with "Out of" or "out of"
The sequence should end with "reasons"
In between, there should be other elements. But it does not matter how much elements (also zero would be OK) and what the elements exactly are.
In between 1. and 2., there shouldn't be a ".", "!" or "?".
There should be a parameter that controls how long the sequence in 3. can maximally be to still produce a result.
Return value of the function should be the intervening elements and/or their indices in the vector.
So, the function should "behave" like this:
c("Out", "of", "specific", "reasons", ".") Return "specific"
c("Out", "of", "very", "specific", "reasons", ".") Return c("very", "specific")
c("out", "of", "curiosity", ".", "He", "had", "his", "reasons") Return "" or NA or NULL, which one doesn't matter - just a signal that there is no result.
As I said: I don't need a full solution. Any pointers to packages that already implement such functionality are appreciated!
Optimally, I don't want to rely on a solution that first pastes the text and then uses regex for matching.
Thanks a lot!
I would be really curious to learn of a package that serves your needs. My inclination would be to collapse the strings and use regular expressions or find a programmer or use perl. But here's one extensible solution in R with a few more cases to experiment on. Not very elegant, but see if this has some utility.
# Recreate data as a list with a few more edge cases
txt1 <- c(
"Out of specific reasons.",
"Out of very specific reasons.",
"Out of curiosity. He had his reasons.",
"Out of reasons.",
"Out of one's mind.",
"For no particular reason.",
"Reasons are out of the ordinary.",
"Out of time and money and for many good reasons, it seems.",
"Out of a box, a car, and for random reasons.",
"Floop foo bar.")
txt2 <- strsplit(txt1, "[[:space:]]+") # remove space
txt3 <- lapply(txt2, strsplit, "(?=[[:punct:]])", perl = TRUE) #
txt <- lapply(txt3, unlist) # create list of tokens from each line
# Define characters to exclude: [. ! and ?] but not [,]
exclude <- "[.!?]"
# Assign acceptable limit to separation
lim <- 5 # try 7 and 12 to experiment
# Create indices identifying each of the enumerated conditions
fun1 <- function(x, pat) grep(pat, x, ignore.case = TRUE)
index1 <- lapply(txt, fun1, "out")
index2 <- lapply(txt, fun1, "of")
index3 <- lapply(txt, fun1, "reasons")
index4 <- lapply(txt, fun1, exclude)
# Create logical vectors from indices satisfying the conditions
fun2 <- function(set, val) val[1] %in% set
cond1 <- sapply(index1, fun2, val = 1) & sapply(index2, fun2, val = 2)
cond2 <- sapply(index3, "[", 1) < lim + 2 + 2 # position of 'of' + 2
cond3 <- sapply(index3, max, -Inf) < sapply(index4, min, Inf)
# Combine logical vectors to a single logical vector
valid <- cond1 & cond2 & cond3
valid <- ifelse(is.na(valid), FALSE, valid)
# Examine selected original lines
print(txt1[valid])
# Helper function to extract the starting and the ending element
fun3 <- function(index2, index3, valid) {
found <- rep(list(NULL), length(index2))
found[valid] <- Map(seq, index2[valid], index3[valid])
found <- lapply(found, tail, -1)
found <- lapply(found, head, -1)
}
# Extract starting and ending element from valid list members
idx <- fun3(index2, index3, valid)
# Return the results or "" for no intervening text or NULL for no match
ans <- Map(function(x, i) {
if (is.null(i)) NULL # no match found
else if (length(i) == 0) "" # no intervening elements
else x[i]}, # all intervening elements <= lim
txt, idx)
# Show found (non-NULL) values
ans[!sapply(ans, is.null)]
So let's assume your example
x <- c("Out", "of", "very", "specific", "reasons", ".")
We first need to get the beginning of the indicator
i_Beginning <- as.numeric(grep("Out|out", x))
and the ending
i_end <- as.numeric(grep("reasons", x))
Need to also check that Out is followed by of
Is_Of <- grepl("Of|of", x[i_Beginning +1])
And if this is true we extract the other elements
if(Is_Of){
extraction <- x[c(i_Beginning +2, i_end -1)]
}
print(extraction)
I am trying to find a group of functions in R that would operate on word level. e.g. a function that could return the position of the word. For example given the following sentence and query
sentence <- "A sample sentence for demo"
query <- "for"
the function would return 4. for is 4th word.
It would be great if I could get a utility function that would allow me to extend query both in left and right direction.
e.g. extend(query, 'right') would return for demo and extend(query, 'left') would return sentence for
I have already gone through functions like grep, gregexp, word from stringr package and others. All seem to operate on character level.
If you use scan, it will split input at whitespace:
> s.scan <- scan(text=sentence, what="")
Read 5 items
> which(s.scan == query)
[1] 4
Need the what="" to tell scan to expect character rather than numeric input. Might need to replace punctuation using gsub with patt="[[:punct:]]" if your input is ever full English sentences. May also need to look at the tm (text mining) package if you are trying to classify parts of speech or handle large documents.
As I mentioned in my comment, stringr is useful in these instances.
library(stringr)
sentence <- "A sample sentence for demo"
wordNumber <- 4L
fourthWord <- word(string = sentence,
start = wordNumber)
previousWords <- word(string = sentence,
start = wordNumber - 1L,
end = wordNumber)
laterWords <- word(string = sentence,
start = wordNumber,
end = wordNumber + 1L)
And this yields:
> fourthWord
[1] "for"
> previousWords
[1] "sentence for"
> laterWords
[1] "for demo"
I hope that helps you.
I have written my own functions, the indexOf method returns the index of the word if it is found in the sentence otherwise returns -1, very much like java indexOf()
indexOf <- function(sentence, word){
listOfWords <- strsplit(sentence, split = " ")
sentenceAsVector <- unlist(listOfWords)
if(word %in% sentenceAsVector == FALSE){
result=-1
}
else{
result = which(sentenceAsVector==word)
}
return(result)
}
The extend method is working properly but is quite lengthy doesn't look like R code at all. If query is a word on the boundary of the sentence, i.e. the first word or the last word, first two words or last two words are returned
extend <- function(sentence, query, direction){
listOfWords = strsplit(sentence, split = " ")
sentenceAsVector = unlist(listOfWords)
lengthOfSentence = length(sentenceAsVector)
location = indexOf(sentence, query)
boundary = FALSE
if(location == 1 | location == lengthOfSentence){
boundary = TRUE
}
else{
boundary = FALSE
}
if(!boundary){
if(location> 1 & direction == "right"){
return(paste(sentenceAsVector[location],
sentenceAsVector[location + 1],
sep=" ")
)
}
else if(location < lengthOfSentence & direction == "left"){
return(paste(sentenceAsVector[location - 1],
sentenceAsVector[location],
sep=" ")
)
}
}
else{
if(location == 1 ){
return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " "))
}
if(location == lengthOfSentence){
return(paste(sentenceAsVector[lengthOfSentence - 1],
sentenceAsVector[lengthOfSentence], sep = " "))
}
}
}
The answer depends on what you mean by a "word". If you mean whitespace-separated token, then #imran-ali's answer works fine. If you mean word as defined by Unicode, with special attention to punctuation, then you need something more sophisticated.
The following handles punctuation correctly:
library(corpus)
sentence <- "A sample sentence for demo"
query <- "for"
# use text_locate to find all instances of the query, with context
text_locate(sentence, query)
## text before instance after
## 1 1 A sample sentence for demo
# find the number of tokens before, then add 1 to get the position
text_ntoken(text_locate(sentence, query)$before) + 1
## 4
This also works if there are multiple matches:
sentence2 <- "for one, for two! for three? for four"
text_ntoken(text_locate(sentence2, query)$before) + 1
## [1] 1 4 7 10
We can verify that this is correct:
text_tokens(sentence2)[[1]][c(1, 4, 7, 10)]
## [1] "for" "for" "for" "for"