Extract words from strings without spaces or delimiters using R - r

babybag - baby bag
badshelter - bad shelter
themoderncornerstore - the modern corner store
hamptonfamilyguidebook - hampton family guide book
Is there a way to use R to extract words from string that do not have spaces or other delimiters? I have a list of URLs and I am trying to figure out what words are included in the URLs.
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

Here is a naive approach that might give you inspiration, I used library hunspell but you could test substrings against any dictionary.
I start from the right, try every substring and keep the longest I can find in the dictionary, then change my starting position, it's quite slow so I hope you don't have 4 millions of those. hampton is not in this dictionary so it doesn't give the right result for the last one :
split_words <- function(x){
candidate <- x
words <- NULL
j <- nchar(x)
while(j !=0){
word <- NULL
for (i in j:1){
candidate <- substr(x,i,j)
if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate
}
if(is.null(word)) return("")
words <- c(word,words)
j <- j-nchar(word)
}
words
}
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")
lapply(input,split_words)
# [[1]]
# [1] "baby" "bag"
#
# [[2]]
# [1] "bad" "shelter"
#
# [[3]]
# [1] "the" "modern" "corner" "store"
#
# [[4]]
# [1] "h" "amp" "ton" "family" "guidebook"
#
Here's a quick fix, adding words manually to the dictionary:
split_words <- function(x, additional = c("hampton","otherwordstoadd")){
candidate <- x
words <- NULL
j <- nchar(x)
while(j !=0){
word <- NULL
for (i in j:1){
candidate <- substr(x,i,j)
if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate
}
if(is.null(word)) return("")
words <- c(word,words)
j <- j-nchar(word)
}
words
}
input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")
lapply(input,split_words)
# [[1]]
# [1] "baby" "bag"
#
# [[2]]
# [1] "bad" "shelter"
#
# [[3]]
# [1] "the" "modern" "corner" "store"
#
# [[4]]
# [1] "hampton" "family" "guidebook"
#
You can just cross fingers not to have any ambiguous expressions though. Note that "guidebook" is in one word in my output so we already have an edge case in your four examples.

Related

How to insert back a character in a string at the exact position where it was originally

I have strings that have dots here and there and I would like to remove them - that is done, and after some other operations - these are also done, I would like to insert the dots back at their original place - this is not done. How could I do that?
library(stringr)
stringOriginal <- c("abc.def","ab.cd.ef","a.b.c.d")
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
I see that str_sub() may help, for example str_sub(stringModified[2], 3,2) <- "." gets me somewhere, but it is still far from the right place, and also I have no idea how to do it programmatically. Thank you for your time!
update
stringOriginal <- c("11.123.100","11.123.200","1.123.1001")
stringOriginalF <- as.factor(stringOriginal)
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
stringNumFac <- sort(as.numeric(stringModified))
stringi::stri_sub(stringNumFac[1:2], 3, 2) <- "."
stringi::stri_sub(stringNumFac[1:2], 7, 6) <- "."
stringi::stri_sub(stringNumFac[3], 2, 1) <- "."
stringi::stri_sub(stringNumFac[3], 6, 5) <- "."
factor(stringOriginal, levels = stringNumFac)
after such manipulation, I am able to order the numbers and convert them back to strings and use them later for plotting.
But since I wouldn't know the position of the dot, I wanted to make it programmatical. Another approach for factor ordering is also welcomed. Although I am still curious about how to insert programmatically back a character in a string at the exact position where it was originally.
This might be one of the cases for using base R's strsplit, which gives you a list, with a vector of substrings for each entry in your original vector. You can manipulate these with lapply or sapply very easily.
split_string <- strsplit(stringOriginal, "[.]")
#> split_string
#> [[1]]
#> [1] "11" "123" "100"
#>
#> [[2]]
#> [1] "11" "123" "200"
#>
#> [[3]]
#> [1] "1" "123" "1001"
Now you can do this to get the numbers
sapply(split_string, function(x) as.numeric(paste0(x, collapse = "")))
# [1] 11123100 11123200 11231001
And this to put the dots (or any replacement for the dots) back in:
sapply(split_string, paste, collapse = ".")
# [1] "11.123.100" "11.123.200" "1.123.1001"
And you could get the location of the dots within each element of your original vector like this:
lapply(split_string, function(x) cumsum(nchar(x) + 1))
# [[1]]
# [1] 3 7 11
#
# [[2]]
# [1] 3 7 11
#
# [[3]]
# [1] 2 6 11

Extracting coefficients while looping over variable names

I'm working on some time-series stuff in R (version 3.4.1), and would like to extract coefficients from regressions I ran, in order to do further analysis.
All results are so far saved as uGARCHfit objects, which are basically complicated list objects, from which I want to extract the coefficients in the following manner.
What I want is in essence this:
for(i in list){
i_GARCH_mxreg <- i_GARCH#fit$robust.matcoef[5,1]
}
"list" is a list object, where every element is the name of one observation. For now, I want my loop to create a new numeric object named as I specified in the loop.
Now this obviously doesn't work because the index, 'i', isn't replaced as I would want it to be.
How do I rewrite my loop appropriately?
Minimal working example:
list <- as.list(c("one", "two", "three"))
one_a <- 1
two_a <- 2
three_a <- 3
for (i in list){
i_b <- i_a
}
what this should give me would be:
> one_b
[1] 1
> two_b
[1] 2
> three_b
[1] 3
Clarification:
I want to extract the coefficients form multiple list objects. These are named in the manner 'string'_obj. The problem is that I don't have a function that would extract these coefficients, the list "is not subsettable", so I have to call the individual objects via obj#fit$robust.matcoef[5,1] (or is there another way?). I wanted to use the loop to take my list of strings, and in every iteration, take one string, add 'string'_obj#fit$robust.matcoef[5,1], and save this value into an object, named again with " 'string'_name "
It might well be easier to have this into a list rather than individual objects, as someone suggest lapply, but this is not my primary concern right now.
There is likely an easy way to do this, but I am unable to find it. Sorry for any confusion and thanks for any help.
The following should match your desired output:
# your list
l <- as.list(c("one", "two", "three"))
one_a <- 1
two_a <- 2
three_a <- 3
# my workspace: note that there is no one_b, two_b, three_b
ls()
[1] "l" "one_a" "three_a" "two_a"
for (i in l){
# first, let's define the names as characters, using paste:
dest <- paste0(i, "_b")
orig <- paste0(i, "_a")
# then let's assign the values. Since we are working with
# characters, the functions assign and get come in handy:
assign(dest, get(orig) )
}
# now let's check my workspace again. Note one_b, two_b, three_b
ls()
[1] "dest" "i" "l" "one_a" "one_b" "orig" "three_a"
[8] "three_b" "two_a" "two_b"
# let's check that the values are correct:
one_b
[1] 1
two_b
[1] 2
three_b
[1] 3
To comment on the functions used: assign takes a character as first argument, which is supposed to be the name of the newly created object. The second argument is the value of that object. get takes a character and looks up the value of the object in the workspace with the same name as that character. For instance, get("one_a") will yield 1.
Also, just to follow up on my comment earlier: If we already had all the coefficients in a list, we could do the following:
# hypothetical coefficients stored in list:
lcoefs <- list(1,2,3)
# let's name the coefficients:
lcoefs <- setNames(lcoefs, paste0(c("one", "two", "three"), "_c"))
# push them into the global environment:
list2env(lcoefs, env = .GlobalEnv)
# look at environment:
ls()
[1] "dest" "i" "l" "lcoefs" "one_a" "one_b" "one_c"
[8] "orig" "three_a" "three_b" "three_c" "two_a" "two_b" "two_c"
one_c
[1] 1
two_c
[1] 2
three_c
[1] 3
And to address the comments, here a slightly more realistic example, taking the list-structure into account:
l <- as.list(c("one", "two", "three"))
# let's "hide" the values in a list:
one_a <- list(val = 1)
two_a <- list(val = 2)
three_a <- list(val = 3)
for (i in l){
dest <- paste0(i, "_b")
orig <- paste0(i, "_a")
# let's get the list-object:
tmp <- get(orig)
# extract value:
val <- tmp$val
assign(dest, val )
}
one_b
[1] 1
two_b
[1] 2
three_b
[1] 3

Using str_view with a list of words in R

I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')

Dimension of a function

Because this would simplify my code only slightly, I'm mainly asking out of curiosity. Say you have a function
f <- function(x) {
c(x[1] - x[2],
x[2] - x[3],
x[3] - x[1])
}
Is there a way of finding out the dimension of the input required, e.g.
dim(f) = 3
Here's a very questionable solution that involves computing on the language.
I've written three functions that allow matching, searching, and extracting pieces of parse trees based on a sort of "template" parse tree piece. Here they are:
ptlike <- function(lhs,rhs,wcs='.any.',wcf=NULL) if (is.symbol(rhs) && as.character(rhs)%in%wcs) !is.function(wcf) || isTRUE(wcf(lhs,as.character(rhs))) else typeof(lhs) == typeof(rhs) && length(lhs) == length(rhs) && if (is.call(lhs)) all(sapply(seq_along(lhs),function(i) ptlike(lhs[[i]],rhs[[i]],wcs,wcf))) else lhs == rhs;
ptfind <- function(ptspace,ptpat,wcs='.any.',wcf=NULL,locf=NULL,loc=integer(),ptspaceorig=ptspace) c(list(),if (ptlike(ptspace,ptpat,wcs,wcf) && (!is.function(locf) || isTRUE(locf(ptspaceorig,loc)))) list(loc),if (is.call(ptspace)) do.call(c,lapply(seq_along(ptspace),function(i) ptfind(ptspace[[i]],ptpat,wcs,wcf,locf,c(loc,i),ptspaceorig))));
ptextract <- function(ptspace,ptpat,gets='.get.',wcs='.any.',wcf=NULL,locf=NULL,getf=NULL) { getlocs <- do.call(c,lapply(gets,function(get) ptfind(ptpat,as.symbol(get),character()))); if (length(getlocs)==0L) getlocs <- list(integer()); c(list(),do.call(c,lapply(ptfind(ptspace,ptpat,unique(c(gets,wcs)),wcf,locf),function(loc) do.call(c,lapply(getlocs,function(getloc) { cloc <- c(loc,getloc); ptget <- if (length(cloc)>0) ptspace[[cloc]] else ptspace; if (!is.function(getf) || isTRUE(getf(if (missing(ptget)) substitute() else ptget,loc,getloc,as.character(ptpat[[getloc]])))) list(if (missing(ptget)) substitute() else ptget); }))))); };
ptlike() matches two parse tree pieces against each other, allowing for wildcards on the RHS to match anything. For example:
ptlike(1L,2L);
## [1] FALSE
ptlike(1L,1L);
## [1] TRUE
ptlike(quote(a),quote(b));
## [1] FALSE
ptlike(quote(a),quote(a));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(sum(b+1)));
## [1] FALSE
ptlike(quote(sum(a+1)),quote(sum(.any.+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.(a+1)));
## [1] TRUE
ptlike(quote(sum(a+1)),quote(.any.));
## [1] TRUE
ptfind() returns a recursive index vector for each match of a parse tree pattern (RHS, if you like) in a given parse tree space (LHS), combined into a list. For example:
sp <- quote({ a+b*c:d; e+f*g:h; });
ptfind(sp,quote(.any.:.any.));
## [[1]]
## [1] 2 3 3
##
## [[2]]
## [1] 3 3 3
##
sp[[c(2L,3L,3L)]];
## c:d
sp[[c(3L,3L,3L)]];
## g:h
ptextract() is similar to ptfind(), but returns the matched piece of the parse tree space or a subset thereof:
ptextract(sp,quote(c:d));
## [[1]]
## c:d
##
ptextract(sp,quote(.any.:.any.));
## [[1]]
## c:d
##
## [[2]]
## g:h
##
ptextract(sp,quote(.any.:.get.));
## [[1]]
## d
##
## [[2]]
## h
##
So what we can do is extract from (the parse tree that comprises the body of) your function all the subscripts of the x argument and get the maximum literal subscript value:
f <- function(x) c(x[1L]-x[2L],x[2L]-x[3L],x[3L]-x[1L]);
max(unlist(Filter(is.numeric,ptextract(body(f),quote(x[.get.])))));
## [1] 3
The Filter(is.numeric,...) piece isn't really necessary here, since all subscripts are literal numbers (doubles in your definition, integers in mine, although that's inconsequential), but if there were ever non-numeric-literal subscripts, then it would be necessary.
Caveats:
It's absurd.
It would not take into account variable subscripts that might rise above the maximum literal subscript value in the parse tree. (Although, strictly speaking, in the general case, it is impossible to statically analyze all possible subscripts that might occur at run-time, due to the halting problem or something.)
If the vector was ever assigned to a different variable name, say y, and then that variable was indexed with a different subscript, even a numeric literal subscript, this algorithm would not take that into account.
And there's probably some more.

subset() drops attributes on vectors; how to maintain/persist them?

Let's say I have a vector where I've set a few attributes:
vec <- sample(50:100,1000, replace=TRUE)
attr(vec, "someattr") <- "Hello World"
When I subset the vector, the attributes are dropped. For example:
tmp.vec <- vec[which(vec > 80)]
attributes(tmp.vec) # Now NULL
Is there a way to, subset and persist attributes without having to save them to another temporary object?
Bonus: Where would one find documentation of this behaviour?
I would write a method for [ or subset() (depending on how you are subsetting) and arrange for that to preserve the attributes. That would need a "class" attribute also adding to your vector so that dispatch occurs.
vec <- 1:10
attr(vec, "someattr") <- "Hello World"
class(vec) <- "foo"
At this point, subsetting removes attributes:
> vec[1:5]
[1] 1 2 3 4 5
If we add a method [.foo we can preserve the attributes:
`[.foo` <- function(x, i, ...) {
attrs <- attributes(x)
out <- unclass(x)
out <- out[i]
attributes(out) <- attrs
out
}
Now the desired behaviour is preserved
> vec[1:5]
[1] 1 2 3 4 5
attr(,"someattr")
[1] "Hello World"
attr(,"class")
[1] "foo"
And the answer to the bonus question:
From ?"[" in the details section:
Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames.
Thanks to a similar answer to my question #G. Grothendieck, you can use collapse::fsubset see here.
library(collapse)
#tmp_vec <- fsubset(vec, vec > 80)
tmp_vec <- sbt(vec, vec > 80) # Shortcut for fsubset
attributes(tmp_vec)
# $someattr
# [1] "Hello World"

Resources