beg2char function in R (qdap package) - r

I am trying keep only that part of the string left of "keyword". Anything on the right of "keyword" should be removed. beg2char seems like the best choice but its not doing what I thought it would do.
Please advise:
x <-"/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword/A//"
beg2char(x,"keyword")
# [1] "/in"

We could use, gsub as below:
gsub("keyword.*", "", x)
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/"

If we want to keep the "keyword" in the output, then set include = TRUE:
library(qdap)
x <-"/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword/A//"
beg2char(x, "keyword", include = TRUE)
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/keyword"
If we want to exclude "keyword", then we would do as you did, which doesn't work, because letter "d" is part of the "keyword". Looks like a bug to me, submitted an issue at GitHub:qdap.
But this works:
beg2char(x, "k")
# [1] "/index.php/front/yellow/searchHeading/heading/926/h_name/Architects/"

Related

How to randomly reshuffle letters in words

I am trying to make a word scrambler in R. So i have put some words in a collection and tried to use strsplit() to split the letters of each word in the collection.
But I don't understand how to jumble the letters of a word and merge them to one word in R Tool. Does anyone know how can I solve this?
This is what I have done
enter image description here
Once you've split the words, you can use sample() to rescramble the letters, and then paste0() with collapse="", to concatenate back into a 'word'
lapply(words, function(x) paste0(sample(strsplit(x, split="")[[1]]), collapse=""))
You can use the stringi package if you want:
> stringi::stri_rand_shuffle(c("hello", "goodbye"))
[1] "oellh" "deoygob"
Here's a one-liner:
lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = "")
[[1]]
[1] "elfi"
[[2]]
[1] "vleo"
[[3]]
[1] "rmsyyet"
Use unlistto get rid of the list:
unlist(lapply(lapply(strsplit(strings, ""), sample), paste0, collapse = ""))
Data:
strings <- c("life", "love", "mystery")
You can use the sample function for this.
here is an example of doing it for a single word. You can use this within your for-loop:
yourword <- "hello"
# split: Split will return a list with one char vector in it.
# We only want to interact with the vector not the list, so we extract the first
# (and only) element with "[[1]]"
jumble <- strsplit(yourword,"")[[1]]
jumble <- sample(jumble, # sample random element from jumble
size = length(jumble), # as many times as the length of jumble
# ergo all Letters
replace = FALSE # do not sample an element multiple times
)
restored <- paste0(jumble,
collapse = "" # bas
)
As the answer from langtang suggests, you can use the apply family for this, which is more efficient. But maybe this answer helps the understanding of what R is actually doing here.

word segmentation for hashtag using R

i would like to do a word segmentation for hashtag. i want to split word in hashtag. this is my attempt but obviously it didn't work.
what i am trying to do
INPUT: #sometrendingtopic
OUTPUT: some trending topic
my attempt:
s<- "#sometrendingtopic"
tokenize_character_shingles(s)
tokenize_words(s)
tokenize_characters(s)
I got some information but it for python https://stackoverflow.com/.../r-split-string-by-symbol
thanks for future idea and guidance
So ... This is an absolutely non trivial task and I think can not be solved generally. Since you are missing a delimiter between your words, you basically need to extract substrings and check them against a dictionary of your desired language.
A very crude method, that will only extract the longest matches from left to right it can find is using hunspell which is designed for spell checking but can be "misused" to maybe solve this task:
split_words <- function(cat.string){
split <- NULL
start.char <- 1
while(start.char < nchar(cat.string))
{
result <- NULL
for(cur.char in start.char:nchar(cat.string))
{
test.string <- substr(cat.string,start.char,cur.char)
test <- hunspell::hunspell(test.string)[[1]]
if(length(test) == 0) result <- test.string
}
if(is.null(result)) return("")
split <- c(split,result)
start.char <- start.char + nchar(result)
}
split
}
input <- c("#sometrendingtopic","#anothertrendingtopic","#someveryboringtopic")
# Clean the hashtag from the input
input <- sub("#","",input)
#apply word split
result <- lapply(input,split_words)
result
[[1]]
[1] "some" "trending" "topic"
[[2]]
[1] "another" "trending" "topic"
[[3]]
[1] "some" "very" "boring" "topic"
Please keep in mind that this method is far from perfect in multiple ways:
It is relatively slow.
It will greedily match from left to right. So if we for example have the hashtag
input <- "#averyboringtopic" the result will be
[[3]]
[1] "aver" "y" "boring" "topic"
Since "aver" apparently is a possible word in this specific dictionary.
So: Use at your own risk and improve upon this!

What is the mistake in this two lines of code? Getting pattern: "http:// blabla .nc"

I have hundreds of TXT files which contain many things and some download links.
The pattern of the download links are like this:
start with: http://
and
end with: .nc
I created a sample text file for your convenience that you could download from this link:
https://www.dropbox.com/s/5crmleli2ppa1rm/textfile_including_https.txt?dl=1
Based on this topic in Stackoverflow, I tried to extract all download links from the text file:
Extract websites links from a text in R
Here is my code:
download_links <- readLines(file.choose())
All_my_links <- gsub(download_links, pattern=".*(http://.*nc).*", replace="\\1")
But it returns all lines, too, while I only want to extract the http links ended with .nc
Here is the result:
head(All_my_links )
tail(All_my_links )
> head(All_my_links )
[1] "#!/bin/bash"
[2] "##############################################################################"
[3] "version=1.3.2"
[4] "CACHE_FILE=.$(basename $0).status"
[5] "openId="
[6] "search_url='https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.HighResMIP.MIROC.NICAM16-9S.highresSST-present.r1i1p1f1.day.pr.gr.v20190830|esgf-data2.diasjp.net'"
> tail(All_my_links )
[1] "MYPROXY_STATUS=$HOME/.MyProxyLogon"
[2] "COOKIE_JAR=$ESG_HOME/cookies"
[3] "MYPROXY_GETCERT=$ESG_HOME/getcert.jar"
[4] "CERT_EXPIRATION_WARNING=$((60 * 60 * 8)) #Eight hour (in seconds)"
[5] ""
[6] "WGET_TRUSTED_CERTIFICATES=$ESG_HOME/certificates"
What is my mistake in the code?
Any comment would be highly appreciated.
gsub() is not for extracting, that's what's wrong with your code. It's for replacing. (See help("gsub")). For the purposes of this demonstration, I will use the following data:
x <- c("abc", "123", "http://site.nc")
(I will not, as a rule, download data posted here as a link. Most others won't also. If you want to share example data, it's best to do so by including in your question the output from dput()).
Let's see what happens with your gsub() approach:
gsub(pattern = ".*(http://.*nc).*", replacement = "\\1", x = x)
# [1] "abc" "123" "http://site.nc"
Looks familiar. What's going on here is gsub() looks at each element of x, and replaces each occurrence of pattern with replacement, which in this case is itself. You will always get the exact same character vector back with that approach.
I would suggest stringr::str_extract():
stringr::str_extract(string = x, pattern = ".*http://.*nc.*")
# [1] NA NA "http://site.nc"
If you wrap this in na.omit(), it gives you the output I think you want:
na.omit(stringr::str_extract(string = x, pattern = ".*http://.*nc.*"))
# [1] "http://site.nc"

Use gsub to replace curly apostrophe with straight apostrophe in R list of character vectors

Looking for some guidance on how to replace a curly apostrophe with a straight apostrophe in an R list of character vectors.
The reason I'm replacing the curly apostrophes - later in the script, I check each list item, to see if it's found in a dictionary (using qdapDictionary) to ensure it's a real word and not garbage. The dictionary uses straight apostrophes, so words with the curly apostrophes are being "rejected."
A sample of the code I have currently follows. In my test list, item #6 contains a curly apostrophe, and item #2 has a straight apostrophe.
Example:
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
func_ReplaceTypographicApostrophes <- function(x) {
gsub("’", "'", x, ignore.case = TRUE)
}
list_TestWords_Fixed <- lapply(list_TestWords, func_ReplaceTypographicApostrophes)
The result: No change. Item 6 still using curly apostrophe. See output below.
list_TestWords_Fixed
[[1]]
[1] "this"
[[2]]
[1] "isn't"
[[3]]
[1] "ideal"
[[4]]
[1] "but"
[[5]]
[1] "we"
[[6]]
[1] "can’t"
[[7]]
[1] "fix"
[[8]]
[1] "it"
Any help you can offer will be most appreciated!
This might work: gsub("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", x)
I found it over here: http://axonflux.com/handy-regexes-for-smart-quotes
You might be running up against a bug in R on Windows. Try using utf8::as_utf8 on your input. Alternatively, this also works:
library(utf8)
list_TestWords <- as.list(c("this", "isn't", "ideal", "but", "we", "can’t", "fix", "it"))
lapply(list_TestWords, utf8_normalize, map_quote = TRUE)
This will replace the following characters with ASCII apostrophe:
U+055A ARMENIAN APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+FF07 FULLWIDTH APOSTROPHE
It will also convert your text to composed normal form (NFC).
I see a problem in your call to gsub:
gsub("/’", "/'", x, ignore.case = TRUE)
You are prefixing the curly single quote with a forward slash. I don't know why you are doing this. I could speculate that you are trying to escape the quote characters, but this is having the side effect that your pattern is now trying to match a forward slash followed by a quote. As this never occurs in your text, no replacements are being made. You should be doing this:
gsub("’", "'", x, ignore.case = TRUE)
Follow the link below for a demo which shows that using the above gsub calls works as you expect.
Demo
Was about to say the same thing.
Try using str_replace from stringr package, will not need to use slashes
I was facing similar problem. Somehow non of the solutions worked for me. So I devised an indirect way of doing it by identifying apostrophe and replacing it with the required format.
gsub("(\\w)(\\W)(\\w\\s)", "\\1'\\3","sid’s bicycle")
[1] "sid's bicycle"
Hope it helps someone.

strsplit not behaving as expected R

I have a basic problem in R, everything I'm working with is familiar to me (data, functions) but for some reason I can't get the strsplit or the gsub function to work as expected. I also tried the stringr package. I'm not going to bother putting up code using that package because I know this problem is simple and can be done with the two functions mentioned above. Personally, I feel like putting up a page for this isn't even necessary but my patience is pretty thin at this point.
I am trying to remove the "." and the number followed by the '.' in an Ensemble Gene ID. Simple, I know.
id <- "ENSG00000223972.5"
gsub(".*", "", id)
strsplit(id, ".")
The asterisk symbol was meant to catch anything after the '.' and remove it but I don't know for sure if that's what it does. The strsplit should definitely output a list of two items, the first being everything before the '.' and the second being the one digit after. All it returns is a list with 17 "" symbols, for no space and one for each character in the string. I think it's an obvious thing that I'm missing but I haven't been able to figure it out. Thanks in advance.
Read the help file for ?strsplit, you cannot use "."
id <- "ENSG00000223972.5"
gsub("[.]", "", id)
strsplit(id, split = "[.]")
Output:
> gsub("[.]", "", id)
[1] "ENSG000002239725"
> strsplit(id, split = "[.]")
[[1]]
[1] "ENSG00000223972" "5"
Help:
unlist(strsplit("a.b.c", "."))
## [1] "" "" "" "" ""
## Note that 'split' is a regexp!
## If you really want to split on '.', use
unlist(strsplit("a.b.c", "[.]"))
## [1] "a" "b" "c"
## or
unlist(strsplit("a.b.c", ".", fixed = TRUE))

Resources