Extracting part of the string using strapplyc in R - r

I have the following sentence, sent:
> "#stance=iPhone : The next revolution of Apple"
And I would like to extract the stance.
After using
strapplyc(sent, "stance=(.*)", simplify = TRUE)
I obtained the following:
> "iPhone : The next revolution of Apple"
Anyone knows if there is a better way to just extract out the "iphone" in this case?

Try this (add : in this regex):
strapplyc(sent, "stance=(.*) :", simplify = TRUE)

Related

Generate full width character string in R

In R, how can I make the following:
convert this string: "my test string"
to something like this ( a full width character string): "my  test  string"
is there a way to do this through hexidecimal character encodings?
Thanks for your help, I'm really not sure how to even start. Perhaps something with {stringr}
I'm trying to get an output similar to what I would expect from this online conversion tool:
http://www.linkstrasse.de/en/%EF%BD%86%EF%BD%95%EF%BD%8C%EF%BD%8C%EF%BD%97%EF%BD%89%EF%BD%84%EF%BD%94%EF%BD%88%EF%BC%8D%EF%BD%83%EF%BD%8F%EF%BD%8E%EF%BD%96%EF%BD%85%EF%BD%92%EF%BD%94%EF%BD%85%EF%BD%92
Here is a possible solution using a function from the archived Nippon package. This is the han2zen function, which can be found here.
x <- "my test string"
han2zen <- function(s){
stopifnot(is.character(s))
zenEisu <- paste0(intToUtf8(65295 + 1:10), intToUtf8(65312 + 1:26),
intToUtf8(65344 + 1:26))
zenKigo <- c(65281, 65283, 65284, 65285, 65286, 65290, 65291,
65292, 12540, 65294, 65295, 65306, 65307, 65308,
65309, 65310, 65311, 65312, 65342, 65343, 65372,
65374)
s <- chartr("0-9A-Za-z", zenEisu, s)
s <- chartr('!#$%&*+,-./:;<=>?#^_|~', intToUtf8(zenKigo), s)
s <- gsub(" ", intToUtf8(12288), s)
return(s)
}
han2zen(x)
# [1] "my test string"

How to remove paranthesis but keep the text in it in R

I am trying to clean a dataset with the column: ltaCpInfoDF$weekdays_rate_1
For some of the rows, I would like to do this:
input: Daily(7am-11pm): $1.20 ; output: 7am-11pm: $1.20
The values within the bracket can be different timings for the rows.
Initially, I was thinking of removing by part such as removing "Daily(" with gsub first then removing ")". However, I seem to be facing issues with that.
ltaCpInfoDF$weekdays_rate_1 <- gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1)
Here is the error shown:
Error in gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1) :
invalid regular expression 'Daily(', reason 'Missing ')''
In addition: Warning message:
In gsub("Daily(", "", ltaCpInfoDF$weekdays_rate_1) :
TRE pattern compilation error 'Missing ')''
Could someone share with me a better way? Thank you in advance!
Use sub with a capture group:
input <- "Daily(7am-11pm): $1.20"
output <- gsub("\\S+\\s*\\((.*?)\\)", "\\1", input)
output
[1] "7am-11pm: $1.20"
We may use without capturing
gsub("^[^(]+\\(|\\)", "", str1)
[1] "7am-11pm: $1.20"
data
str1 <- "Daily(7am-11pm): $1.20"

paste specific text to strings that do not have it

I would like to paste "miR" to strings that do not have "miR" already, and skipping those that have it.
paste("miR", ....)
in
c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
out
c("miR-26b", "miR-26a", "miR-1297", "miR-4465", "miR-26b", "miR-26a")
One way could be by removing "miR" if it is present in the beginning of the string using sub and pasting it to every string irrespectively.
paste0("miR-", sub("^miR-","", x))
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
data
x <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
vec <- c("miR-26b", "miR-26a", "1297", "4465", "miR-26b", "miR-26a")
sub("^(?!miR)(.*)$", "miR-\\1", vec, perl = T)
#[1] "miR-26b" "miR-26a" "miR-1297" "miR-4465" "miR-26b" "miR-26a"
If you want to learn more:
type ?sub into R console
learn regex, have a closer look at negative look ahead, capturing groups LEARN REGEX
I've used perl = T because I get an error if I don't. READ MORE

String pulled directly from source data seems to not match string in source data

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,
out <- df[df$`Var name` == "string",]
I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails
> df[i,j]
[1] "string"
df[i,j]=="string" # pasted from above line
I don't understand how I can be explicitly pasting the output I was just given and it not match.
## attempts to get exact string to paste into subset statement
# from dput
"IF APPLICABLE – Which of the following best characterizes the expectations with"
# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with
# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with
I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?
edit:
I subset based on another way, below is a dput and actual example of what I'm doing:
> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with",
`Item Response` = "It was required.", orgchar_group = "locale",
`Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475,
`Graphs note` = NA_character_, `Report note` = NA_character_,
`Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_,
statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem",
"Item Response", "orgchar_group", "Org Characteristic", "N",
"percent", "Graphs note", "Report note", "Other note", "subsig",
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE
Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.
I was able to figure it out using insights from here (& here) and here.
I used a grep command (from #Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from #hadley) to determine what kind. I then used base solution from #Josh O'Brien to remove it. Turns out it was the heiphen.
# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")
[1] "IF APPLICABLE Which of the following best characterizes the expectations with"
# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")
Still don't understand it enough to explain why it happened but it's fixed now.

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources