String pulled directly from source data seems to not match string in source data - r

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,
out <- df[df$`Var name` == "string",]
I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails
> df[i,j]
[1] "string"
df[i,j]=="string" # pasted from above line
I don't understand how I can be explicitly pasting the output I was just given and it not match.
## attempts to get exact string to paste into subset statement
# from dput
"IF APPLICABLE – Which of the following best characterizes the expectations with"
# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with
# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with
I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?
edit:
I subset based on another way, below is a dput and actual example of what I'm doing:
> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with",
`Item Response` = "It was required.", orgchar_group = "locale",
`Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475,
`Graphs note` = NA_character_, `Report note` = NA_character_,
`Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_,
statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem",
"Item Response", "orgchar_group", "Org Characteristic", "N",
"percent", "Graphs note", "Report note", "Other note", "subsig",
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE

Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.
I was able to figure it out using insights from here (& here) and here.
I used a grep command (from #Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from #hadley) to determine what kind. I then used base solution from #Josh O'Brien to remove it. Turns out it was the heiphen.
# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")
[1] "IF APPLICABLE Which of the following best characterizes the expectations with"
# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")
Still don't understand it enough to explain why it happened but it's fixed now.

Related

How does R Markdown automatically format print effects into dataframes? Or how can I access special print methods?

I'm working with the WRS2 package and there are cases where it'll output its analysis (bwtrim) into a list with a special class of the analysis type class = "bwtrim". I can't as.data.frame() it, but I found that there is a custom print method called print.bwtrim associated with it.
As an example let's say this is the output: bwtrim.out <- bwtrim(...). When I run the analysis output in an Rmarkdown chunk, it seems to "steal" part of the text output and make it into a dataframe.
So here's my question, how can I either access print.bwtrim or how does R markdown automatically format certain outputs into dataframes? Because I'd like to take this outputted dataframe and use it for other purposes.
Update: Here is a minimally working example -- put the following in a chunk in Rmd file."
```{r}
library(WRS2)
df <-
data.frame(
subject = rep(c(1:100), each = 2),
group = rep(c("treatment", "control"), each = 2),
timepoint = rep(c("pre", "post"), times = 2),
dv = rnorm(200, mean = 2)
)
analysis <- WRS2::bwtrim(dv ~ group * timepoint,
id = subject,
data = df,
tr = .2)
analysis
```
With this, a data.frame automatically shows up in the chunk afterwards and it shows all the values very nicely. My main question is how can I get this data.frame for my own uses. Because if you do str(analysis), you see that it's a list. If you do class(analysis) you get "bwtrim". if you do methods(class = "bwtrim"), you get the print method. And methods(print) will have a line that says print.bwtrim*. But I can't seem to figure out how to call print.bwtrim myself.
Regarding what Rmarkdown is doing, compare the following
If you run this in a chunk, it actually steals the data.frame part and puts it into a separate figure.
```{r}
capture.output(analysis)
```
However, if you run the same line in the console, the entire output comes out properly. What's also interesting is that if you try to assign it to another object, the output will be stolen before it can be assigned.
Compare x when you run the following in either a chunk or the console.
```{r}
x<-capture.output(analysis)
```
This is what I get from the chunk approach when I call x
[1] "Call:"
[2] "WRS2::bwtrim(formula = dv ~ group * timepoint, id = subject, "
[3] " data = df, tr = 0.2)"
[4] ""
[5] ""
This is what I get when I do it all in the console
[1] "Call:"
[2] "WRS2::bwtrim(formula = dv ~ group * timepoint, id = subject, "
[3] " data = df, tr = 0.2)"
[4] ""
[5] " value df1 df2 p.value"
[6] "group 1.0397 1 56.2774 0.3123"
[7] "timepoint 0.0001 1 57.8269 0.9904"
[8] "group:timepoint 0.5316 1 57.8269 0.4689"
[9] ""
My question is what can I call whatever Rstudio/Rmarkdown is doing to make data.frames, so that I can have an easy data.frame myself?
Update 2: This is probably not a bug, as discussed here https://github.com/rstudio/rmarkdown/issues/1150.
Update 3: You can access the method by using WRS2:::bwtrim(analysis), though I'm still interested in what Rmarkdown is doing.
Update 4: It might not be the case that Rmarkdown is stealing the output and automatically making dataframes from it, as you can see when you call x after you've already captured the output. Looking at WRS2:::print.bwtrim, it prints a dataframe that it creates, which I'm guessing Rmarkdown recognizes then formats it out.
See below for the print.bwtrim.
function (x, ...)
{
cat("Call:\n")
print(x$call)
cat("\n")
dfx <- data.frame(value = c(x$Qa, x$Qb, x$Qab), df1 = c(x$A.df[1],
x$B.df[1], x$AB.df[1]), df2 = c(x$A.df[2], x$B.df[2],
x$AB.df[2]), p.value = c(x$A.p.value, x$B.p.value, x$AB.p.value))
rownames(dfx) <- c(x$varnames[2], x$varnames[3], paste0(x$varnames[2],
":", x$varnames[3]))
dfx <- round(dfx, 4)
print(dfx)
cat("\n")
}
<bytecode: 0x000001f587dc6078>
<environment: namespace:WRS2>
In R Markdown documents, automatic printing is done by knitr::knit_print rather than print. I don't think there's a knit_print.bwtrim method defined, so it will use the default method, which is defined as
function (x, ..., inline = FALSE)
{
if (inline)
x
else normal_print(x)
}
and normal_print will call print().
You are asking why the output is different. I don't see that when I knit the document to html_document, but I do see it with html_notebook. I don't know the details of what is being done, but if you look at https://rmarkdown.rstudio.com/r_notebook_format.html you can see a discussion of "output source functions", which manipulate chunks to produce different output.
The fancy output you're seeing looks a lot like what knitr::knit_print does for a dataframe, so maybe html_notebook is substituting that in place of print.

Using 'ignore' argument in hunspell function

I'm attempting to exclude some words when running hunspell_check on a text block in Rstudio.
ignore_me <- c("Daniel")
hunspell_check(unlist(some_text), ignore = ignore_me, dict = dictionary("en_GB"))
However, whenever I run I get the following error:
Error in hunspell_check(unlist(some_text, dict = dictionary("en_GB"), :
unused argument (ignore = ignore_me))
I've had a look around SO and trawled the documenation but am struggling to figure what's gone wrong.
It looks like you’ve missed a closing bracket after some_text, so it’s passinng ignore as an argument to unlist() rather than hunspell_check().
UPDATE: Ok, I think you were looking at an old version of the documentation. At least that's what I did at first (https://www.rdocumentation.org/packages/hunspell/versions/1.1/topics/hunspell_check). In the current version, 2.9, ignore is no longer an argument for hunspell_check(). Instead, use add_words in the call to dictionary():
library(hunspell)
some_text <- list("hello", "there", "Daniell")
hunspell_check(unlist(some_text), dict = dictionary("en_GB"))
# [1] TRUE TRUE FALSE
ignore_me <- "Daniell"
hunspell_check(unlist(some_text), dict = dictionary("en_GB", add_words = ignore_me))
# [1] TRUE TRUE TRUE

Give a new variable value 0 or 1 based on the distance between two words in another variable

I am new to R. In my dataset, I have a variable called Reason . I want to create a new column called Price. If any of the following conditions is met:
word "Price" and word "High" are both mentioned in Reason and the distance between them is less than 6 words
word "Price" and word "expensive" are both mentioned in Reason and the distance between them is less than 6 words
-word "Price" and word "increase" are both mentioned in Reason and the distance between them is less than 6 words
than Price=1. Otherwise, price=0.
I found the following user defined function to get the distance between 2 words
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
but I don't know how to apply it the whole column to get the expected results. I tried the following code but it only give me "logical(0)" as the result.
for (j in seq(Survey$Reason))
{
Survey$Price[[j]]<- distance(Survey$Reason[[j]], " price ", " high ") <=6
}
Any help is highly appreciated.
Thanks
Starting from your sample data:
survey <- structure(list(Reason = c("Their price are extremely high.", "Because my price was increased so much, I wouldn't want anyone else to have to deal with that.", "Just because the intial workings were fine, but after we realised it would affect our contract, it left a sour taste in our mouth.", "Problems with the repair", "They did not handle my complaint as well I would have liked.", "Bad service overall.")), .Names = "Reason", row.names = c(NA, 6L), class = "data.frame")
First, I updated your fonction to remove punctuation and directrly returns your position test
distanceOK <- function(string, term1, term2,n=6) {
words <- strsplit(gsub("[[:punct:]]", "", string), "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
dist <- abs(indices[term1] - indices[term2])
ifelse(is.na(dist)|dist>n,0,1)
}
Then we apply:
survey$Price <- sapply(survey$Reason, FUN=function(str) distanceOK(str, "price","high"))

Reshaping data for use with geeglm()

Could you please help me figure out why I am getting an error?
Initially my data looks like this:
> attributes(compl)$names
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "PHQ_Surv1" "PHQ_Surv2" "PHQ_Surv3"
[8] "PHQ_Surv4" "EFE" "Neuro" "Intervention.x" "depr0" "error1_1.x" "error1_2.x"
[15] "error1_3.x" "error1_4.x" "stress0" "stress1" "stress2" "stress3" "stress4"
[22] "hours1" "hours2" "hours3" "hours4" "subject"
First I reshape my data to prepare for geeglm:
compl$subject <- factor(rownames(compl))
nobs <- nrow(compl)
compl_long <- reshape(compl, idvar = "subject",
varying = list(c("PHQ_Surv1", "PHQ_Surv2" ,
"PHQ_Surv3", "PHQ_Surv4"),
c("error1_1.x", "error1_2.x",
"error1_3.x", "error1_4.x"),
c("stress1", "stress2", "stress3",
"stress4"),
c("hours1", "hours2", "hours3",
"hours4")),
v.names = c("PHQ", "error", "stress", "hours"),
times = c("1", "2", "3", "4"), direction = "long")
-(Editor's note: not sure what this next output is from...)
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "EFE" "Neuro" "Intervention.x"
[8] "depr0" "stress0" "subject" "time" "PHQ" "error" "stress"
[15] "hours"
Then I use geeglm function:
library(geepack)
geeSand=(geeglm(PHQ~as.factor(compl_bin) + Neuro+PHQ_base+as.factor(depr0) +
EFE+as.factor(Sex.x) + as.factor(error)+stress+hours,
family = poisson, data=compl_long,
id=subject, corst="exchangeable"))
I am getting an error:
"Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca, :
nrow(zsca) and length(y) not match"
If I remove variables as.factor(error) and hours, geeglm does not complain, and I am getting the output. The function does not work with error and hours variables. I check the length of all the variables, they are equal. Could you please help me figure out what is wrong?
Many thanks!
found this at: https://stat.ethz.ch/pipermail/r-help/2008-October/178337.html
"
I'm pretty sure this is a bug in geese(), which should be reported to
the
maintainer of geepack. The problem is with the treatment of missing
values.
If looks at dim(na.omit(dat[,c("id","score","chem","time")])) one
gets 44.
In geese.fit() zsca is set equal to matrix(1,N,1) where N is set
equal to
length(id). But id has length 46 whereas the response y has been
trimmed
down to length 44 by eliminating any rows of the data where any of
the variables
involved are missing. Hence a problem.
The solution of the problem requires some code re-writing by the
maintainer of geepack."

Get function's title from documentation

I would like to get the title of a base function (e.g.: rnorm) in one of my scripts. That is included in the documentation, but I have no idea how to "grab" it.
I mean the line given in the RD files as \title{} or the top line in documentation.
Is there any simple way to do this without calling Rd_db function from tools and parse all RD files -- as having a very big overhead for this simple stuff? Other thing: I tried with parse_Rd too, but:
I do not know which Rd file holds my function,
I have no Rd files on my system (just rdb, rdx and rds).
So a function to parse the (offline) documentation would be the best :)
POC demo:
> get.title("rnorm")
[1] "The Normal Distribution"
If you look at the code for help, you see that the function index.search seems to be what is pulling in the location of the help files, and that the default for the associated find.packages() function is NULL. Turns out tha tthere is neither a help fo that function nor is exposed, so I tested the usual suspects for which package it was in (base, tools, utils), and ended up with "utils:
utils:::index.search("+", find.package())
#[1] "/Library/Frameworks/R.framework/Resources/library/base/help/Arithmetic"
So:
ghelp <- utils:::index.search("+", find.package())
gsub("^.+/", "", ghelp)
#[1] "Arithmetic"
ghelp <- utils:::index.search("rnorm", find.package())
gsub("^.+/", "", ghelp)
#[1] "Normal"
What you are asking for is \title{Title}, but here I have shown you how to find the specific Rd file to parse and is sounds as though you already know how to do that.
EDIT: #Hadley has provided a method for getting all of the help text, once you know the package name, so applying that to the index.search() value above:
target <- gsub("^.+/library/(.+)/help.+$", "\\1", utils:::index.search("rnorm",
find.package()))
doc.txt <- pkg_topic(target, "rnorm") # assuming both of Hadley's functions are here
print(doc.txt[[1]][[1]][1])
#[1] "The Normal Distribution"
It's not completely obvious what you want, but the code below will get the Rd data structure corresponding to the the topic you're interested in - you can then manipulate that to extract whatever you want.
There may be simpler ways, but unfortunately very little of the needed coded is exported and documented. I really wish there was a base help package.
pkg_topic <- function(package, topic, file = NULL) {
# Find "file" name given topic name/alias
if (is.null(file)) {
topics <- pkg_topics_index(package)
topic_page <- subset(topics, alias == topic, select = file)$file
if(length(topic_page) < 1)
topic_page <- subset(topics, file == topic, select = file)$file
stopifnot(length(topic_page) >= 1)
file <- topic_page[1]
}
rdb_path <- file.path(system.file("help", package = package), package)
tools:::fetchRdDB(rdb_path, file)
}
pkg_topics_index <- function(package) {
help_path <- system.file("help", package = package)
file_path <- file.path(help_path, "AnIndex")
if (length(readLines(file_path, n = 1)) < 1) {
return(NULL)
}
topics <- read.table(file_path, sep = "\t",
stringsAsFactors = FALSE, comment.char = "", quote = "", header = FALSE)
names(topics) <- c("alias", "file")
topics[complete.cases(topics), ]
}

Resources