r stop guessing names when root is similar - r

Is there an option in R that prevents it from returning values from field names with the same beginning if the one you asked for does not exist? This is causing me a fair amount of problems as my fields may or may not be present, and they have similar root names.
d <- data.frame(areallylongname = -99, y = 2, z = 0)
# How do I stop this returning a value
d$a
#[1] -99
# it should return NULL like this
d$jjj
# NULL

You can switch to bracket notation, which requires exact column names:
> d['a']
Error in `[.data.frame`(d, "a") : undefined columns selected
> d['y']
y
1 2

If you want to avoid partial matching and return an error, the following could work.
However, this will make all other warnings to errors as well.
options(warnPartialMatchDollar = TRUE, warn = 2)
# test
d$a
Error in $.data.frame(d, a) :
(converted from warning) Partial match of 'a' to 'areallylongname' in data frame

Related

Separating error message from error condition in package

Background
Packages can include a lot of functions. Some of them require informative error messages, and perhaps some comments in the function to explain what/why is happening. An example, f1 in a hypothetical f1.R file. All documentation and comments (both why the error and why the condition) in one place.
f1 <- function(x){
if(!is.character(x)) stop("Only characters suported")
# user input ...
# .... NaN problem in g()
# ....
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) stop("oof, an error")
log(x)
}
f1(-1)
# >Error in f1(-1) : oof, an error
I create a separate conds.R, specifying a function (and w warning, s suggestion) etc, for example.
e <- function(x){
switch(
as.character(x),
"1" = "Only character supported",
# user input ...
# .... NaN problem in g()
# ....
"2" = "oof, and error") |>
stop()
}
Then in, say, f.R script I can define f2 as
f2 <- function(x){
if(!is.character(x)) e(1)
# ratio of magnitude negative integer i base ^ i is positive
if(x < .Machine$longdouble.min.exp / .Machine$longdouble.min.exp) e(2)
log(x)
}
f2(-1)
#> Error in e(2) : oof, and error
Which does throw the error, and on top of it a nice traceback & rerun with debug option in the console. Further, as package maintainer I would prefer this as it avoids considering writing terse if statements + 1-line error message or aligning comments in a tryCatch statement.
Question
Is there a reason (not opinion on syntax) to avoid writing a conds.R in a package?
There is no reason to avoid writing conds.R. This is very common and good practice in package development, especially as many of the checks you want to do will be applicable across many functions (like asserting the input is character, as you've done above. Here's a nice example from dplyr.
library(dplyr)
df <- data.frame(x = 1:3, x = c("a", "b", "c"), y = 4:6)
names(df) <- c("x", "x", "y")
df
#> x x y
#> 1 1 a 4
#> 2 2 b 5
#> 3 3 c 6
df2 <- data.frame(x = 2:4, z = 7:9)
full_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
nest_join(df, df2, by = "x")
#> Error: Input columns in `x` must be unique.
#> x Problem with `x`.
traceback()
#> 7: stop(fallback)
#> 6: signal_abort(cnd)
#> 5: abort(c(glue("Input columns in `{input}` must be unique."), x = glue("Problem with {err_vars(vars[dup])}.")))
#> 4: check_duplicate_vars(x_names, "x")
#> 3: join_cols(tbl_vars(x), tbl_vars(y), by = by, suffix = c("", ""), keep = keep)
#> 2: nest_join.data.frame(df, df2, by = "x")
#> 1: nest_join(df, df2, by = "x")
Here, both functions rely code written in join-cols.R. Both call join_cols() which in turn calls check_duplicate_vars(), which I've copied the source code from:
check_duplicate_vars <- function(vars, input, error_call = caller_env()) {
dup <- duplicated(vars)
if (any(dup)) {
bullets <- c(
glue("Input columns in `{input}` must be unique."),
x = glue("Problem with {err_vars(vars[dup])}.")
)
abort(bullets, call = error_call)
}
}
Although different in syntax from what you wrote, it's designed to provide the same behaviour, and shows it is possible to include in a package and no reason (from my understanding) not to do this. However, I would add a few syntax points based on your code above:
I would bundle the check (if() statement) inside the package with the error raising to reduce repeating yourself in other areas you use the function.
It's often nicer to include the name of the variable or argument passed in so the error message is explicit, such as in the dplyr example above. This makes the error more clear to the user what is causing the problem, in this case, that the x column is not unique in df.
The traceback showing #> Error in e(2) : oof, and error in your example is more obscure to the user, especially as e() is likely not exported in the NAMESPACE and they would need to parse the source code to understand where the error is generated. If you use stop(..., .call = FALSE) or passing the calling environment through the nested functions, like in join-cols.R, then you can avoid not helpful information in the traceback(). This is for instance suggested in Hadley's Advanced R:
By default, the error message includes the call, but this is typically not useful (and recapitulates information that you can easily get from traceback()), so I think it’s good practice to use call. = FALSE

How to check if vector is a single NA value without length warning and without suppression

I have a function with NA as a default, but if not NA should be a character vector not restricted to size 1. I have a check to validate these, but is.na produces the standard warning when the vector is a character vector with length greater than 1.
so_function <- function(x = NA) {
if (!(is.na(x) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
so_function(c("A", "B"))
#> Warning in if (!(is.na(x) | is.character(x))) {: the condition has length >
#> 1 and only the first element will be used
An option to prevent the warning I came up with was to use identical:
so_function <- function(x = NA) {
if (!(identical(x, NA) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
My issue here is that this function will generally be taking Excel sheet data loaded into R as inputs, and the NA values generated from that are often NA_character_, NA_integer_, and NA_real_, so identical(x, NA) is often FALSE when I actually need it to be TRUE.
For the broader context, I am experiencing this issue for S3 classes I am creating for a package, and the function below approximates how I am validating multiple attributes for that class, which is when the warnings are appearing. Because of this, I am trying to avoid suppressing warnings as the solution, so would be interested to know what best practice exists to solve this issue.
Edit
In order to make use cases clearer, this is validating attributes for a class, where I want to ensure the attribute is either a single NA value, or a character vector of any length:
so_function(NA_character_) # should pass
so_function(NA_integer_) # should pass
so_function(c(NA, NA)) # should fail
so_function(c("A", "B")) # should pass
so_function(c(1, 2, 3)) # should fail
The length warning comes from the use of if, which expects a length 1 vector, and is.na which is vectorised.
You could use any or all around the is.na to compress it to a length 1 vector but there may be edge cases where it doesn't work as you expect so I would use shortcircuit evaluation to check it is length 1 on the is.na check:
so_function <- function(x = NA) {
if (!((length(x)==1 && is.na(x)) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
so_function(NA_character_) # should pass
so_function(NA_integer_) # should pass
so_function(c(NA, NA)) # should fail
Error in so_function(c(NA, NA)) : This was just an example for you SO!
so_function(c("A", "B")) # should pass
so_function(c(1, 2, 3)) # should fail
Error in so_function(c(1, 2, 3)) : This was just an example for you SO!
Another option is to use NULL as the default value instead.
I don't think the problem arises from is.na() - it is a vectorized function which produces a vector as an output. is.character(x) on the other hand is not vectorized so it only will output a single value.
You can leverage apply-like functions to overcome this e.g.
sapply(c("a", NA, 5), is.character)
if also functions similarly - you are better off using ifelse for by-element comparison.
I don't think I quite grasped what you what do to with you function but it could rewritten like this:
so_function_2 <- function(x = NA) {
condit <- !(is.na(x) | sapply(x, is.character))
ifelse(condit, "This was just an example for you SO!", "FALSE")
}

Calculating distance using latitude and longitude error [duplicate]

When working with R I frequently get the error message "subscript out of bounds". For example:
# Load necessary libraries and data
library(igraph)
library(NetData)
data(kracknets, package = "NetData")
# Reduce dataset to nonzero edges
krack_full_nonzero_edges <- subset(krack_full_data_frame, (advice_tie > 0 | friendship_tie > 0 | reports_to_tie > 0))
# convert to graph data farme
krack_full <- graph.data.frame(krack_full_nonzero_edges)
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
# Calculate reachability for each vertix
reachability <- function(g, m) {
reach_mat = matrix(nrow = vcount(g),
ncol = vcount(g))
for (i in 1:vcount(g)) {
reach_mat[i,] = 0
this_node_reach <- subcomponent(g, (i - 1), mode = m)
for (j in 1:(length(this_node_reach))) {
alter = this_node_reach[j] + 1
reach_mat[i, alter] = 1
}
}
return(reach_mat)
}
reach_full_in <- reachability(krack_full, 'in')
reach_full_in
This generates the following error Error in reach_mat[i, alter] = 1 : subscript out of bounds.
However, my question is not about this particular piece of code (even though it would be helpful to solve that too), but my question is more general:
What is the definition of a subscript-out-of-bounds error? What causes it?
Are there any generic ways of approaching this kind of error?
This is because you try to access an array out of its boundary.
I will show you how you can debug such errors.
I set options(error=recover)
I run reach_full_in <- reachability(krack_full, 'in')
I get :
reach_full_in <- reachability(krack_full, 'in')
Error in reach_mat[i, alter] = 1 : subscript out of bounds
Enter a frame number, or 0 to exit
1: reachability(krack_full, "in")
I enter 1 and I get
Called from: top level
I type ls() to see my current variables
1] "*tmp*" "alter" "g"
"i" "j" "m"
"reach_mat" "this_node_reach"
Now, I will see the dimensions of my variables :
Browse[1]> i
[1] 1
Browse[1]> j
[1] 21
Browse[1]> alter
[1] 22
Browse[1]> dim(reach_mat)
[1] 21 21
You see that alter is out of bounds. 22 > 21 . in the line :
reach_mat[i, alter] = 1
To avoid such error, personally I do this :
Try to use applyxx function. They are safer than for
I use seq_along and not 1:n (1:0)
Try to think in a vectorized solution if you can to avoid mat[i,j] index access.
EDIT vectorize the solution
For example, here I see that you don't use the fact that set.vertex.attribute is vectorized.
You can replace:
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
by this:
## set.vertex.attribute is vectorized!
## no need to loop over vertex!
for (attr in names(attributes))
krack_full <<- set.vertex.attribute(krack_full,
attr, value = attributes[,attr])
It just means that either alter > ncol( reach_mat ) or i > nrow( reach_mat ), in other words, your indices exceed the array boundary (i is greater than the number of rows, or alter is greater than the number of columns).
Just run the above tests to see what and when is happening.
Only an addition to the above responses: A possibility in such cases is that you are calling an object, that for some reason is not available to your query. For example you may subset by row names or column names, and you will receive this error message when your requested row or column is not part of the data matrix or data frame anymore.
Solution: As a short version of the responses above: you need to find the last working row name or column name, and the next called object should be the one that could not be found.
If you run parallel codes like "foreach", then you need to convert your code to a for loop to be able to troubleshoot it.
If this helps anybody, I encountered this while using purr::map() with a function I wrote which was something like this:
find_nearby_shops <- function(base_account) {
states_table %>%
filter(state == base_account$state) %>%
left_join(target_locations, by = c('border_states' = 'state')) %>%
mutate(x_latitude = base_account$latitude,
x_longitude = base_account$longitude) %>%
mutate(dist_miles = geosphere::distHaversine(p1 = cbind(longitude, latitude),
p2 = cbind(x_longitude, x_latitude))/1609.344)
}
nearby_shop_numbers <- base_locations %>%
split(f = base_locations$id) %>%
purrr::map_df(find_nearby_shops)
I would get this error sometimes with samples, but most times I wouldn't. The root of the problem is that some of the states in the base_locations table (PR) did not exist in the states_table, so essentially I had filtered out everything, and passed an empty table on to mutate. The moral of the story is that you may have a data issue and not (just) a code problem (so you may need to clean your data.)
Thanks for agstudy and zx8754's answers above for helping with the debug.
I sometimes encounter the same issue. I can only answer your second bullet, because I am not as expert in R as I am with other languages. I have found that the standard for loop has some unexpected results. Say x = 0
for (i in 1:x) {
print(i)
}
The output is
[1] 1
[1] 0
Whereas with python, for example
for i in range(x):
print i
does nothing. The loop is not entered.
I expected that if x = 0 that in R, the loop would not be entered. However, 1:0 is a valid range of numbers. I have not yet found a good workaround besides having an if statement wrapping the for loop
This came from standford's sna free tutorial
and it states that ...
# Reachability can only be computed on one vertex at a time. To
# get graph-wide statistics, change the value of "vertex"
# manually or write a for loop. (Remember that, unlike R objects,
# igraph objects are numbered from 0.)
ok, so when ever using igraph, the first roll/column is 0 other than 1, but matrix starts at 1, thus for any calculation under igraph, you would need x-1, shown at
this_node_reach <- subcomponent(g, (i - 1), mode = m)
but for the alter calculation, there is a typo here
alter = this_node_reach[j] + 1
delete +1 and it will work alright
What did it for me was going back in the code and check for errors or uncertain changes and focus on need-to-have over nice-to-have.

Data.frames in R: name autocompletion?

Sorry if this is trivial. I am seeing the following behaviour in R:
> myDF <- data.frame(Score=5, scoreScaled=1)
> myDF$score ## forgot that the Score variable was capitalized
[1] 1
Expected result: returns NULL (even better: throws error).
I have searched for this, but was unable to find any discussion of this behaviour. Is anyone able to provide any references on this, the rationale on why this is done and if there is any way to prevent this? In general I would love a version of R that is a little stricter with its variables, but it seems that will never happen...
The $ operator needs only the first unique part of a data frame name to index it. So for example:
> d <- data.frame(score=1, scotch=2)
> d$sco
NULL
> d$scor
[1] 1
A way of avoiding this behavior is to use the [[]] operator, which will behave like so:
> d <- data.frame(score=1, scotch=2)
> d[['scor']]
NULL
> d[['score']]
[1] 1
I hope that was helpful.
Cheers!
Using [,""] instead of $ will throw an error in case the name is not found.
myDF$score
#[1] 1
myDF[,"score"]
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
myDF[,"Score"]
#[1] 5
myDF[,"score", drop=TRUE] #More explicit and will also work with tibble::as_tibble
#Error in `[.data.frame`(myDF, , "score", drop = TRUE) :
# undefined columns selected
myDF[,"Score", drop=TRUE]
#[1] 5
as.data.frame(myDF)[,"score"] #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(as.data.frame(myDF), , "score") :
# undefined columns selected
as.data.frame(myDF)[,"Score"]
#[1] 5
unlist(myDF[,"score"], use.names = FALSE) #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
unlist(myDF[,"Score"], use.names = FALSE)
#[1] 5

R - tryCatch warning message being written to data

What I'm trying to achieve
I'm trying to write my own 'impute' function in R with a tryCatch statement which:
1. outputs a warning/error message containing the function name so I can debug easier.
2. Raises a warning if the function runs ok but doesn't impute all the missing values.
ImputeVariables <- function(impute.var, impute.values,
filter.var){
# function to impute values.
# impute.var = variables with NAs
# impute.values = the missing value(s) to replace NAs, value labesl are levels
# filter.var = the variables to filter on.
# filter.levels = the categories of filter.var
tryCatch({
filter.levels <- names(impute.values)
# Validation
stopifnot(class(impute.var) == class(impute.values),
length(impute.values) > 0,
sum(is.na(impute.values)) == 0)
# Impute values
for(level in filter.levels){
impute.var[which(filter.var == level & is.na(impute.var))] <-
impute.values[level]
}
# Check if all NAs removed. Throw warning if not.
if(sum(is.na(impute.var)) > 0){
warning("Not all NAs removed")
}
# Return values
return(impute.var)
},
error = function(err) print(paste0("ImputeValues: ",err)),
warning = function(war) print(paste0("ImputeValues: ",war))
)
}
impute.var and filter.var are vectors taken from a data.frame (they are vectors of Ages and Titles (e.g. 'Mr', 'Mrs')
impute.values is a vector of the same type as impute.var but with labels taken from filter.var (i.e. is of the form c('Mr' = 30, 'Mrs' = 25...))
The problem
To check if my validation was working I supplied the function with a named vector of NAs, thusly:
ages <- c(34, 22, NA, 17, 38, NA)
titles <- c("Mr", "Mr", "Mr", "Mrs", "Mrs", "Mrs")
ages.values <- c("Mr" = NA, "Mrs" = NA)
ages.new <- ImputeVariables(ages, ages.values, titles)
print(ages.new)
But it outputs the following:
"ImputeValues: Error: class(impute.var) == class(impute.values) is not TRUE\n"
"ImputeValues: Error: class(impute.var) == class(impute.values) is not TRUE\n"
The two lines are due to the function printing the ages.new vector and the following print statement printing ages.new (why?)
If I comment out the validation (the stopifnot function) then I just get:
"ImputeValues: simpleWarning in doTryCatch(return(expr), name, parentenv, handler): Not all NAs removed\n"
What I'm asking
Why does the tryCatch block behave this way?
Is my validation and error handling strategy optimal (obviously without the bug)?
Many thanks for your time.
Rob
Thanks Oliver.
The working code is now:
ImputeVariables <- function(impute.var, impute.values,
filter.var){
# function to impute values.
# impute.var = variables with NAs
# impute.values = the missing value(s) to replace NAs, value labesl are levels
# filter.var = the variables to filter on.
# filter.levels = the categories of filter.var
tryCatch({
filter.levels <- names(impute.values)
# Validation
stopifnot(class(impute.var) == class(impute.values),
length(impute.values) > 0,
sum(is.na(impute.values)) == 0)
# Impute values
for(level in filter.levels){
impute.var[which(filter.var == level & is.na(impute.var))] <-
impute.values[level]
}
# Check if all NAs removed. Throw warning if not.
if(sum(is.na(impute.var)) > 0){
warning("Not all NAs removed")
}
# Return values
return(impute.var)
},
error = function(err) stop(paste0("ImputeValues: ",err)),
warning = function(war) {
message(paste0("ImputeValues: ",war))
return(impute.var)}
)
}
This is essentially two different problems. The first problem is that print statements within a function do not print to the terminal, they print to the scope of the function. As an example:
> foo <- function(){
print("bar")
}
> foo()
[1] "bar"
It didn't print "bar" to your screen, it printed it to the function scope and then returned it. The reason it returned it was that it was the last value printed to the function scope, and so (lacking an explicit return() call) is the best candidate for what to return.
So, your code is (in sequence):
Throwing an error;
Not treating that error normally, but instead passing it into tryCatch's error handler, where it is printed;
Because it is the last thing printed within the function scope, since the return() statement is never hit due to the error, treating it as the return value from the function.
If you really want to continue processing the input values even if the stopifnot() conditions are met, you don't want a stopifnot(): however you structure that it's likely to prevent the return() call from running and cause weirdness. What I'd suggest is instead moving the conditional checks currently in stopifnot() outside the tryCatch, and sticking them in a series of if() statements that throw warnings (not errors) if they don't match up. tryCatch isn't really necessary in this situation.

Resources