good practice to use "$" and run a function in one line in R [closed] - r

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Today i have seem a "strange" thing and am wondering if this is a good practice. Basically there is a list:
testList <- list("columnA" = c(1, 2, 3),
"columnB" = c(11,22,33))
and then a function:
calculateMean <- function(input){
out <- lapply(input, mean)
return(out)
}
and the this:
resultTest <- calculateMean(testList)$columnA
Question: Is this a good practice to refer to functions result without storing the results of a function in an intermediate step?

We may use sapply and return a named vector and store it as a single vector and use that for other cases i.e. suppose we want to take the max of that vector, it can be applied directly instead of unlist the list.
calculateMean <- function(input){
out <- sapply(input, mean)
return(out)
}
-ouptut
calculateMean(testList)
columnA columnB
2 22
Regarding storing the output, it depends i.e. if we want to extract the output of 'columnB', we may need to run it again and do $. Instead, save it as a single object and extract as needed

You ask if this is good practice. I'd say there are good and bad aspects to it.
On the positive side, it keeps your code simpler than if you defined a new variable to hold calculateMean(testList) when all you are interested in is one element of it. In some cases (probably not yours though) that could save a lot of memory: that variable might hold a lot of stuff that is of no interest, and it takes up space.
On the negative side, it makes your code harder to debug. Keeping expressions simple makes it easier to see when and why things aren't working. Each line of
temp <- calculateMean(testList)
resultTest <- temp$columnA
is simpler than the one line
resultTest <- calculateMean(testList)$columnA
In some situations you could use an informative name in the two-line version to partially document what you had in mind here (not temp!), making your code easier to understand.
If you were trying to single step through the calculation in a debugger, it would be more confusing, because you'd jump from the calculateMean source to the source for $ (or more likely, to the final result, since that's a primitive function).
Since the one-line version is relatively simple in your case, I'd probably use it, but in other situations I might split it into two lines.

Related

Errors in Executing While loop [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am trying to read a random data from a dataset in while loop but I'm getting errors, can anyone here help me?
How to calculate the percentage of points in the sample that are greater than 100?
I tried following method
dataset = 1:100
i=0
while(dataset[i] > condition) #compare every value in dataset
{
percent_age= dataset[i] + percent_age
i=i+1
if(i=100)
{break}
}
But it gives me only errors.
The while statement is evaluated before anything in the body, so the first time it is evaluated i is equal to 0 and so dataset[i] is dataset[0] which is an empty object (vector of length 0), you also have not defined condition in the code that you give us. So while is looking of a single logical value, but you are giving it the result of comparing a zero-length vector to an undefined value. That is going to give at least one error.
You can fix that by starting i at 1 and defining condition before the while.
In your if statement you have i=100, that is setting i to 100, to compare (and return a logical) it should be i == 100.
Because R can be used interactively, it tries to evaluate code as early as possible, therefore it is best to put the opening curly bracket { on the same line as the keywords like if and while.
A couple of nit-picky things that probably will not resolve errors, but could help for better programming in the future:
Use more whitespace within lines: i = i + 1 can be easier to read than i=i+1 and mistakes like i=100 vs i == 100 are easier to catch when whitespace is used appropriately.
I find the arrow assignment in R i <- 1 reads easier and lessens chances of confusing different uses of =, so I would recommend using it for all assignments.

Why does R allow to refer to a column in a data frame unquoted? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
For example, in Pandas, you always need to refer to a column in DataFrame by its name in a string:
df = pd.DataFrame(list(range(1,10)),columns = ["a"])
df["a"]
But in R, including some of its packages, such as data.table and dplyr, you are allowed to refer to a column without quotes, like in this way:
dt <- data.table(a = 1:10)
dt[,.(a)]
In my opinion, referring to column name unquoted is a disaster. The only benefit you get is that you don't need to type "". But the downsides are unlimited:
1) Very often you will need to select columns programmatically. With column name unquoted, you need to differentiate the variables in "outer" and "inner" context.
col_name <- "a"
dt[,..col_name]
2) Even if you manage to select the columns specified in a vector of strings, it's very hard to do (complex) operations on them. As mentioned in this question, you need to do in this way:
diststr = "dist"
valstr = "val"
x[get(valstr) < 5, c(diststr) :=
get(diststr)*sum(get(diststr))]
All in all, the feeling I have is that wrangling data in R is not straightforward/natural at all compared to the way done in pandas. Could someone please explain are there any upsides of this?
in Pandas you can refer to suitably named columns without quotes, e.g:
df = pd.DataFrame(dict(
a=[1,2,3],
b=[5,6,7],
))
print(df.a)
is valid, concise and similar syntax works in R.
the choice depends on how much the code's author knows about the dataset and what is convenient at the time — for quick analyses this is great, for more repeatable workflows this can be awkward.
I also tend to use unquoted variable accessors a lot when working with databases — column names basically always valid identifiers
df = pd.read_sql('select a, b from foo', dbcon)
df.a
or
df <- dbGetQuery(dbcon, 'select a, b from foo')
df$a
for Pandas and R respectively…
each language/library provides the tools, it's up to you to use them appropriately!

How to report error indices in R? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Update: Question here is closed, now discussed on RStudio Community Platform.
I'm trying to program defensively in my package development, using a lot input validation.
In particular, I'm relying on a lot of the ready-made assertions in checkmate, testthat and the like, which makes life a lot easier (and code shorter).
Hadley Wickhams's tidyverse style guide for error messages suggests that error messages should point users to the exact source of the problem, like so:
#> Error: Can't find column `b` in `.data`
(Columns are just an example, sometimes it might be rows, or some other index).
I'm now wondering how this can be implemented elegantly and consistently in a package, given that a lot of the existing assertions (from above package, but also base r) don't give you any indices back in their errors.
Here's an example:
m <- matrix(data = c(0, 1, 5, -2), nrow = 2)
# arbitrary assertion
assert_positive <- function(x) {
if (any(x < 0)) {
stop(call. = FALSE,
"All numbers must be non-negative")
} else {
return(invisible(x))
}
}
# (there are *lots* of these in packages such as checkmate, testthat or assertr that should be reused)
assert_positive(m)
gives:
## Error: All numbers must be non-negative
So far so good, but this does not give the desired indices of the errors.
Yes, I know that I could just change the above assert_positive() function to do that, but I would like to reuse a lot of the functions in checkmate, testthat and friends, so I can't touch them, and there's too many of them anyway.
So I should probably wrap something around these existing tests, such as a simple for loop:
# via for-loops
assert_positive2 <- function(x) {
for (r in 1:nrow(x)) {
res <- try(expr = assert_positive(x[r, ]), silent = TRUE)
if (inherits(x = res, what = "try-error")) {
stop(
call. = FALSE,
paste0(
"in row ",
r,
": ",
attr(x = res, which = "condition")$message,
"."
)
)
}
}
}
assert_positive2(m)
gives:
## Error: in row 2: All numbers must be non-negative.
That gets the job done, but it's a lot of clutter and the code is not very expressive.
I've also thought about Reduce() with try(), but that won't give indices, and neither would any apply() action.
I guess, finally, a closure or function factory would be helpful to generalise this to many assertions.
This just feels like a problem that many other people (crafting better error messages) must have already run into, so:
What's an elegant/canonical way to do this?
I know this isn't the place for discussions and opinions; but it's still the best forum for such a problem, so please don't shut this down.
I don't see how wrapping many functions would be less work than just changing them / writing your own versions. Plus, like you say, the way you've wrapped the example is anything but cute.
As a short answer, I could imagine using the assertthat package (which you have not mentioned explicitly) and in particular the functions assert_that() (for basic cases) and on_failure() (for broader user-defined assertion functions).
I don't think the assert_positive example does what you want, so maybe you should not try to recycle it. Similarly, the assert_positive2 might also not do what you want in other cases, because you may want to report the specific indices per row that are in violation, not just the rows. But with your own functions, you can maybe write something more flexible that covers multiple cases.

Understanding the logic of R code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am learning R through tutorials, but I have difficulties in "how to read" R code, which in turn makes it difficult to write R code. For example:
dir.create(file.path("testdir2","testdir3"), recursive = TRUE)
vs
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
While I know what these lines of code do, I cannot read or interpret the logic of each line of code. Whether I read left to right or right to left. What strategies should I use when reading/writing R code?
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Don't let lines of code like this ruin writing R code for you
I'm going to be honest here. The code is bad. And for many reasons.
Not a lot of people can read a line like this and intuitively know what the output is.
The point is you should not write lines of code that you don't understand. This is not Excel, you do not have but 1 single line to fit everything within. You have a whole deliciously large script, an empty canvas. Use that space to break your code into smaller bits that make a beautiful mosaic piece of art! Let's dive in~
Dissecting the code: Data Frames
Reading a line of code is like looking at a face for familiar features. You can read left to right, middle to out, whatever -- as long as you can lock onto something that is familiar.
Okay you see data.combined. You know (hope) it has rows and columns... because it's data!
You spot a $ in the code and you know it has to be a data.frame. This is because only lists and data.frames (which are really just lists) allow you to subset columns using $ followed by the column name. Subset-by the way- just means looking at a portion of the overall. In R, subsetting for data.frames and matrices can be done using single brackets[, within which you will see [row, column]. Thus if we type data.combined[1,2], it would give you the value in row 1 of column 2.
Now, if you knew that the name of column 2 was name you can use data.combined[1,"name"] to get the same output as data.combined$name[1]. Look back at that code:
dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$name))), "name"])
Okay, so now we see our eyes should be locked on data.combined[SOMETHING IS IN HERE?!]) and slowly be picking out data.combined[ ?ROW? , Oh the "name" column]. Cool.
Finding those ROW values!
which(duplicated(as.character(data.combined$name)))
Anytime you see the which function, it is just giving you locations. An example: For the logical vector a = c(1,2,2,1), which(a == 1) would give you 1 and 4, the location of 1s in a.
Now duplicated is simple too. duplicated(a) (which is just duplicated(c(1,2,2,1))) will give you back FALSE FALSE TRUE TRUE. If we ran which(duplicated(a)) it would return 3 and 4. Now here is a secret you will learn. If you have TRUES and FALSES, you don't need to use the which function! So maybe which was unnessary here. And also as.character... since duplicated works on numbers and strings.
What You Should Be Writing
Who am I to tell you how to write code? But here's my take.
Don't mix up ways of subsetting: use EITHER data.frame[,column] or data.frame$column...
The code could have been written a little bit more legibly as:
dupes <- duplicated(data.combined$name)
dupe.names <- data.combines$name[dupes]
or equally:
dupes <- duplicated(data.combined[,"name"])
dupe.names <- data.combined[dupes,"name"]
I know this was lengthy but I hope it helps.
An easier way to read any code is to break up their components.
dup.names <-
as.character(
data.combined[which(
duplicated(
as.character(
data.combined$name
)
)
), "name"]
)
For each of the functions - those parts with rounded brackets following them e.g. as.character() you can learn more about what they do and how they work by typing ?as.character in the console
Square brackets [] are use to subset data frames, which are stored in your environment (the box to the upper right if you're using R within RStudio contains your values as well as any defined functions). In this case, you can tell that data.combined is the name that has been given to such a data frame in this example (type ?data.frame to find out more about data frames).
"Unwrapping" long lines of code can be daunting at first. Start by breaking it down into parenthesis , brackets, and commas. Parenthesis directly tacked onto a word indicate a function, and any commas that lie within them (unless they are part of another nested function or bracket) separate arguments which contain parameters that modify the way the function behaves. We can reduce your 2nd line to an outer function as.character and its arguments:
dup.names <- as.character(argument_1)
Just from this, we know that dup.names will be assigned a value with the data type "character" off of a single argument.
Two functions in the first line, file.path() and dir.create(), contain a comma to denote two arguments. Arguments can either be a single value or specified with an equal sign. In this case, the output of file.path happens to perform as argument #1 of dir.create().
file.path(argument_1,argument_2)
dir.create(argument_1,argument_2)
Brackets are a way of subsetting data frames, with the general notation of dataframe_object[row,column]. Within your second line is a dataframe object, data.combined. You know it's a dataframe object because of the brackets directly tacked onto it, and knowing this allows you to that any functions internal to this are contributing to subsetting this data frame.
data.combined[row, column]
So from there, we can see that the internal functions within this bracket will produce an output that specifies the rows of data.combined that will contribute to the subset, and that only columns with name "name" will be selected.
Use the help function to start to unpack these lines by discovering what each function does, and what it's arguments are.

Efficiently packing and unpacking function arguments in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have some R code where I'm starting to get too many arguments in my functions, like this
f<-function(a,b,c,d,e,f,g,...){
#do stuff with a,b,c,d,e,f,g
return(list(q=q,r=r,s=s,...))
}
I was thinking of collapsing arguments into lists of related parameters and then extracting out the parameters from the lists inside the function. This is annoying though since I have to use a lot of boilerplate code
list_of_params<-list(a=a,b=b,...)
f<-function(list_of_params){
a<-list_of_params[["a"]]
b<-list_of_params[["b"]]
c<-list_of_params[["c"]]
...
#do stuff with a,b,c,...
return(list(q=q,r=r,s=s,...))
}
I was thinking about using something like list2env to automatically extract the variables from the list into the environment of the function. Does anyone have opinions about whether that is a reasonable approach? I read somewhere that using assign is a bad idea and this seems similar. My proposed function would look like this:
f<-function(list_of_params){
list2env(list_of_params, envir=as.environment(-1)) #-1 means current environment
#do stuff with a,b,c...
return(list(q=q,r=r,s=s,...))
}
I have never used assign() or list2env() before. I am concerned they may have treacherous pitfalls I should watch out for, in the same manner as attach(). Is the use of list2env() here appropriate? If not, what is the appropriate use of this function?
A long list of parameters is probably a code-smell.
The easiest thing to do is to stop, and think about what type of object that should encapsulate your parameters. It's probably not just a simple list.
Another option is if many of the function parameters are held fixed in terms of procedural or lexical scope. Then you could use the fact that functions are R are closures. Example:
make_f <- function(object, params){
e <- calculate_e(object, params)
f <- calculate_f(object, params)
g <- calculate_g(object, params)
f<-function(a,b,c,d,...){
#do stuff with a,b,c,d,e,f,g
return(list(q=q,r=r,s=s,...))
}
return(f)
}

Resources