Sometimes I read posts where people use the print() function and I don't understand why it is used. Here for example in one answer the code is
print(fitted(m))
# 1 2 3 4 5 6 7 8
# 0.3668989 0.6083009 0.4677463 0.8685777 0.8047078 0.6116263 0.5688551 0.4909217
# 9 10
# 0.5583372 0.6540281
But using fitted(m) would give the same output. I know there are situations where we need print(), for example if we want create plots inside of loops. But why is the print() function used in cases like the one above?
I guess that in many cases usage of print is just a bad/redundant habit, however print has a couple of interesting options:
Data:
x <- rnorm(5)
y <- rpois(5, exp(x))
m <- glm(y ~ x, family="poisson")
m2 <- fitted(m)
# 1 2 3 4 5
# 0.8268702 1.0523189 1.9105627 1.0776197 1.1326286
digits - shows wanted number of digits
print(m2, digits = 3) # same as round(m2, 3)
# 1 2 3 4 5
# 0.827 1.052 1.911 1.078 1.133
na.print - turns NA values into a specified value (very similar to zero.print argument)
m2[1] <- NA
print(m2, na.print = "Failed")
# 1 2 3 4 5
# Failed 1.052319 1.910563 1.077620 1.132629
max - prints wanted number of values
print(m2, max = 2) # similar to head(m2, 2)
# 1 2
# NA 1.052319
I'm guessing, as I rarely use print myself:
using print() makes it obvious which lines of your code do printing and which ones do actual staff. It might make re-reading your code later easier.
using print() explicitly might make it easier to later refactor your code into a function, you just need to change the print into a return
programmers coming from a language with strict syntax might have a strong dislike towards the automatic printing feature of r
I've read the other answers for issues related to the "promise already under evaluation" warning, but I am unable to see how they can help me avoid this problem.
Here I have a function that for one method, takes a default argument value that is a function of another value.
myfun <- function(x, ones = NULL) {
UseMethod("myfun")
}
myfun.list <- function(x, ones = NA) {
data.frame(x = x[[1]], ones)
}
ones <- function(x) {
rep(1, length(x))
}
So far, so good:
myfun(list(letters[1:5]))
## x ones
## 1 a NA
## 2 b NA
## 3 c NA
## 4 d NA
## 5 e NA
But when I define another method that sets the default for the ones argument as the function ones(x), I get an error:
myfun.character <- function(x, ones = ones(x)) {
myfun(as.list(x), ones)
}
myfun(letters[1:5])
## Error in data.frame(x = x[[1]], ones) :
## promise already under evaluation: recursive default argument reference or earlier problems?
For various reasons, I need to keep the argument name the same as the function name (for ones). How can I force evaluation of the argument within my fun.character? I also need this to work (which it does):
myfun(letters[1:5], 1:5)
## x ones
## 1 a 1
## 2 a 2
## 3 a 3
## 4 a 4
## 5 a 5
Thanks!
One would need to look deep into R's (notorious) environments to understand exactly, where it tries to find ones. The problem is located in the way supplied and default arguments are evaluated within a function. You can see this link from the R manual and also an explanation here.
The easy solution is to tell R where to look for it. It will save you the hassle. In your case that's the global environment.
Changing method myfun.character to tell it to look for ones in the global environment:
myfun.character <- function(x, ones = get('ones', envir = globalenv())(x)) {
myfun(as.list(x), ones)
}
will be enough here.
Out:
myfun(letters[1:5])
# x ones
#1 a 1
#2 a 1
#3 a 1
#4 a 1
#5 a 1
myfun(letters[1:5], 1:5)
# x ones
#1 a 1
#2 a 2
#3 a 3
#4 a 4
#5 a 5
I have two data frames
d1 = data.frame(a=1:4,b=2:5)
d2 = data.frame(a=0:3,b=3:6)
and I would like to evaluate the same block of code, for example
c<-exp(a)
d<-b^2
within each data frame. At the moment I have to duplicate the code block as follows:
d1t = within(d1, {
c<-exp(a)
d<-b^2
})
d2t = within(d2, {
c<-exp(a)
d<-b^2
})
which makes my code prone to errors if I make changes to one of the code blocks (they should be the same).
I am not so familiar with environments in R, but I think it should be possible to use them to solve this problem nicely. How can I do it?
This is the perfect situation to write the repeated code blocks into a function:
MyFun <- function(df) {
out = within(df, {
c<-exp(a)
d<-b^2
})
return(out)
}
Will do it as long as the variable names are the same across datasets.
To run the code just do:
d1t <- MyFun(d1)
d2t <- MyFun(d2)
Should work.
We could place the dataframe objects in a list. We search for the names of the objects with the pattern ^d\\d+ ie. 'd' followed by numbers in the global environment. If there are multiple objects (in this case, 2 objects i.e. 'd1' and 'd2'), we can get the values using mget in a list.
lst <- mget(ls(pattern='^d\\d+'))
Now, we loop through the list with lapply and create new variables 'c' and 'd' using transform.
lst1 <- lapply(lst, transform, c=exp(a), d= b^2)
It is better to keep the 'data.frames' within the list. But, if we need to update the original datasets or create new objects i.e. 'd1t' and 'd2t' (not recommended), we can change the names of the list elements with setNames and use list2env to create objects in the global environment.
list2env(setNames(lst1, paste0(names(lst1), 't')), envir=.GlobalEnv)
d1t
# a b c d
#1 1 2 2.718282 4
#2 2 3 7.389056 9
#3 3 4 20.085537 16
#4 4 5 54.598150 25
d2t
# a b c d
#1 0 3 1.000000 9
#2 1 4 2.718282 16
#3 2 5 7.389056 25
#4 3 6 20.085537 36
I have a dataframe with multiple columns. For each row in the dataframe, I want to call a function on the row, and the input of the function is using multiple columns from that row. For example, let's say I have this data and this testFunc which accepts two args:
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> testFunc <- function(a, b) a + b
Let's say I want to apply this testFunc to columns x and z. So, for row 1 I want 1+5, and for row 2 I want 2 + 6. Is there a way to do this without writing a for loop, maybe with the apply function family?
I tried this:
> df[,c('x','z')]
x z
1 1 5
2 2 6
> lapply(df[,c('x','z')], testFunc)
Error in a + b : 'b' is missing
But got error, any ideas?
EDIT: the actual function I want to call is not a simple sum, but it is power.t.test. I used a+b just for example purposes. The end goal is to be able to do something like this (written in pseudocode):
df = data.frame(
delta=c(delta_values),
power=c(power_values),
sig.level=c(sig.level_values)
)
lapply(df, power.t.test(delta_from_each_row_of_df,
power_from_each_row_of_df,
sig.level_from_each_row_of_df
))
where the result is a vector of outputs for power.t.test for each row of df.
You can apply apply to a subset of the original data.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
apply(dat[,c('x','z')], 1, function(x) sum(x) )
or if your function is just sum use the vectorized version:
rowSums(dat[,c('x','z')])
[1] 6 8
If you want to use testFunc
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(x) testFunc(x[1],x[2]))
EDIT To access columns by name and not index you can do something like this:
testFunc <- function(a, b) a + b
apply(dat[,c('x','z')], 1, function(y) testFunc(y['z'],y['x']))
A data.frame is a list, so ...
For vectorized functions do.call is usually a good bet. But the names of arguments come into play. Here your testFunc is called with args x and y in place of a and b. The ... allows irrelevant args to be passed without causing an error:
do.call( function(x,z,...) testFunc(x,z), df )
For non-vectorized functions, mapply will work, but you need to match the ordering of the args or explicitly name them:
mapply(testFunc, df$x, df$z)
Sometimes apply will work - as when all args are of the same type so coercing the data.frame to a matrix does not cause problems by changing data types. Your example was of this sort.
If your function is to be called within another function into which the arguments are all passed, there is a much slicker method than these. Study the first lines of the body of lm() if you want to go that route.
Use mapply
> df <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
> df
x y z
1 1 3 5
2 2 4 6
> mapply(function(x,y) x+y, df$x, df$z)
[1] 6 8
> cbind(df,f = mapply(function(x,y) x+y, df$x, df$z) )
x y z f
1 1 3 5 6
2 2 4 6 8
New answer with dplyr package
If the function that you want to apply is vectorized,
then you could use the mutate function from the dplyr package:
> library(dplyr)
> myf <- function(tens, ones) { 10 * tens + ones }
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mutate(x, value = myf(tens, ones))
hundreds tens ones value
1 7 1 4 14
2 8 2 5 25
3 9 3 6 36
Old answer with plyr package
In my humble opinion,
the tool best suited to the task is mdply from the plyr package.
Example:
> library(plyr)
> x <- data.frame(tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
tens ones V1
1 1 4 14
2 2 5 25
3 3 6 36
Unfortunately, as Bertjan Broeksema pointed out,
this approach fails if you don't use all the columns of the data frame
in the mdply call.
For example,
> library(plyr)
> x <- data.frame(hundreds = 7:9, tens = 1:3, ones = 4:6)
> mdply(x, function(tens, ones) { 10 * tens + ones })
Error in (function (tens, ones) : unused argument (hundreds = 7)
Others have correctly pointed out that mapply is made for this purpose, but (for the sake of completeness) a conceptually simpler method is just to use a for loop.
for (row in 1:nrow(df)) {
df$newvar[row] <- testFunc(df$x[row], df$z[row])
}
Many functions are vectorization already, and so there is no need for any iterations (neither for loops or *pply functions). Your testFunc is one such example. You can simply call:
testFunc(df[, "x"], df[, "z"])
In general, I would recommend trying such vectorization approaches first and see if they get you your intended results.
Alternatively, if you need to pass multiple arguments to a function which is not vectorized, mapply might be what you are looking for:
mapply(power.t.test, df[, "x"], df[, "z"])
Here is an alternate approach. It is more intuitive.
One key aspect I feel some of the answers did not take into account, which I point out for posterity, is apply() lets you do row calculations easily, but only for matrix (all numeric) data
operations on columns are possible still for dataframes:
as.data.frame(lapply(df, myFunctionForColumn()))
To operate on rows, we make the transpose first.
tdf<-as.data.frame(t(df))
as.data.frame(lapply(tdf, myFunctionForRow()))
The downside is that I believe R will make a copy of your data table.
Which could be a memory issue. (This is truly sad, because it is programmatically simple for tdf to just be an iterator to the original df, thus saving memory, but R does not allow pointer or iterator referencing.)
Also, a related question, is how to operate on each individual cell in a dataframe.
newdf <- as.data.frame(lapply(df, function(x) {sapply(x, myFunctionForEachCell()}))
data.table has a really intuitive way of doing this as well:
library(data.table)
sample_fxn = function(x,y,z){
return((x+y)*z)
}
df = data.table(A = 1:5,B=seq(2,10,2),C = 6:10)
> df
A B C
1: 1 2 6
2: 2 4 7
3: 3 6 8
4: 4 8 9
5: 5 10 10
The := operator can be called within brackets to add a new column using a function
df[,new_column := sample_fxn(A,B,C)]
> df
A B C new_column
1: 1 2 6 18
2: 2 4 7 42
3: 3 6 8 72
4: 4 8 9 108
5: 5 10 10 150
It's also easy to accept constants as arguments as well using this method:
df[,new_column2 := sample_fxn(A,B,2)]
> df
A B C new_column new_column2
1: 1 2 6 18 6
2: 2 4 7 42 12
3: 3 6 8 72 18
4: 4 8 9 108 24
5: 5 10 10 150 30
#user20877984's answer is excellent. Since they summed it up far better than my previous answer, here is my (posibly still shoddy) attempt at an application of the concept:
Using do.call in a basic fashion:
powvalues <- list(power=0.9,delta=2)
do.call(power.t.test,powvalues)
Working on a full data set:
# get the example data
df <- data.frame(delta=c(1,1,2,2), power=c(.90,.85,.75,.45))
#> df
# delta power
#1 1 0.90
#2 1 0.85
#3 2 0.75
#4 2 0.45
lapply the power.t.test function to each of the rows of specified values:
result <- lapply(
split(df,1:nrow(df)),
function(x) do.call(power.t.test,x)
)
> str(result)
List of 4
$ 1:List of 8
..$ n : num 22
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.9
..$ alternative: chr "two.sided"
..$ note : chr "n is number in *each* group"
..$ method : chr "Two-sample t test power calculation"
..- attr(*, "class")= chr "power.htest"
$ 2:List of 8
..$ n : num 19
..$ delta : num 1
..$ sd : num 1
..$ sig.level : num 0.05
..$ power : num 0.85
... ...
I came here looking for tidyverse function name - which I knew existed. Adding this for (my) future reference and for tidyverse enthusiasts: purrrlyr:invoke_rows (purrr:invoke_rows in older versions).
With connection to standard stats methods as in the original question, the broom package would probably help.
If data.frame columns are different types, apply() has a problem.
A subtlety about row iteration is how apply(a.data.frame, 1, ...) does
implicit type conversion to character types when columns are different types;
eg. a factor and numeric column. Here's an example, using a factor
in one column to modify a numeric column:
mean.height = list(BOY=69.5, GIRL=64.0)
subjects = data.frame(gender = factor(c("BOY", "GIRL", "GIRL", "BOY"))
, height = c(71.0, 59.3, 62.1, 62.1))
apply(height, 1, function(x) x[2] - mean.height[[x[1]]])
The subtraction fails because the columns are converted to character types.
One fix is to back-convert the second column to a number:
apply(subjects, 1, function(x) as.numeric(x[2]) - mean.height[[x[1]]])
But the conversions can be avoided by keeping the columns separate
and using mapply():
mapply(function(x,y) y - mean.height[[x]], subjects$gender, subjects$height)
mapply() is needed because [[ ]] does not accept a vector argument. So the column
iteration could be done before the subtraction by passing a vector to [],
by a bit more ugly code:
subjects$height - unlist(mean.height[subjects$gender])
A really nice function for this is adply from plyr, especially if you want to append the result to the original dataframe. This function and its cousin ddply have saved me a lot of headaches and lines of code!
df_appended <- adply(df, 1, mutate, sum=x+z)
Alternatively, you can call the function you desire.
df_appended <- adply(df, 1, mutate, sum=testFunc(x,z))
Puzzle for the R cognoscenti: Say we have a data-frame:
df <- data.frame( a = 1:5, b = 1:5 )
I know we can do things like
with(df, a)
to get a vector of results.
But how do I write a function that takes an expression (such as a or a > 3) and does the same thing inside. I.e. I want to write a function fn that takes a data-frame and an expression as arguments and returns the result of evaluating the expression "within" the data-frame as an environment.
Never mind that this sounds contrived (I could just use with as above), but this is just a simplified version of a more complex function I am writing. I tried several variants ( using eval, with, envir, substitute, local, etc) but none of them work. For example if I define fn like so:
fn <- function(dat, expr) {
eval(expr, envir = dat)
}
I get this error:
> fn( df, a )
Error in eval(expr, envir = dat) : object 'a' not found
Clearly I am missing something subtle about environments and evaluation. Is there a way to define such a function?
The lattice package does this sort of thing in a different way. See, e.g., lattice:::xyplot.formula.
fn <- function(dat, expr) {
eval(substitute(expr), dat)
}
fn(df, a) # 1 2 3 4 5
fn(df, 2 * a + b) # 3 6 9 12 15
That's because you're not passing an expression.
Try:
fn <- function(dat, expr) {
mf <- match.call() # makes expr an expression that can be evaluated
eval(mf$expr, envir = dat)
}
> df <- data.frame( a = 1:5, b = 1:5 )
> fn( df, a )
[1] 1 2 3 4 5
> fn( df, a+b )
[1] 2 4 6 8 10
A quick glance at the source code of functions using this (eg lm) can reveal a lot more interesting things about it.
A late entry, but the data.table approach and syntax would appear to be what you are after.
This is exactly how [.data.table works with the j, i and by arguments.
If you need it in the form fn(x,expr), then you can use the following
library(data.table)
DT <- data.table(a = 1:5, b = 2:6)
`[`(x=DT, j=a)
## [1] 1 2 3 4 5
`[`(x=DT, j=a * b)
## [1] 2 6 12 20 30
I think it is easier to use in more native form
DT[,a]
## [1] 1 2 3 4 5
and so on. In the background this is using substitute and eval
?within might also be of interest.
df <- data.frame( a = 1:5, b = 1:5 )
within(df, cx <- a > 3)
a b cx
1 1 1 FALSE
2 2 2 FALSE
3 3 3 FALSE
4 4 4 TRUE
5 5 5 TRUE