Apply function that return data.frame/tibble on vector/data.frame column and bind results - r

I have a function that fetches some data from a database. It takes a single parameter and returns a data.frame. I would like to use an input vector of these parameters and pipe them to map or similar function that takes each elment and returns the db results. The results can differ in rows but columns are always the same. How do I go about without looping and row-binding? (for i in ..)
I tried the following route:
myfuncSingleRow<-function(nbr){
data.frame(a=nbr,b=nbr^2,c=nbr^3)}
myfuncMultipleRow<-function(nbr){
data.frame(a=rep(nbr,3),b=rep(nbr^2,3),c=rep(nbr^3,3))}
a<-data.frame(count=c(1,2,3))
myfuncSingleRow(2)
myfuncMultipleRow(2)
a %>% select(count) %>% map_dfr(.f=myfuncSingleRow) #output as expected
a %>% select(count) %>% map_dfr(.f=myfuncMultipleRow) #output not as expected
Now this does not work as intended either. Example myFuncMultipleRow, I was expecting the first 3 rows to be equal, the next 3 equal, and the same for the final 3. Example using myFuncMultipleRow:
Getting
a b c
1 1 1 1
2 2 4 8
3 3 9 27
4 1 1 1
5 2 4 8
6 3 9 27
7 1 1 1
8 2 4 8
9 3 9 27
Wanting:
a b c
1 1 1 1
2 1 1 1
3 1 1 1
4 2 4 8
5 2 4 8
6 2 4 8
7 3 9 27
8 3 9 27
9 3 9 27
As usual, I am probably not using the functions correctly, but a bit stuck here a do not want to resolve to the old loop and rbind which would probably be a performance bottleneck. Any takers?
EDIT: As pointed out "each" argument in "rep" does solve this one, but does not solve the main issue. If map did iterate and call the function for each element, then using parameter "each" and "times" for function "rep" should yield the same result. The function passed to map is not vectorized, but assumes a single parameter of length 1.
The solution need to do:
res<-data.frame()
for(i in a) res<-rbind(res,myfuncMultipleRow(i))

So, after looking at latest purrr 0.3.0 (was on older version) map_depth pointed to the right direction.
a %>% select(count)%>% map_depth(.depth=2,.f=myfuncMultipleRow) %>% map_dfr(.f=bind_rows)
Dropping map_depth() , bind_rows() and nesting instead:
a %>% select(count)%>% map_dfr(~map_dfr(.,myfuncMultipleRow))
a %>% select(count)%>% map_dfr(.f=function(x) map_dfr(x,.f=myfuncMultipleRow))

Related

Equivalent to first./last. SAS processing in R

I did find a thread on this (R equivalent of .first or .last sas operator) but it did not fully answer my question.
I come from a SAS background and a common operation is, for example, when you have your patient ID with several different values, and you want to keep only the row with the minimum/maximum value for another variable for each ID. For example, I might have data with dates of a certain medical problem for each ID, and I want a dataset with just the first/last problem date for each patient.
Here's a simple example that gets me what I'm want, but I want to know if there's a better way to do it. I sort by ID, and then count, and I want to just keep the row with the largest count for each ID.
testdata<-data.frame(id=c(1,1,1,2,3,3,4,3,4,4,4),
count=c(5,9,2,6,16,12,0,11,8,8,7))
library(dplyr)
testdata2<-arrange(testdata,id,count)
testdata3<-cbind(testdata2,!duplicated(testdata2$id,fromLast=TRUE))
testdata4<-subset(testdata3,testdata3[,3]=='TRUE')[,-3]
> testdata4
id count
3 1 9
4 2 6
7 3 16
11 4 8
Is there a more compact way to do this?
Thank you.
do.call(rbind.data.frame,
c(by(testdata, testdata$id, function(d) d[c(1L,nrow(d)),]), stringsAsFactors=FALSE))
# id count
# 1.1 1 5
# 1.3 1 2
# 2.4 2 6
# 2.4.1 2 6
# 3.5 3 16
# 3.8 3 11
# 4.7 4 0
# 4.11 4 7
Breaking it down:
d[c(1L,nrow(d)),] returns the first and last row from the dataframe. (I'm assuming the frame has already been ordered appropriately.)
by(testdata, testdata$id, function breaks the larger frame into smaller frames by $id, and passes each smaller frame to the anonymous function. This returns a by-list of each return value.
do.call(rbind.data.frame, grabs the list and row-binds them back together into a single frame. Since the default is to use factors, I added stringsAsFactors=FALSE.
If you want to use dplyr, you can do:
library(dplyr)
group_by(testdata, id) %>%
slice(c(1,n())) %>%
ungroup()
# # A tibble: 8 × 2
# id count
# <dbl> <dbl>
# 1 1 5
# 2 1 2
# 3 2 6
# 4 2 6
# 5 3 16
# 6 3 11
# 7 4 0
# 8 4 7
where n() is a special function within dplyr pipes that returns the number of rows in that (optionally-grouped) frame.

Getting stale values on using ifelse in a dataframe

Hi I am aggregating values from two columns and creating a final third column, based on priorities. If values in column 1 are missing or are NA then I go for column 2.
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10))
df
internal external
1 1
2 5 6
3 8
4 6 9
5 NA 10
df$final <- df$internal
df$final <- ifelse((df$final=="" | df$final=="NA"),df$external,df$final)
df
internal external final
1 1 2
2 5 6 3
3 8 4
4 6 9 4
5 NA 10 2
How can I get final value as 4 and 2 for row 3 and row 5 when the external is 8 and 2. I don't know what's wrong but these values don't make any sense to me.
The issue arises because R converts your values to factors.
Your code will work fine with
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10),stringsAsFactors = FALSE)
PS: this hideous conversion to factors should definitely belong to the R Inferno, http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Pasting as object names

I am trying to use paste0 with merge, so that I can merge a bunch of stuff in a loop. However, I'm having trouble with calling specific columns from data.frames
To illustrate, I'll use head
Example:
df <- data.frame(x=1:10,y=1:10)
head(df)
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
head(get("df"))
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
head(df$x)
[1] 1 2 3 4 5 6
head(get("df$x"))
Error in get("df$x") : object 'df$x' not found
Is there a way to get a specific column?
The function get looks for objects defined in an environment. If you do not specify the environment, it defaults to your global workspace.
You need to coerce df into an environment using as.environment, and then call get using this environment, e.g.:
get("x", as.enviroment(get("df")))

Storing an output in the same data.frame when row size of output different

Sometimes I want to perform a function (eg difference calculation) on a dataset and store the results directly in the data frame
df <- data.frame(a$C, diff(a$C))
But I cannot do that because the number of rows is different.
Is there some syntax that will allow me to to that, perhaps having NA when the function (diff()) gives no results?
There isn't a general solution to this without making vast assumptions about the whole panoply of function one may wish to use.
For the example you show, we can easily work out that the first value from diff() would be an NA if it returned it:
set.seed(5)
d <- rpois(10, 5)
> d
[1] 3 6 8 4 2 6 5 7 9 2
> diff(d)
[1] 3 2 -4 -2 4 -1 2 2 -7
So if you are using diff() then you can always just do:
> dd <- data.frame(d, Diff = c(NA, diff(d)))
> dd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
But now consider what you would do with any other function that you might wish to use that doesn't always return NA in the correct place.
For this example, we can use the zoo package which has an na.pad argument:
require(zoo)
d2 <- as.zoo(d)
ddd <- data.frame(d, Diff = diff(d2, na.pad = TRUE))
> ddd
d Diff
1 3 NA
2 6 3
3 8 2
4 4 -4
5 2 -2
6 6 4
7 5 -1
8 7 2
9 9 2
10 2 -7
If you are using a modelling function with a formula interface (e.g. lm()) and that function has an na.action argument, then you can set na.action = na.exclude in the function call and extractor functions such as fitted(), resid() etc will add back in to their output NA in the correct places so that the output is of the same length as the data passed to the modelling function.
If you have other more specific cases you want to explore, please edit your Answer. In specific cases there will usually be a simple Answer to your Q. In the general case the Answer is no, it is not possible to do what you ask.
The standard method is to create as you say a vector that is extended at one end or the other with an NA
dfrm$diffvec <- c(NA, diff(firstvec) )

ave function in R: First argument a vector

I'm trying to use the following code in R:
ID=seq(1,11)
g=c(1,2,3,1,1,2,3,4,4,1,3)
x <- sample(11)
d <- data.frame(ID,g, x)
Ranking_Categoria<-function(d,var,category)
{
d$rank<-ave(d$var,d$category,FUN=rank)
return(d)
}
and I get the following error message:
Error in split.default(x, g) : first argument must be a vector.
Variables var and category (character) are columns of the dataframe d that user needs to specify in order to get the desired result. I need to refer to this names when I use the function ave() as you can see.
You need to use [[ to get the var and category columns by name:
Ranking_Categoria<-function(d,var,category)
{
d$rank<-ave(d[[var]],d[[category]],FUN=rank)
return(d)
}
... because d$var tries to get the column called "var", and there is none.
UPDATE
> Ranking_Categoria(d, "x", "g")
ID g x rank
1 1 1 10 3
2 2 2 9 2
3 3 3 4 1
4 4 1 11 4
5 5 1 1 1
6 6 2 8 1
7 7 3 6 2
8 8 4 2 1
9 9 4 5 2
10 10 1 3 2
11 11 3 7 3
The best solution would be not to use names at all:
Ranking_Categoria<-function(d,var,category)
{
d$rank<-ave(var,category,FUN=rank)
return(d)
}
Then call it as
Ranking_Categoria(d,d$x,d$g)
The reason why the function in your question didn't work as you thought it would is partially because R's syntax and DWIM-ness for string manipulation sucks. Here's a hacky, fragile solution using eval and parse:
Ranking_Categoria<-function(d,var,category)
{
string=paste('d$rank<-ave(d$',var,',d$',category,',FUN=rank)',sep="")
eval(parse(text=string))
return(d)
}
However, you still have to call it as
Ranking_Categoria(d,"x","g")
And if you already have objects with the names of x and g, then may the gods help you if you try to do Ranking_Categoria(d,x,g)... Crap like this is why I've gone from using Perl and R equally to sticking with Perl (my first and native programming language) and using R only when necessary.

Resources