Convert a "by" object to a data frame in R - r

I'm using the "by" function in R to chop up a data frame and apply a function to different parts, like this:
pairwise.compare <- function(x) {
Nright <- ...
Nwrong <- ...
Ntied <- ...
return(c(Nright=Nright, Nwrong=Nwrong, Ntied=Ntied))
}
Z.by <- by(rankings, INDICES=list(rankings$Rater, rankings$Class), FUN=pairwise.compare)
The result (Z.by) looks something like this:
: 4
: 357
Nright Nwrong Ntied
3 0 0
------------------------------------------------------------
: 8
: 357
NULL
------------------------------------------------------------
: 10
: 470
Nright Nwrong Ntied
3 4 1
------------------------------------------------------------
: 11
: 470
Nright Nwrong Ntied
12 4 1
What I would like is to have this result converted into a data frame (with the NULL entries not present) so it looks like this:
Rater Class Nright Nwrong Ntied
1 4 357 3 0 0
2 10 470 3 4 1
3 11 470 12 4 1
How do I do that?

The by function returns a list, so you can do something like this:
data.frame(do.call("rbind", by(x, column, mean)))

Consider using ddply in the plyr package instead of by. It handles the work of adding the column to your dataframe.

Old thread, but for anyone who searches for this topic:
analysis = by(...)
data.frame(t(vapply(analysis,unlist,unlist(analysis[[1]]))))
unlist() will take an element of a by() output (in this case, analysis) and express it as a named vector.
vapply() does unlist to all the elemnts of analysis and outputs the result. It requires a dummy argument to know the output type, which is what analysis[[1]] is there for. You may need to add a check that analysis is not empty if that will be possible.
Each output will be a column, so t() transposes it to the desired orientation where each analysis entry becomes a row.

This expands upon Shane's solution of using rbind() but also adds columns identifying groups and removes NULL groups - two features which were requested in the question. By using base package functions, no other dependencies are required, e.g., plyr.
simplify_by_output = function(by_output) {
null_ind = unlist(lapply(by_output, is.null)) # by() returns NULL for combinations of grouping variables for which there are no data. rbind() ignores those, so you have to keep track of them.
by_df = do.call(rbind, by_output) # Combine the results into a data frame.
return(cbind(expand.grid(dimnames(by_output))[!null_ind, ], by_df)) # Add columns identifying groups, discarding names of groups for which no data exist.
}

I would do
x = by(data, list(data$x, data$y), function(d) whatever(d))
array(x, dim(x), dimnames(x))

Related

create dataframes and column names using R equivalent of python *args

I learnt that in R you can pass a variable number of parameters to your function with ...
I'm now trying to create a function and loop through ..., for example to create a dataframe.
create_df <- function(...) {
for(i in ...){
... <- data.frame(...=c(1,2,3,4),column2=c(1,2,3,4))
}
}
create_df(hello,world)
I would like to create two dataframes with this code one named hello and the other world. One of the columns should also be named hello and world respectively. Thanks
This is my error:
Error in create_df(hello, there) : '...' used in an incorrect context
It's generally not a good idea for function to inject variables into their calling environment. It's better to return values from functions and keep related values in a list. In this case you could do this instead
create_df <- function(...) {
names <- sapply(match.call(expand.dots = FALSE)$..., deparse)
Map(function(x) {
setNames(data.frame(a1=c(1,2,3,4),a2=c(1,2,3,4)), c(x, "column2"))
}, names)
}
create_df(hello,world)
# $hello
# hello column2
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# $world
# world column2
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
This returns a named list which is much easier to work with in R. We use match.call to turn the ... names into strings and then use those strings with functions that expect them like setNames() to change data.frame column names. Map is also a great helper for generating lists. It's often easier to use map style functions rather than bothering with explicit for loops.

pass a function with several arguments and some constants down a dataframe column, and output a new column for each term in the vector

I am trying to use a column in my df, the df is called combo, as an argument 'h' in my function. As well as a vector 'v'. The df is:
'data.frame': 10 obs. of 2 variables:
$ fake : num 2.24e-05 2.40e-05 2.69e-05 2.87e-05 3.14e-05 ...
$ funny: int 1 2 3 4 5 6 7 8 9 10
The vector is ve:
-str(ve)
- num [1:6] 1.37 2.4 2.23 3.2 2.9 3.22
The Function is:
f<-function(h, v){
m_k<- (Density*h)/(Cd*v)
y<- m_k*(v*sin(a)+(m_k*g))
return (y)
}
where Density and Cd are constants.
I get the following error when I run f(combo$fake, ve)
Warning messages:
1: In (Density * h)/(Cd * v) :
longer object length is not a multiple of shorter object length
and I get that they are not the same length. But what I would like is for R to apply the function using the first term of ve for each term in combo$fake , producing one column for that first ve and then repeat with the 2nd term in veso that in the end I have 6 columns and 10 rows of results given by the function.
I have tried using the apply functions, and a for loop, as well as referring to the arguments explicitly as with combo$fake, but I want to avoid hardcoding the function (otherwise I wouldnt use a function). This is just a sample, my real dataset is much bigger.
Here are some examples of what I ve tried and a tibble of the dataframe.
combo$fake $funny
<dbl> <int>
1 0.0000224 1
2 0.000024 2
3 0.0000269 3
4 0.0000287 4
5 0.0000314 5
6 0.0000324 6
y<- for (i in seq_len(ve)) {
for(j in seq_along(combo$fake))
f(combo$fake, ve)
}
y<- mapply(f, ve, combo$fake)
I've tried reading other similar questions on stackoverflow, but I just cannot get it to work :-( . Needless to say, I am very new in R, please help and thank you in advance.
mapply takes the first element of each argument and puts it into your function, then proceeds with the second element of each argument and so on. You can simply get your arguments into the shape needed by repeating its elements, such that element one of ve is combined with each element of df$fake before using the next value of ve. So essentially using rep with its arguments times and each.
If I understand your question correctly, you can use something like this:
# your function and its constants
f<-function(h, v){
Density <- 10
Cd <- 10
a <- 10
g <- 10
m_k<- (Density*h)/(Cd*v)
y<- m_k*(v*sin(a)+(m_k*g))
return (y)
}
# some dataframe and vector
df <- data.frame(fake = 1:10)
ve <- 20:25
# call mapply by repeating df$fake by the length of vector
# and repeat each element in the vector by the number of elements in df$fake
results <- mapply(f, rep(df$fake, times = length(ve)),
rep(ve, each = length(df$fake)))
# reshape the output
matrix(results, nrow=length(df$fake), ncol=length(ve))

R use of lapply() to populate and name one column in list of dataframes

After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.
A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)
Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.

trouble setting up iteration on multiple data.frames in r

I am having a recurring issue of performing specific tasks on multiple data.frames. Here is my working example data.frame, which was imported from text files.
cellID X Y Area AVGFP DeviationGFP AvgRFP DeviationsRFP Slice GUI.ID
1 1 18.20775 26.309859 568 5.389085 7.803248 12.13028 5.569880 0 1
2 2 39.78755 9.505495 546 5.260073 6.638375 17.44505 17.220153 0 1
3 3 30.50000 28.250000 4 6.000000 4.000000 8.50000 1.914854 0 1
4 4 38.20233 132.338521 257 3.206226 5.124264 14.04669 4.318130 0 1
5 5 43.22467 35.092511 454 6.744493 9.028574 11.49119 5.186897 0 1
6 6 57.06534 130.355114 352 3.781250 5.713022 20.96591 14.303546 0 1
7 7 86.81765 15.123529 1020 6.043137 8.022179 16.36471 19.194279 0 1
8 8 75.81932 132.146417 321 3.666667 5.852172 99.47040 55.234726 0 1
9 9 110.54277 36.339233 678 4.159292 6.689660 12.65782 4.264624 0 1
10 10 127.83480 11.384886 569 4.637961 6.992881 11.39192 4.287963 0 1
As previous questions I have posted, there are 40 of these data.frames named slice1...slice40.
What I want to do is add a new column to each of these data.frames that contains the product of AVGFP and Area. I can perform this on one data.frame easily by using
stats[[1]]$totalGFP <- stats[[1]]$AVGFP * stats[[1]]$Area
I am stuck trying to apply this command to every data.frame in stats
I appreciate any and all help. To help moving forward when you post a solution can you please describe the details of the commands used to help me follow along, thank you!
Like this:
stats <- lapply(stats, transform, totalGFP = AVGFP * Area)
I'll do my best to explain but please refer to ?lapply and ?transform for the full docs.
transform is a function to add columns to a data.frame, according to formulas of the type totalGFP = AVGFP * Area passed as arguments. For example, to add the totalGFP column to your first data.frame, you could run transform(stats[[1]], totalGFP = AVGFP * Area).
lapply applies a function (here transform) to each element of a list or a vector (here stats), and returns a list. If the function to be applied requires more arguments, they can be passed at the end of the lapply call, here totalGFP = AVGFP * Area. So here lapply is an elegant way of running transform on each element of stats.
Given that you wrote "please describe the details of the commands", try this simple example:
# create two small data frames
df1 <- data.frame(AVGFP = 1:3, Area = 4:6)
df2 <- data.frame(AVGFP = 7:9, Area = 1:3)
# create a list with named objects: the two data frames.
# ?list: "The arguments to list [...] of the form [...] tag = value
ll <- list(df1 = df1, df2 = df2)
str(ll)
# apply a function on each element in the list
# each element is a single data frame
# Use an 'anonymous function', function(x), where 'x' corresponds to each single data frame
# The function does this:
# (1) calculate the new variable 'total', and (2) add it to the data frame
ll2 <- lapply(X = ll, FUN = function(x){
total <- x$AVGFP * x$Area
x <- data.frame(x, total)
})
# check ll2
str(ll2)

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources