I would like to convert this data frame
data <- data.frame(color=c("red","red","red","green","green","green","blue","blue","blue"),object=c("box","chair","table","box","chair","table","box","chair","table"),units=c(1:9),price=c(11.5,12.5,13.5,14.5,15.5,16.5,17.5,18.5,19.5))
to this other one
output <- data.frame(color=c("red","green","blue"),units_box=c(1,4,7),price_box=c(11.5,14.5,17.5), units_chair=c(2,5,8),price_chair=c(12.5,15.5,18.5),units_table=c(3,6,9),price_table=c(13.5,16.5,19.5))
Therefore, I am using reshape2::melt and reshape2::dcast to build a user-defined function as the following
fun<-function(df,var,group){
r<-reshape2::melt(df,id.vars=var)
r<-reshape2::dcast(r,var~group)
return(r)
}
When I use the function as follows
fun(data,color,object)
I get the following error message
Error in melt_check(data, id.vars, measure.vars, variable.name,
value.name) : object 'color' not found
Do you know how can I solve it? I think that the problem is that I should call the variables in reshape2::melt with quotes but I do not know how.
Note 1: I would like keep the original number format of variables (i.e. objects without decimals and price with one decimal)
Note 2: I would like to remark that that my real code (this is just a simplified example) is much longer and involves dplyr functions (including enquo() and UQ() functions). Therefore the solutions for this case should be compatible with dplyr.
Note 3: I do not use tidyr (I am a big fun of the whole tidyverse) because the current tidyr still use the old language for functions and I share the script with other people that might not be willing to use the development version of tidyr.
We can use dcast from data.table
library(data.table)
dcast(setDT(data), color ~object, value.var = c("units", "price"), FUN = c(length, mean))
I solved the issue by myself (although I do not know very well the reasons behind).
The main problem, as I suspected was passing the variables of the user-defined function in melt and dcast cause some kind of conflict maybe due to the lack of quotes (?).
Anyway I renamed the variables using dplyr::rename so that the names are not anymore depended of variables but characters. Here you can see the final code I am applying:
fun<-function(df,var,group){
enquo_var<-enquo(var)
enquo_group<-enquo(group)
r<-df%>%
reshape2::melt(., id.var=1, variable.name = "parameter")%>%
dplyr::rename(var = UQ(enquo_var))%>%
reshape2::dcast(data=., formula = var~parameter, value.var = "value")
return(r)
}
funx<-fun(data,color,object)
Although I found the solution to my particular problem, I would appreciate very much if someone explains me the reasons behind.
PS: I hope anyway that the new version of tidyr is ready soon to make such tasks easier. Thanks #hadley for your fantastic work.
Related
This is a problem I often encounters: I try to access an object's own name when using a function from apply family and spend hours figuring out how to do it... For instance (this is not the core of my question), today I was willing to inspect an attached package trying to figure out if it contained some non function objects. After a lot of tries and fails, I finally came up with (for the rrapply package - I know looking at the documentation is also easy but this one illustrates well the problem):
library(rrapply)
eapply(rlang::pkg_env('rrapply'), function(x) {if(!is.function(x)) x}) %>%
`[`(sapply(., function(x) !is.null(x))) %>%
names()
## [1] "renewable_energy_by_country" "pokedex"
I feel that is really too complicated for a simple test !
So my question: is there an easy way to loop through an object in base R (or maybe tidyverse) and return only the names of those elements that correspond to a certain condition ? rrapply seems to be able to achieve that but:
it is fairly complicated
and it seems to work on lists only and to loop through all sub-elements as well which is not desired
Thanks !
Identify the environment of interest, e, and then use eapply with the indicated function taking the names of the extracted elements at the end. This isn't conceptually different from the code in the question but does seem somewhat less complex when done in base R in the following way:
e <- as.environment("package:rrapply")
names(Filter(`!`, eapply(e, is.function)))
or the same code written as a pipeline:
library(magrittr)
"package:rrapply" %>%
as.environment %>%
eapply(is.function) %>%
Filter(`!`, .) %>%
names
I am using magrittr, and was able to pass one variable to an R function via pipes from magrittr, and also pick which parameter to place where in the situation of multivariable function : F(x,y,z,...)
But i want to pass 2 parameters at the same time.
For example, i will using Select function from dplyr and pass in tableName and ColumnName:
I thought i could do it like this:
tableName %>% ColumnName %>% select(.,.)
But this did not work.
Hope someone can help me on this.
EDIT :
Some below are saying that this is a duplicate of a link provided by others.
But based on the algebra structure of the magrittr definition of Pipe for multivariable functions, it should be "doable" just based on the algebra definition of the pipe function.
The link provided by others, goes beyond the base definition and employs other external functions and or libraries to try to achieve passing multiple parameter to the function.
I am looking for a solution, IF POSSIBLE, just using the magrittr library and other base operations.
So this is the restriction that is placed on this problem.
In most of my university courses in math and computer science we were restricted to use only those things taught in the course. So when I said I am using dplyr and magrittr, that should imply that those are the only things one is permitted to use, so its under this constraint.
Hope this clarifies the scope of possible solutions here.
And if it's not possible to do this via just these libraries I want someone to tell me that it cannot be done.
I think you need a little more detail about exactly what you want, but as I understand the problem, I think one solution might be:
list(x = tableName, y = "ColumnName") %>% {select(eval(.$x),.$y) }
This is just a modification of the code linked in the chat. The issue with other implementations is that the first and second inputs to select() must be of specific (and different) types. So just plugging in two strings or two objects won't work.
In the same spirit, you can also use either:
list(x = "tableName", y = "ColumnName") %>% { select(get(.$x),.$y) }
or
list(tableName, "ColumnName") %>% do.call("select", .).
Note, however, that all of these functions (i.e., get(), eval(), and do.call()) have an environment specification in them and could result in errors if improperly specified. They work just fine in these examples because everything is happening in the global environment, but that might change if they were, e.g., called in a function.
I'm often writing things like:
dataframe$this_column <- as.Date(dataframe$this_column)
That is, when changing some column in my data frame [table], I'm constantly writing the column twice. Is there some function that allows me to directly change the data frame w/o explicitly reassigning it? Say: ch(dataframe$this_column, as.Date())
EDIT: While similar, the potential duplicate is not the same. I am not looking for a way to shorten self-referential reassignments. I'm looking to avoid the explicit reassignment all together. The answer I accepted here is an appropriate solution (and much better than the answers provided in the "duplicate" question, in regards to their relevance to my question).
Here is the example using magrittr package:
library(magrittr)
x = c('2015-12-12','2015-12-13','2015-12-14')
df = data.frame(x)
df$x %<>% as.Date
I have a project that has already been written using context of data.frame. In order to improve calc times I'm trying to leverage the speed of using data.table instead. My methodology for this has been to construct wrapper functions that read in frames, convert them to tables, do the calculations and then convert back to frames. Here's one of the simple examples...
FastAgg<-function(x, FUN, aggFields, byFields = NULL, ...){
require('data.table')
y<-setDT(x)
y<-y[,lapply(X=.SD,FUN=FUN,...),.SDcols = aggFields,by=byFields]
y<-data.frame(y)
y
}
The problem I'm having is that after running this function x has been converted to a table and then lines of code that I have written using data.frame notation fail. How do I make sure that the data.frame I feed in is unchanged by the running function?
For your case, I'd recommend (of course) to use data.table through out and not just in a function :-).
But if it's not likely to happen, then I'd recommend the setDT + setDF setup. I'd recommend using setDT outside the function (and provide the data.table as input) - to convert your data.frame to a data.table by reference, and then after finishing the operations you'd like, you can use setDF to convert the result back to a data.frame using setDF and return that from the function. However, doing setDT(x) changes x to a data.table - as it operates by reference.
If that is not ideal, then use as.data.table(.) inside your function, as it operates on a copy. Then, you can still use setDF() to convert the resulting data.table to data.frame and return that data.frame from your function.
These functions are recently introduced (mostly due to user requests). The idea to avoid this confusion is to export shallow() function and keep track of objects that require columns to be copied, and do it all internally (and automatically). It's all in very early stages right now. When we've managed, I'll update this post.
Also have a look at ?copy, ?setDT and ?setDF. The first paragraph in these function's help page is:
In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=. Check out the See Also section below for other set* function data.table provides.
And the example for setDT:
set.seed(45L)
X = data.frame(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE),
C=sample(10), stringsAsFactors=FALSE)
# get the frequency of each "A,B" combination
setDT(X)[, .N, by="A,B"][]
does no assignment (although I admit it could be explained slightly better here).
In setDF:
X = data.table(x=1:5, y=6:10)
## convert 'X' to data.frame, without any copy.
setDF(X)
I think this is pretty clear. But I'll try to provide more clarity. Also, I'll try and add how best to use these functions in the documentation as well.
I'm trying to use the ddply method to take a dataframe with various info about 3000 movies and then calculate the mean gross of each genre. I'm new to R, and I've read all the questions on here relating to ddply, but I still can't seem to get it right. Here's what I have now:
> attach(movies)
> ddply(movies, Genre, mean(Gross))
Error in llply(.data = .data, .fun = .fun, ..., .progress = .progress, :
.fun is not a function.
How am I supposed to write a function that takes the mean of the values in the "Gross" column for each set of movies, grouped by genre? I know this seems like a simple question, but the documentation is really confusing to me, and I'm not too familiar with R syntax yet.
Is there a method other than ddply that would make this easier?
Thanks!!
Here is an example using the tips dataset available in ggplot2
library(ggplot2);
mean_tip_by_day = ddply(tips, .(day), summarize, mean_tip = mean(tip/total_bill))
Hope this is useful
You probably don't need plyr for a simple operation like that. tapply() does the job easily and you won't need to load additional packages. The syntax also seems simpler than Ramnath's:
tapply(tips$tip, tips$day, mean)
Note that plyr is a fantastic tool for many tasks. To me, it just seems like overkill here.