R: using ddply to apply functions to subsets of data - r

I'm trying to use the ddply method to take a dataframe with various info about 3000 movies and then calculate the mean gross of each genre. I'm new to R, and I've read all the questions on here relating to ddply, but I still can't seem to get it right. Here's what I have now:
> attach(movies)
> ddply(movies, Genre, mean(Gross))
Error in llply(.data = .data, .fun = .fun, ..., .progress = .progress, :
.fun is not a function.
How am I supposed to write a function that takes the mean of the values in the "Gross" column for each set of movies, grouped by genre? I know this seems like a simple question, but the documentation is really confusing to me, and I'm not too familiar with R syntax yet.
Is there a method other than ddply that would make this easier?
Thanks!!

Here is an example using the tips dataset available in ggplot2
library(ggplot2);
mean_tip_by_day = ddply(tips, .(day), summarize, mean_tip = mean(tip/total_bill))
Hope this is useful

You probably don't need plyr for a simple operation like that. tapply() does the job easily and you won't need to load additional packages. The syntax also seems simpler than Ramnath's:
tapply(tips$tip, tips$day, mean)
Note that plyr is a fantastic tool for many tasks. To me, it just seems like overkill here.

Related

Translate manipulate data Stata to R: if and recursive value in the same line

I am intermediate level in Stata and I feel comfortable working there. I always find the way to do the hardest tasks in Stata instead of R. However, I must present this one in R so I can not avoid it (even if Stata is always simpler to me) this time.
I want to translate this code:
gen new_variable = 0
replace new_variable = 1 if old_variable != old_variable[_n-1]
As per this website (https://www.matthieugomez.com/statar/manipulate-data.html), I should use dplyr library, specifically ifelse and reduce functions, which I do with the following code:
database$new_variable <- mutate(database$new_variable = ifelse(database$old_variable != Reduce(sum, database$old_variable, accumulate = TRUE), 1, database$new_variable))
However, it is not working. I know this code may be quite messy, but I'm so used to Stata.
The question is: How can I successfully translate that code from Stata to R with dplyr library? (if you have a simpler approach it would be great too).
Based on the information you give. Try this
library(dplyr)
new_var=0
database%>%
mutate(new_variable=ifelse(oldvariable!=lag(oldvariable),1,new_var)

Calling variable in user-defined function with reshape2::melt and reshape2::dcast

I would like to convert this data frame
data <- data.frame(color=c("red","red","red","green","green","green","blue","blue","blue"),object=c("box","chair","table","box","chair","table","box","chair","table"),units=c(1:9),price=c(11.5,12.5,13.5,14.5,15.5,16.5,17.5,18.5,19.5))
to this other one
output <- data.frame(color=c("red","green","blue"),units_box=c(1,4,7),price_box=c(11.5,14.5,17.5), units_chair=c(2,5,8),price_chair=c(12.5,15.5,18.5),units_table=c(3,6,9),price_table=c(13.5,16.5,19.5))
Therefore, I am using reshape2::melt and reshape2::dcast to build a user-defined function as the following
fun<-function(df,var,group){
r<-reshape2::melt(df,id.vars=var)
r<-reshape2::dcast(r,var~group)
return(r)
}
When I use the function as follows
fun(data,color,object)
I get the following error message
Error in melt_check(data, id.vars, measure.vars, variable.name,
value.name) : object 'color' not found
Do you know how can I solve it? I think that the problem is that I should call the variables in reshape2::melt with quotes but I do not know how.
Note 1: I would like keep the original number format of variables (i.e. objects without decimals and price with one decimal)
Note 2: I would like to remark that that my real code (this is just a simplified example) is much longer and involves dplyr functions (including enquo() and UQ() functions). Therefore the solutions for this case should be compatible with dplyr.
Note 3: I do not use tidyr (I am a big fun of the whole tidyverse) because the current tidyr still use the old language for functions and I share the script with other people that might not be willing to use the development version of tidyr.
We can use dcast from data.table
library(data.table)
dcast(setDT(data), color ~object, value.var = c("units", "price"), FUN = c(length, mean))
I solved the issue by myself (although I do not know very well the reasons behind).
The main problem, as I suspected was passing the variables of the user-defined function in melt and dcast cause some kind of conflict maybe due to the lack of quotes (?).
Anyway I renamed the variables using dplyr::rename so that the names are not anymore depended of variables but characters. Here you can see the final code I am applying:
fun<-function(df,var,group){
enquo_var<-enquo(var)
enquo_group<-enquo(group)
r<-df%>%
reshape2::melt(., id.var=1, variable.name = "parameter")%>%
dplyr::rename(var = UQ(enquo_var))%>%
reshape2::dcast(data=., formula = var~parameter, value.var = "value")
return(r)
}
funx<-fun(data,color,object)
Although I found the solution to my particular problem, I would appreciate very much if someone explains me the reasons behind.
PS: I hope anyway that the new version of tidyr is ready soon to make such tasks easier. Thanks #hadley for your fantastic work.

Using plyr to obtain all strings in dataframe beginning with a string

I have a data frame and I'm trying to obtain all strings in there that begin with "RLF" and put them in a list. I tried using the dlply function in plyr, but I couldn't get the syntax quite right.
dlply(.data = unformatted_table,.variables = 1:ncol(unformatted_table), .fun = strsplit("RLF") ,.inform = TRUE)
I looked around alot and couldn't apply the solutions to my problem. Also, please let me know if there's a better way than using dlply.

How to operate non-standard-evaluation in correct manner for summarize{dplyr}

I want to pass variables to 'summarize' by way of non-standard-evaluation approach (see http://adv-r.had.co.nz/Computing-on-the-language.html#capturing-expressions).
My script is as follows:
library(dplyr)
library(pryr)
x2<-data.frame(x=runif(1000,1,10),y=rnorm(1:1000))
y2<-group_by(x2,x)
field2<-"x"
z<-substitute(summarize(y2,check=sum(x)),list(x=as.name(field2)))
eval(quote(z),parent.frame())
But the output is not a dataframe as I supposed but a string:
>eval(quote(z),parent.frame())
summarize(y2, check = sum(x))
I am a little bit confused with non-standard-evaluation although I have looked through a number of examples.
Could you specify what is wrong with my approach?

How to change the options of a function (sum()) called inside a function (by()) without giving sum() a specific argument in R

I think this is a simple syntax question but its messing with my brain:
data <- data.frame(y=c(1,1,0,NA,1,1),
iso3=c(rep("USA",3),rep("RUS",3)),
year=rep(1999:2001,2))
I simply want to summarize y by year:
summarized <- by(data$y,data$year,sum)
but without loosing the information in 1999 as happens above. I think this could be done by using sum(,na.rm = TRUE) but if I try that in the code above, sum wants an argument. How can I change the specs of sum and still use it inside by as the function applied to the argument of by? I'm very grateful for any hints or how to's!
p.s.: While I'm grateful for any solution, it would be great if you could give me a solution specific to the 'wrapped functions' problem above as its not the first time I run into this problem and I would like to understand it.
Try
by(data$y,data$year,sum, na.rm=TRUE)
If we are using dplyr
library(dplyr)
data %>%
group_by(year) %>%
summarise(Sum= sum(y, na.rm=TRUE))

Resources