In the example dataset below I need to find the number of unique customers per product summarised per year. The output has to be a data.frame with the headers: year - product - number of customers
Thanks for your help.
year <- c("2009", "2010")
product <- c("a", "b", "c")
df <- data.frame(customer = sample(letters, 50, replace = T),
product = sample(product, 50, replace = T),
year = sample(year, 50, replace = T))
With aggregate() (in the included-with-R stats package):
agdf<-aggregate(customer~product+year,df,function(x)length(unique(x)))
agdf
# product year customer
#1 a 2009 7
#2 b 2009 8
#3 c 2009 10
#4 a 2010 7
#5 b 2010 7
#6 c 2010 6
Using plyr's summarise:
require(plyr)
ddply(df, .(product, year), summarise, customers=length(unique(customer)))
Related
I would like to replace the CATEGORY column with the category corresponding to the max value in the sales column.
My data looks as follows:
df <- data.frame(CATEGORY = c("A","A","A","B","B"), SALES = c(10,20,30,40,50))
I'm looking to fill the CATEGORY variable with "B" since the max value in SALES has a CATEGORY of B
df <- data.frame(CATEGORY = c("B","B","B","B","B"), SALES = c(10,20,30,40,50))
If this can be achieved using dplyr syntax I'd be very grateful if anyone could give me a few pointers.
A possible solution:
library(tidyverse)
df <- data.frame(CATEGORY = c("A","A","A","B","B"), SALES = c(10,20,30,40,50))
df %>%
mutate(CATEGORY = CATEGORY[which.max(SALES)])
#> CATEGORY SALES
#> 1 B 10
#> 2 B 20
#> 3 B 30
#> 4 B 40
#> 5 B 50
How would I divide one data frame by another? The two data frames have the same columns and same rows, but I need to divide every intersect with its corresponding intersect into a new data frame, e.g. below:
DF1
Name Jan Feb Mar
Aaron 2 4 3
Blake 5 6 4
DF2
Name Jan Feb Mar
Aaron 4 6 6
Blake 7 6 5
DF1/DF2 = DF3
DF3 (result)
Name Jan Feb Mar
Aaron 0.5 0.7 0.5
Blake 0.7 1.0 0.8
I'm using subset then dcast to build each data frame, but having a hard time figuring out how to divide them. Thanks for your help!
We divide the numeric columns in both 'DF1' and 'DF2' (by removing the first column) and cbind with the first column.
DF3 <- cbind(DF1[1],round(DF1[-1]/DF2[-1],1))
DF3
# Name Jan Feb Mar
# 1 Aaron 0.5 0.7 0.5
# 2 Blake 0.7 1.0 0.8
Since you mention that you used subset and dcast to build each data frame, I suspect you have these data already all in one data frame in which case assigning the role of numerator and denominator might be all you need to do to in order to run the calculation using ddply. For instance, going with your example data and melting it back into a long-form data frame, would give you the following with a single ddply:
# data
DF1 <- data.frame(Name = c("Aaron", "Blake"), Jan = c(2, 5), Feb = c(4, 6), Mar = c(3, 4))
DF2 <- data.frame(Name = c("Aaron", "Blake"), Jan = c(4, 7), Feb = c(6, 6), Mar = c(6, 5))
# long format with 'numerator' and 'denominator' roles assigned
# (unnecessary if you already have long format, just assign numerator/denomninator)
library(reshape2)
df <- rbind(
transform(
melt(DF1, id.vars = "Name", variable.name = "Month"),
role = "numerator"),
transform(
melt(DF2, id.vars = "Name", variable.name = "Month"),
role = "denominator")
)
# ddply
library(plyr)
ddply(df, .(Name, Month), summarize,
Result = value[role == "numerator"] / value[role == "denominator"])
# Name Month Result
# 1 Aaron Jan 0.5000000
# 2 Aaron Feb 0.6666667
# 3 Aaron Mar 0.5000000
# 4 Blake Jan 0.7142857
# 5 Blake Feb 1.0000000
# 6 Blake Mar 0.8000000
I have a set of 85 possible combinations from two variables, one with five values (years) and one with 17 values (locations). I make a dataframe that has the years in the first column and the locations in the second column. For each combination of year and location I want to calculate the weighted mean value and then add it to the third column, according to the year and location values.
My code is as follows:
for (i in unique(data1$year)) {
for (j in unique(data1$location)) {
data2 <- crossing(data1$year, data1$location)
dataname <- subset(data1, year %in% i & location %in% j)
result <- weighted.mean(dataname$length, dataname$raising_factor, na.rm = T)
}
}
The result I gets puts the last calculated mean in the third column for each row.
How can I get it to add according to matching year and location combination?
thanks.
A base R option would be by
by(df[c('x', 'y')], df[c('group', 'year')],
function(x) weighted.mean(x[,1], x[,2]))
Based on #LAP's example
As #A.Suleiman suggested, we can use dplyr::group_by.
Example data:
df <- data.frame(group = rep(letters[1:5], each = 4),
year = rep(2001:2002, 10),
x = 1:20,
y = rep(c(0.3, 1, 1/0.3, 0.4), each = 5))
library(dplyr)
df %>%
group_by(group, year) %>%
summarise(test = weighted.mean(x, y))
# A tibble: 10 x 3
# Groups: group [?]
group year test
<fctr> <int> <dbl>
1 a 2001 2.000000
2 a 2002 3.000000
3 b 2001 6.538462
4 b 2002 7.000000
5 c 2001 10.538462
6 c 2002 11.538462
7 d 2001 14.000000
8 d 2002 14.214286
9 e 2001 18.000000
10 e 2002 19.000000
I want to calculate the score difference after grouping by Year, State, Tier, Group. A stylised representation of my data would look like:
dat2 <- data.frame(
Year = sample(1990:1996, 10, replace = TRUE),
State = sample(c("AL", "CA", "NY"), 10, replace = TRUE),
Tier = sample(1:2),
Group = sample(c("A", "B"), 10, replace = TRUE),
Score = rnorm(10))
I tried mutate with group_by_ and .dots however it obtains values from the next absolute value (i.e. grouping does not seem to work). I am mostly interested in plotting the yearly differences (ala time-series even though some years would be NA) so this can be solved by either lagging or calculating the next year's score.
Edit: So, if the dataset looks like:
Year State Tier Group Score
1990 AL 1 A 75
1990 AL 2 A 100
1990 AL 1 B 5
1990 AL 2 B 10
1991 AL 1 A 95
1991 AL 2 A 80
1991 AL 1 B 5
1991 AL 2 B 15
The desired end result would be:
Year State Tier Group Score Diff
1991 AL 1 A 95 20
1991 AL 1 B 5 0
1991 AL 2 A 80 -20
1991 AL 2 B 15 5
If I understand correctly, you are trying to calculate the difference in Score within each combination of Year, State, Tier, Group? Presumably, your data will be sorted chronologically for the difference to make any sense. Your example is small for these combinations to be repeated but I believe the solution you are looking for would be:
library(dplyr)
dat2 %>%
arrange(Year) %>%
group_by(State, Tier, Group) %>%
mutate(ScoreDiff = Score - lag(Score))
With your current code, the ScoreDiff column has a lot of NAs because there usually won't be multiple cases of the same combination of your four variables in just 10 cases. But you can try it with a more general code (I've also changed the starting year to 1890 from 1990):
n <- 100
dat2 <- data.frame(
Year = sample(1890:1996, n, replace = TRUE),
State = sample(c("AL", "CA", "NY"), n, replace = TRUE),
Tier = sample(1:2),
Group = sample(c("A", "B"), n, replace = TRUE),
Score = rnorm(n))
dat2 %>%
arrange(Year) %>%
group_by(State, Tier, Group) %>%
mutate(ScoreDiff = Score - lag(Score))
I have a data.frame with two columns: year and score. The years go from 2000-2012 and each year can be listed multiple times. In the score column I list all the scores for each year with each row having a different score.
What I'd like to do is filter the data.frame so only the rows with the maximum scores for each year remain.
So as a tiny example if I have
year score
2000 18
2001 22
2000 21
I would want to return just
year score
2001 22
2000 21
If you know sql this is easier to understand
library(sqldf)
sqldf('select year, max(score) from mydata group by year')
Update (2016-01): Now you can also use dplyr
library(dplyr)
mydata %>% group_by(year) %>% summarise(max = max(score))
using plyr
require(plyr)
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25))
ddply(df, .(year), summarise, max.score=max(score))
using data.table
require(data.table)
dt <- data.table(df, key="year")
dt[, list(max.score=max(score)), by=year]
using aggregate:
o <- aggregate(df$score, list(df$year) , max)
names(o) <- c("year", "max.score")
using ave:
df1 <- df
df1$max.score <- ave(df1$score, df1$year, FUN=max)
df1 <- df1[!duplicated(df1$year), ]
Edit: In case of more columns, a data.table solution would be the best (my opinion :))
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25),
alpha = sample(letters[1:5], 25, replace=T), beta=rnorm(25))
# convert to data.table with key=year
dt <- data.table(df, key="year")
# get the subset of data that matches this criterion
dt[, .SD[score %in% max(score)], by=year]
# year score alpha beta
# 1: 2000 20 b 0.8675148
# 2: 2001 21 e 1.5543102
# 3: 2002 22 c 0.6676305
# 4: 2003 18 a -0.9953758
# 5: 2004 23 d 2.1829996
# 6: 2005 25 b -0.9454914
# 7: 2007 17 e 0.7158021
# 8: 2008 12 e 0.6501763
# 9: 2011 24 a 0.7201334
# 10: 2012 19 d 1.2493954
using base packages
> df
year score
1 2000 18
2 2001 22
3 2000 21
> aggregate(score ~ year, data=df, max)
year score
1 2000 21
2 2001 22
EDIT
If you have additional columns that you need to keep, then you can user merge with aggregate to get those columns
> df <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21) , hrs = c( 10, 11, 12))
> df
year score hrs
1 2000 18 10
2 2001 22 11
3 2000 21 12
> merge(aggregate(score ~ year, data=df, max), df, all.x=T)
year score hrs
1 2000 21 12
2 2001 22 11
data <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21))
new.year <- unique(data$year)
new.score <- sapply(new.year, function(y) max(data[data$year == y, ]$score))
data <- data.frame(year = new.year, score = new.score)
one liner,
df_2<-data.frame(year=sort(unique(df$year)),score = tapply(df$score,df$year,max));