Calculate average over many data frames having same format? - r

I have 15 data frames that are exactly identical, but each have different values stored within each column. Each header row is exactly the same.
Here's an example data frame, call it "A":
Product Q1 Q2
1 Product X 10 15
2 Product Y 20 40
3 Product Z 30 50
And here's another, call it "B":
Product Q1 Q2
1 Product X 12 5
2 Product Y 25 44
3 Product Z 32 51
I would like to calculate the average value across all 15 data frames. Using my two examples, the output would be a similar data frame but with averages. Something like this:
Product Q1 Q2
1 Product X 11.0 10.0
2 Product Y 22.5 42.0
3 Product Z 31.0 50.5
I've searched around for a solution, but to no avail. It seems like the mapply function might be what I need, but I'm not sure how best to put it to use here.

aggregate(.~Product, rbind(A, B), mean)
# Product Q1 Q2
#1 Product X 11.0 10.0
#2 Product Y 22.5 42.0
#3 Product Z 31.0 50.5
DATA
A = structure(list(Product = c("Product X", "Product Y", "Product Z"
), Q1 = c(10L, 20L, 30L), Q2 = c(15L, 40L, 50L)), .Names = c("Product",
"Q1", "Q2"), class = "data.frame", row.names = c("1", "2", "3"
))
B = structure(list(Product = c("Product X", "Product Y", "Product Z"
), Q1 = c(12L, 25L, 32L), Q2 = c(5L, 44L, 51L)), .Names = c("Product",
"Q1", "Q2"), class = "data.frame", row.names = c("1", "2", "3"
))

Since the headers match, let's put all of your data frames into one data frame.
df <- rbind(A,B,... O)
Then we'll use dplyr to summarize:
require(dplyr)
df %>% group_by(Product) %>%
summarize(Q1_Avg= mean(Q1), Q2_Avg= mean(Q2))

Related

Finding sum of data frame column in rows that contain certain value in R

I'm working on a March Madness project. I have a data frame df.A with every team and season.
For example:
Season Team Name Code
2003 Creighton 2003-1166
2003 Notre Dame 2003-1323
2003 Arizona 2003-1112
And another data frame df.B with game results of of every game every season:
WTeamScore LTeamScore WTeamCode LTeamCode
15 10 2003-1166 2003-1323
20 15 2003-1323 2003-1112
10 5 2003-1112 2003-1166
I'm trying to get a column in df.A that totals the number of points in both wins and losses. Basically:
Season Team Name Code Points
2003 Creighton 2003-1166 20
2003 Notre Dame 2003-1323 30
2003 Arizona 2003-1112 25
There are obviously thousands more rows in each data frame, but this is the general idea. What would be the best way of going about this?
Here is another option using tidyverse, where we can pivot df.B to long form, then get the sum for each team, then join back to df.A.
library(tidyverse)
df.B %>%
pivot_longer(everything(),names_pattern = "(WTeam|LTeam)(.*)",
names_to = c("rep", ".value")) %>%
group_by(Code) %>%
summarise(Points = sum(Score)) %>%
left_join(df.A, ., by = "Code")
Output
Season Team.Name Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
Data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), Team.Name = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")), class = "data.frame", row.names = c(NA,
-3L))
We may use match (from base R) between 'Code' on 'df.A' to 'WTeamCode', 'LTeamCode' in df.B to get the matching index, to extract the corresponding 'Score' columns and get the sum (+)
df.A$Points <- with(df.A, df.B$WTeamScore[match(Code,
df.B$WTeamCode)] +
df.B$LTeamScore[match(Code, df.B$LTeamCode)])
-output
> df.A
Season TeamName Code Points
1 2003 Creighton 2003-1166 20
2 2003 Notre Dame 2003-1323 30
3 2003 Arizona 2003-1112 25
If there are nonmatches resulting in missing values (NA) from match, cbind the vectors to create a matrix and use rowSums with na.rm = TRUE
df.A$Points <- with(df.A, rowSums(cbind(df.B$WTeamScore[match(Code,
df.B$WTeamCode)],
df.B$LTeamScore[match(Code, df.B$LTeamCode)]), na.rm = TRUE))
data
df.A <- structure(list(Season = c(2003L, 2003L, 2003L), TeamName = c("Creighton",
"Notre Dame", "Arizona"), Code = c("2003-1166", "2003-1323",
"2003-1112")), class = "data.frame", row.names = c(NA, -3L))
df.B <- structure(list(WTeamScore = c(15L, 20L, 10L), LTeamScore = c(10L,
15L, 5L), WTeamCode = c("2003-1166", "2003-1323", "2003-1112"
), LTeamCode = c("2003-1323", "2003-1112", "2003-1166")),
class = "data.frame", row.names = c(NA,
-3L))

Transform to wide format from long in R

I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA

Assign max value of group to all rows in that group

I would like to assign the max value of a group to all rows within that group. How do I do that?
I have a dataframe containing the names of the group and the max number of credits that belongs to it.
course_credits <- aggregate(bsc_academic$Credits, by = list(bsc_academic$Course_code), max)
which gives
Course Credits
1 ABC1000 6.5
2 ABC1003 6.5
3 ABC1004 6.5
4 ABC1007 5.0
5 ABC1010 6.5
6 ABC1021 6.5
7 ABC1023 6.5
The main dataframe looks like this:
Appraisal.Type Resits Credits Course_code Student_ID
Final result 0 6.5 ABC1000 10
Final result 0 6.5 ABC1003 10
Grade supervisor 0 0 ABC1000 10
Grade supervisor 0 0 ABC1003 10
Final result 0 12 ABC1294 23
Grade supervisor 0 0 ABC1294 23
As you see, student 10 took course ABC1000, worth 6.5 credits. For each course (per student), however, two rows exist: Final result and Grade supervisor. In the end, Final result should be deleted, but the credits should be kept. Therefore, I want to assign the max value of 6.5 to the Grade supervisor row.
Likewise, student 23 has followed course ABC1294, worth 12 credits.
In the end, this should be the result:
Appraisal.Type Resits Credits Course_code Student_ID
Grade supervisor 0 6.5 ABC1000 10
Grade supervisor 0 6.5 ABC1003 10
Grade supervisor 0 12 ABC1294 23
How do I go about this?
An option would be to group by 'Student_ID', mutate the 'Credits' with max of 'Credits' and filter the rows with 'Appraisal.Type' as "Grade supervisor"
library(dplyr)
df1 %>%
group_by(Student_ID) %>%
dplyr::mutate(Credits = max(Credits)) %>%
ungroup %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 2 x 5
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
If we also need 'Course_code' to be included in the grouping
df2 %>%
group_by(Student_ID, Course_code) %>%
dplyr::mutate(Credits = max(Credits)) %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 3 x 5
# Groups: Student_ID, Course_code [3]
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
#3 Grade supervisor 0 12 ABC1294 23
NOTE: I case, plyr package is also loaded, there can be some masking of functions esp summarise/mutate which is also found in plyr. To prevent it, either do this on a fresh session without loading plyr or explicitly specify dplyr::mutate
data
df1 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor"), Resits = c(0L, 0L, 0L,
0L), Credits = c(6.5, 6.5, 0, 0), Course_code = c("ABC1000",
"ABC1003", "ABC1000", "ABC1003"), Student_ID = c(10L, 10L, 10L,
10L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor", "Final result", "Grade supervisor"
), Resits = c(0L, 0L, 0L, 0L, 0L, 0L), Credits = c(6.5, 6.5,
0, 0, 12, 0), Course_code = c("ABC1000", "ABC1003", "ABC1000",
"ABC1003", "ABC1294", "ABC1294"), Student_ID = c(10L, 10L, 10L,
10L, 23L, 23L)), class = "data.frame", row.names = c(NA, -6L))
Generate a sample dataset.
data <- as.data.frame(list(Appraisal.Type = c(rep("Final result", 2), rep("Grade supervisor", 2)),
Resits = rep(0, 4),
Credits = c(rep(6.5, 2), rep(0, 2)),
Course_code = rep(c("ABC1000", "ABC1003"), 2),
Student_ID = rep(10, 4)))
Assign the max value of a group to all rows in this group and then delete rows that contain "Final results".
##Reassign the values of "Credits" column
for (i in 1: nlevels(as.factor(data$Course_code))) {
Course_code <- unique(data$Course_code)[i]
data$Credits [data$Course_code == Course_code] <- max (data$Credits [data$Course_code == Course_code])
}
##New dataset without "Final result" rows
data <- data[data$Appraisal.Type != "Final result",]
Here is the result.
data
Appraisal.Type Resits Credits Course_code Student_ID
3 Grade supervisor 0 6.5 ABC1000 10
4 Grade supervisor 0 6.5 ABC1003 10
Here's a data.table solution,
DT[,Credits := max(Credits),by=Student_ID]
Result <- DT[Appraisal.Type == "Grade supervisor"]

Using dplyr to summarize by multiple groups

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

R - Adding numbers within a data frame cell together

I have a data frame in which the values are stored as characters. However, many values contain two numbers that need to be added together. Example:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 3+6 2+10 8 13+2
Product 2 6 4+0 <NA> 5
Product 3 <NA> 5+9 3+1 11
Is there a way to go through the whole data frame and replace all cells containing characters like "3+6" with new values equal to their sum? I assume this would involve coercing the characters to numeric or integers, but I don't know how that would be possible for values with the + sign in them. I would like the example data frame to end up looking like this:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 9 12 8 15
Product 2 6 4 <NA> 5
Product 3 <NA> 14 4 11
Here's an easier example:
dat <- data.frame(a=c("3+6", "10"), b=c("12", NA), c=c("3+4", "5+6"))
dat
## a b c
## 1 3+6 12 3+4
## 2 10 <NA> 5+6
apply(dat, 1:2, function(x) eval(parse(text=x)))
## a b c
## [1,] 9 12 7
## [2,] 10 NA 11
Using R itself to do the computation with eval and parse does the trick.
Here is one option with gsubfn without using eval(parse. We convert the 'data.frame' to 'matrix' (as.matrix(dat)). We match the numbers ([0-9]+), capture it as a group using parentheses ((..)) followed by +, followed by second set of numbers, and replace it by converting to numeric class and then do the +. The output can be assigned back to the original dataset to get the same structure as in 'dat'.
library(gsubfn)
dat[] <- as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.matrix(dat)))
dat
# 2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
#Product 1 9 12 8 15
#Product 2 6 4 NA 5
#Product 3 NA 14 4 11
Or we can loop the columns with lapply and perform the replacement with gsubfn for each of the columns.
dat[] <- lapply(dat, function(x) as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.character(x))))
data
dat <- structure(list(`2014 Q1 Sales` = structure(c(1L, 2L, NA), .Label = c("3+6",
"6"), class = "factor"), `2014 Q2 Sales` = structure(1:3, .Label = c("2+10",
"4+0", "5+9"), class = "factor"), `2014 Q3 Sales` = structure(c(2L,
NA, 1L), .Label = c("3+1", "8"), class = "factor"), `2014 Q4 Sales` = structure(c(2L,
3L, 1L), .Label = c("11", "13+2", "5"), class = "factor")), .Names = c("2014 Q1 Sales",
"2014 Q2 Sales", "2014 Q3 Sales", "2014 Q4 Sales"), class = "data.frame", row.names = c("Product 1",
"Product 2", "Product 3"))

Resources