Using dplyr to summarize by multiple groups - r

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.

Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))

We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]

I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

Related

How to conditionally merge two rows with multiple values together and mutate in R?

Fish have been caught using different fishing methods.
I would like to merge rows based on Species (that's if they are the same fish species), if they are caught by both Bottom fishing and Trolling methods it will result in two rows collapsing into one row, changing the Method value to Both.
For example Caranx ignobilis will have a new Method value of Both. Bait Released and Kept columns should also have values on the same row.
Species Method Bait Released Kept
4 Caranx ignobilis Both NA 1 1
It seems so simple yet I have been scratching my head for hours and toying around with case_when as part of the tidyverse package.
The tibble is a result of previously sub-setting data using group_by and pivot_wider.
This is what the sample looks like:
# A tibble: 10 x 5
# Groups: Species [9]
Species Method Bait Released Kept
<chr> <fct> <int> <int> <int>
1 Aethaloperca rogaa Bottom fishing NA NA 2
2 Aprion virescens Bottom fishing NA NA 1
3 Balistidae spp. Bottom fishing NA NA 1
4 Caranx ignobilis Trolling NA NA 1
5 Caranx ignobilis Bottom fishing NA 1 NA
6 Epinephelus fasciatus Bottom fishing NA 3 NA
7 Epinephelus multinotatus Bottom fishing NA NA 5
8 Other species Bottom fishing NA 1 NA
9 Thunnus albacares Trolling NA NA 1
10 Variola louti Bottom fishing NA NA 1
Data:
fish_catch <- structure(list(Species = c("Aethaloperca rogaa", "Aprion virescens","Balistidae spp.", "Caranx ignobilis", "Caranx ignobilis", "Epinephelus fasciatus","Epinephelus multinotatus", "Other species", "Thunnus albacares","Variola louti"),
Method = structure(c(1L, 1L, 1L, 2L, 1L, 1L,1L, 1L, 2L, 1L), .Label = c("Bottom fishing", "Trolling"), class = "factor"),Bait = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_),
Released = c(NA, NA, NA, NA, 1L, 3L, NA, 1L,NA, NA),
Kept = c(2L, 1L, 1L, 1L, NA, NA, 5L, NA, 1L, 1L)), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(Species = c("Aethaloperca rogaa", "Aprion virescens",
"Balistidae spp.","Caranx ignobilis", "Epinephelus fasciatus", "Epinephelus multinotatus","Other species", "Thunnus albacares", "Variola louti"), .rows = list(1L, 2L, 3L, 4:5, 6L, 7L, 8L, 9L, 10L)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"), .drop = FALSE))
The route I was going down but then I realised it's not incorporating Species or the other columns
mutate(Method = case_when(Method == "Bottom fishing" & Method == "Trolling" ~ "Both",
Method == "Bottom fishing" ~ "Bottom fishing",
Method == "Trolling" ~ "Trolling", TRUE ~ as.character(MethodCaught)))
Here is one approach using tidyverse. You can group_by(Species) and set Method to "Both" if both Bottom fishing and Trolling are included in Method within that Species. Then afterwards, you can group_by both Species and Method, and use fill to replace NA with known values. In the end, use slice to keep one row for each Species/Method. This assumes you would have otherwise 1 row for each Species/Method - please let me know if this is not the case.
library(tidyverse)
fish_catch %>%
group_by(Species) %>%
mutate(Method = ifelse(all(c("Bottom fishing", "Trolling") %in% Method), "Both", as.character(Method))) %>%
group_by(Species, Method) %>%
fill(c(Bait, Released, Kept), .direction = "updown") %>%
slice(1)
Output
# A tibble: 9 x 5
# Groups: Species, Method [9]
Species Method Bait Released Kept
<chr> <chr> <int> <int> <int>
1 Aethaloperca rogaa Bottom fishing NA NA 2
2 Aprion virescens Bottom fishing NA NA 1
3 Balistidae spp. Bottom fishing NA NA 1
4 Caranx ignobilis Both NA 1 1
5 Epinephelus fasciatus Bottom fishing NA 3 NA
6 Epinephelus multinotatus Bottom fishing NA NA 5
7 Other species Bottom fishing NA 1 NA
8 Thunnus albacares Trolling NA NA 1
9 Variola louti Bottom fishing NA NA 1
This should get you started. You can add the other columns to the summarize function.
library(tidyverse)
fish_catch %>% select(-Bait, -Released, -Kept) %>%
group_by(Species) %>%
summarize(Method = paste0(Method, collapse = "")) %>%
mutate(Method = fct_recode(Method, "both" = "TrollingBottom fishing"))
# A tibble: 9 x 2
Species Method
<chr> <fct>
1 Aethaloperca rogaa Bottom fishing
2 Aprion virescens Bottom fishing
3 Balistidae spp. Bottom fishing
4 Caranx ignobilis both
5 Epinephelus fasciatus Bottom fishing
6 Epinephelus multinotatus Bottom fishing
7 Other species Bottom fishing
8 Thunnus albacares Trolling
9 Variola louti Bottom fishing

How to remove all observations for which there is no observation in the current year in R?

num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%
Let's say I have data like the above. I want to remove the observations that do not have any observations in the most recent year. So, in the above, we would be left with A & B, but C & D would be deleted. The most recent season will always in the data and can be referenced with the max() function (i.e., we don't need to hardcode as 2019 and update it yearly).
The plan is to create a facet wrapped line chart where the percentages are on the y-axis and the years are on the x-axis. The facet would be on the names so each individual will have its own line chart with their percentages by year. We don't care about people who left, so that's why we're dropping records. Though, there is a chance they come back, so I don't want to drop them from the underlying data.
One dplyr option could be:
df %>%
group_by(Name) %>%
filter(any(year %in% max(df$year)))
num Name year X Y
<int> <chr> <int> <int> <chr>
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
W can use subset from base R as well by subsetting the 'Name' where 'year' is the max, get the unique elements and create a logical vector with %in% to subset the rows
subset(df1, Name %in% unique(Name[year == max(year)]))
# num Name year X Y
#1 1 A 2015 68 80%
#2 1 A 2016 69 85%
#3 1 A 2017 70 95%
#4 1 A 2018 71 85%
#5 1 A 2019 72 90%
#6 2 B 2018 20 80%
#7 2 B 2019 23 75%
No packages are used
Or the similar syntax in dplyr
library(dplyr)
df1 %>%
filter(Name %in% unique(Name[year == max(year)]))
data
df1 <- structure(list(num = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 4L, 4L
), Name = c("A", "A", "A", "A", "A", "B", "B", "C", "D", "D"),
year = c(2015L, 2016L, 2017L, 2018L, 2019L, 2018L, 2019L,
2014L, 2012L, 2013L), X = c(68L, 69L, 70L, 71L, 72L, 20L,
23L, 3L, 4L, 5L), Y = c("80%", "85%", "95%", "85%", "90%",
"80%", "75%", "55%", "75%", "100%")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Using the data frame DF shown in the Note at the end we use semi_join to reduce it to the required names, convert Y to numeric and plot it. DF is not modified.
A possible alternative to the semi_join line is
filter(ave(year == max(year), Name, FUN = any)) %>%
The code is--
library(dplyr)
library(ggplot2)
DF %>%
semi_join(filter(., year == max(year)), by = "Name") %>%
mutate(Y = as.numeric(sub("%", "", Y))) %>%
ggplot(aes(year, Y)) + geom_line() + facet_wrap(~Name)
Note
The input in reproducible form:
Lines <- " num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%"
DF <- read.table(text = Lines)

How to undummy a datasset with R

This is the libraryI am using for creating dummies
install.packages("fastDummies")
library(fastDummies)
This is the dataset
winners <- data.frame(
city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"),
year = c(1990, 2000, 1990),
crime = 1:3)
Let's them create super dummies out of these cities:
dummy_cols(winners, select_columns = c("city"))
The results are
city year crime city_SaoPaulito city_NewAmsterdam city_BeatifulCow
1 SaoPaulito 1990 1 1 0 0
2 NewAmsterdam 2000 2 0 1 0
3 BeatifulCow 1990 3 0 0 1
So the question if that I want to return to the previous dataset, any ideas?
Thanks in advance!
We can use dcast
library(data.table)
dcast(setDT(winners), crime ~ city, length)
If we need to get the input, it would be
subset(df1, select = 1:3)
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
Or with melt
melt(setDT(df1), measure = patterns("_"))[value == 1, .(city, year, crime)]
# city year crime
#1: SaoPaulito 1990 1
#2: NewAmsterdam 2000 2
#3: BeatifulCow 1990 3
data
df1 <- structure(list(city = c("SaoPaulito", "NewAmsterdam", "BeatifulCow"
), year = c(1990L, 2000L, 1990L), crime = 1:3, city_SaoPaulito = c(1L,
0L, 0L), city_NewAmsterdam = c(0L, 1L, 0L), city_BeatifulCow = c(0L,
0L, 1L)), class = "data.frame", row.names = c("1", "2", "3"))
If you are going to have only one city as 1 in each row, you can just skip the dummy columns
df[, 1:3]
# city year crime
#1 SaoPaulito 1990 1
#2 NewAmsterdam 2000 2
#3 BeatifulCow 1990 3
If you can have multiple cities one way using dplyr and tidyr::gather is
library(dplyr)
df %>%
tidyr::gather(key, value, starts_with("city_")) %>%
filter(value == 1) %>%
select(-value, -key)

Calculating a value for a current row in a dataframe using subset in R

I need to calculate some intermediate calculations using R.
Here is the data about some events and their types during some years.
structure(list(year = c(1994, 1995, 1997, 1997, 1998, 1998, 1998,
2000, 2000, 2001, 2001, 2002), N = c(3L, 1L, 1L, 4L, 1L, 1L,
4L, 1L, 2L, 1L, 5L, 1L), type = c("OIL", "LNG", "AGS", "OIL",
"DOCK", "LNG", "OIL", "LNG", "OIL", "LNG", "OIL", "DOCK")), .Names = c("year",
"N", "type"), row.names = c(NA, 12L), class = "data.frame")
> head(mydf3)
year N type
1 1994 3 OIL
2 1995 1 LNG
3 1997 1 AGS
4 1997 4 OIL
5 1998 1 DOCK
6 1998 1 LNG
I need to get the data about cumulative sum of N by Year and type, total cumulative sum this year and cumulative sum for year until current for all types.
So i need to get information like this
year type cntyear cnt_cumultype cnt_cumulalltypes
1994 OIL 3 3 3
1994 LNG 0 0 3
1994 AGS 0 0 3
1994 DOCK 0 0 3
1995 OIL 0 3 4
1995 LNG 1 1 4
1995 AGS 0 0 4
1995 DOCK 0 0 4
...
Some explanation:
cntyear - this is N count for current year and type.
cnt_cumultype - this is cumulative sum for this type until current year.
cnt_cumulalltypes - this is cumulative sum for all types for all
years including current <=current year.
Just wanted to do something like this, but it didn't worked right...
mydf3$cnt_cumultype<-tail(cumsum(mydf3[which(mydf3$type==mydf3$type & mydf3$year==mydf3$year),]$N), n=1)
How to calculate this numbers by rows?
Here is a solution with the data.table package. This is also possible to solve in base R, but one step in particular is shorter with data.table.
# load library
library(data.table)
# caste df as a data.table and change column order
setcolorder(setDT(df), c("year", "type", "N"))
# change column names
setnames(df, names(df), c("year", "type", "cntyear"))
# get all type-year combinations in data.table with `CJ` and join these to original
# then, in second [, replace all observations with missing counts to 0
df2 <- df[CJ("year"=unique(df$year), "type"=unique(df$type)), on=c("year", "type")
][is.na(cntyear), cntyear := 0]
# get cumulative counts for each type
df2[, cnt_cumultype := cumsum(cntyear), by=type]
# get total counts for each year
df2[, cnt_cumulalltypes := cumsum(cntyear)]
This results in
df2
year type cntyear cnt_cumultype cnt_cumulalltypes
1: 1994 AGS 0 0 0
2: 1994 DOCK 0 0 0
3: 1994 LNG 0 0 0
4: 1994 OIL 3 3 3
5: 1995 AGS 0 0 3
6: 1995 DOCK 0 0 3
7: 1995 LNG 1 1 4
8: 1995 OIL 0 3 4
9: 1997 AGS 1 1 5
....

R - Adding numbers within a data frame cell together

I have a data frame in which the values are stored as characters. However, many values contain two numbers that need to be added together. Example:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 3+6 2+10 8 13+2
Product 2 6 4+0 <NA> 5
Product 3 <NA> 5+9 3+1 11
Is there a way to go through the whole data frame and replace all cells containing characters like "3+6" with new values equal to their sum? I assume this would involve coercing the characters to numeric or integers, but I don't know how that would be possible for values with the + sign in them. I would like the example data frame to end up looking like this:
2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
Product 1 9 12 8 15
Product 2 6 4 <NA> 5
Product 3 <NA> 14 4 11
Here's an easier example:
dat <- data.frame(a=c("3+6", "10"), b=c("12", NA), c=c("3+4", "5+6"))
dat
## a b c
## 1 3+6 12 3+4
## 2 10 <NA> 5+6
apply(dat, 1:2, function(x) eval(parse(text=x)))
## a b c
## [1,] 9 12 7
## [2,] 10 NA 11
Using R itself to do the computation with eval and parse does the trick.
Here is one option with gsubfn without using eval(parse. We convert the 'data.frame' to 'matrix' (as.matrix(dat)). We match the numbers ([0-9]+), capture it as a group using parentheses ((..)) followed by +, followed by second set of numbers, and replace it by converting to numeric class and then do the +. The output can be assigned back to the original dataset to get the same structure as in 'dat'.
library(gsubfn)
dat[] <- as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.matrix(dat)))
dat
# 2014 Q1 Sales 2014 Q2 Sales 2014 Q3 Sales 2014 Q4 Sales
#Product 1 9 12 8 15
#Product 2 6 4 NA 5
#Product 3 NA 14 4 11
Or we can loop the columns with lapply and perform the replacement with gsubfn for each of the columns.
dat[] <- lapply(dat, function(x) as.numeric(gsubfn('([0-9]+)\\+([0-9]+)',
~as.numeric(x)+as.numeric(y), as.character(x))))
data
dat <- structure(list(`2014 Q1 Sales` = structure(c(1L, 2L, NA), .Label = c("3+6",
"6"), class = "factor"), `2014 Q2 Sales` = structure(1:3, .Label = c("2+10",
"4+0", "5+9"), class = "factor"), `2014 Q3 Sales` = structure(c(2L,
NA, 1L), .Label = c("3+1", "8"), class = "factor"), `2014 Q4 Sales` = structure(c(2L,
3L, 1L), .Label = c("11", "13+2", "5"), class = "factor")), .Names = c("2014 Q1 Sales",
"2014 Q2 Sales", "2014 Q3 Sales", "2014 Q4 Sales"), class = "data.frame", row.names = c("Product 1",
"Product 2", "Product 3"))

Resources