Means multiple columns by multiple groups [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
I am trying to find the means, not including NAs, for multiple columns withing a dataframe by multiple groups
airquality <- data.frame(City = c("CityA", "CityA","CityA",
"CityB","CityB","CityB",
"CityC", "CityC"),
year = c("1990", "2000", "2010", "1990",
"2000", "2010", "2000", "2010"),
month = c("June", "July", "August",
"June", "July", "August",
"June", "August"),
PM10 = c(runif(3), rnorm(5)),
PM25 = c(runif(3), rnorm(5)),
Ozone = c(runif(3), rnorm(5)),
CO2 = c(runif(3), rnorm(5)))
airquality
So I get a list of the names with the number so I know which columns to select:
nam<-names(airquality)
namelist <- data.frame(matrix(t(nam)));namelist
I want to calculate the mean by City and Year for PM25, Ozone, and CO2. That means I need columns 1,2,4,6:7)
acast(datadf, year ~ city, mean, na.rm=TRUE)
But this is not really what I want because it includes the mean of something I do not need and it is not in a data frame format. I could convert it and then drop, but that seems like a very inefficient way to do it.
Is there a better way?

We can use dplyr with summarise_at to get mean of the concerned columns after grouping by the column of interest
library(dplyr)
airquality %>%
group_by(City, year) %>%
summarise_at(vars("PM25", "Ozone", "CO2"), mean)
Or using the devel version of dplyr (version - ‘0.8.99.9000’)
airquality %>%
group_by(City, year) %>%
summarise(across(PM25:CO2, mean))

The summarise_at solution by Colin is simplest, but of course there are several.
Here is another solution, using tidyr to rearrange and calculate the mean:
airquality %>%
select(City, year, PM25, Ozone, CO2) %>%
gather(var, value, -City, -year) %>%
group_by(City, year, var) %>%
summarise(avg = mean(value, na.rm=T)) %>% # can stop here if you want
spread(var, avg) # optional to make this into a wider table
# A tibble: 8 x 5
# Groups: City, year [8]
City year CO2 Ozone PM25
* <fctr> <fctr> <dbl> <dbl> <dbl>
1 CityA 1990 0.275981522 0.19941717 0.826008441
2 CityA 2000 0.090342153 0.50949094 0.005052771
3 CityA 2010 0.007345704 0.21893117 0.625373926
4 CityB 1990 1.148717447 -1.05983482 -0.961916973
5 CityB 2000 -2.334429324 0.28301220 -0.828515418
6 CityB 2010 1.110398814 -0.56434523 -0.804353609
7 CityC 2000 -0.676236740 0.20661529 -0.696816058
8 CityC 2010 0.229428142 0.06202997 -1.396357288

You should try dplyr::mutate_at :
library(dplyr)
airquality %>%
group_by(City, year) %>%
summarise_at(.vars = c("PM10", "PM25", "Ozone", "CO2"), .funs = mean)
# A tibble: 8 x 6
# Groups: City [?]
City year PM10 PM25 Ozone CO2
<fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 CityA 1990 0.004087379 0.5146409 0.44393422 0.61196671
2 CityA 2000 0.039414194 0.8865582 0.06754322 0.69870187
3 CityA 2010 0.116901563 0.6608619 0.51499227 0.32952099
4 CityB 1990 -1.535888778 -0.9601897 1.17183649 0.08380664
5 CityB 2000 0.226046487 0.4037230 0.86554997 -0.05698204
6 CityB 2010 -0.824719956 0.1508471 0.32089806 -0.12871853
7 CityC 2000 -0.824509111 -0.6928741 0.85553837 0.12137923
8 CityC 2010 -1.626150294 1.5176198 0.21183149 -0.63859910

So I tested the comments above and added more replication to the original dataset because I wanted to calculate the average by city and by year. Here is the updated dataset
airquality <- data.frame(City = c("CityA", "CityA","CityA","CityA",
"CityB","CityB","CityB","CityB",
"CityC", "CityC", "CityC"),
year = c("1990", "2000", "2010", "2010",
"1990", "2000", "2010", "2010",
"1990", "2000", "2000"),
month = c("June", "July", "August", "August",
"June", "July", "August","August",
"June", "August", "August"),
PM10 = c(runif(6), rnorm(5)),
PM25 = c(runif(6), rnorm(5)),
Ozone = c(runif(6), rnorm(5)),
CO2 = c(runif(6), rnorm(5)))
airquality
Of the answers above, AK run and Colin worked.

Related

Add a column in R with season time

I have a dataset like thatI want to add a column with season time like this:
Month
Year
Region
Season
January
2019
NY
Winter
February
2019
NY
Winter
March
2019
NY
Spring
September
2019
NY
Fall
How can I do a code in R that automatically add a column where all January, February and December are Winter, all March, April and May are Spring and so on.
Thanks a lot for helping
season <- c(data, Spring = "March", Spring = "April")
We can create a keyvalue dataset and do a join
library(dplyr)
keydat <- tibble(Month = month.name,
Season = rep(c("Winter", "Spring", "Summer", "Fall", "Winter"),
c(2, 3, 3, 3, 1)))
df1 <- left_join(df1, keydat)
-output
df1
Month Year Region Season
1 January 2019 NY Winter
2 February 2019 NY Winter
3 March 2019 NY Spring
4 September 2019 NY Fall
data
df1 <- structure(list(Month = c("January", "February", "March", "September"
), Year = c(2019L, 2019L, 2019L, 2019L), Region = c("NY", "NY",
"NY", "NY")), class = "data.frame", row.names = c(NA, -4L))
In base R you could do:
df1$Season <- c('Winter', 'Spring', 'Summer', 'Fall')[
1 + (match(df1$Month, month.name) %/% 3) %% 4]
Which results in:
df1
#> Month Year Region Season
#> 1 January 2019 NY Winter
#> 2 February 2019 NY Winter
#> 3 March 2019 NY Spring
#> 4 September 2019 NY Fall
(Using akrun's reproducible data)

Creating a new column using scores from past years (which is in the same dataframe)

I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14

How to create a new data frame with grouped transactions in R?

I am trying to create a new data frame in R using an existing data frame of items bought in transactions as shown below:
dput output for the data:
structure(list(Transaction = c(1L, 2L, 2L, 3L, 3L, 3L), Item = c("Bread",
"Scandinavian", "Scandinavian", "Hot chocolate", "Jam", "Cookies"
), date_time = c("30/10/2016 09:58", "30/10/2016 10:05", "30/10/2016 10:05",
"30/10/2016 10:07", "30/10/2016 10:07", "30/10/2016 10:07"),
period_day = c("morning", "morning", "morning", "morning",
"morning", "morning"), weekday_weekend = c("weekend", "weekend",
"weekend", "weekend", "weekend", "weekend"), Year = c("2016",
"2016", "2016", "2016", "2016", "2016"), Month = c("October",
"October", "October", "October", "October", "October")), row.names = c(NA,
6L), class = "data.frame")
As you can see in the example, the rows are due to each individual product bought, not the transactions themselves (hence why Transaction 2 is both rows 2 and 3).
I would like to make a new table where the rows are the different transactions (1, 2, 3, etc.) and the different columns are categorical (Bread = 0, 1) so I can perform apriori analysis.
Any idea how I can group the different transactions together and then create these new columns?
Assuming your dataframe is called df you can use tidyr's pivot_wider :
df1 <- tidyr::pivot_wider(df, names_from = Item, values_from = Item,
values_fn = n_distinct, values_fill = 0)
df1
# Transaction date_time period_day weekday_weekend Year Month Bread Scandinavian `Hot chocolate` Jam Cookies
# <int> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
#1 1 30/10/2016 09… morning weekend 2016 Octob… 1 0 0 0 0
#2 2 30/10/2016 10… morning weekend 2016 Octob… 0 1 0 0 0
#3 3 30/10/2016 10… morning weekend 2016 Octob… 0 0 1 1 1
Or with data.table's dcast :
library(data.table)
dcast(setDT(df), Transaction+date_time+period_day + weekday_weekend +
Year + Month ~ Item, value.var = 'Item', fun.aggregate = uniqueN)
Try dummy_cols from the fastDummies package. This will turn the item column into 0's and 1's. The second line sums per transaction.
d <- dummy_cols(data[1:2], remove_selected_column=T)
d <- aggregate(d[-1], by=list(Transaction=d$Transaction), FUN=sum)

R dplyr perform different aggregation by group

I have a dataframe dat which looks like this:
dat <- structure(list(cell.ID = c(329574L, 329574L, 329574L, 329574L,
329574L, 329574L, 329574L, 329574L, 329574L, 329574L, 329574L,
329574L), Year = c("2010", "2010", "2010", "2010", "2010", "2010",
"2010", "2010", "2010", "2010", "2010", "2010"), month_name = c("June",
"July", "June", "July", "June", "July", "June", "July", "June",
"July", "June", "July"), value = c(459.860986624053, 398.94083733151,
16, 23, 111.69, 453.333, 71.55, 30.38, 31.928, 30.13355, 17.587,
19.7938709677419), variable_name = c("ETo", "ETo", "Rday", "Rday",
"Rsum", "Rsum", "Thdd", "Thdd", "Tmax", "Tmax", "Tmin", "Tmin"
), monthID = c(6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L
)), row.names = c(NA, -12L), class = "data.frame")
library(dplyr)
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(variable = sum(value))
If I want to average the Tmax and Tmin and sum the rest of the variables, I did this
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(variable = ifelse(variable_name %in% c('Tmax', 'Tmin'), mean(value), sum(value)))
Error: Column `variable` must be length 1 (a summary value), not 2
How do I correct this?
Another way to do this is dplyr is to use if and else instead of ifelse:
dat %>%
group_by(Year, variable_name) %>%
summarise(variable = if (variable_name[1] %in% c('Tmax', 'Tmin')) mean(value) else sum(value))
# A tibble: 6 x 3
# Groups: Year [1]
Year variable_name variable
<chr> <chr> <dbl>
1 2010 ETo 859.
2 2010 Rday 39
3 2010 Rsum 565.
4 2010 Thdd 102.
5 2010 Tmax 31.0
6 2010 Tmin 18.7
I think the problem is that ifelse in this context is operating row-wise, not at the level of the group. If that's right, then you could work around the problem by getting both summary statistics and then conditionally selecting the one you want by variable name, like this:
dat %>%
dplyr::group_by(Year, variable_name) %>%
dplyr::summarise(var_mean = mean(value), var_sum = sum(value)) %>%
dplyr::mutate(variable = ifelse(variable_name %in% c('Tmax', 'Tmin'), var_mean, var_sum)) %>%
dplyr::select(-var_mean, -var_sum)
Result:
# A tibble: 6 x 3
# Groups: Year [1]
Year variable_name variable
<chr> <chr> <dbl>
1 2010 ETo 859.
2 2010 Rday 39
3 2010 Rsum 565.
4 2010 Thdd 102.
5 2010 Tmax 31.0
6 2010 Tmin 18.7

How to summarize the top 3 highest values in a dataset when there are ties

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130

Resources