Efficient algorithm to calculate values in data.frame without loop - r

Here is the situation where I got kinda stuck with R. I have data table with one row for each day, something like this:
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month1 = c("DEC", "JAN");
Year1 = c("15", "16");
Price1 = c(100, 110);
Month2 = c(NA_character_, NA_character_);
Year2 = c(NA_character_, NA_character_);
Price2 = c(NA_integer_, NA_integer_);
Month3 = c(NA_character_, NA_character_);
Year3 = c(NA_character_, NA_character_);
Price3 = c(NA_integer_, NA_integer_);
Month4 = c(NA_character_, NA_character_);
Year4 = c(NA_character_, NA_character_);
Price4 = c(NA_integer_, NA_integer_);
dataSample = data.frame(Date, Month1, Year1, Price1, Month2, Year2, Price2, Month3, Year3, Price3, Month4, Year4, Price4);
Which gives such a table:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now I need to calculate all months and prices for each. For that I have 2 other data frames:
Date = c(as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2015-12-31"), as.Date("2016-01-01"), as.Date("2016-01-01"), as.Date("2016-01-01"));
Month.Start = c("DEC", "JAN", "FEB", "JAN", "FEB", "MAR");
Year.Start = c("15", "16", "16", "16", "16", "16")
Month.End = c("JAN", "FEB", "MAR", "FEB", "MAR", "APR");
Year.End = c("16", "16", "16", "16", "16", "16")
Diff = c(10, 15, -15, 19, -20, -5);
diffsOneMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Date = c(as.Date("2015-12-31"), as.Date("2016-01-01"));
Month.Start = c("DEC", "MAR");
Year.Start = c("15", "16")
Month.End = c("MAR", "JUN");
Year.End = c("16", "16")
Diff = c(11, 25);
diffsThreeMonth = data.frame(Date, Month.Start, Year.Start, Month.End, Year.End, Diff)
Which gives me these tables:
One month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 JAN 16 10
2 2015-12-31 JAN 16 FEB 16 15
3 2015-12-31 FEB 16 MAR 16 -15
4 2016-01-01 JAN 16 FEB 16 19
5 2016-01-01 FEB 16 MAR 16 -20
6 2016-01-01 MAR 16 APR 16 -5
Three month price differences
Date Month.Start Year.Start Month.End Year.End Diff
1 2015-12-31 DEC 15 MAR 16 20
2 2016-01-01 MAR 16 JUN 16 25
Now I must fill dataSample data frame by using data from differences tables. I check what start/end months/years are available there and have to fill those months/years in dataSample. Then take difference of price and set calculated price in dataSample. So for example in dataSample we start with DEC 15, then in diffsOneMonth we have entry DEC 15 - JAN 16 with difference 10 so we add it to DEC 15 price and get JAN 16 price 110:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 <NA> <NA> NA <NA> <NA> NA
2 2016-01-01 JAN 16 110 <NA> <NA> NA <NA> <NA> NA <NA> <NA> NA
Now its possible to do next month and then next etc. If we use diffsOneMonth only we would get desirable result like this:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 110
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
However there is additional requirement that I must use wider month spread to calculate prices if its possible. So for 2015-12-31 there exists three month spread from DEC 15 to MAR 16 which should override price from one month difference. So DEC 15 price is 110 and DEC 15 - MAR 16 difference is 11 which makes MAR 16 price not 110 but 111:
Date Month1 Year1 Price1 Month2 Year2 Price2 Month3 Year3 Price3 Month4 Year4 Price4
1 2015-12-31 DEC 15 100 JAN 16 110 FEB 16 125 MAR 16 111
2 2016-01-01 JAN 16 110 FEB 16 129 MAR 16 109 APR 16 104
So for this sample it would be my final desirable output.
Real data is much more complex, with 6 and 12 month differences and 64 months forward for each date. Also some months can be missing. I tried to do it with a loop but it was very slow, however I am not sure how to approach such a problem without a loop. I have created few helper methods to be able to calculate next year/month:
nextContract = function(currentMonth, currentYear, length = 1,
years = c("10", "11", "12", "13", "14", "15", "16", "17", "18"),
months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
yDiff = ifelse(length(months) < mIdx, mIdx / length(months) - ifelse(mIdx %% length(months) == 0, 1, 0), 0);
return(data.frame(nextMonth(currentMonth, length, months), nextYear(currentYear, length = yDiff)))
}
nextYear = function(currentYear, length = 1, years = c("10", "11", "12", "13", "14", "15", "16", "17", "18")) {
return(years[match(currentYear, years)+length]);
}
nextMonth = function(currentMonth, length = 1, months = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")) {
mIdx <- match(currentMonth, months)+length;
return(months[ifelse(length(months) < mIdx, ifelse(mIdx %% length(months) != 0, mIdx %% length(months), length(months)), mIdx)]);
}
Example of usage could be:
> nextContract("DEC", "15")
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 JAN 16
or:
> nextContract("DEC", "15", length = 3)
nextMonth.currentMonth..length..months. nextYear.currentYear..length...yDiff.
1 MAR 16
This got to be pretty long question but I hope someone will take time to review it :)
Thanks in advance!
EDIT
A little bit of improvement on proposed solution and I got what I needed:
outrightAndForwardRows <- list("1" = diffsOneMonth, "3" = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
mutate(Day.End = 1) %>%
mutate(Outright.Day = 1) %>%
unite("Contract.Start", Day.Start, Month.Start, Year.Start) %>%
unite("Contract.End", Day.End, Month.End, Year.End) %>%
unite("Contract.Outright", Outright.Day, Month1, Year1) %>%
mutate(time_step = as.numeric(time_step),
Contract.Start =
Contract.Start %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.End =
Contract.End %>%
parse_date_time("%d_%b_%y")) %>%
mutate(Contract.Outright =
Contract.Outright %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Contract.End) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Contract.End) %>%
slice(time_step %>% which.max) %>%
ungroup() %>%
select(-time_step, -Diff, -Contract.Start)
#### add outright and forward months to the same columns
outright <- outrightAndForwardRows %>% select(Date, Price=Price1, Contract=Contract.Outright) %>% unique
forwardMonths <- outrightAndForwardRows %>% select(Date, Contract=Contract.End, Price)
# join and sort rows
joined <- rbind(outright, forwardMonths) %>% arrange(Date, Contract)
# add contract sequence
joined = data.table(joined)
joined = joined[, Contract.seq:=seq(.N), by=Date];
dcast(joined, Date ~ Contract.seq, value.var=c("Price", "Contract"))

Something like this:
library(dplyr)
library(tidyr)
library(lubridate)
list(`1` = diffsOneMonth,
`3` = diffsThreeMonth) %>%
bind_rows(.id = "time_step") %>%
left_join(dataSample %>%
select(Date, Price1, Month1, Year1) ) %>%
mutate(Day.Start = 1) %>%
unite("Date.Start", Day.Start, Month.Start, Year.Start) %>%
mutate(time_step = as.numeric(time_step),
Date.Start =
Date.Start %>%
parse_date_time("%d_%b_%y")) %>%
group_by(time_step, Date) %>%
arrange(Date.Start) %>%
mutate(Price = cumsum(Diff) + Price1) %>%
group_by(Date, Date.Start) %>%
slice(time_step %>% which.max)

Related

How to create new column based off information in other columns in R

I have a large dataset that spans over 20 years. I have a column for the date and another column for the hour ending (HE). I'm trying to add a new column to provide the hour by hour (hrxhr) information in a given year (so running total). So date: Jan 1, 2023, HE: 1 should be hrxhr: 1 and Dec 31, 2023, HE: 24, should be hrxhr:8760 (8784 on leap years).
Should look like this:
YEAR
MONTH
DAY
HOUR OF DAY
Month_num
Date
Date1
NEW COLUMN hrxhr
2023
Dec
31
22
12
2023-12-31
365
8758
2023
Dec
31
23
12
2023-12-31
365
8759
2023
Dec
31
24
12
2023-12-31
365
8760
2024
Jan
01
1
01
2024-01-01
1
1
2024
Jan
01
2
01
2024-01-01
1
2
At first I thought I could get the Julian date and then multiple that by the HE, but that is incorrect since Jan 2, 2023, HE:1 would then equal 2 but the hrxhr/running total should equal 25.
In base R:
df <- data.frame(
YEAR = c(2023L, 2023L, 2023L, 2024L, 2023L),
MONTH = c("Dec", "Dec", "Dec", "Jan", "Jan"), DAY = c(31L, 31L, 31L, 1L, 1L),
HOUR_OF_DAY = c(22L, 23L, 24L, 1L, 2L), Month_num = c(12L,
12L, 12L, 12L, 12L), Date = c("2023-12-31", "2023-12-31",
"2023-12-31", "2024-01-01", "2024-01-01"), Date1 = c(365L,
365L, 365L, 1L, 1L))
df$hrxhr <- mapply(\(from, to, by) length(seq.POSIXt(from, to, by)),
from = trunc(as.POSIXlt(df$Date), "years"),
to = as.POSIXlt(df$Date),
by="1 hour") + df$HOUR_OF_DAY - 1
df
#> YEAR MONTH DAY HOUR_OF_DAY Month_num Date Date1 hrxhr
#> 1 2023 Dec 31 22 12 2023-12-31 365 8758
#> 2 2023 Dec 31 23 12 2023-12-31 365 8759
#> 3 2023 Dec 31 24 12 2023-12-31 365 8760
#> 4 2024 Jan 1 1 12 2024-01-01 1 1
#> 5 2023 Jan 1 2 12 2024-01-01 1 2
If you are open to a tidyverse / lubridate solution, you could use
library(dplyr)
library(lubridate)
df1 %>%
mutate(
begin = ymd_hms(paste(year(Date), "-01-01 00:00:00")),
target = ymd_hms(paste(Date, HOUR_OF_DAY, ":00:00")),
hrxhr = time_length(interval(begin, target), "hours")) %>%
select(-begin, -target)
This returns
# A tibble: 5 × 7
YEAR MONTH DAY HOUR_OF_DAY Month_num Date hrxhr
<dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
1 2023 Dec 31 22 12 2023-12-31 8758
2 2023 Dec 31 23 12 2023-12-31 8759
3 2023 Dec 31 24 12 2023-12-31 8760
4 2024 Jan 01 1 12 2024-01-01 1
5 2024 Jan 01 2 12 2024-01-01 2
Data
structure(list(YEAR = c(2023, 2023, 2023, 2024, 2024), MONTH = c("Dec",
"Dec", "Dec", "Jan", "Jan"), DAY = c("31", "31", "31", "01",
"01"), HOUR_OF_DAY = c(22, 23, 24, 1, 2), Month_num = c(12, 12,
12, 12, 12), Date = structure(c(19722, 19722, 19722, 19723, 19723
), class = "Date")), row.names = c(NA, -5L), class = "data.frame")

How to add an increasing index based on multiple columns in R

I have a data frame that contains the columns "hour", "day","month" and "count".
library(tidyverse)
set.seed(0)
df <- expand_grid(expand_grid(
hour = seq(0:23),
day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = c("Jan", "Feb", "Mar", "Apr", "May", "Jun")) %>%
mutate(count = sample(0:100, n(), replace = TRUE))
head(df)
# A tibble: 6 × 4
hour day month count
<int> <chr> <chr> <int>
1 1 Mon Jan 13
2 1 Mon Feb 67
3 1 Mon Mar 38
4 1 Mon Apr 0
5 1 Mon May 33
6 1 Mon Jun 86
I would like to add a new column named "id" that contains an increasing index which can be used to sort the data in chronological order. The solution I found is not particularly concise and requires me to set factor levels before calling arrange(). Is there another way to solve this issue that capitalises on the fact that I am working with (unformatted) dates?
This is my solution with arrange():
df2 <- df %>%
mutate(day = factor(day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun"))) %>%
arrange(month, day, hour) %>%
mutate(id = row_number())
head(df2)
# A tibble: 6 × 5
hour day month count id
<int> <fct> <fct> <int> <int>
1 1 Mon Jan 13 1
2 2 Mon Jan 43 2
3 3 Mon Jan 82 3
4 4 Mon Jan 66 4
5 5 Mon Jan 49 5
6 6 Mon Jan 79 6
Any suggestions are much appreciated. Thank you!

Make an average based on condition from second df in R

I have a following problme. I have two dataframes. In the second one. there are conditions about how a new column in the first dataframe should be calculated. See example bellow:
First df:
df1 <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7)
)
Second df:
df2 <- data.frame( country = c("01", "02", "03", "04", "05" ),
match_country1 = c("02", "03", "01", "01", "01"),
match_country2 = c("03", "04", "02", "02", "03"),
match_country3 = c("05", "05", "04", "03", "04")
)
Now I need to compute a new_value that is an average of three values as defined in df2. I need to respect a date in df1. For example, new_value for country "01" and date "2020-01-01" is an average of a value from country "02", country "03", country "05" all from date "2020-01-01".
Desired output is below:
new_df <- data.frame(country = c("01", "01", "02", "03", "03", "03" , "04", "05"),
date = c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-01", "2020-01-02"),
value = c(4, 3, 2, -3, 1.5, 12, 10, 15),
blabla = c(23, 41, 32, 8, 50, 27, 8, 7),
new_value = c(NA, #because no data for 02, 03, 05 from 2020-01-01
(2-3+15)/3,
(-3+15)/2, #because no data for 04 from 2020-01-02
(3+2)/2, #because no data for 04 from 2020-01-02
NA, #because no data for 01, 02, 04 from 2020-01-03
NA, #because no data for 01, 02, 04 from 2020-01-04
4, #because no data for 02, 03 from 2020-01-01
(3-3)/2 #because no data for 04 from 2020-01-02
)
)
How can I do this, please?
This can be done using an SQL triple join. For each row in df1 get the matching country row in df2 via left join and then get all the rows in the b instance of df1 for which the date is the same and there is a country match in df2. Then take the average b value in the matching rows.
library(sqldf)
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and
b.country in (c.match_country1, c.match_country2, c.match_country3)
group by a.rowid")
giving this data frame:
country date value blabla new_value
1 01 2020-01-01 4.0 23 NA
2 01 2020-01-02 3.0 41 4.666667
3 02 2020-01-02 2.0 32 6.000000
4 03 2020-01-02 -3.0 8 2.500000
5 03 2020-01-03 1.5 50 NA
6 03 2020-01-04 12.0 27 NA
7 04 2020-01-01 10.0 8 4.000000
8 05 2020-01-02 15.0 7 0.000000
Variations
Here are two variations. The first generates the in (...) string as matches and substitutes it in and the second converts df2 to long form, long first.
matches <- toString(names(df2)[-1])
fn$sqldf("select a.*, avg(b.value) new_value
from df1 a
left join df2 c on a.country = c.country
left join df1 b on a.date = b.date and b.country in ($matches)
group by a.rowid")
varying <- list(match_country = names(df2)[-1])
long <- reshape(df2, dir = "long", varying = varying, v.names = names(varying))
sqldf("select a.*, avg(b.value) new_value
from df1 a
left join long c on a.country = c.country
left join df1 b on a.date = b.date and b.country = c.match_country
group by a.rowid")
This tidyverse approach may help
df1
#> country date value blabla
#> 1 01 2020-01-01 4.0 23
#> 2 01 2020-01-02 3.0 41
#> 3 02 2020-01-02 2.0 32
#> 4 03 2020-01-02 -3.0 8
#> 5 03 2020-01-03 1.5 50
#> 6 03 2020-01-04 12.0 27
#> 7 04 2020-01-01 10.0 8
#> 8 05 2020-01-02 15.0 7
df2
#> country match_country1 match_country2 match_country3
#> 1 01 02 03 05
#> 2 02 03 04 05
#> 3 03 01 02 04
#> 4 04 01 02 03
#> 5 05 01 03 04
suppressMessages(library(tidyverse))
df1 %>%
left_join(df2, by = 'country') %>%
nest(data = !date) %>%
mutate(data = map(data, ~.x %>%
mutate(across(contains('match'), ~value[match(., country)])) %>%
rowwise() %>%
mutate(avg = mean(c_across(contains('match')), na.rm = T)) %>%
select(!contains('match'))
)
) %>%
unnest(data)
#> # A tibble: 8 x 5
#> date country value blabla avg
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2020-01-01 01 4 23 NaN
#> 2 2020-01-01 04 10 8 4
#> 3 2020-01-02 01 3 41 4.67
#> 4 2020-01-02 02 2 32 6
#> 5 2020-01-02 03 -3 8 2.5
#> 6 2020-01-02 05 15 7 0
#> 7 2020-01-03 03 1.5 50 NaN
#> 8 2020-01-04 03 12 27 NaN
Created on 2021-05-02 by the reprex package (v2.0.0)
Though there already is an accepted answer, here is a base R, since the two answers posted (2nd) require external packages.
df1$new_value <- with(df1, ave(seq_len(n), date, FUN = function(i){
mrg <- merge(df1[i, ], df2)
j <- grep("^match", names(mrg))
ctry <- unique(df1[i, "country"])
apply(mrg[j], 1, function(row){
k <- match(row, ctry)
if(any(!is.na(k)))
mean(mrg[k, "value"], na.rm = TRUE)
else NA_real_
})
}))
identical(df1$new_value, new_df$new_value)
#[1] TRUE

Combining abbreviated months and year into one variable in R

I have a time series data with a column for a month and a column for a year. The months are JAN, FEB, etc.
I'm trying to combine them into one month year variable in order to run time series analysis on it. I'm very new to R and could use any guidance.
Perhaps something like this?
library(dplyr)
c("JAN", "FEB", "MAR", "APR",
"MAY", "JUN", "JUL", "AUG",
"SEP", "OCT", "NOV", "DEC") %>%
rep(., times = 3) %>%
as.factor() -> months
c("2018", "2019", "2020") %>%
rep(., each = 12) %>%
as.factor() -> years
df1 <- cbind.data.frame(months, years)
paste(df1$months, df1$years, sep = ".") %>%
as.factor() -> merged.years.months
Start with your month/year df.
library(tidyverse)
library(lubridate)
events <- tibble(month = c("JAN", "MAR", "FEB", "NOV", "AUG"),
year = c(2018, 2019, 2018, 2020, 2019))
Let's say that each of your time periods start on the first of the month.
series <- events %>%
mutate(mo1 = dmy(paste(1, month, year)))
This is what you want
R > series
# A tibble: 5 x 3
month year mo1
<chr> <dbl> <date>
1 JAN 2018 2018-01-01
2 MAR 2019 2019-03-01
3 FEB 2018 2018-02-01
4 NOV 2020 2020-11-01
5 AUG 2019 2019-08-01
These are now dates;you can use them in other analyses.
Base R solution:
events <- within(events,{
month_no <- as.integer(as.factor(sort(month)))
date <- as.Date(paste(year, ifelse(nchar(month_no) < 2, paste0("0", month_no),
month_no), "01", sep = "-"), "%Y-%m-%d")
rm(month_no, month, year)
}
)

How to summarize the top 3 highest values in a dataset when there are ties

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.
A tibble: 15 x 3
city month number
<chr> <chr> <dbl>
1 Lund jan 12
2 Lund feb 12
3 Lund mar 18
4 Lund apr 28
5 Lund may 28
6 Stockholm jan 15
7 Stockholm feb 15
8 Stockholm mar 30
9 Stockholm apr 30
10 Stockholm may 10
11 Uppsala jan 22
12 Uppsala feb 30
13 Uppsala mar 40
14 Uppsala apr 60
15 Uppsala may 30
This is the code I have tried:
# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))
The expected (wanted) output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 75
3 Uppsala 130
but the actual R output is:
# A tibble: 3 x 2
city top_nr
<chr> <dbl>
1 Lund 86
2 Stockholm 90
3 Uppsala 160
It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.
Any help would be much appreciated! :)
We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows
my_data %>%
distinct(city, number, .keep_all = TRUE) %>%
group_by(city) %>%
top_n(3, number) %>%
summarise(top_nr = sum(number))
Update
Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'
my_data %>%
group_by(city) %>%
top_n(3, number) %>%
arrange(city, desc(number)) %>%
summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
# city number
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130
data
my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund",
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm",
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan",
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may",
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L,
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))
Life might be way simpler without top_n():
dat %>%
group_by(city) %>%
summarize(
top_nr = sum(tail(sort(number), 3))
)
This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.
library(tidyverse)
my_data %>%
group_by(city) %>%
arrange(desc(number), .by_group = TRUE) %>%
filter(row_number() %in% 1:3) %>%
summarise(top_nr = sum(number))
## A tibble: 3 x 2
# city top_nr
# <chr> <int>
#1 Lund 74
#2 Stockholm 75
#3 Uppsala 130

Resources