I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
How can I scrape the data and add and additional column to show the year that it is scraped?
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>%
add_column(year = year)
write.csv(tables, year, file = "nba_draftsR.csv", na ="")
}
2000:2017 %>%
walk(function(year) {
nba_drafts(year)
})
Error: Column 1 must be named.
Checked your code, the error is happening at the step highlighted in below code.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>% # error is happening at this step
Debug Step:
The reason for this error is the first three columns names are balnks(""), which you need to assign first, then only you can change to tibble or data frame.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
names(tables)
[1] "" "" "" "Round 1" "Round 1" "" "Totals" "Totals" "Totals" "Totals" "Totals"
[12] "Shooting" "Shooting" "Shooting" "Per Game" "Per Game" "Per Game" "Per Game" "Advanced" "Advanced" "Advanced" "Advanced"
I have added a for loop to update the names
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
oldName<-names(tables)
#updating names with col_
for(i in 1:length(oldName)){
oldName[i]<- paste0("col_",i,oldName[i])
}
names(tables)<-oldName
tables<-tables %>%
as.tibble() %>%
add_column(year = year)
return(tables)
}
Output:
> nba_drafts("2019")
# A tibble: 63 x 23
col_1 col_2 col_3 `col_4Round 1` `col_5Round 1` col_6 col_7Totals col_8Totals col_9Totals col_10Totals col_11Totals
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Rk Pk Tm Player College Yrs G MP PTS TRB AST
2 1 1 NOP Zion Williams… Duke 1 19 565 448 129 41
3 2 2 MEM Ja Morant Murray State 1 59 1771 1041 208 409
4 3 3 NYK RJ Barrett Duke 1 56 1704 803 279 143
5 4 4 LAL De'Andre Hunt… Virginia 1 63 2018 778 286 112
6 5 5 CLE Darius Garland Vanderbilt 1 59 1824 728 111 229
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17
I am selecting the top 10 destinations of flights, and how many flights went there. To achieve this I needed to use summarise, which throws away everything that I didn't mention in the group_by(..).
Later I need the column origin, but I no longer can retrieve this column, as it is discarded along with other columns. To keep the origin it seems that I would need to mention it in my group_by(..) but I don't want this, as my result would then be incorrect. How can I get the origin of these top 10 flights?
library(tidyverse)
library(nycflights13)
(newFlights<- flights %>%
group_by("Destination" = dest) %>%
summarise("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>% top_n(10))
You want to include origin in the call to group_by(). See documentation:
newFlights <- as.data.frame(flights %>%
group_by(origin, dest)%>%
summarize("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>%
top_n(10)
)
head(newFlights, 10)
Giving you:
origin dest AllFlights
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327
I want to get the top 10 destinations, and also how many flights were made to these destinations. I am using summarise, and my problem is that summarise throws away all columns that a not mentioned in the summarise(..). I need to keep the column origin for later use.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
Here is the result from the code above
# A tibble: 10 x 2
dest allFlights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
I think this is correct. But all I am missing, is another column that prints the origin
I was thinking about doing some a join to get the origin, but this doesn't make sense, as doing the join on this result set might not yield the correct flights.
I found this post: How to summarise all columns using group_by and summarise? but it was not helpful to me, as summarise is unable to find the columns I mention, that are not in its function.
When you sum the flights by destination, you are summing the total number of flights arriving in the destination city, which have many different origin cities. So it would not make sense for there to be a single value in the origin column here.
If you want, you could replace group_by(dest) with group_by(origin,dest). That would give you the top 10 pairs of origin-destination cities, which is a different output than in your question, but would retain the origin and destination columns for further analysis.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(origin, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
output
# A tibble: 10 x 3
# Groups: origin [3]
origin dest n
<chr> <chr> <int>
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327