Misconception regarding Group_by and Summarize function in R[DPLYR Package] - r

I had to plot the graph of Fatalities per year. So I took out the
year from Date and then grouped by it and then I summarized so that I
get Fatalities per year. But when I run then it it gives me Fatalities throughout the dataset.
I don't understand why? And Any other alternate to get Fatalities per year.
In Dataset,Fatalities is given per incident and every year a lot of incidents happened.
crash_data=read.csv("https://raw.githubusercontent.com/gluque/analytics_task2/master/Airplane_Crashes_and_Fatalities_Since_1908.csv")
> crash_data$Date <- as.Date(crash_data$Date, "%m/%d/%Y")
> crash_data$Date <- format(crash_data$Date, '%Y')
> cd<-subset(crash_data,select = c(Fatalities,Date))
> ab<-group_by(cd,Date)
> ef<-summarize(ab,Fatalities=sum(Fatalities,na.rm = TRUE))
> ef
Fatalities
1 105479

> group_by(cd,Date) %>% summarize(Fatalities = sum(Fatalities, na.rm = TRUE))
# # A tibble: 98 x 2
# Date Fatalities
# <chr> <int>
# 1 1908 1
# 2 1912 5
# 3 1913 45
# 4 1915 40
# 5 1916 108
# 6 1917 124
# 7 1918 65
# 8 1919 5
# 9 1920 24
# 10 1921 68
# ... with 88 more rows

Related

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

How to create a loop code from big dataframe in R?

I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Resources