aggregation of the region's values ​in the dataset - r

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.

Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).

Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Related

How to create a loop code from big dataframe in R?

I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)

How to look up date value from same table based on a condition in R?

I am not sure if this task is of self join or not. I am basically trying to lookup the latest date for each State.UnionTerritory in below dataframe where the Daily_confirmed cases for each of them were <= half of current Date.
This will help me to get the the doubling time of cases for each State on each date.
library(tidyverse)
library(lubridate)
df_ind <- read_csv("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/sample_data.csv")
df_ind %>% head()
# output
Date State.UnionTerritory Daily_confirmed
<date> <chr> <dbl>
1 2021-12-23 Haryana 46
2 2021-12-23 Maharashtra 1179
3 2021-12-23 Delhi 118
4 2021-12-22 Haryana 55
5 2021-12-22 Maharashtra 1201
6 2021-12-22 Delhi 125
For example Delhi has 118 Cases on 2021-12-23 and less than or half of this for Delhi is coming as 57 on 2021-12-15 so doubling rate would be 2021-12-23 - 2021-12-15 = 8 days.
so I should get something like:
This should be applied for each State in the data & on all dates.
df_ind <- df_ind %>%
mutate(Daily_confirmed_half = as.integer(Daily_confirmed / 2) )
I am not sure how exactly I can try this to get proper Dates as the Doubling_Date column where Daily_confirmed Cases values met the condition of <= half.
I can group summarize & use first to pull the latest dates but not sure what would be the efficient way of bringing the right dates in another column of this data frame.
I'm not sure if this is your desired output. The approach uses a full_join together with filter to simulate an non-euqal join in dplyr. Then we do some data cleaning and in the last step we need a left_join to our original data, since there are quite a couple of days, where we cannot calculate the doubling date, since its not included in the time series.
library(tidyverse)
library(lubridate)
df_ind %>%
group_by(State.UnionTerritory) %>%
full_join(., ., by = c("State.UnionTerritory")) %>%
filter(Date.x > Date.y,
Daily_confirmed.x > (Daily_confirmed.y * 2)) %>%
group_by(Date.x, State.UnionTerritory) %>%
filter(Date.y == max(Date.y)) %>%
filter(Daily_confirmed.y == max(Daily_confirmed.y)) %>%
rename("Date" = Date.x,
"Daily_confirmed" = Daily_confirmed.x,
"Doubling_Date" = Date.y) %>%
select(- Daily_confirmed.y) %>%
mutate(Day_to_double = Date - Doubling_Date) %>%
left_join(df_ind,
.,
by = c("Date", "State.UnionTerritory", "Daily_confirmed")) %>%
arrange(State.UnionTerritory, desc(Date))
#> # A tibble: 252 x 5
#> Date State.UnionTerritory Daily_confirmed Doubling_Date Day_to_double
#> <date> <chr> <dbl> <date> <drtn>
#> 1 2021-12-23 Delhi 118 2021-12-15 8 days
#> 2 2021-12-22 Delhi 125 2021-12-15 7 days
#> 3 2021-12-21 Delhi 102 2021-12-14 7 days
#> 4 2021-12-20 Delhi 91 2021-12-14 6 days
#> 5 2021-12-19 Delhi 107 2021-12-14 5 days
#> 6 2021-12-18 Delhi 86 2021-12-13 5 days
#> 7 2021-12-17 Delhi 69 2021-12-13 4 days
#> 8 2021-12-16 Delhi 85 2021-12-13 3 days
#> 9 2021-12-15 Delhi 57 2021-11-27 18 days
#> 10 2021-12-14 Delhi 45 2021-11-15 29 days
#> # … with 242 more rows
Created on 2021-12-25 by the reprex package (v0.3.0)

How to conditionally mutate a new column when data is in long format, and condition is depending on grouping combination

I have data in long format, and I'm trying to test each row against the mean of a certain grouping combination, in order to generate a new column with the conclusion from that test.
Example
In this toy example, I have data about 20 cars. Each car could be of one of three possible makers. We have mpg data for each car, measured 8 times: in the city or highway, in the morning or evening, during the winter or spring.
library(tidyr)
set.seed(2021)
df_id_and_makers <-
data.frame(id = 1:20,
maker = sample(c("toyota", "audi", "bmw"), size = 20, replace = TRUE))
df <- tidyr::expand_grid(df_id_and_makers,
road_type = c("city", "highway"),
time_of_day = c("morning", "evening"),
season = c("winter", "spring"))
df$mpg_val <- sample(15:40, size = nrow(df), replace = TRUE)
df
#> # A tibble: 160 x 6
#> id maker road_type time_of_day season mpg_val
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 bmw city morning winter 28
#> 2 1 bmw city morning spring 22
#> 3 1 bmw city evening winter 40
#> 4 1 bmw city evening spring 18
#> 5 1 bmw highway morning winter 19
#> 6 1 bmw highway morning spring 36
#> 7 1 bmw highway evening winter 30
#> 8 1 bmw highway evening spring 16
#> 9 2 audi city morning winter 33
#> 10 2 audi city morning spring 18
#> # ... with 150 more rows
Created on 2021-07-07 by the reprex package (v2.0.0)
I want to analyze this data to test my hypothesis that mpg in city is larger than mpg in highway. To this end, I want to create a new column that tests whether the value in mpg_val when road_type is city is larger than the mean of mpg_val across rows where road_type is highway. Furthermore, I want to compare just among cars of the same makers.
So, for example, id = 1 is bmw, and therefore the new column I want to compute should test each value of mpg_val in rows where road_type == city (i.e., rows 1-4, but not 5-6), and see whether mpg_val is larger than mean(mpg_val) in rows where road_type == highway and maker == bmw.
Expected output
Here's the manual and dumb way of doing this. I'll show only how I do this for maker = bmw for the sake of demonstration.
library(dplyr)
# step 1 -- calculate the mean of `mpg_val` for `road_type = highway` and only across bmw
mean_bmw_highway_mpg <-
df %>%
filter(maker == "bmw",
road_type == "highway") %>%
pull(mpg_val) %>%
mean()
mean_bmw_highway_mpg
## [1] 26.22222
# step 2 -- compare each row where `maker = "bmw"` and `road_type = "city"` for its `mpg_val` against `mean_bmw_highway_mpg`
result_bmw_only <-
df %>%
mutate(is_mpg_city_larger_than_mpg_highway = case_when(maker != "bmw" ~ "not_relevant",
road_type != "city" ~ "not_relevant",
mpg_val > mean_bmw_highway_mpg ~ "yes",
TRUE ~ "no"))
result_bmw_only
## # A tibble: 160 x 7
## id maker road_type time_of_day season mpg_val is_mpg_city_larger_than_mpg_highway
## <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 bmw city morning winter 28 yes ## because 28 > 26.222
## 2 1 bmw city morning spring 22 no ## because 22 < 26.222
## 3 1 bmw city evening winter 40 yes
## 4 1 bmw city evening spring 18 no
## 5 1 bmw highway morning winter 19 not_relevant
## 6 1 bmw highway morning spring 36 not_relevant
## 7 1 bmw highway evening winter 30 not_relevant
## 8 1 bmw highway evening spring 16 not_relevant
## 9 2 audi city morning winter 33 not_relevant
## 10 2 audi city morning spring 18 not_relevant
## # ... with 150 more rows
How could I achieve the same result as result_bmw_only (but applied to the entire df) in a more elegant way? Hopefully using dplyr approach, because this is what I'm used to, but otherwise any method will do.
Thanks!
EDIT 1
One solution I could think of involves purrr, but I can't get this done yet.
library(purrr)
solution_purrr <-
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
mpg_val > mean(mpg_val) ~ "yes",
TRUE ~ "no"))))
It seems that solution_purrr gets the desired output, but not exactly. This is because the second logic in case_when (i.e., mpg_val > mean(mpg_val) ~ "yes") is not what I want. I want to compare mpg_val to mean(mpg_val) when that mean is computed based only on rows where road_type == "highway". But here mean(mpg_val) computes across all rows.
EDIT 2
Based on #Till's answer below, I'd like to clarify that I'm looking for a solution that avoids a separate calculation of the mean we want to test against. What I did above with mean_bmw_highway_mpg is the undesired way of working towards the output. I showed mean_bmw_highway_mpg only for demonstrating the kind of mean I need to calculate.
What you tried is already close. Take a look at the documentation of dplyr::group_by()
it is designed for these kinds of operations.
Below is how you can expand your BMW-only solution to the full dataset using group_by().
library(tidyverse)
mean_highway_mpg_df <-
df %>%
filter(road_type == "highway") %>%
group_by(maker) %>%
summarise(mean_highway_mpg = mean(mpg_val))
result_df <-
df %>%
filter(road_type == "city") %>%
group_by(maker) %>%
left_join(mean_highway_mpg_df) %>%
mutate(mpg_city_higher_highway = mpg_val > mean_highway_mpg)
#> Joining, by = "maker"
result_df %>%
select(-(time_of_day:season))
#> # A tibble: 80 x 6
#> # Groups: maker [3]
#> id maker road_type mpg_val mean_highway_mpg mpg_city_higher_highway
#> <int> <chr> <chr> <int> <dbl> <lgl>
#> 1 1 bmw city 28 26.2 TRUE
#> 2 1 bmw city 22 26.2 FALSE
#> 3 1 bmw city 40 26.2 TRUE
#> 4 1 bmw city 18 26.2 FALSE
#> 5 2 audi city 33 28.1 TRUE
#> 6 2 audi city 18 28.1 FALSE
#> 7 2 audi city 35 28.1 TRUE
#> 8 2 audi city 36 28.1 TRUE
#> 9 3 audi city 25 28.1 FALSE
#> 10 3 audi city 32 28.1 TRUE
#> # … with 70 more rows
I think I got this. The following solution is based on both my EDIT 1 above, as well as #MrFlick's comment here.
First, we define a helper function:
is_x_larger_than_mean_y <- function(x, y) {
x > mean(y)
}
Then, we run:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes",
TRUE ~ "no")))) %>%
select(-data) %>%
unnest(cols = tbl_with_desired_new_col)
This way, the line within case_when() that says is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes" ensures that we compute the mean of mpg_val only based on rows in which road_type == "highway".

How to create rate on R

I want to change my data so that it gives me the rate of pedestrians to that states population. I am using a linear model and my summary values look like this:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087061 0.029876 2.914 0.00438 **
intersection 0.009192 0.003086 2.978 0.00362 **
Here, my beta value intersection is .009192 and that is not meaningful because compared to a state that has a smaller population, this value might be nothing in comparison.
Below is a condensed version of my data without all the columns I use, but here is the link of the csv incase someone wants to download it from there.
> head(c)
# A tibble: 6 x 15
STATE STATENAME PEDS PERSONS PERMVIT PERNOTMVIT COUNTY COUNTYNAME CITY DAY MONTH YEAR LATITUDE LONGITUD
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Alabama 0 3 3 0 81 LEE (81) 2340 7 2 2019 32.7 -85.3
2 1 Alabama 0 2 2 0 55 ETOWAH (55) 1280 23 1 2019 34.0 -86.1
3 1 Alabama 0 4 4 0 29 CLEBURNE (29) 0 22 1 2019 33.7 -85.4
4 1 Alabama 1 1 1 1 55 ETOWAH (55) 2562 22 1 2019 34.0 -86.1
5 1 Alabama 0 1 1 0 3 BALDWIN (3) 0 18 1 2019 30.7 -87.8
6 1 Alabama 0 2 2 0 85 LOWNDES (85) 0 7 1 2019 32.2 -86.4
# … with 1 more variable: FATALS <dbl>
Here is the code I have that runs through the process I am doing. I don't see how I can change it so that each value is a rate (values like peds or type_int)
#Libraries
rm(list=ls()) # this is to clear anything in memory
library(leaflet)
library(tidyverse)
library(ggmap)
library(leaflet.extras)
library(htmltools)
library(ggplot2)
library(maps)
library(mapproj)
library(mapdata)
library(zoo)
library(tsibble)
setwd("~/Desktop/Statistics790/DataSets/FARS2019NationalCSV")
df <- read.csv("accident.csv")
state <- unique(df$STATE)
for(i in state){
df1<- df %>%
filter(STATE==i) %>%
dplyr::select(c(STATE,PEDS,DAY,MONTH,YEAR,TYP_INT)) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d"))) %>% # create a date
group_by(date) %>% # Group by State id and date
# summarise_at(.vars = vars(PEDS), sum)
summarise(pedday=sum(PEDS),intersection=mean(TYP_INT))
#ts1<-ts(df,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts/figures")
plots<-df1 %>%
ggplot()+
geom_line(aes(x=date,y=pedday))+ylim(0,13)+
theme_bw()
ggsave(paste0("state_",i,".png"),width=8,height=6, )
ts1<-ts(df1,start=c(2019,1,1), frequency=365)
setwd("~/Desktop/Statistics790/States_ts")
ts1 %>% write.csv(paste0("state_",i,".csv"),row.names = F)
#Plots
}
#date1<- as.character(df$date)
#df1<- df%>% filter(STATE=="1")
#ts2<-xts(df,order.by = as.Date(df$date,"%Y-%m-%d"))
setwd("~/Desktop/Statistics790/States_ts")
cat("\f")
#df <- read.csv(paste0("state_1.csv"))
#print("------Linear Model------")
#summary(lm(pedday~weather,data=df))
for(i in state){
print(paste0("-------------------------Analysis for State: ",i," -------------------------------"))
df <- read.csv(paste0("state_",i,".csv"))
print("------Linear Model------")
print(summary(lm(pedday~intersection,data=df)))
}
Collating my answers from the comments: you need to get state population data from an outside source such as the US Census https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html#par_textimage_1574439295, read it in, join it to your dataset, and then calculate rate as pedestrians per population, scaled for ease of reading on the graph. You can make your code faster by taking some of your calculations out of the loop. The code below assumes the census data is called 'census.csv' and has columns 'Geographic Area' for state and 'X2019' for the most recent population data available.
pop <- read.csv('census.csv')
df <- read.csv('accidents.csv') %>%
left_join(pop, by = c('STATENAME' = 'Geographic Area') %>%
mutate(rate = (PEDS / X2019) * <scale>) %>%
mutate(date = as.Date(as.character(paste(YEAR, MONTH, DAY, sep = "-"),"%Y-%m-%d")))
The left_join will match state names and give each row a population value depending on its state, regardless of how many rows there are.

Create ID for specific sequence of consecutive days based on grouping variable in R

For a list of events at the country-day level, we would like to create a unique ID for a sequence of consecutive days in a specific country (if two or more days of events in a country are consecutive --> create unique ID), so that I can ultimately reduce the data frame to specific sequences of events rather than event days.
I did not manage to aggregate the data based on the sequence of events. I believe this response is similar (Creating groups of consecutive days meeting a given criteria) however it is in SQL.
The data has the following format:
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_date <- as.Date(c("2017-06-16", "2017-06-17", "2017-06-18", "2017-08-22", "2017-08-23", "2019-04-18", "2019-04-19", "2019-04-20", "2018-03-15", "2018-03-16", "2016-03-17"))
mydata <- data.frame(country, event_date)
In the output, I expect to have a new column with the ID that is unique to each series of events in a country:
seq.ID <- c(1,1,1,2,2,3,3,3,4,4,4)
mydata2 <- data.frame(country, event_date, seq.ID)
So that ultimately, I can reduce the data to the level of country and sequence of events:
mydata3 <- mydata2[!duplicated(mydata2$seq.ID),]
Try:
library(dplyr)
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)
Output:
# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin
You can also use the .keep_all argument in distinct and preserve the first date of each sequence:
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)
# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3
In case of desired non-aggregated output with different sequence IDs, you could do:
mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)
country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5
Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.

Resources