I tried to download data on covid provided by the Economist's Github repository.
library(readr)
library(knitr)
myfile <- "https://raw.githubusercontent.com/TheEconomist/covid-19-excess-deaths-tracker/master/output-data/excess-deaths/all_weekly_excess_deaths.csv"
test <- read_csv(myfile)
What I get is a tibble data frame and I am unable to easily access the data stored in that tibble. I would like to look at one column, say test$covid_deaths_per_100k and re-shape that into a matrix or ts object with rows referring to time and columns referring to countries.
I tried it manually, but I failed. Then I tried with the tsibble package and failed again:
tsibble(test[c("covid_deaths_per_100k","country")],index=test$start_date)
Error: Must extract column with a single valid subscript.
x Subscript `var` has the wrong type `date`.
ℹ It must be numeric or character.
So, I guess the problem is that the data are stacked by countries and hence the time index is duplicated. I would need some of these magic pipe functions to make this work? Is there an easy way to do that, perhaps without piping?
A valid tsibble must have distinct rows identified by key and index:
as_tsibble(test,index = start_date,key=c(country,region))
# A tsibble: 11,715 x 17 [1D]
# Key: country, region [176]
country region region_code start_date end_date days year week population total_deaths
<chr> <chr> <chr> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia Australia 0 2020-01-01 2020-01-07 7 2020 1 25734100 2497
2 Australia Australia 0 2020-01-08 2020-01-14 7 2020 2 25734100 2510
3 Australia Australia 0 2020-01-15 2020-01-21 7 2020 3 25734100 2501
4 Australia Australia 0 2020-01-22 2020-01-28 7 2020 4 25734100 2597
5 Australia Australia 0 2020-01-29 2020-02-04 7 2020 5 25734100 2510
6 Australia Australia 0 2020-02-05 2020-02-11 7 2020 6 25734100 2530
7 Australia Australia 0 2020-02-12 2020-02-18 7 2020 7 25734100 2613
8 Australia Australia 0 2020-02-19 2020-02-25 7 2020 8 25734100 2608
9 Australia Australia 0 2020-02-26 2020-03-03 7 2020 9 25734100 2678
10 Australia Australia 0 2020-03-04 2020-03-10 7 2020 10 25734100 2602
# ... with 11,705 more rows, and 7 more variables: covid_deaths <dbl>, expected_deaths <dbl>,
# excess_deaths <dbl>, non_covid_deaths <dbl>, covid_deaths_per_100k <dbl>,
# excess_deaths_per_100k <dbl>, excess_deaths_pct_change <dbl>
ts works best with monthly, quarterly or annual series. Here we show a few approaches.
1) monthly This creates a monthly zoo object z from the indicated test columns splitting by country and aggregating to produce a monthly time series. It then creates a ts object from that.
library(zoo)
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = as.yearmon, aggregate = sum)
as.ts(z)
2) weekly To create a weekly ts object with frequency 53
to_weekly <- function(x) {
yr <- as.integer(as.yearmon(x))
wk <- as.integer(format(as.Date(x), "%U"))
yr + wk/53
}
z <- read.zoo(test[c("start_date", "country", "covid_deaths")],
split = "country", FUN = to_weekly, aggregate = sum)
as.ts(z)
3) daily If you want a series where the times are dates then omit the FUN argument and use zoo directly.
z <- read.zoo(test[c("end_date", "country", "covid_deaths")],
split = "country", aggregate = sum)
Related
if (!file.exists("ames-liquor.rds")) {
url <- "https://github.com/ds202-at-ISU/materials/blob/master/03_tidyverse/data/ames-liquor.rds?raw=TRUE"
download.file(url, "ames-liquor.rds", mode="wb")
}
data <- readRDS("ames-liquor.rds")
how to extract geographic latitude and longitude from the variable Store Location and check variable types?
and how to use the package lubridate to convert the Date variable to a date. Then extract year, month and day from the variable Date
I am having hard time figuring out how can I extract geographic latitude
I used the code below to pload data on R
One option would be to use tidyr::extract to extract the longitude and latitude. For the dates convert to a proper date using e.g. as.Date. Afterwards you could get the year, month and day using the respective functions from lubridate:
library(dplyr)
library(tidyr)
library(lubridate)
data |>
tidyr::extract(`Store Location`, into = c("lon", "lat"),
regex = "\\((\\-?\\d+\\.\\d+) (\\-?\\d+\\.\\d+)\\)",
remove = FALSE,
convert = TRUE) |>
mutate(Date = as.Date(Date, "%m/%d/%Y"),
year = lubridate::year(Date),
month = lubridate::month(Date),
day = lubridate::day(Date)) |>
select(`Store Location`, lon, lat, Date, year, month, day)
#> # A tibble: 661,945 × 7
#> `Store Location` lon lat Date year month day
#> <chr> <dbl> <dbl> <date> <dbl> <dbl> <int>
#> 1 POINT (-93.619455 42.022848) -93.6 42.0 2020-11-02 2020 11 2
#> 2 POINT (-93.669896 42.02160500000001) -93.7 42.0 2020-07-01 2020 7 1
#> 3 POINT (-93.669896 42.02160500000001) -93.7 42.0 2019-07-31 2019 7 31
#> 4 <NA> NA NA 2019-07-25 2019 7 25
#> 5 <NA> NA NA 2019-07-05 2019 7 5
#> 6 POINT (-93.618911 42.022854) -93.6 42.0 2020-07-02 2020 7 2
#> 7 POINT (-93.669896 42.02160500000001) -93.7 42.0 2021-03-03 2021 3 3
#> 8 POINT (-93.619455 42.022848) -93.6 42.0 2021-03-03 2021 3 3
#> 9 POINT (-93.669896 42.02160500000001) -93.7 42.0 2019-07-17 2019 7 17
#> 10 <NA> NA NA 2022-08-03 2022 8 3
#> # … with 661,935 more rows
I am trying to calculate rolling average of covid cases in the month of march day by day.
For example on 5th of march it should take the mean of cases for first 5 days of march, on 20th it should take mean of first 20 days.
I have written a small piece of code for this but is there a prebuilt function or a better way of doing this ?
df:
Country.Region Date Cases_count
<chr> <date> <dbl>
1 France 2021-03-01 4730
2 France 2021-03-02 22872
3 France 2021-03-03 26903
4 France 2021-03-04 25286
5 France 2021-03-05 23507
6 France 2021-03-06 23306
7 France 2021-03-07 21835
8 France 2021-03-08 5534
9 France 2021-03-09 23143
10 France 2021-03-10 29674
code:
max_date <- ymd(max(df$Date))
march <- seq(ymd("2021-03-01"), ymd(max_date), by = "day")
rolling_data <- lapply(march, function(x){
rolling_avg <- df %>%
filter(
Country.Region == "France",
Date %in% c(ymd("2021-03-01"): x)) %>%
summarise(rolling_mean = mean(Cases_count)) #%>%
# from: https://stackoverflow.com/questions/61038643/loop-through-irregular-list-of-numbers-to-append-rows-to-summary-table
data.frame(Date = x, rolling_march = rolling_avg)
})
do.call(rbind,rolling_data)
output:
Date rolling_mean
1 2021-03-01 4730.00
2 2021-03-02 13801.00
3 2021-03-03 18168.33
4 2021-03-04 19947.75
5 2021-03-05 20659.60
6 2021-03-06 21100.67
7 2021-03-07 21205.57
8 2021-03-08 19246.62
9 2021-03-09 19679.56
10 2021-03-10 20679.00
Issue: For using this along with cases count I will have to do some join. So if there is some prebuilt function then I can probably use it with mutate or summarise.
So what you actually want is a cummulative average, not a rolling/moving average.
A way easier approach is to use cumsum. For example, if you have a vector x with N elements, the cummulative mean could be expressed as:
cummulative_mean <- cumsum(x) / seq_len(length(x))
For an actual rolling mean, the zoo pkg provides us zoo::rollmean.
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17
I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1
I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.