I have a dataset that roughly looks like this:
COUNTRY ADMIN1 month YEAR nevent
<chr> <chr> <dttm> <dbl> <dbl>
1 Algeria Adrar 2003-07-01 00:00:00 2003 1.00
2 Algeria Adrar 2004-10-01 00:00:00 2004 1.00
3 Algeria Adrar 2005-04-01 00:00:00 2005 1.00
4 Algeria Adrar 2005-06-01 00:00:00 2005 1.00
5 Algeria Adrar 2008-12-01 00:00:00 2008 1.00
6 Algeria Adrar 2009-02-01 00:00:00 2009 1.00
7 Algeria Adrar 2009-04-01 00:00:00 2009 2.00
8 Algeria Adrar 2009-11-01 00:00:00 2009 1.00
9 Algeria Adrar 2010-07-01 00:00:00 2010 1.00
10 Algeria Adrar 2012-02-01 00:00:00 2012 1.00
My units are monthly/1stadminlevel observations
Now, my main problem is that this dataset is displaying just months/adminlevels and countries in Africa when the occurrence of certain events > 0. How can I populate observations with 0 observed events?
In other words I would need to add:
- every month when no event happened (n event= 0)
- every admin1 where no events happened (ever and in specific months)
- every country where no events happened (ever and in specific months)
I notice the existence of the following:
library(tidyverse)
complete(dat, id, date)
But how can adapt it for particular my case?
Thanks
Related
I have the dataframe assets_year:
fiscalyear countryname Assets net_margin
<int> <chr> <dbl> <dbl>
1 2010 Austria 1602544072. 1.72
2 2010 Belgium 2534519957. 0.974
3 2010 Estonia 33248259. 1.31
4 2010 Finland 1490200498. 1.42
5 2010 France 17137601040. 1.51
6 2010 Germany 11553780086. 2.32
tail
fiscalyear countryname Assets net_margin
<int> <chr> <dbl> <dbl>
1 2017 Luxembourg 503785373. 0.730
2 2017 Netherlands 3810079489. 1.40
3 2017 Portugal 504072448. 1.73
4 2017 Slovakia 61735274. 2.49
5 2017 Slovenia 41642423. 1.96
6 2017 Spain 4397884239. 1.39
Additionally, I summed up the asset values per year in another DF:
fiscalyear `sum(Assets)`
<int> <dbl>
1 2010 52192928317.
2 2011 55914561036.
3 2012 52202110772.
4 2013 42418952433.
5 2014 53001352848.
6 2015 43550880007.
In order to scale net margin per asset value, I would like to cbind(...) the sum(assets) to my preexisting dataframe which is in panel format. Thus all countries have a entry for 2010, 2011 ... 2017.
My data set has flow rate measurements of a river for every day of the year from 2009 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).
This is a sample of my data set:
> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
river year season month date flow_rate quality
<chr> <dbl> <chr> <chr> <dttm> <dbl> <chr>
1 chittern_brook 2009 Winter December 2009-12-01 00:00:00 0.059 Good
2 chittern_brook 2009 Winter December 2009-12-02 00:00:00 0.061 Good
3 chittern_brook 2009 Winter December 2009-12-03 00:00:00 0.064 Good
4 chittern_brook 2009 Winter December 2009-12-04 00:00:00 0.068 Good
5 chittern_brook 2009 Winter December 2009-12-05 00:00:00 0.076 Good
6 chittern_brook 2009 Winter December 2009-12-06 00:00:00 0.138 Good
7 chittern_brook 2009 Winter December 2009-12-07 00:00:00 0.592 Good
8 chittern_brook 2009 Winter December 2009-12-08 00:00:00 1.04 Good
9 chittern_brook 2009 Winter December 2009-12-09 00:00:00 1.46 Good
10 chittern_brook 2009 Winter December 2009-12-10 00:00:00 1.7 Good
# ... with 4,427 more rows
I want to find the 95th percentile, 5th percentile, median and the mean of each season of every year and have the values for 95th 5th, median and mean in separate columns in a new dataframe.
For example:
> (df)
# A tibble: 49 x 2
season_label flow_rate_mean Q95 Q5 flow_rate_median
<chr> <dbl>
1 Winter 2009 0.453 3 2 4
2 Spring 2010 0.519 6 3 4
3 Summer 2010 0.0627 4 3 6
4 Autumn 2010 0.0415 6 2 6
5 Winter 2010 0.0622 8 3 3
6 Spring 2011 0.188 10 3 2
7 Summer 2011 0.0499 2 3 2
8 Autumn 2011 0.0383 2 2 1
9 Winter 2011 0.0461 5 2 7
10 Spring 2012 0.0925 3 2 8
# ... with 39 more rows
I currently have this code which creates the above dataframe with just the first two columns but I would like it to also include 95th percentile, 5th percentile and median. Is this feasible or will I need to do it separately and then combine it into one dataframe?
df <- chitt_brook_wylye_2 %>%
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>%
dplyr::group_by(season_id) %>%
dplyr::mutate(season_label = paste(season, min(year))) %>%
dplyr::group_by(season_id,season_label) %>%
dplyr::summarise(flow_rate = mean(flow_rate))
Reproducible example and code:
date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)
df <- data.frame(date,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::mutate(season_label = paste(min(year),season)) %>%
dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
dplyr::summarise(var = mean(var)) # Computing the mean
I'm trying to find lag using gg_lag but I keep getting the same error regarding my tsibble
# A tsibble: 255 x 6 [7D]
# Key: Demand [163]
Week Demand Date Month year Quarter
<dbl> <dbl> <date> <mth> <chr> <qtr>
1 1 48 2018-01-01 2018 Jan 2018 2018 Q1
2 2 101 2018-01-08 2018 Jan 2018 2018 Q1
3 3 129 2018-01-15 2018 Jan 2018 2018 Q1
4 4 113 2018-01-22 2018 Jan 2018 2018 Q1
5 5 116 2018-01-29 2018 Jan 2018 2018 Q1
6 6 123 2018-02-05 2018 Feb 2018 2018 Q1
7 7 137 2018-02-12 2018 Feb 2018 2018 Q1
8 8 136 2018-02-19 2018 Feb 2018 2018 Q1
9 9 151 2018-02-26 2018 Feb 2018 2018 Q1
10 10 87 2018-03-05 2018 Mar 2018 2018 Q1
# ... with 245 more rows
Printer_Q %>% gg_lag(Demand, geom='point')
Error: The data provided to contains more than one time series. Please filter a single time series to use gg_lag()
I tried filtering my data with:
Printer_Q <- Demandts %>%
select(-Week, -year, -Month, -Quarter)
...so that I am left with Demand and Date but it still says I have more than one time series? What am I doing wrong?
The Demand column should not be a key variable. A key variable is a categorical variable used to distinguish multiple time series in a single tsibble. It appears you just have one time series here, so you don't need a key variable.
I'm attempting to find the percent differences of state characteristics (using a defined index created using factor analysis) between the years 2012 and 2017. However some states begin with a score of -0.617 (2012) and end with -1.25 (2017), creating a positive percent difference rather than a negative.
The only other thing I've tried is subtracting 1 from the fraction factor1/lag(factor1). Below is is the code I'm currently working with:
STFACTOR %>>%
dplyr::select(FIPSst, Geography, Year, factor1) %>>%
filter(Year == c(2012, 2017)) %>>%
group_by(Geography) %>>%
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
These are the changes and results from each change in code
mutate(pct_change = (1-factor1/lag(factor1)) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 102.
I would expect the final result to look like this:
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100)
Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.
we can use
mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)
I have the following table of quarterly data and want to generate a new column of date type for each row.
Year,Quarter,Sales
2008,1,1.703
2008,2,0.717
2008,3,6.892
2008,4,4.363
2009,1,3.793
2009,2,5.208
2009,3,7.367
2009,4,8.737
2010,1,8.752
2010,2,8.398
This is what I tried
quarters <- c('-03-31', '-06-30', '-09-30', '-12-31')
gen_date <- function(row) {
year <- row[1]
quarter <- row[2]
date <- paste(toString(year), quarters[quarter], sep='')
date <- as.Date((date), format="%Y-%m-%d")
return(date)
}
df$Date <- apply(df, 1, gen_date)
However, the resulting column df$Date is not a date, but an int.
Year Quarter Sales Date
1 2008 1 1.703 13969
2 2008 2 0.717 14060
3 2008 3 6.892 14152
4 2008 4 4.363 14244
5 2009 1 3.793 14334
6 2009 2 5.208 14425
7 2009 3 7.367 14517
8 2009 4 8.737 14609
Try with lubridate:
library(lubridate)
Year=c(rep(2008,4),rep(2009,4),2010,2010)
Quarter=c(1,2,3,4,1,2,3,4,1,2)
Sales=c(1.7,0.7,6.9,4.3,3.79,5.2,7.3,8.7,8.7,8.4)
df=tibble(Year,Quarter,Sales)
df$Date=yq(paste(as.character(df$Year),as.character(df$Quarter),sep="-"))
df
Year Quarter Sales Date
<dbl> <dbl> <dbl> <date>
1 2008 1.00 1.70 2008-01-01
2 2008 2.00 0.700 2008-04-01
3 2008 3.00 6.90 2008-07-01
4 2008 4.00 4.30 2008-10-01
5 2009 1.00 3.79 2009-01-01
6 2009 2.00 5.20 2009-04-01
7 2009 3.00 7.30 2009-07-01
8 2009 4.00 8.70 2009-10-01
9 2010 1.00 8.70 2010-01-01
10 2010 2.00 8.40 2010-04-01
Try this:
library(lubridate)
dfx <- read.table(text = "Year,Quarter,Sales
2008,1,1.703
2008,2,0.717
2008,3,6.892
2008,4,4.363
2009,1,3.793
2009,2,5.208
2009,3,7.367
2009,4,8.737
2010,1,8.752
2010,2,8.398", header=T, sep=",")
dfx$month <- factor(dfx$Quarter)
levels(dfx$month) <- c('-03-31', '-06-30', '-09-30', '-12-31')
dfx$month <- as.character(dfx$month)
dfx$date <- ymd(paste(dfx$Year, dfx$month, sep="-"))
HTH