Transform Year-to-date to Quarterly data with data.table - r

Quarterly data from a data provider has the issue that for some variables the quarterly data values are actually Year-to-date figures. That means the values are the sum of all previous quarters (Q2 = Q1 + Q2 , Q3 = Q1 + Q2 + Q3, ...).
The structure of the original data looks the following:
library(data.table)
library(plyr)
dt.quarter.test <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L)
, Quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L)
, Data.Year.to.Date = c(162, 405, 610, 938, 331, 1467, 1981, 2501))
, .Names = c("Year", "Quarter", "Data.Year.to.Date"), class = c("data.table", "data.frame"), row.names = c(NA, -8L))
In order to calculate the quarterly values I therefore need to subtract the previous Quarter from Q2, Q3 and Q4.
I've managed to get the desired results by using the ddply function from the plyr package.
dt.quarter.result <- ddply(dt.quarter.test, "Year"
, transform
, Data.Quarterly = Data.Year.to.Date - shift(Data.Year.to.Date, n = 1L, type = "lag", fill = 0))
dt.quarter.result
Year Quarter Data.Year.to.Date Data.Quarterly
1 2000 1 162 162
2 2000 2 405 243
3 2000 3 610 205
4 2000 4 938 328
5 2001 1 331 331
6 2001 2 1467 1136
7 2001 3 1981 514
8 2001 4 2501 520
But I am not really happy with the command, since it seems quite clumsy and I would like to get some input on how to improve it and especially do it directly within the data.table.

Here is the data.table syntax, and you might find data.table cheat sheet helpful:
library(data.table)
dt.quarter.test[, Data.Quarterly := Data.Year.to.Date - shift(Data.Year.to.Date, fill = 0), Year][]
# Year Quarter Data.Year.to.Date Data.Quarterly
# 1: 2000 1 162 162
# 2: 2000 2 405 243
# 3: 2000 3 610 205
# 4: 2000 4 938 328
# 5: 2001 1 331 331
# 6: 2001 2 1467 1136
# 7: 2001 3 1981 514
# 8: 2001 4 2501 520

Related

How to replace a column in R by a modified column, dependent on filtered values? (removing outliers in panel data)

I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.

Calculating sums of observation in time intervals in a df [duplicate]

This question already has answers here:
Aggregate one data frame by time intervals from another data frame
(3 answers)
Closed 1 year ago.
I've posted this as another question, but realised I've got my sample data wrong.
I've got two separate datasets. df1 looks like this:
loc_ID year observations
nin212 2002 90
nin212 2003 98
nin212 2004 102
cha670 2001 18
cha670 2002 19
cha670 2003 21
df2 looks like this:
loc_ID start_year end_year
nin212 2002 2003
nin212 2003 2004
cha670 2001 2002
cha670 2002 2003
I want to calculate the number of observations in the time intervals (start_year to end_year) per loc_ID. In the example above, I would like to achieve this final dataset:
loc_ID start_year end_year observations
nin212 2002 2003 188
nin212 2003 2004 200
cha670 2001 2002 37
cha670 2002 2003 40
How could I do this?
We can do a non-equi join
library(data.table)
setDT(df2)[, observations := setDT(df1)[df2, sum(observations),
on = .(loc_ID, year >= start_year, year <= end_year),
by = .EACHI]$V1]
-output
df2
# loc_ID start_year end_year observations
#1: nin212 2002 2003 188
#2: nin212 2003 2004 200
#3: cha670 2001 2002 37
#4: cha670 2002 2003 40
data
structure(list(loc_ID = c("nin212", "nin212", "nin212", "cha670",
"cha670", "cha670"), year = c(2002L, 2003L, 2004L, 2001L, 2002L,
2003L), observations = c(90L, 98L, 102L, 18L, 19L, 21L)),
class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(loc_ID = c("nin212", "nin212", "cha670", "cha670"
), start_year = c(2002L, 2003L, 2001L, 2002L), end_year = c(2003L,
2004L, 2002L, 2003L)), class = "data.frame", row.names = c(NA,
-4L))

How to calculate percent differences in a table in R

I have a csv file where rows 1-5 represent one state, 5-10 another, etc... I also have a column with years 1970,1980,..,2010 repeated for each state. In R (although I'm not opposed to a solution in Excel if that is easier), I want for each state to calculate the percent difference between that year and 1970, i.e. for Alabama 1990 it would be (AL 1990 - AL 1970)/(AL 1970), and add it to a new column in the data table so I can export it to a csv.
State, Year, Num
AL, 1970, 1
AL, 1980, 2
AL, 1990, 3
AL, 2000, 4
AL, 2010, 6
Output would be a column
pct_change
0
1
2
3
5
The dplyr package includes the function first which provides an easy method for getting the first value of a group. So if we arrange by Year to make it so that 1970 will be the first value of each group, when we group_by(State), we can use first(Num) to get that first value of Num which represents the value from 1970:
# Example data with 2 states
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
library(dplyr)
df %>%
arrange(State, Year) %>%
group_by(State) %>%
mutate(perc_diff = 100 * (Num - first(Num))/first(Num))
# A tibble: 10 x 4
# Groups: State [2]
State Year Num perc_diff
<chr> <int> <dbl> <dbl>
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), order by 'State', 'Year' in the i, grouped by 'State', get the difference of the 'Num' with the first value of 'Num' and assign (:=) to create the 'perc_diff'
library(data.table)
setDT(df)[order(State, Year), perc_diff :=
100 * (Num - first(Num))/first(Num), State][]
# State Year Num perc_diff
# 1: AL 1970 1 0
# 2: AL 1980 2 100
# 3: AL 1990 3 200
# 4: AL 2000 4 300
# 5: AL 2010 6 500
# 6: TX 1970 5 0
# 7: TX 1980 2 -60
# 8: TX 1990 10 100
# 9: TX 2000 12 140
#10: TX 2010 6 20
Or using base R
v1 <- with(df, ave(Num, State, FUN = function(x) x[1]))
df$perc_diff <- with(df, 100 * (Num - v1)/v1)
data
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
Base R solution using tapply
df <- df[with(df, order(State, Year)), ]
df$pct_change <- unlist( tapply(df$Num, df$State, function(x) 100 * (x - x[1]) / x[1]) )
> df
State Year Num pct_change
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20

Months to integer R

This is part of the dataframe I am working on. The first column represents the year, the second the month, and the third one the number of observations for that month of that year.
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3
I have observations from 2000 to 2018. I would like to run a Kernel Regression on this data, so I need to create a continuum integer from a date class vector. For instance Jan 2000 would be 1, Jan 2001 would be 13, Jan 2002 would be 25 and so on. With that I will be able to run the Kernel. Later on, I need to translate that back (1 would be Jan 2000, 2 would be Feb 2000 and so on) to plot my model.
Just use a little algebra:
df$cont <- (df$year - 2000L) * 12L + df$month
You could go backward with modulus and integer division.
df$year <- df$cont %/% 12 + 2000L
df$month <- df$cont %% 12 # 12 is set at 0, so fix that with next line.
df$month[df$month == 0L] <- 12L
Here, %% is the modulus operator and %/% is the integer division operator. See ?"%%" for an explanation of these and other arithmetic operators.
What you can do is something like the following. First create a dates data.frame with expand.grid so we have all the years and months from 2000 01 to 2018 12. Next put this in the correct order and last add an order column so that 2000 01 starts with 1 and 2018 12 is 228. If you merge this with your original table you get the below result. You can then remove columns you don't need. And because you have a dates table you can return the year and month columns based on the order column.
dates <- expand.grid(year = seq(2000, 2018), month = seq(1, 12))
dates <- dates[order(dates$year, dates$month), ]
dates$order <- seq_along(dates$year)
merge(df, dates, by.x = c("year", "month"), by.y = c("year", "month"))
year month obs order
1 2005 10 4 70
2 2005 12 2 72
3 2005 7 2 67
4 2006 1 4 73
5 2006 10 3 82
6 2006 2 1 74
7 2006 7 2 79
8 2006 8 1 80
data:
df <- structure(list(year = c(2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L),
month = c(7L, 10L, 12L, 1L, 2L, 7L, 8L, 10L),
obs = c(2L, 4L, 2L, 4L, 1L, 2L, 1L, 3L)),
class = "data.frame",
row.names = c(NA, -8L))
An option is to use yearmon type from zoo package and then calculate difference of months from Jan 2001 using difference between yearmon type.
library(zoo)
# +1 has been added to difference so that Jan 2001 is treated as 1
df$slNum = (as.yearmon(paste0(df$year, df$month),"%Y%m")-as.yearmon("200001","%Y%m"))*12+1
# year month obs slNum
# 1 2005 7 2 67
# 2 2005 10 4 70
# 3 2005 12 2 72
# 4 2006 1 4 73
# 5 2006 2 1 74
# 6 2006 7 2 79
# 7 2006 8 1 80
# 8 2006 10 3 82
Data:
df <- read.table(text =
"year month obs
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3",
header = TRUE, stringsAsFactors = FALSE)

get the mean of a variable subset of data in R [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
Imagine I have the following data:
Year Month State ppo
2011 Jan CA 220
2011 Feb CA 250
2012 Jan CA 230
2011 Jan WA 200
2011 Feb WA 210
I need to calculate the mean for each state for the year, so the output would look something like this:
Year Month State ppo annualAvg
2011 Jan CA 220 230
2011 Feb CA 240 230
2012 Jan CA 260 260
2011 Jan WA 200 205
2011 Feb WA 210 205
where the annual average is the mean of any entries for that state in the same year. If the year and state were constant I would know how to do this, but somehow the fact that they are variable is throwing me off.
Looking around, it seems like maybe ddply is what I want to be using for this (https://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r), but when I tried to use it I was doing something wrong and kept getting errors (I have tried so many variations of it that I won't bother to post them all here). Any idea how I am actually supposed to be doing this?
Thanks for the help!
Try this:
library(data.table)
setDT(df)
df[ , annualAvg := mean(ppo) , by =.(Year, State) ]
Base R: df$ppoAvg <- ave(df$ppo, df$State, df$Year, FUN = mean)
Using dplyr with group_by %>% mutate to add a column:
library(dplyr)
df %>% group_by(Year, State) %>% mutate(annualAvg = mean(ppo))
#Source: local data frame [5 x 5]
#Groups: Year, State [3]
# Year Month State ppo annualAvg
# (int) (fctr) (fctr) (int) (dbl)
#1 2011 Jan CA 220 235
#2 2011 Feb CA 250 235
#3 2012 Jan CA 230 230
#4 2011 Jan WA 200 205
#5 2011 Feb WA 210 205
Using data.table:
library(data.table)
setDT(df)[, annualAvg := mean(ppo), .(Year, State)]
df
# Year Month State ppo annualAvg
#1: 2011 Jan CA 220 235
#2: 2011 Feb CA 250 235
#3: 2012 Jan CA 230 230
#4: 2011 Jan WA 200 205
#5: 2011 Feb WA 210 205
Data:
structure(list(Year = c(2011L, 2011L, 2012L, 2011L, 2011L), Month = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("Feb", "Jan"), class = "factor"),
State = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("CA",
"WA"), class = "factor"), ppo = c(220L, 250L, 230L, 200L,
210L), annualAvg = c(235, 235, 230, 205, 205)), .Names = c("Year",
"Month", "State", "ppo", "annualAvg"), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x105000778>)

Resources