I'm trying to calculate the year to year change in some data I have. It is in panel/longitudinal form
the data is in a dataframe that looks like this
Year ZipCode Value
2011 11411 5
2012 11411 10
2013 11411 20
2011 11345 6
2012 11345 7
2013 11345 10
I would like to get a dataframe that comes out in the form like this
Year Differnce Zipcode % Change
2011-2012 11411 100%
2012-2013 11411 100%
2011-2012 11345 16%
2012-2013 11345 42%
One way would using dplyr is to calculate Change by subtracting current Value from previous Value and paste the Year together for each ZipCode.
library(dplyr)
df %>%
group_by(ZipCode) %>%
mutate(Change = (Value - lag(Value))/lag(Value) * 100,
Year_Diff = paste(lag(Year), Year, sep = "-")) %>%
slice(-1) %>%
select(Year_Diff, ZipCode, Change)
# Year_Diff ZipCode Change
# <chr> <int> <dbl>
#1 2011-2012 11345 16.7
#2 2012-2013 11345 42.9
#3 2011-2012 11411 100
#4 2012-2013 11411 100
Using data.table, we group by 'ZipCode', take the diff of 'Value', divide by the'Value' length adjusted while pasteing the adjacent 'Year' together
library(data.table)
setDT(df1)[, .(Change = 100 *diff(Value)/Value[-.N],
Year_Diff = paste(Year[-.N], Year[-1], sep="-")), .(ZipCode)]
# ZipCode Change Year_Diff
#1: 11411 100.00000 2011-2012
#2: 11411 100.00000 2012-2013
#3: 11345 16.66667 2011-2012
#4: 11345 42.85714 2012-2013
data
df1 <- structure(list(Year = c(2011L, 2012L, 2013L, 2011L, 2012L, 2013L
), ZipCode = c(11411L, 11411L, 11411L, 11345L, 11345L, 11345L
), Value = c(5L, 10L, 20L, 6L, 7L, 10L)), class = "data.frame",
row.names = c(NA,
-6L))
Related
I have a panel dataset that goes like this
year
id
treatment_year
time_to_treatment
outcome
2000
1
2011
-11
2
2002
1
2011
-10
3
2004
2
2015
-9
22
and so on and so forth. I am trying to deal with the outliers by 'Winsorize'. The end goal is to make a scatterplot with time_to_treatment on the X axis and outcome on the Y.
I would like to replace the outcomes for each time_to_treatment by its winsorized outcomes, i.e. replace all extreme values with the 5% and 95% quantile values.
So far what I have tried to do is this but it doesn't work.
for(i in range(dataset$time_to_treatment)){
dplyr::filter(dataset, time_to_treatment == i)$outcome <- DescTools::Winsorize(dplyr::filter(dataset,time_to_treatment==i)$outcome)
}
I get the error - Error in filter(dataset, time_to_treatment == i) <- *vtmp* :
could not find function "filter<-"
Would anyone able to give a better way?
Thanks.
my actual data
where: conflicts = outcome, commission = year of treatment, CD_mun = id.
The concerned time period indicator is time_to_t
Groups: year, CD_MUN, type [6]
type
CD_MUN
year
time_to_t
conflicts
commission
chr
dbl
dbl
dbl
int
dbl
manif
1100023
2000
-11
1
2011
manif
1100189
2000
-3
2
2003
manif
1100205
2000
-9
5
2009
manif
1500602
2000
-4
1
2004
manif
3111002
2000
-11
2
2011
manif
3147006
2000
-10
1
2010
Assuming, "time periods" refer to 'commission' column, you may use ave.
transform(dat, conflicts_w=ave(conflicts, commission, FUN=DescTools::Winsorize))
# type CD_MUN year time_to_t conflicts commission conflicts_w
# 1 manif 1100023 2000 -11 1 2011 1.05
# 2 manif 1100189 2000 -3 2 2003 2.00
# 3 manif 1100205 2000 -9 5 2009 5.00
# 4 manif 1500602 2000 -4 1 2004 1.00
# 5 manif 3111002 2000 -11 2 2011 1.95
# 6 manif 3147006 2000 -10 1 2010 1.00
Data:
dat <- structure(list(type = c("manif", "manif", "manif", "manif", "manif",
"manif"), CD_MUN = c(1100023L, 1100189L, 1100205L, 1500602L,
3111002L, 3147006L), year = c(2000L, 2000L, 2000L, 2000L, 2000L,
2000L), time_to_t = c(-11L, -3L, -9L, -4L, -11L, -10L), conflicts = c(1L,
2L, 5L, 1L, 2L, 1L), commission = c(2011L, 2003L, 2009L, 2004L,
2011L, 2010L)), class = "data.frame", row.names = c(NA, -6L))
For a start you may use this:
# The data
set.seed(123)
df <- data.frame(
time_to_treatment = seq(-15, 0, 1),
outcome = sample(1:30, 16, replace=T)
)
# A solution without Winsorize based solely on dplyr
library(dplyr)
df %>%
mutate(outcome05 = quantile(outcome, probs = 0.05), # 5% quantile
outcome95 = quantile(outcome, probs = 0.95), # 95% quantile
outcome = ifelse(outcome <= outcome05, outcome05, outcome), # replace
outcome = ifelse(outcome >= outcome95, outcome95, outcome)) %>%
select(-c(outcome05, outcome95))
You may adapt this to your exact problem.
I have a data set DF with the following data
Zone
Year
X
Y
1001
2018
10
5
1001
2019
20
10
1001
2020
30
20
1002
2018
15
10
1002
2019
25
20
1002
2020
35
40
I want to create a column Z = X + Y - Previous year's Y
So it creates the following Table:
Zone
Year
X
Y
Z
1001
2018
10
5
NA
1001
2019
20
10
25
1001
2020
30
20
40
1002
2018
15
10
NA
1002
2019
25
20
35
1002
2020
35
40
55
I can use "mutate" from DPLYR to generate column Z:
mutate(DF, Z = X + Y - lag(Y))
I can use tapply to apply recursively on DF. Can I create a function using DPLYR in a user-defined function to apply this using tapply later?
In dplyr you can add group_by to apply a function for every group (Zone).
library(dplyr)
DF %>% group_by(Zone) %>% mutate(Z = X + Y - lag(Y))
# Zone Year X Y Z
# <int> <int> <int> <int> <int>
#1 1001 2018 10 5 NA
#2 1001 2019 20 10 25
#3 1001 2020 30 20 40
#4 1002 2018 15 10 NA
#5 1002 2019 25 20 35
#6 1002 2020 35 40 55
We can also write a function :
add_new_col = function(x, y) {
x + y - lag(y)
}
which can be used as :
DF %>% group_by(Zone) %>% mutate(Z = add_new_col(X, Y))
data
DF <- structure(list(Zone = c(1001L, 1001L, 1001L, 1002L, 1002L, 1002L
), Year = c(2018L, 2019L, 2020L, 2018L, 2019L, 2020L), X = c(10L,
20L, 30L, 15L, 25L, 35L), Y = c(5L, 10L, 20L, 10L, 20L, 40L)),
class = "data.frame", row.names = c(NA, -6L))
Using data.table
library(data.table)
setDT(DF)[, Z := X + Y - shift(Y), Zone]
num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%
Let's say I have data like the above. I want to remove the observations that do not have any observations in the most recent year. So, in the above, we would be left with A & B, but C & D would be deleted. The most recent season will always in the data and can be referenced with the max() function (i.e., we don't need to hardcode as 2019 and update it yearly).
The plan is to create a facet wrapped line chart where the percentages are on the y-axis and the years are on the x-axis. The facet would be on the names so each individual will have its own line chart with their percentages by year. We don't care about people who left, so that's why we're dropping records. Though, there is a chance they come back, so I don't want to drop them from the underlying data.
One dplyr option could be:
df %>%
group_by(Name) %>%
filter(any(year %in% max(df$year)))
num Name year X Y
<int> <chr> <int> <int> <chr>
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
W can use subset from base R as well by subsetting the 'Name' where 'year' is the max, get the unique elements and create a logical vector with %in% to subset the rows
subset(df1, Name %in% unique(Name[year == max(year)]))
# num Name year X Y
#1 1 A 2015 68 80%
#2 1 A 2016 69 85%
#3 1 A 2017 70 95%
#4 1 A 2018 71 85%
#5 1 A 2019 72 90%
#6 2 B 2018 20 80%
#7 2 B 2019 23 75%
No packages are used
Or the similar syntax in dplyr
library(dplyr)
df1 %>%
filter(Name %in% unique(Name[year == max(year)]))
data
df1 <- structure(list(num = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 4L, 4L
), Name = c("A", "A", "A", "A", "A", "B", "B", "C", "D", "D"),
year = c(2015L, 2016L, 2017L, 2018L, 2019L, 2018L, 2019L,
2014L, 2012L, 2013L), X = c(68L, 69L, 70L, 71L, 72L, 20L,
23L, 3L, 4L, 5L), Y = c("80%", "85%", "95%", "85%", "90%",
"80%", "75%", "55%", "75%", "100%")), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Using the data frame DF shown in the Note at the end we use semi_join to reduce it to the required names, convert Y to numeric and plot it. DF is not modified.
A possible alternative to the semi_join line is
filter(ave(year == max(year), Name, FUN = any)) %>%
The code is--
library(dplyr)
library(ggplot2)
DF %>%
semi_join(filter(., year == max(year)), by = "Name") %>%
mutate(Y = as.numeric(sub("%", "", Y))) %>%
ggplot(aes(year, Y)) + geom_line() + facet_wrap(~Name)
Note
The input in reproducible form:
Lines <- " num Name year X Y
1 1 A 2015 68 80%
2 1 A 2016 69 85%
3 1 A 2017 70 95%
4 1 A 2018 71 85%
5 1 A 2019 72 90%
6 2 B 2018 20 80%
7 2 B 2019 23 75%
8 2 C 2014 3 55%
9 4 D 2012 4 75%
10 4 D 2013 5 100%"
DF <- read.table(text = Lines)
I have some Revenue data in a format like that shown below. So the years are not sequential and they can also repeat (because of a different firm).
Firm Year Revenue
1 A 2018 100
2 B 2017 90
3 B 2018 80
4 C 2016 20
And I want to adjust the Revenue for inflation, by dividing through by the appropriate CPI for each year. The CPI data looks like this:
Year CPI
1 2016 98
2 2017 100
3 2018 101
I have a solution that works, but is this the best way to do it? Is it clumsy to mutate an entire calculating column in there?
revenue <- data.frame(stringsAsFactors=FALSE,
Firm = c("A", "B", "B", "C"),
Year = c(2018L, 2017L, 2018L, 2016L),
Revenue = c(100L, 90L, 80L, 20L)
)
cpi <- data.frame(
Year = c(2016L, 2017L, 2018L),
CPI = c(98L, 100L, 101L)
)
library(dplyr)
df <- left_join(revenue, cpi, by = 'Year')
mutate(df, real_revenue = (Revenue*100)/CPI)
The output is correct, shown below. But is this the best way to do it?
Firm Year Revenue CPI real_revenue
1 A 2018 100 101 99.00990
2 B 2017 90 100 90.00000
3 B 2018 80 101 79.20792
4 C 2016 20 98 20.40816
I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...