Detecting changes for several companies - r

I wonder if anyone can provide me with some tools/packages/codes to detect changes in the peer groups that are used for relative performance evaluation.
I have a dataframe with all peers that are used for a certain firm (CIK) over the years. An example of this data is given below:
CIK <- c("21344","21344", "21344", "21344", "21344", "21344", "21344", "21344", "21344")
FiscalYear <- c("2013", "2014", "2015", "2016", "2017", "2014", "2015", "2016", "2017")
PeerCIK <- c("1800","1800","1800","1800","1800","21456","21456","21456","21456")
dataframe <- data.frame(CIK, FiscalYear, PeerCIK)
This results in the following table:
CIK FiscalYear PeerCIK
1 21344 2013 1800
2 21344 2014 1800
3 21344 2015 1800
4 21344 2016 1800
5 21344 2017 1800
6 21344 2014 21456
7 21344 2015 21456
8 21344 2016 21456
9 21344 2017 21456
Now, I want to identify whether the peers (PeerCIK) are present for the whole period that is covered by the firm (CIK). I thus first need to identify the first and last year per CIK (in this example it is clear (2013-2017), but I need to do this for many firms). A code I used for this is:
df2 <- dataframe %>%
group_by(CIK) %>%
summarise(
start = min(FiscalYear),
end = max(FiscalYear)
)
> df2
CIK start end
1 21344 2013 2017
and following I need to identify whether all different peers are present for that period.
If this is not true, then a change must have taken place in the peer group (the peer is added to or deleted from the group). This is where I have trouble with how to continue. The outcome that I ultimately want, is a dataframe with for every firm (CIK), per fiscalyear whether a change has taken place in the peer group compared to last year (where change is a dummy variable with value 1 if change takes place). Such a change is thus when a peer is added (after the starting date) or when a peer is no longer included while the end date is not yet reached for that particular CIK.
Expected outcome
For the example above, I would have the following outcome, as company 21456 is added from 2014 onwards and thus a change has taken place compared to 2013:
CIK FiscalYear change
1 21344 2013 0
2 21344 2014 1
3 21344 2015 0
4 21344 2016 0
5 21344 2017 0
I really hope someone can help me, please let me know

A slightly different approach via expand(), full_join, and some helper variables which should cover most of your edge cases:
library(tidyverse)
dataframe %>%
# Add helper variable to indicate present relationships.
mutate(
present = 1
) %>%
# Generate all possible variations of CIK, FiscalYear, and PeerCik
# and join with our data.
full_join(
dataframe %>% expand(CIK, FiscalYear, PeerCIK),
by = c("CIK", "FiscalYear", "PeerCIK")
) %>%
# Set the helper variable to 0 wherever it is missing,
# which is the case in your newly joined empty data from `expand(...)`.
mutate(
present = ifelse(is.na(present), 0, present)
) %>%
# Sort the data because now the order will be important.
arrange(CIK, PeerCIK, FiscalYear) %>%
# Group by CIK-PeerCIK relationship...
group_by(
CIK, PeerCIK
) %>%
# ...and compare each FiscalYear to the previous FiscalYear.
mutate(
# Check if a relationship was added compared to the year before.
added = case_when(
row_number() == 1 ~ 0,
lag(present) == 0 & present == 1 ~ 1,
TRUE ~ 0
),
# Check if a relationship was removed compared to the year before.
removed = case_when(
row_number() == 1 ~ 0,
lag(present) == 1 & present == 0 ~ 1,
TRUE ~ 0
),
# Combine those two into one variable.
change = ifelse(abs(added) + abs(removed) > 0, 1, 0)
) %>%
ungroup() %>%
# Now to the summary: Group by CIK and FiscalYear...
group_by(
CIK, FiscalYear
) %>%
# ...and calculate all sums for each CIK and FiscalYear.
summarize(
# Total number of present relationships in this year.
num_present = sum(present),
# Number of added relationships in this year.
num_added = sum(added),
# Number of removed relationships in this year.
num_removed = sum(removed),
# Was there any change in this year?
# An alternative would be `sum(change)` to
# indicate the number of changed relationships.
change = max(change)
) %>%
ungroup()
Result:
# A tibble: 5 × 6
CIK FiscalYear num_present num_added num_removed change
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 21344 2013 1 0 0 0
2 21344 2014 2 1 0 1
3 21344 2015 2 0 0 0
4 21344 2016 2 0 0 0
5 21344 2017 2 0 0 0

Related

Remove observation based on occurrence in panel data

I'm working with a panel data
and i want keep observations of id that the first time v_1=1 is not the first time of the specific id.
Kind of bysort command in stata
In the example i want to keep only 61312 obs and not 42848
Thanks
dd <- read.table(text="
id year v_1
61312 2015 0
61312 2016 0
61312 2017 1
61312 2018 1
42848 2016 1
42848 2017 0", header=TRUE)
You can use group_by and filter from dplyr to help with this task
library(dplyr)
dd %>%
group_by(id) %>%
filter(first(v_1) != 1)
we use group_by so when we use first() it looks at the first values for each id
You can use -
subset(dd, id %in% unique(id)[v_1[!duplicated(id)] != 1])
# id year v_1
#1 61312 2015 0
#2 61312 2016 0
#3 61312 2017 1
#4 61312 2018 1
v1[!duplicated(id)] keeps only the first v1 value of each id and we select only those id's which has that first value not equal to 1.

How to sample from smaller data frame with multiple conditions to a larger data frame?

I have a main dataset df.main with 3 sites and each site has 3 subsites, that were samlpled over three months. I have a separate dataset with some abiotic variables ONLY for a single month df.sample. But for each site, I have three values from the sub-sites. In my original dataset, I need to add the abitoic column. However, for every month, for each sub-site I only want to SAMPLE with replacement from one of the three samples from the sub.site.
set.seed(111)
##Main Data Set
month <- rep(c("Jan","Feb","Mar"), each =9 )
site <- rep(c("1","2","3","1","2","3","1","2","3"), each = 3)
sub.site <- rep(c(1,2,3,1,2,3,1,2,3), time = 3 )
df.main <- data.frame(month, site, sub.site)
month site sub.site
Jan 1 1
Jan 1 2
Jan 1 3
Jan 2 1
Jan 2 2
... .. ..
Mar 3 3
##Sampler Data Set
site <- rep(c(1,2,3), time = 9)
sub.site <- rep(c(1,1,1,2,2,2,3,3,3), each = 3)
abiotic <- rnorm(27,7,1)
df.sample <- data.frame(site, sub.site, abiotic)
site sub.site abiotic
1 1 7.175096
1 1 8.805868
1 1 6.783571
1 2 7.910917
1 2 7.202307
1 2 8.404883
...
2 1 7.122915
2 1 6.152732
...
3 1 7.978232
3 1 6.870228
##Desired Output in df.main
month site sub.site abiotic
Jan 1 1 8.805868
Jan 1 2 7.910917
You can do a full join of the two tables using site and sub.site, and then just sample one row from each month, site and sub.site combination.
If you are unsure about table joining (full join, left join, etc.), you may want to look that up online. It is very simple and standard in, say querying database.
In your case, after the full joining, because you have 9 unique combination of site and sub.site, you will have 81 rows:
joining.output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site"))
> joining.output
month site sub.site abiotic
1 Jan 1 1 7.235221
2 Jan 1 1 4.697654
3 Jan 1 1 5.502573
...
28 Feb 1 1 7.235221
29 Feb 1 1 4.697654
30 Feb 1 1 5.502573
...
55 Mar 1 1 7.235221
56 Mar 1 1 4.697654
57 Mar 1 1 5.502573
Then to sample 1 row for each site and sub.site combination for each month, just group by the 3 variables and sample.
Here is the code that puts everything together:
output <- df.main %>%
full_join(df.sample, by = c("site", "sub.site")) %>%
group_by(month, site, sub.site) %>%
slice_sample(n=1)
p.s. in your example, df.main$sub.site is a character array but df.sample$sub.site is a numeric array. You may want to convert the character array to numeric using as.double() function before joining.

How to do a frequency table where column values are variables?

I have a DF named JOB. In that DF i have 4 columns. Person_ID; JOB; FT (full time or part time with values of 1 for full time and 2 for part time) and YEAR. Every person can have only 1 full time job per year in this DF. This is the full time job they got most of their income during the year.
DF
PERSON_ID JOB FT YEAR
1 Analyst 1 2018
1 Analyst 1 2019
1 Analyst 1 2020
2 Coach 1 2018
2 Coach 1 2019
2 Analyst 1 2020
3 Gardener 1 2020
4 Coach 1 2018
4 Coach 1 2019
4 Analyst 1 2020
4 Coach 2 2019
4 Gardener 2 2019
I want to get different frequency in the lines of the following question:
What full time job changes occurred from 2019 and 2020?
I want to look only at changes where FT=1.
I want my end table to look like this
2019 2020 frequency
Analyst Analyst 1
Coach Analyst 2
NA Gardener 1
I want to look at the data so that i can say 2 people moved from they coaching job to analyst job. 1 analyst did not change their job and one person entered the labour market as a gardener.
I tried to fiddle around with the table function but did not even get close to what i wanted. I could not get the YEAR's to go to separate variables.
10 Bonus points if i can do it in base R :)
Thank you for your help
Not pretty but worked:
# split df by year
df_2019 <- df[df$YEAR %in% c(2019) & df$FT == 1, ]
df_2020 <- df[df$YEAR %in% c(2020) & df$FT == 1, ]
# rename Job columns
df_2019$JOB_2019 <- df_2019$JOB
df_2020$JOB_2020 <- df_2020$JOB
# select needed columns
df_2019 <- df_2019[, c("PERSON_ID", "JOB_2019")]
df_2020 <- df_2020[, c("PERSON_ID", "JOB_2020")]
# merge dfs
df2 <- merge(df_2019, df_2020, by = "PERSON_ID", all = TRUE)
df2$frequency <- 1
df2$JOB_2019 <- addNA(df2$JOB_2019)
df2$JOB_2020 <- addNA(df2$JOB_2020)
# aggregate frequency
aggregate(frequency ~ JOB_2019 + JOB_2020, data = df2, FUN = sum, na.action=na.pass)
JOB_2019 JOB_2020 frequency
1 Analyst Analyst 1
2 Coach Analyst 2
3 <NA> Gardener 1
Not R base but worked:
library(dplyr)
library(tidyr)
data %>%
filter(FT==1, YEAR %in% c(2019, 2020)) %>%
group_by(YEAR, JOB, PERSON_ID) %>%
tally() %>%
pivot_wider(names_from = YEAR, values_from = JOB) %>%
select(-PERSON_ID) %>%
group_by(`2019`, `2020`) %>%
summarise(n = n())
`2019` `2020` n
<chr> <chr> <int>
1 Analyst Analyst 1
2 Coach Analyst 2
3 NA Gardener 1

How to summarize events prior to a specific event (that can happen multiple times) across multiple observations in r?

I'm trying to collect data on what events have happened prior to a specific event (i.e. bDragons)which can be recurring based on the full observation. These are just an excerpt of one observation where a dragon is taken more than once, and I want to be able to pull insights on each and every one over many observations. So in the data set below, I would want to know that only 1 outer turret was taken prior to the first dragon at Time == 12.891. The next is taken at 20.215, which 4 towers and a drake before it.
ID TeamObj Time Type Lane League Year Season bResult rResult gamelength Gold
1 1 bTowers 9.397 OUTER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
2 1 bDragons 12.891 AIR_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
3 1 bTowers 16.215 OUTER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
4 1 bTowers 16.591 INNER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
5 1 bTowers 19.830 OUTER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
6 1 bDragons 20.215 EARTH_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
7 1 bBarons 22.512 BARON_NASHOR <NA> CBLoL 2017 Summer 1 0 34 NA
8 1 bTowers 23.962 INNER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
9 1 bTowers 24.707 INNER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
10 1 bTowers 24.962 BASE_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
I'd want this for every TeamObj of that type but the issue comes up where I try to group_by address and filter by (Time <= which(Team == bDragons)and the wrong things get filtered out or I can't summarize based on that count(Type) or anything. I'm looking for help on recording some type of recurring function or a better way to record and summarize that. Looking to fit the observations into a linear model later on, but I can't get to that square one which causes the issue.
Am I thinking about my filter incorrectly? My summarize? tst3 %>% group_by(ID) %>% filter(Time <= which(Team == "bDragons")) %>% summarize(count(Type))
Something like:
ID dragonID dragonType Time Baron_Nashor Base_Turret Inner_Turret Nexus_Turret Outer_Turret
1 1 AIR_DRAGON 12.891 N/A N/A N/A N/A 1
2 2 EARTH_DRAGON 20.215 N/A N/A 1 N/A 3
and so on, if that is clear. Want to be able to use each as an observation.
How about the following
tst3 %>%
group_by(ID) %>%
# arrange(Time) %>% # uncomment if needed
mutate(
Type = factor(Type),
dragonID = cumsum(dplyr::lag(TeamObj == 'bDragons', default = 1))) %>%
group_by(ID, dragonID) %>%
summarize(
dragonType = last(Type),
Time = last(Time),
tmp = list(as.data.frame(table(Type)))) %>%
unnest() %>%
spread(Type, Freq, fill = 0) %>%
# select(-ends_with("DRAGON")) %>%
group_by(ID) %>%
mutate_at(vars(BARON_NASHOR:OUTER_TURRET), cumsum) %>%
filter(str_detect( dragonType, "DRAGON"))

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Resources