Remove observation based on occurrence in panel data - r

I'm working with a panel data
and i want keep observations of id that the first time v_1=1 is not the first time of the specific id.
Kind of bysort command in stata
In the example i want to keep only 61312 obs and not 42848
Thanks
dd <- read.table(text="
id year v_1
61312 2015 0
61312 2016 0
61312 2017 1
61312 2018 1
42848 2016 1
42848 2017 0", header=TRUE)

You can use group_by and filter from dplyr to help with this task
library(dplyr)
dd %>%
group_by(id) %>%
filter(first(v_1) != 1)
we use group_by so when we use first() it looks at the first values for each id

You can use -
subset(dd, id %in% unique(id)[v_1[!duplicated(id)] != 1])
# id year v_1
#1 61312 2015 0
#2 61312 2016 0
#3 61312 2017 1
#4 61312 2018 1
v1[!duplicated(id)] keeps only the first v1 value of each id and we select only those id's which has that first value not equal to 1.

Related

Detecting changes for several companies

I wonder if anyone can provide me with some tools/packages/codes to detect changes in the peer groups that are used for relative performance evaluation.
I have a dataframe with all peers that are used for a certain firm (CIK) over the years. An example of this data is given below:
CIK <- c("21344","21344", "21344", "21344", "21344", "21344", "21344", "21344", "21344")
FiscalYear <- c("2013", "2014", "2015", "2016", "2017", "2014", "2015", "2016", "2017")
PeerCIK <- c("1800","1800","1800","1800","1800","21456","21456","21456","21456")
dataframe <- data.frame(CIK, FiscalYear, PeerCIK)
This results in the following table:
CIK FiscalYear PeerCIK
1 21344 2013 1800
2 21344 2014 1800
3 21344 2015 1800
4 21344 2016 1800
5 21344 2017 1800
6 21344 2014 21456
7 21344 2015 21456
8 21344 2016 21456
9 21344 2017 21456
Now, I want to identify whether the peers (PeerCIK) are present for the whole period that is covered by the firm (CIK). I thus first need to identify the first and last year per CIK (in this example it is clear (2013-2017), but I need to do this for many firms). A code I used for this is:
df2 <- dataframe %>%
group_by(CIK) %>%
summarise(
start = min(FiscalYear),
end = max(FiscalYear)
)
> df2
CIK start end
1 21344 2013 2017
and following I need to identify whether all different peers are present for that period.
If this is not true, then a change must have taken place in the peer group (the peer is added to or deleted from the group). This is where I have trouble with how to continue. The outcome that I ultimately want, is a dataframe with for every firm (CIK), per fiscalyear whether a change has taken place in the peer group compared to last year (where change is a dummy variable with value 1 if change takes place). Such a change is thus when a peer is added (after the starting date) or when a peer is no longer included while the end date is not yet reached for that particular CIK.
Expected outcome
For the example above, I would have the following outcome, as company 21456 is added from 2014 onwards and thus a change has taken place compared to 2013:
CIK FiscalYear change
1 21344 2013 0
2 21344 2014 1
3 21344 2015 0
4 21344 2016 0
5 21344 2017 0
I really hope someone can help me, please let me know
A slightly different approach via expand(), full_join, and some helper variables which should cover most of your edge cases:
library(tidyverse)
dataframe %>%
# Add helper variable to indicate present relationships.
mutate(
present = 1
) %>%
# Generate all possible variations of CIK, FiscalYear, and PeerCik
# and join with our data.
full_join(
dataframe %>% expand(CIK, FiscalYear, PeerCIK),
by = c("CIK", "FiscalYear", "PeerCIK")
) %>%
# Set the helper variable to 0 wherever it is missing,
# which is the case in your newly joined empty data from `expand(...)`.
mutate(
present = ifelse(is.na(present), 0, present)
) %>%
# Sort the data because now the order will be important.
arrange(CIK, PeerCIK, FiscalYear) %>%
# Group by CIK-PeerCIK relationship...
group_by(
CIK, PeerCIK
) %>%
# ...and compare each FiscalYear to the previous FiscalYear.
mutate(
# Check if a relationship was added compared to the year before.
added = case_when(
row_number() == 1 ~ 0,
lag(present) == 0 & present == 1 ~ 1,
TRUE ~ 0
),
# Check if a relationship was removed compared to the year before.
removed = case_when(
row_number() == 1 ~ 0,
lag(present) == 1 & present == 0 ~ 1,
TRUE ~ 0
),
# Combine those two into one variable.
change = ifelse(abs(added) + abs(removed) > 0, 1, 0)
) %>%
ungroup() %>%
# Now to the summary: Group by CIK and FiscalYear...
group_by(
CIK, FiscalYear
) %>%
# ...and calculate all sums for each CIK and FiscalYear.
summarize(
# Total number of present relationships in this year.
num_present = sum(present),
# Number of added relationships in this year.
num_added = sum(added),
# Number of removed relationships in this year.
num_removed = sum(removed),
# Was there any change in this year?
# An alternative would be `sum(change)` to
# indicate the number of changed relationships.
change = max(change)
) %>%
ungroup()
Result:
# A tibble: 5 × 6
CIK FiscalYear num_present num_added num_removed change
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 21344 2013 1 0 0 0
2 21344 2014 2 1 0 1
3 21344 2015 2 0 0 0
4 21344 2016 2 0 0 0
5 21344 2017 2 0 0 0

If statement with three true conditions

This is my data:
Year1 <- c(2015,2013,2012,2018)
Year2 <- c(2017,2015,2014,2020)
my_data <- data.frame(Year1, Year2)
I need an if statement that returns 1 when year 1 equals 2015 OR 2016 AND year 2 is greater than 2016. Currently, my code looks like this:
my_data <- my_data %>%
mutate(Y_2016=ifelse(my_data$Year1==2015|2016 & my_data$Year1>2016,1,0))
But this does not work and only seems to check the condition if Year 2 is greater than 2016, since it returns 1 even for the last row when Year 1 is 2018 and Year 2 is 2020.
Thank you for your help!
Instead of my_data$Year1==2015|2016, use %in% like my_data$Year1 %in% c(2015,2016).
Typo in my_data$Year1>2016
As you using dplyr you do not need to specify every variable with $ like my_data$...
my_data%>%
mutate(Y_2016=ifelse(Year1 %in% c(2015,2016) & Year2>2016,1,0))
Year1 Year2 Y_2016
1 2015 2017 1
2 2013 2015 0
3 2012 2014 0
4 2018 2020 0

How can I filter out Duplicated Rows per Group

So this is the data I'm working with:
ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000
What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.
The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.
I can't tell if I should use Unique() or Distinct() but here is what I have so far:
Answer <- data %>%
group_by(Year, State, Grade) %>%
filter(row_number(ID) == 1) %>% #This is the part to replace
summarise(x = sum(Loss) / sum(Total)) %>%
spread(State, x)
The output should look like this:
Year State Grade x
2016 AZ A 0.05
2016 AZ B 0
2016 AZ C 0
2017 AZ A 0
2017 AZ B 0
2017 AZ C 0.1
A few things. Below, I use distinct to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete to add this (and any other missing cases) as a zero. Finally, as #akrun notes above: where does 0.00833 come from? Typo or have I misunderstood the calculation?
df <- read.table(text = "ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000", header = TRUE)
Answer <- df %>%
distinct %>%
group_by(Year, State, Grade) %>%
summarise(x = sum(Loss) / sum(Total)) %>%
complete(Year, State, Grade, fill = list(x = 0))
# # A tibble: 6 x 4
# # Groups: Year, State [2]
# Year State Grade x
# <int> <fct> <fct> <dbl>
# 1 2016 AZ A 0.05
# 2 2016 AZ B 0
# 3 2016 AZ C 0
# 4 2017 AZ A 0
# 5 2017 AZ B 0
# 6 2017 AZ C 0.1

Use R to count values in a number of different columns

I have a dataset of patents, where I have recorded 1) the month and year associated with a patent renewal and 2) whether the patent holder chose to pay the patent fee or let the patent lapse. So
patentid fee1date fee1paid fee2date fee2paid
1 May 2010 True May 2013 False
2 May 2010 True April 2014 True
What I want to do is develop a count of the number of renewals by month and by year, as well as the number of abandoned patents, as follows:
date renewed lapsed
May 2010 2 0
How might I count the data that I have now? Thank you!
EDIT: The key point is to aggregate these across different columns. The issue that I am running into now is that when I try using the count library, it treats 2 renewals in May 2010 as two separate values.
Using dplyr
require(tidyr)
require(dplyr)
data %>% gather(year,value, -Patent.ID) %>%
separate('year',c('Fee','N','Act')) %>%
spread(Act,value) %>%
unite(Fee, Fee,N, sep = '.') %>%
group_by(Date) %>%
summarise(R=sum(Paid=='True'), NotR=sum(Paid=='False'))
# A tibble: 3 x 3
Date R NotR
<chr> <int> <int>
1 April 2014 1 0
2 May 2010 2 0
3 May 2013 0 1
Data
data <- read.table(text="
'Patent ID' 'Fee 1 Date' 'Fee 1 Paid' 'Fee 2 Date' 'Fee 2 Paid'
1 'May 2010' True 'May 2013' False
2 'May 2010' True 'April 2014' True
",header=T, stringsAsFactors = F)

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Resources