Remove observation based on occurrence in panel data

Remove observation based on occurrence in panel data - r

I'm working with a panel data
and i want keep observations of id that the first time v_1=1 is not the first time of the specific id.
Kind of bysort command in stata
In the example i want to keep only 61312 obs and not 42848
Thanks
dd <- read.table(text="
id year v_1
61312 2015 0
61312 2016 0
61312 2017 1
61312 2018 1
42848 2016 1
42848 2017 0", header=TRUE)

You can use group_by and filter from dplyr to help with this task
library(dplyr)
dd %>%
group_by(id) %>%
filter(first(v_1) != 1)
we use group_by so when we use first() it looks at the first values for each id

You can use -
subset(dd, id %in% unique(id)[v_1[!duplicated(id)] != 1])
# id year v_1
#1 61312 2015 0
#2 61312 2016 0
#3 61312 2017 1
#4 61312 2018 1
v1[!duplicated(id)] keeps only the first v1 value of each id and we select only those id's which has that first value not equal to 1.

Related

Detecting changes for several companies

I wonder if anyone can provide me with some tools/packages/codes to detect changes in the peer groups that are used for relative performance evaluation.
I have a dataframe with all peers that are used for a certain firm (CIK) over the years. An example of this data is given below:
CIK <- c("21344","21344", "21344", "21344", "21344", "21344", "21344", "21344", "21344")
FiscalYear <- c("2013", "2014", "2015", "2016", "2017", "2014", "2015", "2016", "2017")
PeerCIK <- c("1800","1800","1800","1800","1800","21456","21456","21456","21456")
dataframe <- data.frame(CIK, FiscalYear, PeerCIK)
This results in the following table:
CIK FiscalYear PeerCIK
1 21344 2013 1800
2 21344 2014 1800
3 21344 2015 1800
4 21344 2016 1800
5 21344 2017 1800
6 21344 2014 21456
7 21344 2015 21456
8 21344 2016 21456
9 21344 2017 21456
Now, I want to identify whether the peers (PeerCIK) are present for the whole period that is covered by the firm (CIK). I thus first need to identify the first and last year per CIK (in this example it is clear (2013-2017), but I need to do this for many firms). A code I used for this is:
df2 <- dataframe %>%
group_by(CIK) %>%
summarise(
start = min(FiscalYear),
end = max(FiscalYear)
)
> df2
CIK start end
1 21344 2013 2017
and following I need to identify whether all different peers are present for that period.
If this is not true, then a change must have taken place in the peer group (the peer is added to or deleted from the group). This is where I have trouble with how to continue. The outcome that I ultimately want, is a dataframe with for every firm (CIK), per fiscalyear whether a change has taken place in the peer group compared to last year (where change is a dummy variable with value 1 if change takes place). Such a change is thus when a peer is added (after the starting date) or when a peer is no longer included while the end date is not yet reached for that particular CIK.
Expected outcome
For the example above, I would have the following outcome, as company 21456 is added from 2014 onwards and thus a change has taken place compared to 2013:
CIK FiscalYear change
1 21344 2013 0
2 21344 2014 1
3 21344 2015 0
4 21344 2016 0
5 21344 2017 0
I really hope someone can help me, please let me know

A slightly different approach via expand(), full_join, and some helper variables which should cover most of your edge cases:
library(tidyverse)
dataframe %>%
# Add helper variable to indicate present relationships.
mutate(
present = 1
) %>%
# Generate all possible variations of CIK, FiscalYear, and PeerCik
# and join with our data.
full_join(
dataframe %>% expand(CIK, FiscalYear, PeerCIK),
by = c("CIK", "FiscalYear", "PeerCIK")
) %>%
# Set the helper variable to 0 wherever it is missing,
# which is the case in your newly joined empty data from `expand(...)`.
mutate(
present = ifelse(is.na(present), 0, present)
) %>%
# Sort the data because now the order will be important.
arrange(CIK, PeerCIK, FiscalYear) %>%
# Group by CIK-PeerCIK relationship...
group_by(
CIK, PeerCIK
) %>%
# ...and compare each FiscalYear to the previous FiscalYear.
mutate(
# Check if a relationship was added compared to the year before.
added = case_when(
row_number() == 1 ~ 0,
lag(present) == 0 & present == 1 ~ 1,
TRUE ~ 0
),
# Check if a relationship was removed compared to the year before.
removed = case_when(
row_number() == 1 ~ 0,
lag(present) == 1 & present == 0 ~ 1,
TRUE ~ 0
),
# Combine those two into one variable.
change = ifelse(abs(added) + abs(removed) > 0, 1, 0)
) %>%
ungroup() %>%
# Now to the summary: Group by CIK and FiscalYear...
group_by(
CIK, FiscalYear
) %>%
# ...and calculate all sums for each CIK and FiscalYear.
summarize(
# Total number of present relationships in this year.
num_present = sum(present),
# Number of added relationships in this year.
num_added = sum(added),
# Number of removed relationships in this year.
num_removed = sum(removed),
# Was there any change in this year?
# An alternative would be `sum(change)` to
# indicate the number of changed relationships.
change = max(change)
) %>%
ungroup()
Result:
# A tibble: 5 × 6
CIK FiscalYear num_present num_added num_removed change
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 21344 2013 1 0 0 0
2 21344 2014 2 1 0 1
3 21344 2015 2 0 0 0
4 21344 2016 2 0 0 0
5 21344 2017 2 0 0 0

If statement with three true conditions

This is my data:
Year1 <- c(2015,2013,2012,2018)
Year2 <- c(2017,2015,2014,2020)
my_data <- data.frame(Year1, Year2)
I need an if statement that returns 1 when year 1 equals 2015 OR 2016 AND year 2 is greater than 2016. Currently, my code looks like this:
my_data <- my_data %>%
mutate(Y_2016=ifelse(my_data$Year1==2015|2016 & my_data$Year1>2016,1,0))
But this does not work and only seems to check the condition if Year 2 is greater than 2016, since it returns 1 even for the last row when Year 1 is 2018 and Year 2 is 2020.
Thank you for your help!

Instead of my_data$Year1==2015|2016, use %in% like my_data$Year1 %in% c(2015,2016).
Typo in my_data$Year1>2016
As you using dplyr you do not need to specify every variable with $ like my_data$...
my_data%>%
mutate(Y_2016=ifelse(Year1 %in% c(2015,2016) & Year2>2016,1,0))
Year1 Year2 Y_2016
1 2015 2017 1
2 2013 2015 0
3 2012 2014 0
4 2018 2020 0

How can I filter out Duplicated Rows per Group

So this is the data I'm working with:
ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000
What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.
The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.
I can't tell if I should use Unique() or Distinct() but here is what I have so far:
Answer <- data %>%
group_by(Year, State, Grade) %>%
filter(row_number(ID) == 1) %>% #This is the part to replace
summarise(x = sum(Loss) / sum(Total)) %>%
spread(State, x)
The output should look like this:
Year State Grade x
2016 AZ A 0.05
2016 AZ B 0
2016 AZ C 0
2017 AZ A 0
2017 AZ B 0
2017 AZ C 0.1

A few things. Below, I use distinct to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete to add this (and any other missing cases) as a zero. Finally, as #akrun notes above: where does 0.00833 come from? Typo or have I misunderstood the calculation?
df <- read.table(text = "ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000", header = TRUE)
Answer <- df %>%
distinct %>%
group_by(Year, State, Grade) %>%
summarise(x = sum(Loss) / sum(Total)) %>%
complete(Year, State, Grade, fill = list(x = 0))
# # A tibble: 6 x 4
# # Groups: Year, State [2]
# Year State Grade x
# <int> <fct> <fct> <dbl>
# 1 2016 AZ A 0.05
# 2 2016 AZ B 0
# 3 2016 AZ C 0
# 4 2017 AZ A 0
# 5 2017 AZ B 0
# 6 2017 AZ C 0.1

Use R to count values in a number of different columns

I have a dataset of patents, where I have recorded 1) the month and year associated with a patent renewal and 2) whether the patent holder chose to pay the patent fee or let the patent lapse. So
patentid fee1date fee1paid fee2date fee2paid
1 May 2010 True May 2013 False
2 May 2010 True April 2014 True
What I want to do is develop a count of the number of renewals by month and by year, as well as the number of abandoned patents, as follows:
date renewed lapsed
May 2010 2 0
How might I count the data that I have now? Thank you!
EDIT: The key point is to aggregate these across different columns. The issue that I am running into now is that when I try using the count library, it treats 2 renewals in May 2010 as two separate values.

Using dplyr
require(tidyr)
require(dplyr)
data %>% gather(year,value, -Patent.ID) %>%
separate('year',c('Fee','N','Act')) %>%
spread(Act,value) %>%
unite(Fee, Fee,N, sep = '.') %>%
group_by(Date) %>%
summarise(R=sum(Paid=='True'), NotR=sum(Paid=='False'))
# A tibble: 3 x 3
Date R NotR
<chr> <int> <int>
1 April 2014 1 0
2 May 2010 2 0
3 May 2013 0 1
Data
data <- read.table(text="
'Patent ID' 'Fee 1 Date' 'Fee 1 Paid' 'Fee 2 Date' 'Fee 2 Paid'
1 'May 2010' True 'May 2013' False
2 'May 2010' True 'April 2014' True
",header=T, stringsAsFactors = F)

Grouping the Data in a data frame based on conditions from more than 1 columns

Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0

Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))

You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.

Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove observation based on occurrence in panel data - r

You can use group_by and filter from dplyr to help with this task library(dplyr) dd %>% group_by(id) %>% filter(first(v_1) != 1) we use group_by so when we use first() it looks at the first values for each id

You can use - subset(dd, id %in% unique(id)[v_1[!duplicated(id)] != 1]) # id year v_1 #1 61312 2015 0 #2 61312 2016 0 #3 61312 2017 1 #4 61312 2018 1 v1[!duplicated(id)] keeps only the first v1 value of each id and we select only those id's which has that first value not equal to 1.

Related

Detecting changes for several companies

If statement with three true conditions

How can I filter out Duplicated Rows per Group

Use R to count values in a number of different columns

Grouping the Data in a data frame based on conditions from more than 1 columns

Categories

Resources