How to exclude observation that does not appear at least once every year - R - r

I have a database where companies are identified by an ID (cnpjcei) from 2009 to 2018, where we can have 1 or more observations of a given company in a given year or no observations of a given company in a given year.
Here is a sample of the database:
> df
cnpjcei year
<chr> <dbl>
1 4774 2009
2 4774 2010
3 28959 2009
4 29688 2009
5 43591 2010
6 43591 2010
7 65803 2011
8 105104 2011
9 113980 2012
10 220043 2013
I would like to keep in that df only the companies that appear at least once a year.
What would be the easiest way to do this?

Using the data.table library:
library(data.table)
df<-data.table(df)
df<-df[,unique_years:=length(unique(year)), by=list(cnpjcei),][unique_years==10]

We can use dplyr, group_by id and filter only the cases in which all the elements in 2009:2018 can be found %in% the year column.
Please mind that, for this code to work with the sample database as in the question, the range would have to be replaced with 2009:2013
library(dplyr)
df %>% group_by(cnpjcei) %>% filter(all(2009:2018 %in% year))

You can keep the ids (cnpjcei) which has all the unique years available in the data.
library(dplyr)
result <- df %>%
group_by(cnpjcei) %>%
filter(n_distinct(year) == n_distinct(.$year)) %>%
ungroup

Related

How to replace NA values with average of precedent and following values, in R

I currently have a dataset that has more or less the following characteristics:
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
What I would like to do is find a command/write a function that fills only the NAs for each country with:
if NA Observation is for the first year (2010) fills it with the next non-NA Observation;
if NA Observation is for the last year (2014) fills it with the previous available period's Observation.
3.1 if NA Observation is for years between the first and last fills is with the average of the 2 closest periods.
3.2 However, if there are 2 or more consecutive NAs, (let's take 2 as an example) first fill the first with the preceding Observation and the second with the same method as (3.1)
As an illustration, the previous dataset should finally be:
Observation2 <- c(2, 5, 5, 3.5 ,2,3,2, 2,2,3,1,1)
df2 <- data.frame(Country, Year, Observation2)
I hope I was sufficiently clear. It is very specific but I hope someone can help.
Feel free to ask any questions about it if you do not understand.
Input. There is some question of whether alternation of country names as mentioned in the comments under the question and shown in the Note at the end was intended but at any rate assume that each subsequence of increasing years is a separate group and group by them, grp. (If it was intended that the first 6 entries in Country be Honduras the last 6 be Belize then we could replace the group_by(...) with group_by(Country) in the code below.)
Clarification of Question. We assume that the question is asking that within group:
Leading NAs are to be replaced with the first non-NA.
Trailing NAs are to be replaced with the last non-NA.
If there is one consecutive NA surrounded by non-NAs it is replaced by the prior non-NA.
If there are two consecutive NA's then the first is replaced with the prior non-NA and the second is filled in with the average of the prior non-NA and next non-NA.
The question does not address the situation of 3+ consecutive NAs so maybe this never occurs but just in case it does what the code should do is fill in the first NA with the prior non-NA and the remainder should be filled in using linear interpolation.
Code. Now for each group, replace any NA with the prior value. Then use linear interpolation on what is left via na.approx using rule=2 to extend the ends. Finally only keep desired columns.
dplyr clashes. Note that lag and filter in dplyr collide in an incompatible way with the functions of the same name in base R so we exclude them and use dplyr:: prefix if we want to access them.
library(dplyr, exclude = c("lag", "filter"))
library(zoo)
df2 <- df %>%
# group_by(Country) %>%
group_by(grp = cumsum(c(TRUE, diff(Year) < 0))) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
identical(df2$Observation2, Observation2)
## [1] TRUE
Note
We used this input taken from the question.
Country <- rep(c("Honduras", "Belize"),6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5,NA, NA,2,3,NA, NA,2,3,1,NA)
df <- data.frame(Country, Year, Observation)
df
giving:
Country Year Observation
1 Honduras 2010 2
2 Belize 2011 5
3 Honduras 2012 NA
4 Belize 2014 NA
5 Honduras 2015 2
6 Belize 2016 3
7 Honduras 2010 NA
8 Belize 2011 NA
9 Honduras 2012 2
10 Belize 2014 3
11 Honduras 2015 1
12 Belize 2016 NA
Added
In a comment the poster added another example. We run it here. This is the same code incorporating the simplification to group_by discussed in the first paragraph above. (That does not change the result.)
Country <- rep(c("Honduras", "Belize"),each=6)
Year <- rep(c(2010,2011,2012,2014,2015,2016),2)
Observation <- c(2, 5, NA, NA,2,3, NA, NA,2, NA,1,NA)
df <- data.frame(Country, Year, Observation)
df2 <- df %>%
group_by(Country) %>%
mutate(Observation2 = coalesce(Observation, dplyr::lag(Observation)) %>%
na.approx(rule = 2)) %>%
ungroup %>%
select(Country, Year, Observation2)
df2
giving:
# A tibble: 12 x 3
Country Year Observation2
<chr> <dbl> <dbl>
1 Honduras 2010 2
2 Honduras 2011 5
3 Honduras 2012 5
4 Honduras 2014 3.5
5 Honduras 2015 2
6 Honduras 2016 3
7 Belize 2010 2
8 Belize 2011 2
9 Belize 2012 2
10 Belize 2014 2
11 Belize 2015 1
12 Belize 2016 1

How to get unique values from table() function in R

I have a data frame which 31 columns. In column of Year (named "Anos"), I have rows which years are repeated and when I use table(df$Anos), I get frequency of years. I need only years with 12 observations (12 months).
Example:
freq_years <- table(df$Anos)
freq_years
Result:
2009 2010 2011 2012 2013 2014 2015 2017 2018 2019 2020
10 12 12 3 11 6 8 12 12 12 5
How to get automatically in a new variable only years with freq = 12? (maybe like 2010,2011,2018,2019)
Here is a tidyverse version. Depending on your use with the other 30 columns in your data frame, keeping the data as df2 might be useful.
install.packages("dplyr")
install.packages("magrittr")
library("magrittr")
library("dplyr")
#create example dataset
df <- data.frame("Anos" = c(rep(2009,10),
rep(2010,12),
rep(2011,12),
rep(2012,3),
rep(2013,11),
rep(2014,6),
rep(2015,8),
rep(2016,12),
rep(2017,12)))
head(df)
# count number of years by row and filter to those with only 12
df2 <- df %>% group_by(Anos) %>% count() %>% filter(n == 12)
head(df2)
# create variable with list of years that have exactly 12 rows
variable <- df2$Anos
variable
We can create a logical vector and subset the names of the table output
names(freq_years)[freq_years == 12]

Using custom order to arrange rows after previous sorting with arrange

I know this has already been asked, but I think my issue is a bit different (nevermind if it is in Portuguese).
I have this dataset:
df <- cbind(c(rep(2012,6),rep(2016,6)),
rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
runif(12,0,1))
colnames(df) <- c('Year,'Variable','Value)
I want to order the rows to group first everything that has the same year. Afterwards, I want the Variable column to be ordered like this:
Receitas.total
Fisicas.total
Emp.total
Politicos.total
Proprio.total
Outros.total
I know I could usearrange() from dplyr to sort by the year. However, I do not know how to combine this with any routine using factor and order without messing up the previous ordering by year.
Any help? Thank you
We create a custom order by converting the 'Variable' into factor with levels specified in the custom order
library(dplyr)
df %>%
arrange(Year, factor(Variable, levels = c('Receitas.total',
'Fisicas.total', 'Emp.total', 'Politicos.total',
'Proprio.total', 'Outros.total')))
# A tibble: 12 x 3
# Year Variable Value
# <dbl> <chr> <dbl>
# 1 2012 Receitas.total 0.6626196
# 2 2012 Fisicas.total 0.2248911
# 3 2012 Emp.total 0.2925740
# 4 2012 Politicos.total 0.5188971
# 5 2012 Proprio.total 0.9204438
# 6 2012 Outros,total 0.7042230
# 7 2016 Receitas.total 0.6048889
# 8 2016 Fisicas.total 0.7638205
# 9 2016 Emp.total 0.2797356
#10 2016 Politicos.total 0.2547251
#11 2016 Proprio.total 0.3707349
#12 2016 Outros,total 0.8016306
data
set.seed(24)
df <- data_frame(Year =c(rep(2012,6),rep(2016,6)),
Variable = rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
Value = runif(12,0,1))

R include rows conditioned to other variables with `add_row`

I have a data.frame like test. It corresponds to information associated with a registry of firms. year.entry reflects the time period when a firm gets into the registry. items are elements that represent capacity and remain fixed throughout time. It may happen that the firm increases its capacity in a particular year. My aim is to present this information longitudinally.
For doing that I would ideally include rows for the years that are missing between 2010 and 2015. I have tried with this with add_row() from tibble but I am having difficulties to make it work.
> test %>% add_row(firm = firm, year.entry == (year.entry)+1, item = item, .before = row_number(year.entry) == n())
Error in eval(expr, envir, enclos) : object 'firm' not found
I wonder whether there is an easier way to solve this problem. The ideal data frame should look like this:
firm year.entry item
<chr> <chr> <int>
1 1-102642692 2010 15
2 1-102642692 2011 15
3 1-102642692 2012 15
4 1-102642692 2013 15
5 1-102642692 2014 15
6 1-102642692 2015 8
test is given by:
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
I add a dummy firm to the data to use later.
First I make sure every firm has all the years of the period of interest with complete. That is why I entered a dummy firm.
The missing years are added to the dataframe.
Then I take the last observation carried forward with na.locf.
When completed I remove the dummy firm.
comp <- data.frame(firm="test", year.entry= (2009:2016), item=0)
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
library(zoo)
rbind(test,comp) %>%
complete(firm,year.entry) %>%
arrange(firm, year.entry)%>%
group_by(firm) %>%
mutate(item = na.locf(item, na.rm=FALSE)) %>%
filter(firm !="test")
result:
firm year.entry item
<fctr> <dbl> <dbl>
1-102642692 2009 NA
1-102642692 2010 15
1-102642692 2011 15
1-102642692 2012 15
1-102642692 2013 15
1-102642692 2014 15
1-102642692 2015 8
1-102642692 2016 8

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

Resources