I have a data.frame like test. It corresponds to information associated with a registry of firms. year.entry reflects the time period when a firm gets into the registry. items are elements that represent capacity and remain fixed throughout time. It may happen that the firm increases its capacity in a particular year. My aim is to present this information longitudinally.
For doing that I would ideally include rows for the years that are missing between 2010 and 2015. I have tried with this with add_row() from tibble but I am having difficulties to make it work.
> test %>% add_row(firm = firm, year.entry == (year.entry)+1, item = item, .before = row_number(year.entry) == n())
Error in eval(expr, envir, enclos) : object 'firm' not found
I wonder whether there is an easier way to solve this problem. The ideal data frame should look like this:
firm year.entry item
<chr> <chr> <int>
1 1-102642692 2010 15
2 1-102642692 2011 15
3 1-102642692 2012 15
4 1-102642692 2013 15
5 1-102642692 2014 15
6 1-102642692 2015 8
test is given by:
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
I add a dummy firm to the data to use later.
First I make sure every firm has all the years of the period of interest with complete. That is why I entered a dummy firm.
The missing years are added to the dataframe.
Then I take the last observation carried forward with na.locf.
When completed I remove the dummy firm.
comp <- data.frame(firm="test", year.entry= (2009:2016), item=0)
test = data.frame(firm = c("1-102642692", "1-102642692"), year.entry = c(2010, 2015), item =c(15,8))
library(zoo)
rbind(test,comp) %>%
complete(firm,year.entry) %>%
arrange(firm, year.entry)%>%
group_by(firm) %>%
mutate(item = na.locf(item, na.rm=FALSE)) %>%
filter(firm !="test")
result:
firm year.entry item
<fctr> <dbl> <dbl>
1-102642692 2009 NA
1-102642692 2010 15
1-102642692 2011 15
1-102642692 2012 15
1-102642692 2013 15
1-102642692 2014 15
1-102642692 2015 8
1-102642692 2016 8
Related
I am trying to extract the team with the maximum number of wins each year in women's college basketball, and I am currently stuck with having the number of wins for each year for each team, and I want only the team with the maximum number of wins in each year.
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
summarise(totalwinsyr = sum(Outcome))
Output currently looks like this, but I am expecting to see each year only once with the team with the maximum number of wins in the subsequent columns
Year Team totalwinsyr
<fct> <chr> <dbl>
1 2014 AbileneChristian 10
2 2014 AirForce 0
3 2014 Akron 18
4 2014 Alabama 10
5 2014 AlabamaAM 3
6 2014 AlabamaHuntsville 0
7 2014 AlabamaMobile 0
8 2014 AlabamaSt 15
9 2014 AlaskaAnchorage 1
10 2014 AlbanyNY 16
How to select the rows with maximum values in each group with dplyr?
I have already looked here but I could not find any resources to help with a group_by() with multiple values
Create a new column with the number of wins and then filter:
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
mutate(totalwinsyr = sum(Outcome)) %>%
filter(totalwinsyr == max(totalwinsyr))
I have a database where companies are identified by an ID (cnpjcei) from 2009 to 2018, where we can have 1 or more observations of a given company in a given year or no observations of a given company in a given year.
Here is a sample of the database:
> df
cnpjcei year
<chr> <dbl>
1 4774 2009
2 4774 2010
3 28959 2009
4 29688 2009
5 43591 2010
6 43591 2010
7 65803 2011
8 105104 2011
9 113980 2012
10 220043 2013
I would like to keep in that df only the companies that appear at least once a year.
What would be the easiest way to do this?
Using the data.table library:
library(data.table)
df<-data.table(df)
df<-df[,unique_years:=length(unique(year)), by=list(cnpjcei),][unique_years==10]
We can use dplyr, group_by id and filter only the cases in which all the elements in 2009:2018 can be found %in% the year column.
Please mind that, for this code to work with the sample database as in the question, the range would have to be replaced with 2009:2013
library(dplyr)
df %>% group_by(cnpjcei) %>% filter(all(2009:2018 %in% year))
You can keep the ids (cnpjcei) which has all the unique years available in the data.
library(dplyr)
result <- df %>%
group_by(cnpjcei) %>%
filter(n_distinct(year) == n_distinct(.$year)) %>%
ungroup
I am tidying the Quarterly Employment Statistics dataset that Statistics South Africa provides. The Excel file is just aggregated data.
Two variables (employment levels and aggregate earnings) are in wide format, with the data spanning quarters over time. Row 1 in the Excel file gives the variable names, and row 2 gives the quarters. The two variables are separated by a blank column, but the quarter values are the same. I am happy to ignore row 1, and instead set row 2 as the "names", however, the quarter values are repeated between the two variables, so I need to rename the columns.
Instead of writing code to tidy the current dataset (hard-coding in a numerical vector for the columns to rename), I want to write code that will tidy all future updated datasets, where I expect that each new quarter for employees will be added just before the blank column. The following code imports the dataset correctly at present. Please tell me how I can rather select the columns to rename based on the blank column separation.
QES <- read_xlsx(
path="QES.xlsx",
range=cell_limits(c(2, 1), c(115, NA))
) %>%
rename(
industry=...1,
SIC=...2
) %>%
rename_with(
.fn=function(q) paste("employees", q, sep="."),
.cols=seq(3, 45, 1) # '45' will change every quarter.
) %>%
rename_with(
.fn=~substr(x=., start=1, stop=16),
.cols=starts_with("employees.")
) %>%
rename_with(
.fn=function(q) paste("earnings", q, sep="."),
.cols=seq(47, 89, 1) # '47' and '89' will change every quarter.
) %>%
rename_with(
.fn=~substr(x=., start=1, stop=15),
.cols=starts_with("earnings.")
) %>%
select(-46) %>% # blank column
pivot_longer(
cols=-c(1, 2),
names_to=c(".value", "time"),
names_pattern="(.+).([0-9]{6})"
) %>%
mutate( time=as.yearmon(time, "%Y%m") )
Output:
# A tibble: 4,859 x 5
industry SIC time employees earnings
<chr> <chr> <yearmon> <dbl> <dbl>
1 Mining and quarrying 2 Sep 2009 487132 16448188825
2 Mining and quarrying 2 Dec 2009 487745 17510873008
3 Mining and quarrying 2 Mar 2010 490847 17149708859
4 Mining and quarrying 2 Jun 2010 496710 17603372833
5 Mining and quarrying 2 Sep 2010 505244 19129235237
6 Mining and quarrying 2 Dec 2010 504067 19696688896
7 Mining and quarrying 2 Mar 2011 511433 19567627283
8 Mining and quarrying 2 Jun 2011 517104 20444707224
9 Mining and quarrying 2 Sep 2011 518719 21593220480
10 Mining and quarrying 2 Dec 2011 518240 24878861928
# ... with 4,849 more rows
Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1
I know this has already been asked, but I think my issue is a bit different (nevermind if it is in Portuguese).
I have this dataset:
df <- cbind(c(rep(2012,6),rep(2016,6)),
rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
runif(12,0,1))
colnames(df) <- c('Year,'Variable','Value)
I want to order the rows to group first everything that has the same year. Afterwards, I want the Variable column to be ordered like this:
Receitas.total
Fisicas.total
Emp.total
Politicos.total
Proprio.total
Outros.total
I know I could usearrange() from dplyr to sort by the year. However, I do not know how to combine this with any routine using factor and order without messing up the previous ordering by year.
Any help? Thank you
We create a custom order by converting the 'Variable' into factor with levels specified in the custom order
library(dplyr)
df %>%
arrange(Year, factor(Variable, levels = c('Receitas.total',
'Fisicas.total', 'Emp.total', 'Politicos.total',
'Proprio.total', 'Outros.total')))
# A tibble: 12 x 3
# Year Variable Value
# <dbl> <chr> <dbl>
# 1 2012 Receitas.total 0.6626196
# 2 2012 Fisicas.total 0.2248911
# 3 2012 Emp.total 0.2925740
# 4 2012 Politicos.total 0.5188971
# 5 2012 Proprio.total 0.9204438
# 6 2012 Outros,total 0.7042230
# 7 2016 Receitas.total 0.6048889
# 8 2016 Fisicas.total 0.7638205
# 9 2016 Emp.total 0.2797356
#10 2016 Politicos.total 0.2547251
#11 2016 Proprio.total 0.3707349
#12 2016 Outros,total 0.8016306
data
set.seed(24)
df <- data_frame(Year =c(rep(2012,6),rep(2016,6)),
Variable = rep(c('Emp.total',
'Fisicas.total',
'Outros,total',
'Politicos.total',
'Receitas.total',
'Proprio.total'),2),
Value = runif(12,0,1))