Subset dataframe according to maxima of groups

Subset dataframe according to maxima of groups - r

I am trying to create a subset of a dataframe conditional on grouped cumulative sums of one of the columns (i.e., cumsum of Total, grouped by Year, below).
I have a population table that looks as follows (simplified)
Year Age Total Cum.Sum
1991 20 94619 94619
1991 21 97455 192074
1991 22 101418 293492
1991 23 104192 397684
1991 24 108332 506016
1991 25 111355 617371
1991 26 114569 731940
1991 27 113852 845792
1991 28 112264 958056
1991 29 110230 1068286
1991 30 109149 1177435
1991 31 108222 1285657
1991 32 106641 1392298
1991 33 106658 1498956
1991 34 104730 1603686
1991 35 103383 1707069
1991 36 101441 1808510
1991 37 99773 1908283
1991 38 100621 2008904
1991 39 98135 2107039
1991 40 101946 2208985
2010 20 93470 93470
2010 21 94762 188232
2010 22 92527 280759
2010 23 94696 375455
2010 24 95416 470871
2010 25 98016 568887
2010 26 98387 667274
2010 27 102254 769528
2010 28 103343 872871
2010 29 105179 978050
2010 30 104278 1082328
2010 31 104099 1186427
2010 32 105240 1291667
2010 33 105316 1396983
2010 34 106250 1503233
2010 35 109019 1612252
2010 36 110044 1722296
2010 37 113949 1836245
2010 38 118086 1954331
2010 39 119845 2074176
2010 40 123647 2197823
Now I'd like to subset this dataframe so that the cumulative sum of each year does not exceed a certain treshold, e.g.
1991 2010
1605897 1803476
I do not want to have separate datasets per year.

This will do:
t.h <- read.table(header=TRUE, text=
'Year th
1991 1605897
2010 1803476')
d <- merge(dataset, t.h)
subset(dataset, Cum.Sum < t.h)

Related

I need to assign populations (denominators) to a data frame in R

I need to assign populations (denominators) to a data frame in R. For each age group and each year, the populations are different.
My data frame is
Year agegroup count
2000 0-4 24
2000 5-9 36
....
2021 0-4 42
2021 95+ 132
How can I assign each year and age group (row) a different population?
I don't know how to do it, can someone help me? Thanks

Thank you,
I have this data frame:
head(pop)
Year Age_group Count Population
1:00 1993 7 12
2:00 1994 7 18
3:00 1995 7 14
4:00 1993 8 16
5:00 1994 8 26
6:00 1995 8 27
7:00 1996 8 21
… Continue
And I want to put in the populations column the data that I have in another dataframe, so that the result is this:
head(pop1)
Year Age_group Count Population
1:00 1993 7 12 133404
2:00 1994 7 18 155638
3:00 1995 7 14 100053
4:00 1993 8 16 211223
5:00 1994 8 26 111170
6:00 1995 8 27 255691
7:00 1996 8 21 255691
… Continue

Sorry, I have this data frame:
Year agegroup count
2000 0-4 24
2000 5-9 36
....
2021 0-4 42
2021 95+ 132
And I want to put in the populations column the data that I have in another dataframe, so that the result is this:
Year agegroup count population
2000 0-4 24 123500
2000 5-9 36 132600
....
2021 0-4 42 145200
2021 95+ 132 187540

Merging two data frames with different rows in R

I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?

With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99

Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo

Parallelize for loops that subset panel data by industry-year

I want to carry out an estimation procedure that uses data on all firms in a given sector, for a rolling window of 5 years.
I can do it easily in a loop, but since the estimation procedure takes quite a while, I would like to parallelize it. Is there any way to do this?
My data looks like this:
sale_log cogs_log ppegt_log m_naics4 naics_2 gvkey year
1 3.9070198 2.5146032 3.192821715 9.290151e-02 72 1001 1983
2 4.1028774 2.7375141 3.517861329 1.067687e-01 72 1001 1984
3 4.5909863 3.2106595 3.975112703 2.511660e-01 72 1001 1985
4 3.2560391 2.7867256 -0.763368555 1.351031e-02 44 1003 1982
5 3.2966287 2.8088799 -0.305698649 1.151525e-02 44 1003 1983
6 3.2636907 2.8330357 0.154036559 8.699394e-03 44 1003 1984
7 3.7916480 3.2346849 0.887916936 1.351803e-02 44 1003 1985
8 4.1778028 3.5364473 1.177985972 1.761273e-02 44 1003 1986
9 4.1819066 3.7297111 1.393016951 1.686331e-02 44 1003 1987
10 4.0174411 3.6050022 1.479584215 1.601205e-02 44 1003 1988
11 3.4466429 2.9633579 1.312863013 8.888067e-03 44 1003 1989
12 3.0667367 2.6128805 0.909779173 2.102674e-02 42 1004 1965
13 3.2362968 2.8140391 1.430690273 2.050934e-02 42 1004 1966
14 3.1981990 2.8822097 1.721614365 1.702929e-02 42 1004 1967
15 3.9265031 3.6159280 2.399823853 2.559074e-02 42 1004 1968
16 4.3343438 4.0116068 2.592692585 3.649313e-02 42 1004 1969
17 4.5869564 4.3059855 2.772196529 4.743631e-02 42 1004 1970
18 4.7015486 4.3995561 2.875267240 5.155589e-02 42 1004 1971
19 5.0564414 4.7539697 3.218686385 6.863808e-02 42 1004 1972
20 5.4323873 5.1711531 3.350849771 8.272720e-02 42 1004 1973
21 5.2979696 5.0033437 3.383504340 6.726429e-02 42 1004 1974
22 5.3958779 5.1475985 3.475121024 1.534230e-01 42 1004 1975
23 5.5442635 5.3195666 3.517557041 1.674937e-01 42 1004 1976
24 5.6260795 5.3909462 3.694842501 1.711362e-01 42 1004 1977
25 5.8039766 5.5455887 3.895724689 1.836405e-01 42 1004 1978
26 5.8198831 5.5665980 3.960153940 1.700499e-01 42 1004 1979
27 5.7474447 5.4697019 3.943733263 1.520660e-01 42 1004 1980
where gvkey is the firm id and naics are the industry codes.
The code I wrote:
theta=matrix(,60,23)
count=1
temp <- dat %>% select(
"sale_log", "cogs_log", "ppegt_log",
"m_naics4", "naics_2", "gvkey", "year"
)
for (i in 1960:2019) { # 5-year rolling sector-year specific production functions
sub <- temp[between(temp$year,i-5,i),] # subset 5 years
jcount <- 1
for (j in sort(unique(sub$naics_2))) { # loop over sectors
temp2 <- sub[sub$naics_2==j,]
mdl <- prodestOP(
Y=temp2$sale_log, fX=temp2$cogs_log, sX=temp2$ppegt_log,
pX=temp2$cogs_log, cX=temp2$m_naics4, idvar=temp2$gvkey,
timevar=temp2$year
)
theta[count,jcount] <- mdl#Model$FSbetas[2]
jcount <- jcount+1
}
count <- count+1
}

R stacked percentage bar plot with percentage of binary factor and labels

I want to produce a graphic that looks something like this (with percentage and legend) by R:
My original data is:
AIRBUS BOEING EMBRAER
2002 18 21 30
2003 20 23 31
2004 23 26 29
2005 22 25 26
2006 22 25 25
2007 22 27 17
2008 21 21 16
2009 17 19 22
2010 14 22 24
2011 17 27 22
2012 16 22 19
2013 11 24 19
There are similar questions on SO already, but I seem to lack the sufficient amount of intelligence (or understanding of R) to extrapolate from them to a solution to my particular problem.

First, gather or melt your data into long format. Then it's easy.
library(tidyverse)
df <- read.table(
text = "
YEAR AIRBUS BOEING EMBRAER
2002 18 21 30
2003 20 23 31
2004 23 26 29
2005 22 25 26
2006 22 25 25
2007 22 27 17
2008 21 21 16
2009 17 19 22
2010 14 22 24
2011 17 27 22
2012 16 22 19
2013 11 24 19",
header = TRUE
)
df_long <- df %>%
gather(company, percentage, AIRBUS:EMBRAER)
ggplot(df_long, aes(x = YEAR, y = percentage, fill = company)) +
geom_col() +
ggtitle("Departure delays by company and Year") +
scale_x_continuous(breaks = 2002:2013)

Creating a vector with multiple sequences based on number of IDs' repetitions

I've got a data frame with panel-data, subjects' characteristic through the time. I need create a column with a sequence from 1 to the maximum number of year per every subject. For example, if subject 1 is in the data frame from 2000 to 2005, I need the following sequence: 1,2,3,4,5,6.
Below is a small fraction of my data. The last column (exp) is what I trying to get. Additionally, if you have a look at the first subject (13) you'll see that in 2008 the value of qtty is zero. In this case I need just a NA or a code (0,1, -9999), it doesn't matter which one.
Below the data is what I did to get that vector, but it didn't work.
Any help will be much appreciated.
subject season qtty exp
13 2000 29 1
13 2001 29 2
13 2002 29 3
13 2003 29 4
13 2004 29 5
13 2005 27 6
13 2006 27 7
13 2007 27 8
13 2008 0 NA
28 2000 18 1
28 2001 18 2
28 2002 18 3
28 2003 18 4
28 2004 18 5
28 2005 18 6
28 2006 18 7
28 2007 18 8
28 2008 18 9
28 2009 20 10
28 2010 20 11
28 2011 20 12
28 2012 20 13
35 2000 21 1
35 2001 21 2
35 2002 21 3
35 2003 21 4
35 2004 21 5
35 2005 21 6
35 2006 21 7
35 2007 21 8
35 2008 21 9
35 2009 14 10
35 2010 11 11
35 2011 11 12
35 2012 10 13
My code:
numbY<-aggregate(season ~ subject, data = toCountY,length)
colnames(numbY)<-c("subject","inFish")
toCountY$inFish<-numbY$inFish[match(toCountY$subject,numbY$subject)]
numbYbyFisher<-unique(numbY)
seqY<-aggregate(numbYbyFisher$inFish, by=list(numbYbyFisher$subject), function(x)seq(1,x,1))

I am using ddply and I distinguish 2 cases:
Either you generate a sequence along subjet and you replace by NA where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp=ifelse(qtty==0,NA,seq_along(subjet)))
Or you generate a sequence along qtty different of zero with a jump where you have qtty is zero
ddply(dat,.(subjet),transform,new.exp={
hh <- seq_along(which(qtty !=0))
if(length(which(qtty ==0))>0)
hh <- append(hh,NA,which(qtty==0)-1)
hh
})

EDITED
ind=qtty!=0
exp=numeric(length(subject))
temp=0
for(i in 1:length(unique(subject[ind]))){
temp[i]=list(seq(from=1,to=table(subject[ind])[i]))
}
exp[ind]=unlist(temp)
this will provide what you need

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset dataframe according to maxima of groups - r

This will do: t.h <- read.table(header=TRUE, text= 'Year th 1991 1605897 2010 1803476') d <- merge(dataset, t.h) subset(dataset, Cum.Sum < t.h)

Related

I need to assign populations (denominators) to a data frame in R

Merging two data frames with different rows in R

Parallelize for loops that subset panel data by industry-year

R stacked percentage bar plot with percentage of binary factor and labels

Creating a vector with multiple sequences based on number of IDs' repetitions

Categories

Resources