Calculate total number of distinct users across years, cumulatively - r

Let's say I have a data.frame like so:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)
For each user I have a unique id, industry, and year.
I'm trying to compute a cumulative count of the people who have ever worked Government, so the cumulative count should be a count of the total number of unique users for that year and all preceding years.
I know I can do an ordinary cumulative sum like so:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3
However, this isn't what I want since the sums in the year 2000 and 2001 will include people who have already been included in 1999. I want each year to be a cumulative count of the total number of unique users that have ever worked in Government at a given year. I couldn't figure out the right way to do this in dplyr.
So the correct output should look like:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3

One option might be:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3

1) sqldf This can be implemented through a complex self-join in sql. This joins each row to the rows having the same industry and same year or before and then groups them by year and industry counting the distinct id's.
library(sqldf)
sqldf("select a.year, a.industry, count(distinct b.id) cum_sum
from user_df a
left join user_df b on b.industry = a.industry and b.year <= a.year
group by a.year, a.industry")
giving:
year industry cum_sum
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
2) baseA base only solution is formed by merging the data frame to itself on industry and then subset to the same or earlier year and aggregate over industry and year. This is inefficient since unlike the SQL statement which filters as it joins this creates the entire join before filtering it down; however, if your data is not too large this may be sufficient.
m <- merge(user_df, user_df, by = "indstry")
s <- subset(m, year.y <= year.x)
ag <- aggregate(id.y ~ industry + year.x, s, function(x) length(unique(x)))
names(ag) <- sub("\\..*", "", names(ag))
ag
giving:
industry year id
1 Government 1999 3
2 NGO 1999 1
3 Private 1999 1
4 Government 2000 4
5 Government 2001 5
6 Private 2001 3

Related

Is there a way I can get the maximum value for each group after a double group_by in R?

I am trying to extract the team with the maximum number of wins each year in women's college basketball, and I am currently stuck with having the number of wins for each year for each team, and I want only the team with the maximum number of wins in each year.
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
summarise(totalwinsyr = sum(Outcome))
Output currently looks like this, but I am expecting to see each year only once with the team with the maximum number of wins in the subsequent columns
Year Team totalwinsyr
<fct> <chr> <dbl>
1 2014 AbileneChristian 10
2 2014 AirForce 0
3 2014 Akron 18
4 2014 Alabama 10
5 2014 AlabamaAM 3
6 2014 AlabamaHuntsville 0
7 2014 AlabamaMobile 0
8 2014 AlabamaSt 15
9 2014 AlaskaAnchorage 1
10 2014 AlbanyNY 16
How to select the rows with maximum values in each group with dplyr?
I have already looked here but I could not find any resources to help with a group_by() with multiple values
Create a new column with the number of wins and then filter:
winsbyyear <- WomenCBnewdf %>%
group_by(Year,Team)%>%
mutate(totalwinsyr = sum(Outcome)) %>%
filter(totalwinsyr == max(totalwinsyr))

Replacing NA values with values from neighbouring rows [duplicate]

This question already has answers here:
Complete column with group_by and complete
(2 answers)
Closed 1 year ago.
I need some help filling cells which have an 'NA' values with other values which are already present in the surrounding rows.
I currently have a panel dataset of investors and their activities. Some of the rows were missing, so I have completed the panel to include these rows, replacing the financial deal information with '0' values.
The other variables relate to wider firm characteristics, such as region and strategy. I am unsure how to replicate these for each firm.
This is my code so far.
df <- df %>%
group_by(investor) %>%
mutate(min = min(dealyear, na.rm = TRUE),
max = max(dealyear, na.rm = TRUE)) %>%
complete(investor, dealyear = min:max, fill = list(counttotal=0, countgreen=0, countbrown=0)) %>%
An example of data before completion - notice year 2004 is missing.
investor
dealyear
dealcounts
strategy
region
123IM
2002
5
buyout
europe
123IM
2003
5
buyout
europe
123IM
2005
5
buyout
europe
123IM
2006
5
buyout
europe
Example of data after completion, with missing row added in
investor
dealyear
dealcounts
strategy
region
123IM
2002
5
buyout
europe
123IM
2003
5
buyout
europe
123IM
2004
0
NA
NA
123IM
2005
5
buyout
europe
123IM
2006
5
buyout
europe
How would I go about replacing these NA values with the corresponding information for each investment firm?
Many thanks
Rory
You may use complete with group_by as -
library(dplyr)
library(tidyr)
df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
ungroup
# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 NA NA
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe
If you want to replace NA in strategy and region column you may use fill.
df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
fill(strategy, region) %>%
ungroup
# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 buyout europe
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe

How do I go about filtering my data by the upper 50th percentile for a separate dependent variable?

I need to split my data so that when I use the facet_wrap I have the top 50 percentile for each year.
Here is a sample of my data:
# A tibble: 10,519 x 3
Species Abundance Year
<chr> <dbl> <chr>
1 Astropecten irregularis 2 2009
2 Asterias rubens 14 2009
3 Echinus esculentus 1 2009
4 Pagurus prideaux 1 2009
5 Raja clavata 1 2009
6 Astropecten irregularis 4 2009
7 Asterias rubens 47 2009
8 Henricia sp. 2 2009
9 Ophiura ophiura 8 2009
10 Solaster endeca 1 2009
# ... with 10,509 more rows
My current strategy is this:
Data <- All_years %>%
group_by(Species, Year) %>%
summarise(Abundance = sum(Abundance, na.rm = TRUE)) %>%
filter(quantile(Abundance, 0.50)<Abundance) %>%
filter(Abundance > 50)
The issue is that this gives me the top 50 percentile for the whole set while I would like it to give me the top 50 for each year so I can then display it with a facet_wrap in ggplot.

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

table restructure split R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table where I have a variable Technology, which includes "AllRenewables", "Biomass","Solar","Offshore wind", "Onshore wind" and "Wind".
I would like that the "All Renewables" is split into "Biomass","Solar","Offshore wind", "Onshore wind" and that "Wind" technology should be split into ""Offshore wind", "Onshore wind".
The table looks approximately as follows:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Wind 2
2000 A Onshore wind 2
2000 A All Renewables 3
It should look as follows after the re-structuring:
Table
Year Country Technology Changes
2000 A Solar 1
2000 A Onshore wind 2
2000 A Offshore wind 2
2000 A Onshore wind 3
2000 A Biomass 3
2000 A Solar 3
2000 A Onshore wind 3
2000 A Offshore wind 3
If anybody could help, I would be really really thankful.
Sarah
You could rename factor levels and use tidyr::separate_rows
lvls <- c(
"Biomass, Solar, Offshore wind, Onshore wind",
"Onshore wind",
"Solar",
"Offshore wind, Onshore wind")
levels(df$Technology) <- lvls;
library(tidyverse)
df %>% separate_rows(Technology, sep = ", ") %>%
group_by_all() %>%
slice(1) %>%
ungroup() %>%
arrange(Changes)
## A tibble: 7 x 4
# Year Country Technology Changes
# <int> <fct> <chr> <int>
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Biomass 3
#5 2000 A Offshore wind 3
#6 2000 A Onshore wind 3
#7 2000 A Solar 3
Explanation: We redefine factor levels such that "All Renewables" becomes "Biomass, Solar, Offshore wind, Onshore wind" and "Wind" becomes "Offshore wind, Onshore wind". Then we use tidyr::separate_rows to split entries with a comma into separate rows. All that remains are removal of duplicates and re-ordering of rows.
Sample data
df <- read.table(text =
"Year Country Technology Changes
2000 A 'Solar' 1
2000 A 'Wind' 2
2000 A 'Onshore wind' 2
2000 A 'All Renewables' 3", header = T)
Just a question of merging (with tidyverse) :
# Your data:
df <- read.csv(textConnection("Y, A, B, C
2000,A,Solar,1
2000,A,Wind,2
2000,A,Onshore wind,2
2000,A,All Renewables,3"),stringsAsFactors=FALSE)
# Your synonyms:
c <- read.csv(textConnection("B, D
All Renewables,Biomass
All Renewables,Solar
All Renewables,Offshore wind
All Renewables,Onshore wind
Wind,Offshore wind
Wind,Onshore wind"),stringsAsFactors=FALSE)
df %>% left_join(c,by="B") %>% mutate(B=coalesce(D,B)) %>% select(-D)
# Y A B C
#1 2000 A Solar 1
#2 2000 A Offshore wind 2
#3 2000 A Onshore wind 2
#4 2000 A Onshore wind 2
#5 2000 A Biomass 3
#6 2000 A Solar 3
#7 2000 A Offshore wind 3
#8 2000 A Onshore wind 3

Resources