Combining & totalling rows in R - r

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!

We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)

For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)

From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

Related

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How to summarize events prior to a specific event (that can happen multiple times) across multiple observations in r?

I'm trying to collect data on what events have happened prior to a specific event (i.e. bDragons)which can be recurring based on the full observation. These are just an excerpt of one observation where a dragon is taken more than once, and I want to be able to pull insights on each and every one over many observations. So in the data set below, I would want to know that only 1 outer turret was taken prior to the first dragon at Time == 12.891. The next is taken at 20.215, which 4 towers and a drake before it.
ID TeamObj Time Type Lane League Year Season bResult rResult gamelength Gold
1 1 bTowers 9.397 OUTER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
2 1 bDragons 12.891 AIR_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
3 1 bTowers 16.215 OUTER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
4 1 bTowers 16.591 INNER_TURRET BOT_LANE CBLoL 2017 Summer 1 0 34 NA
5 1 bTowers 19.830 OUTER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
6 1 bDragons 20.215 EARTH_DRAGON <NA> CBLoL 2017 Summer 1 0 34 NA
7 1 bBarons 22.512 BARON_NASHOR <NA> CBLoL 2017 Summer 1 0 34 NA
8 1 bTowers 23.962 INNER_TURRET MID_LANE CBLoL 2017 Summer 1 0 34 NA
9 1 bTowers 24.707 INNER_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
10 1 bTowers 24.962 BASE_TURRET TOP_LANE CBLoL 2017 Summer 1 0 34 NA
I'd want this for every TeamObj of that type but the issue comes up where I try to group_by address and filter by (Time <= which(Team == bDragons)and the wrong things get filtered out or I can't summarize based on that count(Type) or anything. I'm looking for help on recording some type of recurring function or a better way to record and summarize that. Looking to fit the observations into a linear model later on, but I can't get to that square one which causes the issue.
Am I thinking about my filter incorrectly? My summarize? tst3 %>% group_by(ID) %>% filter(Time <= which(Team == "bDragons")) %>% summarize(count(Type))
Something like:
ID dragonID dragonType Time Baron_Nashor Base_Turret Inner_Turret Nexus_Turret Outer_Turret
1 1 AIR_DRAGON 12.891 N/A N/A N/A N/A 1
2 2 EARTH_DRAGON 20.215 N/A N/A 1 N/A 3
and so on, if that is clear. Want to be able to use each as an observation.
How about the following
tst3 %>%
group_by(ID) %>%
# arrange(Time) %>% # uncomment if needed
mutate(
Type = factor(Type),
dragonID = cumsum(dplyr::lag(TeamObj == 'bDragons', default = 1))) %>%
group_by(ID, dragonID) %>%
summarize(
dragonType = last(Type),
Time = last(Time),
tmp = list(as.data.frame(table(Type)))) %>%
unnest() %>%
spread(Type, Freq, fill = 0) %>%
# select(-ends_with("DRAGON")) %>%
group_by(ID) %>%
mutate_at(vars(BARON_NASHOR:OUTER_TURRET), cumsum) %>%
filter(str_detect( dragonType, "DRAGON"))

Calculating age per animal by subtracting years in R

I am looking to calculate relative age of animals. I need to subtract sequentially each year from the next for each animal in my dataset. Because an animal can have multiple reproductive events in a year, I need the age for the remaining events in that year (i.e. all events after the first) to be the same as the initial calculation.
Update:
The dataset more resembles this:
Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1
And I need it to look like this
Year ID Age
1 1975 6 0
2 1975 6 0
3 1976 6 1
4 1977 6 2
6 1975 9 0
8 1978 9 3
Apologies for the initial confusion, if I wasn't clear on what I needed to accomplish.
Any help would be greatly appreciated.
Things done "by group" are usually easiest to do using dplyr or data.table
library(dplyr)
your_data %>%
group_by(ID) %>% # group by ID
mutate(Age = Year - min(Year)) # add new column
or
library(data.table)
setDT(your_data) # convert to data table
# add new column by group
your_data[, Age := Year - min(Year), by = ID]
In base R, ave is probably easiest for adding a groupwise columns to existing data:
your_data$Age = with(your_data, ave(Year, ID, function(x) x - min(x)))
but the syntax isn't as nice as the options above.
You can test on this data:
your_data = read.table(text = " Year ID Age
1 1975 6 -1
2 1975 6 -1
3 1976 6 -1
4 1977 6 -1
6 1975 9 -1
8 1978 9 -1 ", header = T)
if you're trying to figure out the relative age based on one intial birth year, 1975 (which it seems like you are), then you can just make a new column called "RelativeAge" and set it equal to the year - 1975
data$RelativeAge = (Year-1975)
then just get rid of the original "Age" column, or rename as necessary

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

Resources