Sum already existing rows into one row given specific years - r

This is my current data
Year
Economy
GDP
2000
US
26
2000
China
24
2000
Rest of the World
100
2001
US
25
2001
China
25
2001
Rest of the World
120
I want to add "China" to "Rest of the World" for each year. My final data should look like this. Thanks in advance for your assistance.
Year
Economy
GDP
2000
US
26
2000
Rest of the World
124
2001
US
25
2001
Rest of the World
145

We can do a group by 'Year' and with case_when replace the 'Economy' values that are no 'US' to "Rest of the World", then get the sum of "GDP"
library(dplyr)
df1 %>%
group_by(Year, Economy = case_when(Economy != "US" ~
"Rest of the World", TRUE ~ Economy)) %>%
summarise(GDP = sum(GDP, na.rm = TRUE))

Related

Summarise rows in dataframe by two columns

I have this data frame called Worldwhich shows the following:
City Year Income Tourist
London 2008 50 100
NY 2009 75 250
Paris 2010 45 340
Dubai 2008 32 240
London 2011 50 140
Abu Dhabi 2009 60 120
Paris 2009 70 140
NY 2007 50 150
Tokyo 2008 45 150
Dubai 2010 40 480
#With 207 more rows
I want to summarise each rows so that every city shows the total income and tourists for all the years. So I want to find a code where City and Years are matched and then summarised so that every city just have one row.
Something like this:
City Income Tourist
London 1051 5040
NY 1547 5432
Paris 2600 4321
Dubai 3222 5312
Abu Dhabi 3100 7654
Tokyo 2404 4321
#With 40 more rows
After the research I've done n_distinct and group_by should be used.
Base R solution:
You can use the sapply() function to iterate over cities.
the first argument will be a vector of unique cities
we then write our function that select all the rows (years) of each city and returns the "Income" and "Tourist" columns
Sum the columns values with colSums() function
Transpose the output using the t() function.
t( sapply( unique( World$City ),function(CITY) colSums(World[World$City==CITY,c("Income","Tourist")] ) ) )
Solution with R's data.table package:
Make sure your object is of type data.table.
in the j part of the bracket (the do part):
you can provide names to the wanted columns ("Income="),
and specify the wanted output ("sum(Income)").
To group the cities, add a by argument to the data.table object.
World[,.(Income=sum(Income),Tourist=sum(Tourist)),by=City]
yes, you can use group_by and summarise function.
world %>% group_by(City) %>% summarise(across(c(Income,Tourist), sum))
you can also add Year in the group by function.
world %>% group_by(City,Year) %>% summarise(across(c(Income,Tourist), sum))

How to filter values within a threshold in R

I have a data set that looks like this with the first 10 rows
country freq
Albania 2
Argentina 4
Australia 26
Austria 14
Belgium 22
Brazil 46
Bulgaria 2
Cambodia 2
Canada 37
Chile 19
I want to filter out counts(frequency) that are less than 30
i tried this code:
dd %>%
group_by(freq) %>%
filter(n()<30)
The output was same with the dataset. I did not get want i want
how do I resolve this?
Thanks in advance
Use simple indexing. Why are you grouping by?
dd <- dd[dd$freq >= 30, ]

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

Combining & totalling rows in R

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!
We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)
For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)
From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Resources