R: How to spread, group_by, summarise and mutate at the same time - r

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.

Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows

This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Related

Calculate ratio of values within one column

I have created a simple data frame with simulated GDP data for Costa Rica and the US, using the following code
gdp_test <- read.table(text = "Country, Year, GDP
costa_rica 1979 200
costa_rica 1980 210
costa_rica 1981 250
usa 1979 350
usa 1980 375
usa 1981 421", header=T)
gdp_test <- as.data.frame(gdp_test)
The output is as follows
Country. Year. GDP
1 costa_rica 1979 200
2 costa_rica 1980 210
3 costa_rica 1981 250
4 usa 1979 350
5 usa 1980 375
6 usa 1981 421
What I would like to do is to create a new variable consisting of the ratio of each country's GDP, for each year, to the usa gdp for that same year (obviously the ratio wouldl be 1 for the usa every year).
Any ideas of how to do it? It is an easy task in Excel, but I have found no way of doing it withing R
I have not been able to write any code that would do the task
That might do the trick, using tidyverse.
if(no_NA) {
Remove last pipe line
}
:)
gdp_test %>%filter(Country.=="usa") %>% group_by(Year.) %>% select(-Country.) %>%
left_join(gdp_test,by="Year.") %>%
rename(GDPus=GDP.x,GDP=GDP.y) %>%
mutate(ratio=GDP/GDPus) %>% ungroup() %>%
mutate(ratio=ifelse(ratio==1,NA,ratio))
Here is a very clumsy way of getting the job done. I am sure there are much better ways of doing it. Help would be enormously appreciated.
gdp_test <- read.table(text = "Country, Year, GDP
costa_rica 1979 200
costa_rica 1980 210
costa_rica 1981 250
usa 1979 350
usa 1980 375
usa 1981 421", header=T)
gdp_test <- as.data.frame(gdp_test) %>%
mutate(ID=row_number(),)
gdp_usa <- gdp_test$GDP[4:6]
usa <- as.data.frame(c(gdp_usa,gdp_usa)) %>%
mutate(ID=row_number(),)
gdp <-full_join(gdp_test,usa, by = "ID")
gdp <- gdp %>% mutate(ratio = GDP/gdp_usa)

How to find number of storms per year since 2010?

The question says: Find the number of storms per year since 2010.
So far, I have this as my code in R.
The data set is "storms" which is a dataset that is loaded into R, and is a subset of the NOAA Atlantic hurricane database.
storms %>%
select(status, year) %>%
filter(year == 2010) %>%
tally()
What I don't know is if the "since" keyword means before 2010 or should I just count the number of storms found in 2010?
Storms since 2010 per year means including 2010 and afterwards the number of storms each year. Maybe this is what the question is asking:
storms2 = storms %>% filter(year>= 2010)
storms2 %>% count(year)
# A tibble: 11 × 2
year n
<dbl> <int>
1 2010 402
2 2011 323
3 2012 454
4 2013 202
5 2014 139
6 2015 220
7 2016 396
8 2017 306
9 2018 266
10 2019 330
11 2020 570

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

Combining & totalling rows in R

I have the below dataset, with the variables as follows:
member_id - an id number for each member
year - the year in question
gender - binary variable, 0 is male, 1 is female
party - the party of the member
Leadership - TRUE if the member holds a leadership position in government or opposition, FALSE if they don't
house_start - the date the member became an MP
Year.Entered - the year they became an MP
Years.in.parliament - how many years it has been since they were first elected
Edu - the amount of time the MP has participated in debates related to education in the given year.
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat FALSE 09/06/1983 1983 14 3
6 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 9
As you can see with rows 5 and 6 in the dataset, the same member is recorded twice in the one year. This has happened throughout the dataset for some members because of the Leadership variable. For example this member (id number 15) did not have a leadership position for the first part of 1997 but did get one later in the year. I want to be able to combine these two rows and have the Leadership variable as TRUE in these cases. I also need to compute the sum of Edu rows for these as well, so for this member it would become 12 (because I want each members number of times participated per year for this policy area). So I want it to look like:
member_id year gender party Leadership house_start Year.Entered Years.in.parliament Edu
1 386 1997 0 Conservative FALSE 03/05/1979 1979 18 7
2 37 1997 0 Labour FALSE 03/05/1979 1979 18 10
3 47 1997 0 Labour TRUE 09/06/1983 1983 14 157
4 408 1997 0 Conservative TRUE 03/05/1979 1979 18 48
5 15 1997 1 Liberal Democrat TRUE 09/06/1983 1983 14 12
I have been trying to change these manually on Excel, but I need to do this for several different policy areas, so it is taking a lot of time. Any help would be much appreciated!
We can do a group by sum and arrange and slice the first row
library(dplyr)
df1 %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
arrange(party, desc(Leadership)) %>%
slice(1)
For each group you can select the rows where there is only one row or row where Leadership is TRUE.
library(dplyr)
df %>%
group_by(member_id, year, gender, party) %>%
mutate(Edu = sum(Edu)) %>%
filter(n() == 1 | Leadership)
From my understanding the minimal repeating group is the member_id & year, we can then sum the Edu amount defensively (using na.rm = TRUE) and then slice the grouped data.frame using boolean algebra (taking the maximum of a boolean vector yields true records).
library(dplyr)
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
slice(which.max(Leadership)) %>%
ungroup()
Alternatively we can use top_n function (which yields the same result):
df %>%
group_by(member_id, year) %>%
mutate(Edu = sum(Edu, na.rm = TRUE)) %>%
top_n(1, Leadership) %>%
ungroup()

Find the nth largest value based on criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
This is the basically same problem I had in Excel a few days ago (Excel - find nth largest value based on criteria), but this time in R (the data set contains half a million entries and that is more than Excel seems to be able to handle).
I have a table that looks like this that I have imported from Excel:
Country Region Code Product name Year Value
Sweden Stockholm 123 Apple 1991 244
Sweden Kirruna 123 Apple 1987 100
Japan Kyoto 543 Pie 1987 544
Denmark Copenhagen 123 Apple 1998 787
Denmark Copenhagen 123 Apple 1987 100
Denmark Copenhagen 543 Pie 1991 320
Denmark Copenhagen 126 Candy 1999 200
Sweden Gothenburg 126 Candy 2013 300
Sweden Gothenburg 157 Tomato 1987 150
Sweden Stockholm 125 Juice 1987 250
Sweden Kirruna 187 Banana 1998 310
Japan Kyoto 198 Ham 1987 157
Japan Kyoto 125 Juice 1987 550
Japan Tokyo 125 Juice 1991 100
What I want to do is to make a code that can give me the sum of the nth largest value of products that have been sold in a specific country. For instance, the most sold product in Sweden is Apple so I want to code to find that apple is the most sold product (in total, which is what I am interested in) and then summaries all the values of the sold apples in the country Sweden, 344.
I also want to be able to find the nth largest value based on both country and year. That is, if I am looking for the most sold product in Sweden in the year 2013, it should return the product Candy and the value 300.
Solution for your first question (find most sold product per country, summarise value for this product) using dplyr:
library(tidyverse)
df %>%
group_by(Country, Product_name) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country) %>%
filter(sum_value == max(sum_value))
# A tibble: 3 x 3
# Groups: Country [3]
Country Product_name sum_value
<fctr> <fctr> <int>
1 Denmark Apple 887
2 Japan Juice 650
3 Sweden Apple 344
Solution for second question (show nth most sold products per country and year, sum value):
df %>%
group_by(Country, Product_name, Year) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country, Year) %>%
arrange(desc(sum_value), .by_group = TRUE) %>%
slice(., 1:2)
Had to change the data a bit to get a decent output, so here's the output with all years set to 1987 (change the 2 in the 1:2 within the last row for a different n):
# A tibble: 6 x 4
# Groups: Country, Year [3]
Country Product_name Year sum_value
<fctr> <fctr> <int> <int>
1 Denmark Apple 1987 887
2 Denmark Pie 1987 320
3 Japan Juice 1987 650
4 Japan Pie 1987 544
5 Sweden Apple 1987 344
6 Sweden Banana 1987 310

Resources