Calculating proportion by year using dplyr - r

I'm attempting to calculate the frequency a variable (in this case country) shows up in any given year. For example:
name <- c('AJ Griffin','Steve Bacon','Kevin Potatoe','Jose Hernandez','Kent Brockman',
'Sal Fasno','Kirk Kelly','Wes United','Livan Domingo','Mike Fast')
country <- c('USA', 'USA', 'Canada', 'Dominican Republic', 'Panama', 'Dominican Republic', 'Canada', 'USA', 'Dominican Republic', 'Mexico')
year <- c('2016', '2016', '2016', '2016', '2016', '2015', '2015', '2015', '2015', '2015')
country_analysis <-data.frame(name, country, year)
When I use the following code I get the proportion of countries for the entire dataset, but I would like to pare this down even further to specific years.
P <- country_analysis %>%
group_by(country) %>%
summarise(n=n())%>%
mutate(freq = round(n / sum(n), 1))
Ideally the end result would have country, year, frequency column (i.e. 2016, USA, 0.4). Any input would be appreciated.

First collapse by year and country, then by just year. For example
country_analysis %>%
group_by(year, country) %>%
summarize(count=n()) %>%
mutate(proportion=count/sum(count))
# year country count proportion
# <fctr> <fctr> <int> <dbl>
# 1 2015 Canada 1 0.2
# 2 2015 Dominican Republic 2 0.4
# 3 2015 Mexico 1 0.2
# 4 2015 USA 1 0.2
# 5 2016 Canada 1 0.2
# 6 2016 Dominican Republic 1 0.2
# 7 2016 Panama 1 0.2
# 8 2016 USA 2 0.4

Related

Is there a way to count repeated observations using the summarize function in R?

I'm working with a data set that contains CustomerID, Sales_Rep, Product, and year columns. The problem I have with this dataset is that there is no unique Transaction Number. The data looks like this:
CustomerID Sales Rep Product Year
301978 Richard Grayson Product A 2017
302151 Maurin Thompkins Product B 2018
301962 Wallace West Product C 2019
301978 Richard Grayson Product B 2018
402152 Maurin Thompkins Product A 2017
501967 Wallace West Product B 2017
301978 Richard Grayson Product B 2018
What I'm trying to do is count how many transactions were made by each Sales Rep, per year by counting the number of Customer IDs that appear for each Sales Rep per year regardless if the customer ID is repeated, and then compile it into one data frame called "Count". I tried using the following functions in R:
Count <- Sales_Data %>%
group_by(Sales_Rep, year) %>%
summarize(count(CustomerID))
but I get this error:
Error: Problem with `summarise()` input `..1`.
i `..1 = count(PatientID)`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
The result I want to produce is this:
Sales Rep 2017 2018 2019
Richard Grayson 1 2
Maurin Thompkins 1 1
Wallace West 1 1
Can anybody help me?
There is no need to group and summarise, function count does that in one step. Then reshape to wide format.
Sales_Data <- read.table(text = "
CustomerID 'Sales Rep' Product Year
301978 'Richard Grayson' 'Product A' 2017
302151 'Maurin Thompkins' 'Product B' 2018
301962 'Wallace West' 'Product C' 2019
301978 'Richard Grayson' 'Product B' 2018
402152 'Maurin Thompkins' 'Product A' 2017
501967 'Wallace West' 'Product B' 2017
301978 'Richard Grayson' 'Product B' 2018
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
Sales_Data %>% count(CustomerID)
#> CustomerID n
#> 1 301962 1
#> 2 301978 3
#> 3 302151 1
#> 4 402152 1
#> 5 501967 1
Sales_Data %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <chr> <int> <int> <int>
#> 1 Maurin Thompkins 1 1 NA
#> 2 Richard Grayson 1 2 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-03 by the reprex package (v2.0.1)
Edit
To have the output column 'Sales Rep' in the same order as in the input data, coerce to factor setting the levels attribute to that original order. This is taken care of by unique. After pivoting, 'Sales Rep' can be coerced back to character, if needed. I have omitted this final step in the code that follows.
Sales_Data %>%
mutate(`Sales Rep` = factor(`Sales Rep`, levels = unique(`Sales Rep`))) %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <fct> <int> <int> <int>
#> 1 Richard Grayson 1 2 NA
#> 2 Maurin Thompkins 1 1 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-05 by the reprex package (v2.0.1)

Can get ggplot2 bar chart to display direct values for Y axis?

When I plot a my barchart, the chart is putting out values on the Y-axis I don't understand. How can I get the barchart to use actual values?
#Here is the code for my graph
stock %>%
#Tidy data to be handled correctly
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
#make the graph
ggplot(aes(ocean_whole, value2/100, fill=name)) +
geom_bar(stat = "identity")
The bar chart is putting out values between 2.5 and -2.5 when my value 2 values range between 100 and - 100
ocean_sub code year ocean_whole name value value2
<chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Eastern Central Atlantic NA 2017 atlantic bio_sus 57.1 -57.1
2 Eastern Central Atlantic NA 2017 atlantic bio_notsus 42.9 42.9
3 Eastern Central Pacific NA 2017 pacific bio_sus 86.7 -86.7
4 Eastern Central Pacific NA 2017 pacific bio_notsus 13.3 13.3
5 Eastern Indian Ocean NA 2017 indian bio_sus 68.6 -68.6
How can I get the chart to display the actual values?
#My code is from TidyTuesdays Global seafood:
stock <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-12/fish-stocks-within-sustainable-levels.csv')
#transformed in the following way
oceans <- c("pacific", "atlantic", "indian", "mediterranean")
lu <- stack(sapply(oceans, grep, x = stock$entity, ignore.case = TRUE))
stock$oceans <- stock$entity
stock$oceans[lu$values] <- as.character(lu$ind)
stock %>%
group_by(oceans) %>%
summarise(across(matches("^share"), sum))
colnames(stock) <- (c("ocean_sub", "code", "year", "bio_sus", "bio_notsus", "ocean_whole"))
Your tibble contains multiple values for ocean_whole before you give it over to ggplot(). The sums of these values amount the unexpected values. Check:
library(dplyr)
stock %>%
group_by(year) %>%
filter(year == "2017") %>%
pivot_longer(bio_sus:bio_notsus) %>%
mutate(value2 = ifelse(name=="bio_sus",-1*value, value)) %>%
group_by(ocean_whole, name) %>%
summarise(sum(value2))

How to modify data frame in R based on one unique column

I have a data frame that looks like this.
Data
Denmark MG301
Denmark MG302
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
I need to make a new data frame based on unique values of 2nd columns while removing repeating values in Denmark. And results should look like this
Data
Australia MG301
Australia MG302
Sweden MG100
Sweden MG120
Regards
Update after clarification:
This code keeps all distinct values in column2:
distinct(df, code, .keep_all = TRUE)
Output:
1 Denmark MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
First answer:
I am not quite sure. But it gives the desired output:
df %>%
filter(country != "Denmark")
Output:
country code
<chr> <chr>
1 Australia MG301
2 Australia MG302
3 Sweden MG100
4 Sweden MG120
data:
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG301",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
In base R, the following code removes all rows with "Denmark" in the first column and all duplicated 2nd column by groups of 1st column.
i <- df1$V1 != "Denmark"
j <- as.logical(ave(df1$V2, df1$V1, FUN = duplicated))
df1[i & !j, ]
# V1 V2
#3 Australia MG301
#4 Australia MG302
#5 Sweden MG100
#6 Sweden MG120
Do you want just distinct ? then this may help
df <- data.frame(A = c("denmark", "denmark", "Australia", "Australia", "Sweden", "Sweden"), B = c("MG301","MG302","MG301","MG302","MG100","MG100"))
df %>% distinct()
A B
1 denmark MG301
2 denmark MG302
3 Australia MG301
4 Australia MG302
5 Sweden MG100
Or you want this ?
df %>%
group_by(B) %>%
dplyr::summarise(A = first(A))
B A
* <chr> <chr>
1 MG100 Sweden
2 MG301 denmark
3 MG302 denmark
Use duplicated with a ! bang operator to remove duplicated rows among that column.
To show a rather complicated case, I am adding one row in Denmark which is not duplicated and hence should not be filtered out.
df<- tribble(
~country, ~code,
"Denmark", "MG301",
"Denmark", "MG302",
'Denmark', "MG303",
"Australia", "MG301",
"Australia", "MG302",
"Sweden", "MG100",
"Sweden", "MG120")
# A tibble: 7 x 2
country code
<chr> <chr>
1 Denmark MG301
2 Denmark MG302
3 Denmark MG303
4 Australia MG301
5 Australia MG302
6 Sweden MG100
7 Sweden MG120
df %>%
mutate(d = duplicated(code)) %>%
group_by(code) %>%
mutate(d = sum(d)) %>% ungroup() %>%
filter(!(d > 0 & country == 'Denmark'))
# A tibble: 5 x 3
country code d
<chr> <chr> <int>
1 Denmark MG303 0
2 Australia MG301 1
3 Australia MG302 1
4 Sweden MG100 0
5 Sweden MG120 0

Changing observation`s name using dplyr

Suppose I have this dataset:
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
df <- data.frame(Variable, Country)
I want to change the GDP to "Country_observation" GDP, i.e, Brazil GDP and Chile GDP.
I have a much larger dataset and I've been trying to do this by using
df %>% mutate(Variable = replace(Variable, Variable == "GDP", paste(Country, "GDP")))
However, it will print the first observation of variable "Country" for every observation in "Variable" that meets the conditional. Is there any way to make paste() use the value of Country on the row it is applying to?
I've tried to use rowwise() and it did not work. I've tried the following code as well and encountered the same problem
df %>% mutate(Country = ifelse(Country == "Chile", replace(Variable, Variable == "GDP",
paste(Country, "GDP")), Variable))
Thanks to everyone!
EDIT
I can't simply use unite because I still need the variable Country. So a workaround that I found was (I had several other observations that I needed to change their names)
df %>% mutate(Variable2 = ifelse(Variable == "GDP", paste0(Country, " ",
Variable), Variable)) %>%
mutate(Variable2 = replace(Variable2, Variable2 ==
"CR", "Country Risk")) %>%
mutate(Variable2 = replace(Variable2, Variable2
== "EXR", "Exchange Rate")) %>%
mutate(Variable2 = replace(Variable2,mVariable2 == "INTR", "Interest Rate"))
%>% select(-Variable) %>%
select(Horizon, Variable = Variable2, Response, Low, Up, Shock, Country,
Status)
EDIT 2
My desired output was
Horizon Variable Response Shock Country
1 Brazil GDP 0.0037 PCOM Brazil
2 Brazil GDP 0.0060 PCOM Brazil
3 Brazil GDP 0.0053 PCOM Brazil
4 Brazil GDP 0.0033 PCOM Brazil
5 Brazil GDP 0.0021 PCOM Brazil
6 Brazil GDP 0.0020 PCOM Brazil
This example should help:
library(tidyr)
library(dplyr)
Variable <- c("GDP")
Country <- c("Brazil", "Chile")
value = c(5,10)
df <- data.frame(Variable, Country, value)
# original data
df
# Variable Country value
# 1 GDP Brazil 5
# 2 GDP Chile 10
# update
df %>% unite(NewGDP, Variable, Country)
# NewGDP value
# 1 GDP_Brazil 5
# 2 GDP_Chile 10
If you want to use paste you can do:
df %>% mutate(NewGDP = paste0(Country,"_",Variable))
# Variable Country value NewGDP
# 1 GDP Brazil 5 Brazil_GDP
# 2 GDP Chile 10 Chile_GDP

Divide case by population

In the table2 dataset from the tidyr package, we have:
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
How do I code this so that I can divide the type cases by the type population and then multiply by 10000. (Yes, this is a question from R for Data Science by Hadley Wickham.)
I've thought of:
sum_1 <- vector()
for (i,j in 1:nrow(table2)) {
if (i %% 2 != 0) {
sum_1 <- (table2[i] / table2[j]) * 10000
Assuming that there are only 2 values for 'type' for each 'country', 'year', then after grouping by 'country', 'year', arrange by 'type' (in case the order is different) and divide the first value of 'count' with the last value of 'count' to create the 'newcol'
library(dplyr)
table2 %>%
group_by(country, year) %>%
arrange(country, year, type) %>%
mutate(newcol = 10000*first(count)/last(count))
If we need only a summarised output, replace mutate with summarise
If there are other values in type in addition to 'cases' and 'population', then we subset the 'count' based on logical index
table2 %>%
group_by(country, year) %>%
mutate(newcol = 10000*count[type=="cases"]/count[type=="population"])
Here, also the assumption is that there is only a single 'cases' and 'population' per each 'country', 'year'

Resources