Combining rows and generating category counts - r

I want to be able to first combine rows with a similar attribute into one(for example, one row for each City/Year), and then find the specific counts for types of categories for each of those rows.
For example, with this as the original data:
City Year Type of Death
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide
I want to be able to produce something like this:
City Year n_Total n_Homicides n_Suicides
NYC 1995 1 1 0
NYC 1996 2 1 1
LA 1995 3 1 2
I've tried something like the below, but it only gives me the n_Total and doesn't take into account the splits for n_Homicides and n_Suicides:
library(dplyr)
total_deaths <- data %>%
group_by(city, year)%>%
summarize(n_Total= n())

You may do this
library(tidyverse, warn.conflicts = F)
df <- read.table(header = T, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
pivot_wider(names_from = TypeofDeath, values_fn = length, values_from = TypeofDeath, values_fill = 0, names_prefix = 'n_') %>%
mutate(n_total = rowSums(select(cur_data(), starts_with('n_'))))
#> # A tibble: 3 x 5
#> City Year n_Homicide n_Suicide n_total
#> <chr> <int> <int> <int> <dbl>
#> 1 NYC 1995 1 0 1
#> 2 NYC 1996 1 1 2
#> 3 LA 1995 1 2 3
Created on 2021-07-05 by the reprex package (v2.0.0)

If you don't have too many types of death, then something simple (albeit a little "manual") like this might have some appeal.
library(dplyr, warn.conflicts = FALSE)
df <- read.table(header = TRUE, text = 'City Year TypeofDeath
NYC 1995 Homicide
NYC 1996 Homicide
NYC 1996 Suicide
LA 1995 Suicide
LA 1995 Homicide
LA 1995 Suicide')
df %>%
group_by(City, Year) %>%
summarize(n_Total = n(),
n_Suicide = sum(TypeofDeath == "Suicide"),
n_Homicide = sum(TypeofDeath == "Homicide"))
#> `summarise()` has grouped output by 'City'. You can override using the `.groups` argument.
#> # A tibble: 3 x 5
#> # Groups: City [2]
#> City Year n_Total n_Suicide n_Homicide
#> <chr> <int> <int> <int> <int>
#> 1 LA 1995 3 2 1
#> 2 NYC 1995 1 0 1
#> 3 NYC 1996 2 1 1
Created on 2021-07-05 by the reprex package (v2.0.0)

You can first dummify your factor variable using the fastDummies package, than summarise(). This is a more general and versatile approach that can be used seamlessly with any number of unique types of death.
If you only have two types of death and will settle for a simpler (though more "manual") approach, you can use the other suggestions with summarise(x=..., y=..., total=n()
library(dplyr)
library(fastDummies)
df%>%fastDummies::dummy_cols('TypeofDeath', remove_selected_columns = TRUE)%>%
group_by(City, Year)%>%
summarise(across(contains('Type'), sum),
total_deaths=n())
# A tibble: 3 x 5
# Groups: City [2]
City Year TypeofDeath_Homicide TypeofDeath_Suicide total_deaths
<chr> <int> <int> <int> <int>
1 LA 1995 1 2 3
2 NYC 1995 1 0 1
3 NYC 1996 1 1 2

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

Is there a way to count repeated observations using the summarize function in R?

I'm working with a data set that contains CustomerID, Sales_Rep, Product, and year columns. The problem I have with this dataset is that there is no unique Transaction Number. The data looks like this:
CustomerID Sales Rep Product Year
301978 Richard Grayson Product A 2017
302151 Maurin Thompkins Product B 2018
301962 Wallace West Product C 2019
301978 Richard Grayson Product B 2018
402152 Maurin Thompkins Product A 2017
501967 Wallace West Product B 2017
301978 Richard Grayson Product B 2018
What I'm trying to do is count how many transactions were made by each Sales Rep, per year by counting the number of Customer IDs that appear for each Sales Rep per year regardless if the customer ID is repeated, and then compile it into one data frame called "Count". I tried using the following functions in R:
Count <- Sales_Data %>%
group_by(Sales_Rep, year) %>%
summarize(count(CustomerID))
but I get this error:
Error: Problem with `summarise()` input `..1`.
i `..1 = count(PatientID)`.
x no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
The result I want to produce is this:
Sales Rep 2017 2018 2019
Richard Grayson 1 2
Maurin Thompkins 1 1
Wallace West 1 1
Can anybody help me?
There is no need to group and summarise, function count does that in one step. Then reshape to wide format.
Sales_Data <- read.table(text = "
CustomerID 'Sales Rep' Product Year
301978 'Richard Grayson' 'Product A' 2017
302151 'Maurin Thompkins' 'Product B' 2018
301962 'Wallace West' 'Product C' 2019
301978 'Richard Grayson' 'Product B' 2018
402152 'Maurin Thompkins' 'Product A' 2017
501967 'Wallace West' 'Product B' 2017
301978 'Richard Grayson' 'Product B' 2018
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
})
Sales_Data %>% count(CustomerID)
#> CustomerID n
#> 1 301962 1
#> 2 301978 3
#> 3 302151 1
#> 4 402152 1
#> 5 501967 1
Sales_Data %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <chr> <int> <int> <int>
#> 1 Maurin Thompkins 1 1 NA
#> 2 Richard Grayson 1 2 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-03 by the reprex package (v2.0.1)
Edit
To have the output column 'Sales Rep' in the same order as in the input data, coerce to factor setting the levels attribute to that original order. This is taken care of by unique. After pivoting, 'Sales Rep' can be coerced back to character, if needed. I have omitted this final step in the code that follows.
Sales_Data %>%
mutate(`Sales Rep` = factor(`Sales Rep`, levels = unique(`Sales Rep`))) %>%
count(`Sales Rep`, Year) %>%
pivot_wider(id_cols = `Sales Rep`, names_from = Year, values_from = n)
#> # A tibble: 3 x 4
#> `Sales Rep` `2017` `2018` `2019`
#> <fct> <int> <int> <int>
#> 1 Richard Grayson 1 2 NA
#> 2 Maurin Thompkins 1 1 NA
#> 3 Wallace West 1 NA 1
Created on 2022-04-05 by the reprex package (v2.0.1)

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Convert Panel Data to Long in R

My current data is for missiles between 1920 and 2018. The goal is to measure a nation’s ability to deploy missiles of different kinds for each year from 1920 to 2018. The problems that arise are that the data has multiple observations per nation and often per year. This creates issues because for instance if a nation adopted a missile in 1970 that is Air to Air and imported then developed one in 1980 that is Air to Air and Air to Ground and produced domestically, that change needs to be reflected. The goal is to have a unique row/observation for each year for every nation. Also it should be noted that it is assumed if the nation can produced Air to air for instance in 1970 they can do so until 2018.
Current:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2016 2 United States 1 1
Desired:
YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 Saudi Arabia 0 1
2015 670 Saudi Arabia 0 1
2016 670 Saudi Arabia 0 1
2017 670 Saudi Arabia 1 1
2018 670 Saudi Arabia 1 1
2016 2 United States 0 1
2017 2 United States 0 1
2018 2 United States 0 1
Note: There are many entries and so I would like it to generate from 1920 to 2018 for every country even if they will have straight zeroes. That is not necessary but it would be a great bit!
You can do this via several steps:
Create the combination of all years and countries (a CROSS JOIN in SQL)
LEFT JOIN these combinations with the available data
Use a function like zoo::na.locf() to replace NA values by the last known ones per country.
The first step is common:
df <- read.table(text = 'YearAcquired CountryCode CountryName Domestic AirtoAir
2014 670 "Saudi Arabia" 0 1
2017 670 "Saudi Arabia" 1 1
2016 2 "United States" 1 1', header = TRUE, stringsAsFactors = FALSE)
combinations <- merge(data.frame(YearAcquired = seq(1920, 2018, 1)),
unique(df[,2:3]), by = NULL)
For steps 2 and 3 here a solution using dplyr
library(dplyr)
library(zoo)
df <- left_join(combinations, df) %>%
group_by(CountryCode) %>%
mutate(Domestic = na.locf(Domestic, na.rm = FALSE),
AirtoAir = na.locf(AirtoAir, na.rm = FALSE))
And one solution using data.table:
library(data.table)
library(zoo)
setDT(df)
setDT(combinations)
df <- df[combinations, on = c("YearAcquired", "CountryCode", "CountryName")]
df <- df[, na.locf(.SD, na.rm = FALSE), by = "CountryCode"]
You could create a new dataframe using the country names and codes available and perform a left join with your existing data. This would give you 1920 to 2018 for each country and code, leaving NA's in where you don't have data available but you could easily replace them given how you want your data structured.
# df is your initial dataframe
countries <- df$CountryName
codes <- df
new_df <- data.frame(YearAcquired = seq(1920, 2018, 1),
CountryName = df$CountryName
CountryCode = df$CountryCode)
new_df <- left_join(new_df, df)
Using tidyverse (dplyr and tidyr)...
If you only need to fill in internal years per country...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
group_by(countrycode) %>%
complete(YearAcquired = full_seq(YearAcquired, 1), countrycode, CountryName) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir)
#> # A tibble: 5 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2016 2 United States 1 1
#> 2 2014 670 Saudi Arabia 0 1
#> 3 2015 670 Saudi Arabia 0 1
#> 4 2016 670 Saudi Arabia 0 1
#> 5 2017 670 Saudi Arabia 1 1
If you want to expand each country to all years found in the dataset...
df <- read.table(header = TRUE, as.is = TRUE, text = "
YearAcquired countrycode CountryName Domestic AirtoAir
2014 670 'Saudi Arabia' 0 1
2017 670 'Saudi Arabia' 1 1
2016 2 'United States' 1 1
")
library(dplyr)
library(tidyr)
df %>%
complete(YearAcquired = full_seq(YearAcquired, 1),
nesting(countrycode, CountryName)) %>%
group_by(countrycode) %>%
arrange(countrycode, YearAcquired) %>%
fill(Domestic, AirtoAir) %>%
mutate_at(vars(Domestic, AirtoAir), funs(if_else(is.na(.), 0L, .)))
#> # A tibble: 8 x 5
#> # Groups: countrycode [2]
#> YearAcquired countrycode CountryName Domestic AirtoAir
#> <dbl> <int> <chr> <int> <int>
#> 1 2014 2 United States 0 0
#> 2 2015 2 United States 0 0
#> 3 2016 2 United States 1 1
#> 4 2017 2 United States 1 1
#> 5 2014 670 Saudi Arabia 0 1
#> 6 2015 670 Saudi Arabia 0 1
#> 7 2016 670 Saudi Arabia 0 1
#> 8 2017 670 Saudi Arabia 1 1

Count origin-destination relationships (without direct) with R

I have a origin-destination table like this.
library(dplyr)
set.seed(1983)
namevec <- c('Portugal', 'Romania', 'Nigeria', 'Peru', 'Texas', 'New Jersey', 'Colorado', 'Minnesota')
## Create OD pairs
df <- data_frame(origins = sample(namevec, size = 100, replace = TRUE),
destinations = sample(namevec, size = 100, replace = TRUE))
Question
I got stucked in counting the relationships for each origin-destination (with no directionality).
How can I get output that Colorado-Minnesota and Minnesota-Colorado are seen as one group?
What I have tried so far:
## Counts for each OD-pairs
df %>%
group_by(origins, destinations) %>%
summarize(counts = n()) %>%
ungroup() %>%
arrange(desc(counts))
Source: local data frame [48 x 3]
origins destinations counts
(chr) (chr) (int)
1 Nigeria Colorado 5
2 Colorado Portugal 4
3 New Jersey Minnesota 4
4 New Jersey New Jersey 4
5 Peru Nigeria 4
6 Peru Peru 4
7 Romania Texas 4
8 Texas Nigeria 4
9 Minnesota Minnesota 3
10 Nigeria Portugal 3
.. ... ... ...
One way is to combine the sorted combination of the two locations into a single field. Summarizing on that will remove your two original columns, so you'll need to join them back in.
paired <- df %>%
mutate(
orderedpair = paste(pmin(origins, destinations), pmax(origins, destinations), sep = "::")
)
paired
# # A tibble: 100 × 3
# origins destinations orderedpair
# <chr> <chr> <chr>
# 1 Peru Colorado Colorado::Peru
# 2 Romania Portugal Portugal::Romania
# 3 Romania Colorado Colorado::Romania
# 4 New Jersey Minnesota Minnesota::New Jersey
# 5 Minnesota Texas Minnesota::Texas
# 6 Romania Texas Romania::Texas
# 7 Peru Peru Peru::Peru
# 8 Romania Nigeria Nigeria::Romania
# 9 Portugal Minnesota Minnesota::Portugal
# 10 Nigeria Colorado Colorado::Nigeria
# # ... with 90 more rows
left_join(
paired,
group_by(paired, orderedpair) %>% count(),
by = "orderedpair"
) %>%
select(-orderedpair) %>%
distinct() %>%
arrange(desc(n))
# # A tibble: 48 × 3
# origins destinations n
# <chr> <chr> <int>
# 1 Romania Portugal 6
# 2 New Jersey Minnesota 6
# 3 Portugal Romania 6
# 4 Minnesota New Jersey 6
# 5 Romania Texas 5
# 6 Nigeria Colorado 5
# 7 Texas Nigeria 5
# 8 Texas Romania 5
# 9 Nigeria Texas 5
# 10 Peru Peru 4
# # ... with 38 more rows
(The only reason I used "::" as the separator is in the unlikely event you need to parse orderedpair; using the default " " (space) won't work with (e.g.) "New Jersey" in the mix.)

Resources