Combining multiple observations into one column in R - r

I'm working with a Covid19 dataset in which each row contains Covid data (tests, positives, negatives, deaths, etc...) for a particular day in a particular country. This means there are multiple rows for each day since every country's data gets its own row. I'm trying to generalize the data to only include one row per day per continent. Is there a simple way to sum all columns where the date is the same?
For example, I'd want to go from a table like this...
Date
Continent
Country
Positives
2020-02-05
Europe
United Kingdom
10
2020-02-05
Europe
Poland
5
2020-02-05
Europe
Sweden
0
2020-02-06
Europe
United Kingdom
12
2020-02-06
Europe
Poland
7
2020-02-06
Europe
Sweden
1
to one like this...
Date
Continent
Positives
2020-02-05
Europe
15
2020-02-06
Europe
20
The closest I've gotten is
covid19EU <- covid19 %>% filter(Continent_Name == "Europe") %>% group_by(Date) %>% summarise_all(max)
but this returns the highest value instead of summing the value over all observations for the same date, ie
Date
Continent
Country
Positives
2020-02-05
Europe
United Kingdom
10
2020-02-06
Europe
United Kingdom
12

covid19 %>%
group_by(Date, Continent) %>%
summarize(across(where(is.numeric), sum))
Result
Date Continent Positives Something_else
<chr> <chr> <int> <int>
1 2020-02-05 Europe 15 6
2 2020-02-06 Europe 20 15
If there's only one column of data to sum, you can use the count(wt = ...) shortcut:
covid19 %>%
count(Date, Continent, wt = Positives, name = "Positives")
# Date Continent Positives
#1 2020-02-05 Europe 15
#2 2020-02-06 Europe 20
Example data with another column of numeric data to sum (possibly implied by OP question "Is there a simple way to sum all columns where the date is the same?")
covid19 <- data.frame(
stringsAsFactors = FALSE,
Date = c("2020-02-05","2020-02-05", "2020-02-05",
"2020-02-06","2020-02-06","2020-02-06"),
# NOTE UPDATED DATES VS OP
Continent = c("Europe","Europe","Europe",
"Europe","Europe","Europe"),
Country = c("United Kingdom","Poland",
"Sweden","United Kingdom","Poland","Sweden"),
Positives = c(10L, 5L, 0L, 12L, 7L, 1L),
Something_else = 1:6
)

in base R:
aggregate(Positives~Date+Continent,covid19, sum)
Date Continent Positives
1 2020-02-05 Europe 15
2 2020-02-06 Europe 20

Related

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Create ID for specific sequence of consecutive days based on grouping variable in R

For a list of events at the country-day level, we would like to create a unique ID for a sequence of consecutive days in a specific country (if two or more days of events in a country are consecutive --> create unique ID), so that I can ultimately reduce the data frame to specific sequences of events rather than event days.
I did not manage to aggregate the data based on the sequence of events. I believe this response is similar (Creating groups of consecutive days meeting a given criteria) however it is in SQL.
The data has the following format:
country <- c("Angola","Angola","Angola","Angola","Angola", "Benin","Benin","Benin","Benin","Benin","Benin")
event_date <- as.Date(c("2017-06-16", "2017-06-17", "2017-06-18", "2017-08-22", "2017-08-23", "2019-04-18", "2019-04-19", "2019-04-20", "2018-03-15", "2018-03-16", "2016-03-17"))
mydata <- data.frame(country, event_date)
In the output, I expect to have a new column with the ID that is unique to each series of events in a country:
seq.ID <- c(1,1,1,2,2,3,3,3,4,4,4)
mydata2 <- data.frame(country, event_date, seq.ID)
So that ultimately, I can reduce the data to the level of country and sequence of events:
mydata3 <- mydata2[!duplicated(mydata2$seq.ID),]
Try:
library(dplyr)
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)
Output:
# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin
You can also use the .keep_all argument in distinct and preserve the first date of each sequence:
mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)
# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3
In case of desired non-aggregated output with different sequence IDs, you could do:
mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)
country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5
Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Resources