This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
I have a df looking like this:
ID Country
55 Poland
55 Romania
55 France
98 Spain
98 Portugal
98 UK
65 Germany
67 Luxembourg
84 Greece
22 Estonia
22 Lithuania
Where some ID are repeated because they belong to the same group. What I want to do is to paste together all Country with the same ID, to have an output like this.
So far, I tried with
ifelse(df[duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE),], paste('Countries', df$Country), NA) but this is not retrieving the expected output.
Using data.table
library(data.table)
setDT(df)[, New_Name := c(paste0(Country, collapse = " + ")[1L], rep(NA, .N -1)), by = ID]
#df
#ID Country New_Name
#1: 55 Poland Poland + Romania + France
#2: 55 Romania <NA>
#3: 55 France <NA>
#4: 98 Spain Spain + Portugal + UK
#5: 98 Portugal <NA>
#6: 98 UK <NA>
#7: 65 Germany Germany
#8: 67 Luxembourg Luxembourg
#9: 84 Greece Greece
#10: 22 Estonia Estonia + Lithuania
#11: 22 Lithuania <NA>
Using base R,
replace(v1 <- with(df, ave(as.character(Country), ID, FUN = toString)), duplicated(v1), NA)
#[1] "Poland, Romania, France" NA NA "Spain, Portugal, UK" NA NA "Germany" "Luxembourg" "Greece" "Estonia, Lithuania"
#[11] NA
Using dplyr, one way would be
library(dplyr)
df %>%
group_by(ID) %>%
mutate(new_name = paste0(Country,collapse = " + "),
new_name = replace(new_name, duplicated(new_name), NA))
# ID Country new_name
# <int> <fct> <chr>
# 1 55 Poland Poland + Romania + France
# 2 55 Romania NA
# 3 55 France NA
# 4 98 Spain Spain + Portugal + UK
# 5 98 Portugal NA
# 6 98 UK NA
# 7 65 Germany Germany
# 8 67 Luxembourg Luxembourg
# 9 84 Greece Greece
#10 22 Estonia Estonia + Lithuania
#11 22 Lithuania NA
However, to get your exact expected output we might need
df %>%
group_by(ID) %>%
mutate(new_name = if (n() > 1)
paste0("Countries ", paste0(Country,collapse = " + ")) else Country,
new_name = replace(new_name, duplicated(new_name), NA))
# ID Country new_name
# <int> <fct> <chr>
# 1 55 Poland Countries Poland + Romania + France
# 2 55 Romania NA
# 3 55 France NA
# 4 98 Spain Countries Spain + Portugal + UK
# 5 98 Portugal NA
# 6 98 UK NA
# 7 65 Germany Germany
# 8 67 Luxembourg Luxembourg
# 9 84 Greece Greece
#10 22 Estonia Countries Estonia + Lithuania
#11 22 Lithuania NA
Using aggregate and then matching back for the first time only:
flat <- function(x) paste("Countries:", paste(x,collapse=", "))
tmp <- aggregate(Country ~ ID, data=dat, FUN=flat)
dat$Country <- NA
dat$Country[match(tmp$ID, dat$ID)] <- tmp$Country
# ID Country
#1 55 Countries: Poland, Romania, France
#2 55 <NA>
#3 55 <NA>
#4 98 Countries: Spain, Portugal, UK
#5 98 <NA>
#6 98 <NA>
#7 65 Countries: Germany
#8 67 Countries: Luxembourg
#9 84 Countries: Greece
#10 22 Countries: Estonia, Lithuania
#11 22 <NA>
With purrr and dplyr:
df %>%
nest(-ID) %>%
mutate(new_name = map_chr(data, ~ paste0(.x$Country, collapse = " + "))) %>%
unnest()
Table:
ID new_name Country
55 Poland + Romania + France Poland
55 Poland + Romania + France Romania
55 Poland + Romania + France France
98 Spain + Portugal + UK Spain
98 Spain + Portugal + UK Portugal
98 Spain + Portugal + UK UK
65 Germany Germany
67 Luxembourg Luxembourg
84 Greece Greece
22 Estonia + Lithuania Estonia
22 Estonia + Lithuania Lithuania
Related
How do I choose the date to access yesterday's information in the dataset that provides daily information.
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
I am trying to print how many people died yesterday.
df1 <- aggregate(death~countryName, subset(df, region =="Europe"), sum)
collects this code every day and gives it. I don't want their sum. With which code can I find out how many people died the previous day?
Yesterday's total (not individual day's count):
df1 <- subset(df, region =="Europe" & day == '2020/05/02')
head(df1)
# day countryCode countryName region lat lon confirmed recovered death
# 102 2020/05/02 AD Andorra Europe 42.50000 1.50000 747 472 44
# 612 2020/05/02 AL Albania Europe 41.00000 20.00000 789 519 31
# 1020 2020/05/02 AT Austria Europe 47.33333 13.33333 15558 13180 596
# 1428 2020/05/02 BA Bosnia and Herzegovina Europe 44.00000 18.00000 1839 779 72
# 1734 2020/05/02 BE Belgium Europe 50.83333 4.00000 49517 12211 7765
# 1938 2020/05/02 BG Bulgaria Europe 43.00000 25.00000 1594 287 72
I previously used your aggregate code, but realized a few flaws in its assumptions:
Since each row for a given country is a running total, then using sum as an aggregating technique is logically incorrect when looking at multiple days; and
Since you want just a single day's worth of data, there's no need to aggregate, we can just subset the data.
"Proof":
tail(sort(df$day), n=1)
# [1] 2020/05/02
# 102 Levels: 2020/01/22 2020/01/23 2020/01/24 2020/01/25 ... 2020/05/02
head( subset(df, region == "Europe" & day == "2020/05/02") )
# day countryCode countryName region lat lon confirmed recovered death
# 102 2020/05/02 AD Andorra Europe 42.50000 1.50000 747 472 44
# 612 2020/05/02 AL Albania Europe 41.00000 20.00000 789 519 31
# 1020 2020/05/02 AT Austria Europe 47.33333 13.33333 15558 13180 596
# 1428 2020/05/02 BA Bosnia and Herzegovina Europe 44.00000 18.00000 1839 779 72
# 1734 2020/05/02 BE Belgium Europe 50.83333 4.00000 49517 12211 7765
# 1938 2020/05/02 BG Bulgaria Europe 43.00000 25.00000 1594 287 72
head( subset(df, region == "Europe" & day == "2020/05/01") )
# day countryCode countryName region lat lon confirmed recovered death
# 101 2020/05/01 AD Andorra Europe 42.50000 1.50000 745 468 43
# 611 2020/05/01 AL Albania Europe 41.00000 20.00000 782 488 31
# 1019 2020/05/01 AT Austria Europe 47.33333 13.33333 15531 13110 589
# 1427 2020/05/01 BA Bosnia and Herzegovina Europe 44.00000 18.00000 1781 755 70
# 1733 2020/05/01 BE Belgium Europe 50.83333 4.00000 49032 11892 7703
# 1937 2020/05/01 BG Bulgaria Europe 43.00000 25.00000 1555 276 68
If you need just the death column, you can always select= the columns:
df1 <- subset(df, region =="Europe" & day == '2020/05/02', select = c(countryName, death))
head(df1)
# countryName death
# 102 Andorra 44
# 612 Albania 31
# 1020 Austria 596
# 1428 Bosnia and Herzegovina 72
# 1734 Belgium 7765
# 1938 Bulgaria 72
If you're looking for the difference between yesterday and the previous reporting number (which should be "the day prior", but nothing is verified), then a dplyr solution could be
library(dplyr)
as_tibble(df) %>%
arrange(day) %>%
group_by(countryCode) %>%
mutate_at(vars(confirmed, recovered, death), list(~ c(NA, diff(.)))) %>%
slice(n())
# Warning: Factor `countryCode` contains implicit NA, consider using `forcats::fct_explicit_na`
# Warning: Factor `countryCode` contains implicit NA, consider using `forcats::fct_explicit_na`
# # A tibble: 212 x 9
# # Groups: countryCode [212]
# day countryCode countryName region lat lon confirmed recovered death
# <fct> <fct> <fct> <fct> <dbl> <dbl> <int> <int> <int>
# 1 2020/05/02 AD Andorra Europe 42.5 1.5 2 4 1
# 2 2020/05/02 AE United Arab Emirates Asia 24 54 561 121 8
# 3 2020/05/02 AF Afghanistan Asia 33 65 134 21 4
# 4 2020/05/02 AG Antigua and Barbuda Americas 17.0 -61.8 0 0 0
# 5 2020/05/02 AI Anguilla Americas 18.2 -63.2 0 0 0
# 6 2020/05/02 AL Albania Europe 41 20 7 31 0
# 7 2020/05/02 AM Armenia Asia 40 45 125 33 0
# 8 2020/05/02 AO Angola Africa -12.5 18.5 5 0 0
# 9 2020/05/02 AR Argentina Americas -34 -64 0 28 4
# 10 2020/05/02 AT Austria Europe 47.3 13.3 27 70 7
# # ... with 202 more rows
I think it's a safe assumption, but I'm relying on the sorting of non-formatted day here. One could always convert to Date class explicitly with mutate(day = as.Date(day, format = "%Y/%m/%d")) before the arrange to be "complete".
And because I challenged myself months ago to become more proficient with data.table, here's an alternative solution in that dialect. (Note that I use magrittr's %>% operator here to break out each stage of processing; this can easily be done without that as a more traditional data.table-chain-processing.)
library(data.table)
cols <- c("confirmed", "recovered", "death")
as.data.table(df) %>%
.[, (cols) := lapply(.SD, function(a) c(NA, diff(a))), by = .(countryName), .SDcols = cols] %>%
.[, .SD[.N,], by = .(countryName) ]
# countryName day countryCode region lat lon confirmed recovered death
# 1: Andorra 2020/05/02 AD Europe 42.50000 1.50000 2 4 1
# 2: United Arab Emirates 2020/05/02 AE Asia 24.00000 54.00000 561 121 8
# 3: Afghanistan 2020/05/02 AF Asia 33.00000 65.00000 134 21 4
# 4: Antigua and Barbuda 2020/05/02 AG Americas 17.05000 -61.80000 0 0 0
# 5: Anguilla 2020/05/02 AI Americas 18.25000 -63.16667 0 0 0
# ---
# 208: Yemen 2020/05/02 YE Asia 15.00000 48.00000 3 0 0
# 209: Mayotte 2020/05/02 YT Africa -12.83333 45.16667 0 0 0
# 210: South Africa 2020/05/02 ZA Africa -29.00000 24.00000 385 167 7
# 211: Zambia 2020/05/02 ZM Africa -15.00000 30.00000 10 1 0
# 212: Zimbabwe 2020/05/02 ZW Africa -20.00000 30.00000 -6 0 0
Write an auxiliary function yesterday and use it to subset the data.
yesterday <- function() Sys.Date() - 1L
yesterday()
# [1] "2020-05-02"
df1 <- aggregate(death ~ countryName, subset(df, region =="Europe" & day == yesterday()), sum)
A dplyr solution:
library(dplyr)
df %>%
filter(day == yesterday(), region == "Europe") %>%
group_by(countryName) %>%
summarise(death = sum(death))
Data
Following r2evans' comment, here is the data reading and date conversion code.
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv')
df$day <- as.Date(df$day, "%Y/%m/%d")
I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo
This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93
I don't know how to name this data transformation neither know if there exists some kind of function to use
My data has this shape:
rank abbrv country eci_value delta year
(int) (fctr) (fctr) (dbl) (int) (int)
1 30 BRA Brazil 0.5588656 2 1995
2 47 URY Uruguay 0.2098838 -14 1995
3 52 PAN Panama 0.1164776 2 1995
4 56 ARG Argentina 0.0013733 7 1995
5 58 VEN Venezuela -0.0329851 11 1995
6 64 COL Colombia -0.2216275 -2 1995
And I want a data frame with just the information provided by "year, "rank" and country presented in this way:
country 1995 1996 1997 1998 ...
Peru rank1995 rank1996 rank1997 rank1998 ...
Brazil rank1995 rank1996 rank1997 rank1998 ...
Chile rank1995 rank1996 rank1997 rank1998 ...
... ... ... ... ...
The var "year" ranges from 1995 to 2014 and the rank varies each year
I've thought of using a melt and dcast functions from reshape2 package... but nothing useful goes out.
Thanks
This could work for you. Here is an example using dplyr and tidyr, using your small sample above (you will have to test on a larger data set or provide one).
library(dplyr)
library(tidyr)
df
# rank abbrv country eci_value delta year
#1 30 BRA Brazil 0.5588656 2 1995
#2 47 URY Uruguay 0.2098838 -14 1995
#3 52 PAN Panama 0.1164776 2 1995
#4 56 ARG Argentina 0.0013733 7 1995
#5 58 VEN Venezuela -0.0329851 11 1995
#6 64 COL Colombia -0.2216275 -2 1995
df %>% select(country, year, rank) %>% spread(year, rank)
# country 1995
#1 Argentina 56
#2 Brazil 30
#3 Colombia 64
#4 Panama 52
#5 Uruguay 47
#6 Venezuela 58
my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6