my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6
Related
I'm trying to join two datasets together- a dataset from Natural Earth subset to contain only countries in Europe (europe_map) and a list of locations in Europe (europe_places).
Here is the headers of the datasets:
europe_places
Simple feature collection with 23 features and 4 fields
geometry type: POINT
dimension: XY
bbox: xmin: -9.1393 ymin: 38.7223 xmax: 24.1052 ymax: 58.97
geographic CRS: WGS 84
First 10 features:
Location Year Country Continent geometry
1 Paris 2008 France Europe POINT (2.3522 48.8566)
2 Stavanger 2009 Norway Europe POINT (5.7331 58.97)
3 Paris 2009 France Europe POINT (2.3522 48.8566)
4 Berlin 2010 Germany Europe POINT (13.405 52.52)
5 Prague 2011 Czechia Europe POINT (14.4378 50.0755)
6 Piancavallo 2012 Italy Europe POINT (12.5166 46.10768)
7 Budapest 2012 Hungary Europe POINT (19.0402 47.4979)
8 Aprica 2013 Italy Europe POINT (10.15177 46.15486)
9 Vienna 2014 Austria Europe POINT (16.3738 48.2082)
10 Folgaria 2014 Italy Europe POINT (11.17205 45.9162)
europe_map
Simple feature collection with 6 features and 94 fields
geometry type: GEOMETRY
dimension: XY
bbox: xmin: -8.144824 ymin: 41.89756 xmax: 40.12832 ymax: 60.83188
geographic CRS: WGS 84
featurecla scalerank LABELRANK Country SOV_A3 ADM0_DIF LEVEL TYPE
5 Admin-0 country 6 6 Vatican VAT 0 2 Sovereign country
28 Admin-0 country 4 6 United Kingdom GB1 1 2 Country
29 Admin-0 country 4 6 United Kingdom GB1 1 2 Country
30 Admin-0 country 3 6 United Kingdom GB1 1 2 Country
31 Admin-0 country 1 2 United Kingdom GB1 1 2 Country
33 Admin-0 country 1 3 Ukraine UKR 0 2 Sovereign country
ADMIN ADM0_A3 GEOU_DIF GEOUNIT GU_A3 SU_DIF SUBUNIT SU_A3 BRK_DIFF
5 Vatican VAT 0 Vatican VAT 0 Vatican VAT 0
28 Jersey JEY 0 Jersey JEY 0 Jersey JEY 0
29 Guernsey GGY 0 Guernsey GGY 0 Guernsey GGY 0
30 Isle of Man IMN 0 Isle of Man IMN 0 Isle of Man IMN 0
31 United Kingdom GBR 0 United Kingdom GBR 0 United Kingdom GBR 0
33 Ukraine UKR 0 Ukraine UKR 0 Ukraine UKR 0
NAME NAME_LONG BRK_A3 BRK_NAME BRK_GROUP ABBREV POSTAL
5 Vatican Vatican VAT Vatican <NA> Vat. V
28 Jersey Jersey JEY Jersey Channel Islands Jey. JE
29 Guernsey Guernsey GGY Guernsey Channel Islands Guern. GG
30 Isle of Man Isle of Man IMN Isle of Man <NA> IoMan IM
31 United Kingdom United Kingdom GBR United Kingdom <NA> U.K. GB
33 Ukraine Ukraine UKR Ukraine <NA> Ukr. UA
FORMAL_EN FORMAL_FR NAME_CIAWF
5 State of the Vatican City <NA> Holy See (Vatican City)
28 Bailiwick of Jersey <NA> Jersey
29 Bailiwick of Guernsey <NA> Guernsey
30 <NA> <NA> Isle of Man
31 United Kingdom of Great Britain and Northern Ireland <NA> United Kingdom
33 Ukraine <NA> Ukraine
NOTE_ADM0 NOTE_BRK NAME_SORT NAME_ALT MAPCOLOR7 MAPCOLOR8 MAPCOLOR9
5 <NA> <NA> Vatican (Holy See) Holy See 1 3 4
28 U.K. crown dependency <NA> Jersey <NA> 6 6 6
29 U.K. crown dependency <NA> Guernsey <NA> 6 6 6
30 U.K. crown dependency <NA> Isle of Man <NA> 6 6 6
31 <NA> <NA> United Kingdom <NA> 6 6 6
33 <NA> <NA> Ukraine <NA> 5 1 6
MAPCOLOR13 POP_EST POP_RANK GDP_MD_EST POP_YEAR LASTCENSUS GDP_YEAR
5 2 1000 3 0 2015 NA 0
28 3 98840 8 5080 2017 2001 2015
29 3 66502 8 3465 2017 2001 2015
30 3 88815 8 7428 2017 2006 2014
31 3 64769452 16 2788000 2017 2011 2016
33 3 44033874 15 352600 2017 2001 2016
ECONOMY INCOME_GRP WIKIPEDIA FIPS_10_ ISO_A2 ISO_A3
5 2. Developed region: nonG7 2. High income: nonOECD 0 VT VA VAT
28 2. Developed region: nonG7 2. High income: nonOECD NA JE JE JEY
29 2. Developed region: nonG7 2. High income: nonOECD NA GK GG GGY
30 2. Developed region: nonG7 2. High income: nonOECD NA IM IM IMN
31 1. Developed region: G7 1. High income: OECD NA UK GB GBR
33 6. Developing region 4. Lower middle income NA UP UA UKR
ISO_A3_EH ISO_N3 UN_A3 WB_A2 WB_A3 WOE_ID WOE_ID_EH
5 VAT 336 336 <NA> <NA> 23424986 23424986
28 JEY 832 832 JG CHI 23424857 23424857
29 GGY 831 831 JG CHI 23424827 23424827
30 IMN 833 833 IM IMY 23424847 23424847
31 GBR 826 826 GB GBR -90 23424975
33 UKR 804 804 UA UKR 23424976 23424976
WOE_NOTE
5 Exact WOE match as country
28 Exact WOE match as country
29 Exact WOE match as country
30 Exact WOE match as country
31 Eh ID includes Channel Islands and Isle of Man. UK constituent countries of England (24554868), Wales (12578049), Scotland (12578048), and Northern Ireland (20070563).
33 Exact WOE match as country
ADM0_A3_IS ADM0_A3_US ADM0_A3_UN ADM0_A3_WB CONTINENT REGION_UN SUBREGION
5 VAT VAT NA NA Europe Europe Southern Europe
28 JEY JEY NA NA Europe Europe Northern Europe
29 GGY GGY NA NA Europe Europe Northern Europe
30 IMN IMN NA NA Europe Europe Northern Europe
31 GBR GBR NA NA Europe Europe Northern Europe
33 UKR UKR NA NA Europe Europe Eastern Europe
REGION_WB NAME_LEN LONG_LEN ABBREV_LEN TINY HOMEPART MIN_ZOOM MIN_LABEL
5 Europe & Central Asia 7 7 4 4 1 0 5.0
28 Europe & Central Asia 6 6 4 NA NA 0 5.0
29 Europe & Central Asia 8 8 6 NA NA 0 5.0
30 Europe & Central Asia 11 11 5 NA NA 0 5.0
31 Europe & Central Asia 14 14 4 NA 1 0 1.7
33 Europe & Central Asia 7 7 4 NA 1 0 3.0
MAX_LABEL NE_ID WIKIDATAID NAME_AR NAME_BN NAME_DE
5 10.0 1159321407 Q237 الفاتيكان ভ্যাটিকান সিটি Vatikanstadt
28 10.0 1159320725 Q785 جيرزي জার্সি Jersey
29 10.0 1159320715 Q25230 غيرنزي <NA> Guernsey
30 10.0 1159320721 Q9676 جزيرة مان আইল অব ম্যান Isle of Man
31 6.7 1159320713 Q145 المملكة المتحدة যুক্তরাজ্য Vereinigtes Königreich
33 7.0 1159321345 Q212 أوكرانيا ইউক্রেন Ukraine
NAME_EN NAME_ES NAME_FR NAME_EL NAME_HI
5 Vatican City Ciudad del Vaticano Vatican Βατικανό वैटिकन नगर
28 Jersey Jersey Jersey Τζέρσεϊ जर्सी
29 Guernsey Guernsey Guernesey Γκέρνσεϊ ग्वेर्नसे
30 Isle of Man Isla de Man île de Man Νήσος του Μαν मनुष्य का टापू
31 United Kingdom Reino Unido Royaume-Uni Ηνωμένο Βασίλειο यूनाइटेड किंगडम
33 Ukraine Ucrania Ukraine Ουκρανία युक्रेन
NAME_HU NAME_ID NAME_IT NAME_JA NAME_KO
5 Vatikán Vatikan Città del Vaticano バチカン 바티칸 시국
28 Jersey Jersey Baliato di Jersey ジャージー 저지 섬
29 Guernsey Bailiffség Guernsey Guernsey ガーンジー 건지 섬
30 Man Pulau Man Isola di Man マン島 맨 섬
31 Egyesült Királyság Britania Raya Regno Unito イギリス 영국
33 Ukrajna Ukraina Ucraina ウクライナ 우크라이나
NAME_NL NAME_PL NAME_PT NAME_RU NAME_SV
5 Vaticaanstad Watykan Vaticano Ватикан Vatikanstaten
28 Jersey Jersey Jersey Джерси Jersey
29 Guernsey Guernsey Guernsey Гернси Guernsey
30 Man Wyspa Man Ilha de Man остров Мэн Isle of Man
31 Verenigd Koninkrijk Wielka Brytania Reino Unido Великобритания Storbritannien
33 Oekraïne Ukraina Ucrânia Украина Ukraina
NAME_TR NAME_VI NAME_ZH
5 Vatikan Thành Vatican 梵蒂冈
28 Jersey Jersey 澤西島
29 Guernsey Guernsey 根西岛
30 Man Adası Đảo Man 马恩岛
31 Birleşik Krallık Vương quốc Liên hiệp Anh và Bắc Ireland 英国
33 Ukrayna Ukraina 乌克兰
geometry
5 POLYGON ((12.43916 41.89839...
28 POLYGON ((-2.018652 49.2312...
29 POLYGON ((-2.512305 49.4945...
30 POLYGON ((-4.412061 54.1853...
31 MULTIPOLYGON (((-2.667676 5...
33 MULTIPOLYGON (((38.21436 47...
I used the following code to join the datasets together:
europe.map1<-st_join(europe_places, europe_map, by="Country")
But when I did the entries for Venice, Lisbon and Copenhagen had NA values despite the entry for Country containing values that matched those in the europe_map dataset.
Picking up on the comments above, you have not specified the spatial join correctly. I think this is what you are looking for:
europe.map1<- st_join(europe_places, europe_map,
join=st_within, # always best to specify the method
left=TRUE)
This should work for you. That said, you may want to switch the order of europe_places and europe_map. I am not sure about your goal. You can find more information about the different types of spatial joins within the sf package here.
I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
I have a df looking like this:
ID Country
55 Poland
55 Romania
55 France
98 Spain
98 Portugal
98 UK
65 Germany
67 Luxembourg
84 Greece
22 Estonia
22 Lithuania
Where some ID are repeated because they belong to the same group. What I want to do is to paste together all Country with the same ID, to have an output like this.
So far, I tried with
ifelse(df[duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE),], paste('Countries', df$Country), NA) but this is not retrieving the expected output.
Using data.table
library(data.table)
setDT(df)[, New_Name := c(paste0(Country, collapse = " + ")[1L], rep(NA, .N -1)), by = ID]
#df
#ID Country New_Name
#1: 55 Poland Poland + Romania + France
#2: 55 Romania <NA>
#3: 55 France <NA>
#4: 98 Spain Spain + Portugal + UK
#5: 98 Portugal <NA>
#6: 98 UK <NA>
#7: 65 Germany Germany
#8: 67 Luxembourg Luxembourg
#9: 84 Greece Greece
#10: 22 Estonia Estonia + Lithuania
#11: 22 Lithuania <NA>
Using base R,
replace(v1 <- with(df, ave(as.character(Country), ID, FUN = toString)), duplicated(v1), NA)
#[1] "Poland, Romania, France" NA NA "Spain, Portugal, UK" NA NA "Germany" "Luxembourg" "Greece" "Estonia, Lithuania"
#[11] NA
Using dplyr, one way would be
library(dplyr)
df %>%
group_by(ID) %>%
mutate(new_name = paste0(Country,collapse = " + "),
new_name = replace(new_name, duplicated(new_name), NA))
# ID Country new_name
# <int> <fct> <chr>
# 1 55 Poland Poland + Romania + France
# 2 55 Romania NA
# 3 55 France NA
# 4 98 Spain Spain + Portugal + UK
# 5 98 Portugal NA
# 6 98 UK NA
# 7 65 Germany Germany
# 8 67 Luxembourg Luxembourg
# 9 84 Greece Greece
#10 22 Estonia Estonia + Lithuania
#11 22 Lithuania NA
However, to get your exact expected output we might need
df %>%
group_by(ID) %>%
mutate(new_name = if (n() > 1)
paste0("Countries ", paste0(Country,collapse = " + ")) else Country,
new_name = replace(new_name, duplicated(new_name), NA))
# ID Country new_name
# <int> <fct> <chr>
# 1 55 Poland Countries Poland + Romania + France
# 2 55 Romania NA
# 3 55 France NA
# 4 98 Spain Countries Spain + Portugal + UK
# 5 98 Portugal NA
# 6 98 UK NA
# 7 65 Germany Germany
# 8 67 Luxembourg Luxembourg
# 9 84 Greece Greece
#10 22 Estonia Countries Estonia + Lithuania
#11 22 Lithuania NA
Using aggregate and then matching back for the first time only:
flat <- function(x) paste("Countries:", paste(x,collapse=", "))
tmp <- aggregate(Country ~ ID, data=dat, FUN=flat)
dat$Country <- NA
dat$Country[match(tmp$ID, dat$ID)] <- tmp$Country
# ID Country
#1 55 Countries: Poland, Romania, France
#2 55 <NA>
#3 55 <NA>
#4 98 Countries: Spain, Portugal, UK
#5 98 <NA>
#6 98 <NA>
#7 65 Countries: Germany
#8 67 Countries: Luxembourg
#9 84 Countries: Greece
#10 22 Countries: Estonia, Lithuania
#11 22 <NA>
With purrr and dplyr:
df %>%
nest(-ID) %>%
mutate(new_name = map_chr(data, ~ paste0(.x$Country, collapse = " + "))) %>%
unnest()
Table:
ID new_name Country
55 Poland + Romania + France Poland
55 Poland + Romania + France Romania
55 Poland + Romania + France France
98 Spain + Portugal + UK Spain
98 Spain + Portugal + UK Portugal
98 Spain + Portugal + UK UK
65 Germany Germany
67 Luxembourg Luxembourg
84 Greece Greece
22 Estonia + Lithuania Estonia
22 Estonia + Lithuania Lithuania
This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93
I don't know how to name this data transformation neither know if there exists some kind of function to use
My data has this shape:
rank abbrv country eci_value delta year
(int) (fctr) (fctr) (dbl) (int) (int)
1 30 BRA Brazil 0.5588656 2 1995
2 47 URY Uruguay 0.2098838 -14 1995
3 52 PAN Panama 0.1164776 2 1995
4 56 ARG Argentina 0.0013733 7 1995
5 58 VEN Venezuela -0.0329851 11 1995
6 64 COL Colombia -0.2216275 -2 1995
And I want a data frame with just the information provided by "year, "rank" and country presented in this way:
country 1995 1996 1997 1998 ...
Peru rank1995 rank1996 rank1997 rank1998 ...
Brazil rank1995 rank1996 rank1997 rank1998 ...
Chile rank1995 rank1996 rank1997 rank1998 ...
... ... ... ... ...
The var "year" ranges from 1995 to 2014 and the rank varies each year
I've thought of using a melt and dcast functions from reshape2 package... but nothing useful goes out.
Thanks
This could work for you. Here is an example using dplyr and tidyr, using your small sample above (you will have to test on a larger data set or provide one).
library(dplyr)
library(tidyr)
df
# rank abbrv country eci_value delta year
#1 30 BRA Brazil 0.5588656 2 1995
#2 47 URY Uruguay 0.2098838 -14 1995
#3 52 PAN Panama 0.1164776 2 1995
#4 56 ARG Argentina 0.0013733 7 1995
#5 58 VEN Venezuela -0.0329851 11 1995
#6 64 COL Colombia -0.2216275 -2 1995
df %>% select(country, year, rank) %>% spread(year, rank)
# country 1995
#1 Argentina 56
#2 Brazil 30
#3 Colombia 64
#4 Panama 52
#5 Uruguay 47
#6 Venezuela 58