Merging two data frames with different rows in R - r

I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?

With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99

Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo

Related

Change country names according to a table

I have a data set that contains a column with country names, it looks like this:
> xx
# A tibble: 139 × 5
Country `2012_value` `2013_value` `2014_value` `2015_value`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Albania NA NA NA 35
2 Algeria 1 1 1 NA
3 Andorra NA NA NA 50
4 Antigua & Barbuda NA NA NA 98
5 Argentina NA NA NA 65
6 Armenia NA 44 46 46.5
7 Ascension NA NA NA 100
8 Austria NA NA NA 98
9 Azerbaijan NA NA 49 50
10 Bahamas NA NA NA 95
I would like to change the names of the countries according to a table I have:
> print(itu_emi_countries, n = 30)
# A tibble: 215 × 2
`ITU Name` `EMI Name`
<chr> <chr>
1 Afghanistan Afghanistan
2 Albania Albania
3 Algeria Algeria
4 American Samoa American Samoa
5 Andorra Andorra
6 Angola Angola
7 Antigua & Barbuda Antigua
8 Argentina Argentina
9 Armenia Armenia
10 Aruba Aruba
11 Australia Australia
12 Austria Austria
13 Azerbaijan Azerbaijan
14 Bahamas Bahamas
15 Bahrain Bahrain
16 Bangladesh Bangladesh
17 Barbados Barbados
18 Belarus Belarus
19 Belgium Belgium
20 Belize Belize
21 Benin Benin
22 Bermuda Bermuda
23 Bhutan Bhutan
24 Bolivia Bolivia
25 Bosnia and Herzegovina Bosnia-Herzegovina
26 Botswana Botswana
27 Brazil Brazil
28 Brunei Darussalam Brunei
29 Bulgaria Bulgaria
30 Burkina Faso Burkina Faso
# … with 185 more rows
The country names are written as in the first column, and I want to change them to the second column. How can I do this?
A simple rename and left_join does the trick:
library(tidyverse)
itu_emi_countries <- itu_emi_countries %>%
rename(Country = `ITU Name`)
left_join(xx, itu_emi_countries, by = "Country")
With dplyr, you could use recode() and pass a named vector indicating the relations between the old and new names.
library(dplyr)
xx %>%
mutate(Country = recode(Country, !!!tibble::deframe(itu_emi_countries)))

How do I replicate the same column values for the next 2 cells in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
I am trying to replicate the same column values for the next 2 cells in the column using R.
I have a data-frame of the following form:
Time World Cate Data
1994 Africa A 12
1994 B 17
1994 C 22
1994 Asia A 55
1994 B 10
1994 C 58
1995 Africa A 62
1995 B 87
1995 C 12
1995 Asia A 59
1995 B 12
1995 C 38
and I want to convert it to the following form:
Time World Cate Data
1994 Africa A 12
1994 Africa B 17
1994 Africa C 22
1994 Asia A 55
1994 Asia B 10
1994 Asia C 58
1995 Africa A 62
1995 Africa B 87
1995 Africa C 12
1995 Asia A 59
1995 Asia B 12
1995 Asia C 38
Use fill from the tidyr package:
If your dataframe is called dat, then
dat <- tidyr::fill(dat, World)
Using na.locf function from library(zoo)
library(zoo)
na.locf(df)
Time World Cate Data
1 1994 Africa A 12
2 1994 Africa B 17
3 1994 Africa C 22
4 1994 Asia A 55
5 1994 Asia B 10
6 1994 Asia C 58
7 1995 Africa A 62
8 1995 Africa B 87
9 1995 Africa C 12
10 1995 Asia A 59
11 1995 Asia B 12
12 1995 Asia C 38
Code
dummy$World <- rep(dummy$World[(1:floor(dim(dummy)[1]/5))*5-4],each = 5)
dummy

Remove rows with NA values and delete those observations in another year [duplicate]

This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93

Data transformation, almost like when you use cast and melt

I don't know how to name this data transformation neither know if there exists some kind of function to use
My data has this shape:
rank abbrv country eci_value delta year
(int) (fctr) (fctr) (dbl) (int) (int)
1 30 BRA Brazil 0.5588656 2 1995
2 47 URY Uruguay 0.2098838 -14 1995
3 52 PAN Panama 0.1164776 2 1995
4 56 ARG Argentina 0.0013733 7 1995
5 58 VEN Venezuela -0.0329851 11 1995
6 64 COL Colombia -0.2216275 -2 1995
And I want a data frame with just the information provided by "year, "rank" and country presented in this way:
country 1995 1996 1997 1998 ...
Peru rank1995 rank1996 rank1997 rank1998 ...
Brazil rank1995 rank1996 rank1997 rank1998 ...
Chile rank1995 rank1996 rank1997 rank1998 ...
... ... ... ... ...
The var "year" ranges from 1995 to 2014 and the rank varies each year
I've thought of using a melt and dcast functions from reshape2 package... but nothing useful goes out.
Thanks
This could work for you. Here is an example using dplyr and tidyr, using your small sample above (you will have to test on a larger data set or provide one).
library(dplyr)
library(tidyr)
df
# rank abbrv country eci_value delta year
#1 30 BRA Brazil 0.5588656 2 1995
#2 47 URY Uruguay 0.2098838 -14 1995
#3 52 PAN Panama 0.1164776 2 1995
#4 56 ARG Argentina 0.0013733 7 1995
#5 58 VEN Venezuela -0.0329851 11 1995
#6 64 COL Colombia -0.2216275 -2 1995
df %>% select(country, year, rank) %>% spread(year, rank)
# country 1995
#1 Argentina 56
#2 Brazil 30
#3 Colombia 64
#4 Panama 52
#5 Uruguay 47
#6 Venezuela 58

Select "europe" from df

my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6

Resources