Creating a column with differences based on another column - r

I have a data frame that looks like this (simplified from 699 treaties):
TRT <- data.frame(T.ID=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8),
Treaty=c("hungary slovenia 1994", "hungary slovenia 1994",
"nicaragua taiwan 2006", "nicaragua taiwan 2006",
"ukraine uzbekistan 1994", "ukraine uzbekistan 1994",
"brazil uruguay 1986", "brazil uruguay 1986",
"albania macedonia 2002", "albania macedonia 2002",
"albania moldova 2003", "albania moldova 2003",
"albania romania 2003", "albania romania 2003",
"Treaty of Izmir 1977","Treaty of Izmir 1977",
"Treaty of Izmir 1977"),
sc.y=c("HUN1994", "SLV1994", "NIC2006", "TAW2006", "UKR1994",
"UZB1994", "BRA1986", "URU1986", "ALB2002", "MAC2002",
"ALB2003", "MLD2003", "ALB2003", "RUM2003", "IRN1977",
"TUR1977", "PAK1977"),
prom.demo=c(1,1,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0),
polity=c(10,10,8,10,7,-9,7,9,7,9,7,8,7,8,-10,-7,9))
In the end, I want to have a data frame that lists each treaty only once, its value of the “prom.demo”-column and one column that contains the difference of the maximum and minimum among the “polity”-values of the contracting parties of each treaty (most treaties have only two contracting parties, but some have up to 51).
Is there any R command that spares me 699 calculations?

Using dplyr its a join on scode and year followed by grouping by Treaty and then working out the difference between the min and max polity:
require(dplyr)
left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity,na.rm=TRUE)-min(polity,na.rm=TRUE))
Source: local data frame [8 x 2]
Treaty PolityDiff
1 albania macedonia 2002 2
2 albania moldova 2003 1
3 albania romania 2003 1
4 brazil uruguay 1986 2
5 hungary slovenia 1994 0
6 nicaragua taiwan 2006 2
7 Treaty of Izmir 1977 NA
8 ukraine uzbekistan 1994 16
The NA's are where you don't have any matching scode/year (The Treaty of Izmir is IRN/TUR/PAK in 1977, and none of those are in the Polity data).
Note that if you want NA if any one of the participating countries are not in the Polity data, use:
left_join(treaties, Polity, c("scode","year")) %>% group_by(Treaty) %>% summarise(PolityDiff=max(polity)-min(polity))
which gives:
Treaty PolityDiff
1 albania macedonia 2002 2
2 albania moldova 2003 1
3 albania romania 2003 1
4 brazil uruguay 1986 2
5 hungary slovenia 1994 NA
6 nicaragua taiwan 2006 2
7 Treaty of Izmir 1977 NA
8 ukraine uzbekistan 1994 16
because Slovenia is coded as SLV in Polity but there's SLO in the treaties - mistake? Anyway, there's no SLO/1994 in Polity so that treaty returns as NA in this variant. It returns zero in my first example because the NA gets dropped and the polity difference is the difference between one number and itself...

Related

Change country names according to a table

I have a data set that contains a column with country names, it looks like this:
> xx
# A tibble: 139 × 5
Country `2012_value` `2013_value` `2014_value` `2015_value`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Albania NA NA NA 35
2 Algeria 1 1 1 NA
3 Andorra NA NA NA 50
4 Antigua & Barbuda NA NA NA 98
5 Argentina NA NA NA 65
6 Armenia NA 44 46 46.5
7 Ascension NA NA NA 100
8 Austria NA NA NA 98
9 Azerbaijan NA NA 49 50
10 Bahamas NA NA NA 95
I would like to change the names of the countries according to a table I have:
> print(itu_emi_countries, n = 30)
# A tibble: 215 × 2
`ITU Name` `EMI Name`
<chr> <chr>
1 Afghanistan Afghanistan
2 Albania Albania
3 Algeria Algeria
4 American Samoa American Samoa
5 Andorra Andorra
6 Angola Angola
7 Antigua & Barbuda Antigua
8 Argentina Argentina
9 Armenia Armenia
10 Aruba Aruba
11 Australia Australia
12 Austria Austria
13 Azerbaijan Azerbaijan
14 Bahamas Bahamas
15 Bahrain Bahrain
16 Bangladesh Bangladesh
17 Barbados Barbados
18 Belarus Belarus
19 Belgium Belgium
20 Belize Belize
21 Benin Benin
22 Bermuda Bermuda
23 Bhutan Bhutan
24 Bolivia Bolivia
25 Bosnia and Herzegovina Bosnia-Herzegovina
26 Botswana Botswana
27 Brazil Brazil
28 Brunei Darussalam Brunei
29 Bulgaria Bulgaria
30 Burkina Faso Burkina Faso
# … with 185 more rows
The country names are written as in the first column, and I want to change them to the second column. How can I do this?
A simple rename and left_join does the trick:
library(tidyverse)
itu_emi_countries <- itu_emi_countries %>%
rename(Country = `ITU Name`)
left_join(xx, itu_emi_countries, by = "Country")
With dplyr, you could use recode() and pass a named vector indicating the relations between the old and new names.
library(dplyr)
xx %>%
mutate(Country = recode(Country, !!!tibble::deframe(itu_emi_countries)))

Error in `.rowNamesDF<-`(x, value = value) : 'row.names' duplicate are not allowed. In addition: Warning message: non-unique values

I have the following dataframe total_authority
structure(list(country = c("Albania", "Algeria", "American Somoa",
"Angola", "Anguilla", "Antigua", "Argentina", "Armenia", "Aruba",
"Australia"), `1994` = c(0.0000000000000000312250225675825, 0.0000000000000000312250225675825,
0.0000000000000000312250225675825, 0.0000000000000000312250225675825,
0.0000000000000000312250225675825, 0.0000000000000000312250225675825,
0.00289122132708816, 0.0000000000000000312250225675825, 0.00000528966979389429,
0.00622391681538348), country.1 = c("Albania", "Algeria", "American Somoa",
"Angola", "Anguilla", "Antigua", "Argentina", "Armenia", "Aruba",
"Australia"), `1995` = c(0.00000320558770721281, 0.0000000000000000277555756156289,
0.0000000000000000277555756156289, 0.0000000000000000277555756156289,
0.0000000000000000277555756156289, 0.0000000000000000277555756156289,
0.0224538010858487, 0.0000000000000000277555756156289, 0.0000000000000000277555756156289,
0.407633483379219)), row.names = c(NA, 10L), class = "data.frame")
which I would like to rearrange in such a way the first column contains the countries, the second denotes the year and the third the value scored by the countries in that year.
Visually, the dataframe total_authority is now
country 1994 country.1 1995
1 Albania 0.00000000000000003122502 Albania 0.00000320558770721280500
2 Algeria 0.00000000000000003122502 Algeria 0.00000000000000002775558
3 American Somoa 0.00000000000000003122502 American Somoa 0.00000000000000002775558
4 Angola 0.00000000000000003122502 Angola 0.00000000000000002775558
5 Anguilla 0.00000000000000003122502 Anguilla 0.00000000000000002775558
6 Antigua 0.00000000000000003122502 Antigua 0.00000000000000002775558
7 Argentina 0.00289122132708816148572 Argentina 0.02245380108584869860433
8 Armenia 0.00000000000000003122502 Armenia 0.00000000000000002775558
9 Aruba 0.00000528966979389429437 Aruba 0.00000000000000002775558
10 Australia 0.00622391681538347896208 Australia 0.40763348337921861963551
The desired result is instead:
country score year
Albania 0.00000000000000003122502 1994
Algeria 0.00000000000000003122502 1994
American Somoa 0.00000000000000003122502 1994
Angola 0.00000000000000003122502 1994
Anguilla 0.00000000000000003122502 1994
Antigua 0.00000000000000003122502 1994
Argentina 0.00289122132708816148572 1994
Armenia 0.00000000000000003122502 1994
Aruba 0.00000528966979389429437 1994
Australia 0.00622391681538347896208 1994
Albania 0.00000320558770721280500 1995
Algeria 0.00000000000000002775558 1995
American Somoa 0.00000000000000002775558 1995
Angola 0.00000000000000002775558 1995
Anguilla 0.00000000000000002775558 1995
Antigua 0.00000000000000002775558 1995
Argentina 0.02245380108584869860433 1995
Armenia 0.00000000000000002775558 1995
Aruba 0.00000000000000002775558 1995
Australia 0.40763348337921861963551 1995
This is my attempt (count index of the for loop ranges between 1 and 2 but it is just an example):
actors<-c("Albania", "Algeria", "American Somoa", "Angola", "Anguilla", "Antigua", "Argentina", "Armenia", "Aruba", "Australia")
final_output<-data.frame()
for (count in 1:2) {
df <- data.frame(country=actors)
df$year=rep(names(total_authority)[2*count],nrow(df))
df$authority<-total_authority[2*count]
final_output <- rbind(final_output, df)
}
Anyway, I obtained the following error:
Error in `.rowNamesDF<-`(x, value = value) :
'row.names' duplicate are not allowed.
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’
We don't need a for loop here. Just index the data.frame to subset the columns, unlist and construct data.frame directly
out <- data.frame(country = unlist(total_authority[c(1,3)]),
score = unlist(total_authority[c(2,4)]),
year = rep(names(total_authority)[c(2,4)], each = nrow(total_authority)))
row.names(out) <- NULL
-output
> out
country score year
1 Albania 0.00000000000000003122502 1994
2 Algeria 0.00000000000000003122502 1994
3 American Somoa 0.00000000000000003122502 1994
4 Angola 0.00000000000000003122502 1994
5 Anguilla 0.00000000000000003122502 1994
6 Antigua 0.00000000000000003122502 1994
7 Argentina 0.00289122132708816018468 1994
8 Armenia 0.00000000000000003122502 1994
9 Aruba 0.00000528966979389429013 1994
10 Australia 0.00622391681538347982944 1994
11 Albania 0.00000320558770721281009 1995
12 Algeria 0.00000000000000002775558 1995
13 American Somoa 0.00000000000000002775558 1995
14 Angola 0.00000000000000002775558 1995
15 Anguilla 0.00000000000000002775558 1995
16 Antigua 0.00000000000000002775558 1995
17 Argentina 0.02245380108584869860433 1995
18 Armenia 0.00000000000000002775558 1995
19 Aruba 0.00000000000000002775558 1995
20 Australia 0.40763348337921900821357 1995
Regarding the error with duplicate row.names, it occurs because the authority created is a data.frame with a single column ([), instead, we need a vector by extracting the column ([[)
final_output<-data.frame()
for (count in 1:2) {
df <- data.frame(country=actors)
df$year=rep(names(total_authority)[2*count],nrow(df))
df$authority<-total_authority[[2*count]]
final_output <- rbind(final_output, df)
}
-output
> final_output
country year authority
1 Albania 1994 0.00000000000000003122502
2 Algeria 1994 0.00000000000000003122502
3 American Somoa 1994 0.00000000000000003122502
4 Angola 1994 0.00000000000000003122502
5 Anguilla 1994 0.00000000000000003122502
6 Antigua 1994 0.00000000000000003122502
7 Argentina 1994 0.00289122132708816018468
8 Armenia 1994 0.00000000000000003122502
9 Aruba 1994 0.00000528966979389429013
10 Australia 1994 0.00622391681538347982944
11 Albania 1995 0.00000320558770721281009
12 Algeria 1995 0.00000000000000002775558
13 American Somoa 1995 0.00000000000000002775558
14 Angola 1995 0.00000000000000002775558
15 Anguilla 1995 0.00000000000000002775558
16 Antigua 1995 0.00000000000000002775558
17 Argentina 1995 0.02245380108584869860433
18 Armenia 1995 0.00000000000000002775558
19 Aruba 1995 0.00000000000000002775558
20 Australia 1995 0.40763348337921900821357

Struggling to filter data in R

Here is the data that I am using :
https://www.dropbox.com/s/dl/chmzqmus6bfoaim/climate_clean.csv
I added a variable called average_temperature_fahrenheit by doing so
climate = mutate(climate, average_temperature_fahrenheit = 9/5*average_temperature_celsius+32)
Now I want to know the highest temperature in Fahrenheit during the months of June, July and August between 1970 and 1980 in Europe and North America so I thought I needed to filter my data frame climate by doing so
climate %>% filter(continent == c("Europe","North America") & month == c("Jun","Jul","Aug") & year[1970:1980])
But clearly I did not succeed because it shows me only the month of August
Please could you tell me where I messed up in the filter function part
You could go with
library(data.table)
climate <-fread("~/downloads/climate_clean.csv")
climate %>%
mutate(average_temperature_fahrenheit = 9/5*average_temperature_celsius+32) %>%
filter(year %in% 1970:1980, month %in% c("Jun", "Jul", "Aug")) %>%
group_by(continent) %>%
summarise(MaxF = max(average_temperature_fahrenheit))
# A tibble: 6 x 2
continent MaxF
<chr> <dbl>
1 Africa 93.6
2 Asia 98.6
3 Europe 83.9
4 North America 83.0
5 Oceania 83.7
6 South America 80.2
You could reverse the rank function touse it in ave and subset the data where it's 1. Instead of creating a long spaghetti of code I'd do subsetting in multiple steps.
## subset by month, year, continent
res <- climate[climate$month %in% month.abb[6:8] & climate$year %in% 1970:1980 &
climate$continent %in% c("Europe", "North America"), ]
## create rank
res <- transform(res,
ave.temp.F=average_temperature_celsius * 9/5 + 32,
rank=ave(average_temperature_celsius, iso3,
FUN=function(x) rev(rank(x))))
## subset by rank
res <- res[res$rank == 1, ]
## some ordering (`-9` drops the rank column)
res <- res[order(res$continent, res$year, res$month, res$iso3), -9]
Result
res
# country iso3 continent year month average_temperature_celsius average_rainfall_mm ave.temp.F
# 177956 Italy ITA Europe 1970 Aug 20.41250 77.53980 68.74250
# 189836 Monaco MCO Europe 1970 Aug 17.54660 51.22330 63.58388
# 179535 Kazakhstan KAZ Europe 1971 Aug 19.82530 18.28780 67.68554
# 148959 Armenia ARM Europe 1972 Aug 19.71050 34.85870 67.47890
# 150399 Azerbaijan AZE Europe 1972 Aug 23.73650 18.82320 74.72570
# 206420 Spain ESP Europe 1972 Aug 20.46590 29.92810 68.83862
# 170559 Georgia GEO Europe 1972 Aug 19.18560 66.58620 66.53408
# 198500 Portugal PRT Europe 1972 Aug 20.93670 6.03864 69.68606
# 200799 Russia RUS Europe 1972 Aug 11.89720 61.69850 53.41496
# 146684 Albania ALB Europe 1974 Aug 21.45010 46.19520 70.61018
# 152444 Belarus BLR Europe 1974 Aug 17.00830 43.09090 62.61494
# 163047 Cyprus CYP Europe 1974 Aug 27.13730 3.16681 80.84714
# 167204 Estonia EST Europe 1974 Aug 15.27410 56.88330 59.49338
# 169004 Finland FIN Europe 1974 Aug 13.47990 91.12890 56.26382
# 171884 Greece GRC Europe 1974 Aug 23.14990 13.01370 73.66982
# 184844 Lithuania LTU Europe 1974 Aug 16.37240 53.21470 61.47032
# 182684 Latvia LVA Europe 1974 Aug 15.71120 59.34710 60.28016
# 185564 Macedonia MKD Europe 1974 Aug 20.34650 33.97170 68.62370
# 215084 Ukraine UKR Europe 1974 Aug 19.58960 39.45000 67.26128
# 156042 Bulgaria BGR Europe 1974 Jun 17.94660 66.96620 64.30388
# 154602 Bosnia and Herzegovina BIH Europe 1974 Jun 15.80810 128.16000 60.45458
# 189522 Moldova MDA Europe 1974 Jun 18.23260 87.43030 64.81868
# 199602 Republic of Montenegro MNE Europe 1974 Jun 14.96460 103.38300 58.93628
# 175496 Iceland ISL Europe 1975 Aug 8.40970 87.26150 47.13746
# 187736 Malta MLT Europe 1975 Aug 25.47840 36.46360 77.86112
# 194936 Norway NOR Europe 1975 Aug 11.25500 74.38320 52.25900
# 209336 Sweden SWE Europe 1975 Aug 13.22820 60.74740 55.81076
# 149948 Austria AUT Europe 1976 Aug 12.60120 98.79700 54.68216
# 163628 Czech Republic CZE Europe 1976 Aug 15.39740 55.42650 59.71532
# 162188 Croatia HRV Europe 1976 Aug 16.69060 98.21130 62.04308
# 175148 Hungary HUN Europe 1976 Aug 17.16470 48.66250 62.89646
# 198188 Poland POL Europe 1976 Aug 15.82960 48.73520 60.49328
# 200348 Romania ROU Europe 1976 Aug 16.03970 85.58330 60.87146
# 199988 Republic of Serbia SRB Europe 1976 Aug 16.82680 78.80250 62.28824
# 204308 Slovakia SVK Europe 1976 Aug 14.86390 53.49560 58.75502
# 204668 Slovenia SVN Europe 1976 Aug 14.87980 98.93020 58.78364
# 213519 Turkey TUR Europe 1977 Aug 22.37990 7.52567 72.28382
# 147452 Andorra AND Europe 1978 Aug 18.81170 33.16530 65.86106
# 152852 Belgium BEL Europe 1978 Aug 15.65460 39.78100 60.17828
# 169412 France FRA Europe 1978 Aug 17.11220 41.63730 62.80196
# 215852 United Kingdom GBR Europe 1978 Aug 13.77240 83.64290 56.79032
# 177332 Ireland IRL Europe 1978 Aug 13.97130 86.99350 57.14834
# 185252 Luxembourg LUX Europe 1978 Aug 15.21680 43.07880 59.39024
# 192452 Netherlands NLD Europe 1978 Aug 15.46520 48.70340 59.83736
# 209744 Switzerland CHE Europe 1979 Aug 12.49180 191.84800 54.48524
# 171224 Germany DEU Europe 1979 Aug 15.89550 71.42340 60.61190
# 164024 Denmark DNK Europe 1979 Aug 15.08630 80.27490 59.15534
# 167984 Faroe Islands FRO Europe 1979 Aug 9.45723 82.92610 49.02301
# 184544 Liechtenstein LIE Europe 1979 Aug 9.05277 217.23500 48.29499
# 172208 Greenland GRL North America 1971 Aug -6.22770 52.95410 20.79014
# 157820 Canada CAN North America 1972 Aug 9.42089 53.17810 48.95760
# 148124 Antigua and Barbuda ATG North America 1974 Aug 26.48220 174.62500 79.66796
# 151004 Bahamas BHS North America 1974 Aug 27.64530 149.01000 81.76154
# 162524 Cuba CUB North America 1974 Aug 27.07380 110.54600 80.73284
# 164684 Dominica DMA North America 1974 Aug 25.12810 145.65000 77.23058
# 165044 Dominican Republic DOM North America 1974 Aug 25.07960 143.50400 77.14328
# 174404 Haiti HTI North America 1974 Aug 25.46790 152.89300 77.84222
# 178364 Jamaica JAM North America 1974 Aug 25.69750 329.85900 78.25550
# 207164 St. Kitts and Nevis KNA North America 1974 Aug 25.65000 181.47400 78.17000
# 198884 Puerto Rico PRI North America 1974 Aug 26.26840 246.12400 79.28312
# 212564 Trinidad and Tobago TTO North America 1974 Aug 26.11950 182.57200 79.01510
# 216176 United States USA North America 1975 Aug 18.68970 64.20100 65.64146
# 196375 Panama PAN North America 1975 Jul 24.11360 325.95700 75.40448
# 161454 Costa Rica CRI North America 1975 Jun 23.90800 333.15800 75.03440
# 153223 Belize BLZ North America 1979 Jul 26.74570 349.16200 80.14226
# 172662 Grenada GRD North America 1979 Jun 27.77390 239.63500 81.99302
# 173022 Guatemala GTM North America 1979 Jun 24.49050 400.83000 76.08290
# 174822 Honduras HND North America 1979 Jun 24.59780 372.34600 76.27604
# 207582 St. Lucia LCA North America 1979 Jun 26.97770 335.70000 80.55986
# 189222 Mexico MEX North America 1979 Jun 24.40470 81.00740 75.92846
# 193542 Nicaragua NIC North America 1979 Jun 25.30880 466.90400 77.55584
# 166182 El Salvador SLV North America 1979 Jun 24.14100 348.93900 75.45380
# 207942 St. Vincent and the Grenadines VCT North America 1979 Jun 28.16000 252.66000 82.68800

Counting number of consecutive years in a range in R

I need to count the number of contiguous years in a data frame. I want to filter data frames that have more than 30 years of consecutive records. Before I was doing:
(length(unique(Daily_Streamflow$year)) > 30
But I realized that the number of years (unique years) could be more than 30 but not in a consecutive range, for example:
(unique(DSF_09494000$year))
[1] 1917 1918 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
[27] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
[53] 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
How is possible to count the number of years in a range that is continuous without missing years? Is there a similar function as na.contiguous of stats package but for non-NA values?

How to transform tibble from a table of dates when x happened to table of dates with categorical data for x

So I have a data-set that shows the year each country joined the World Trade Organistion (WTO) and its predecessor, the General Agreement on Tariffs and Trade (1995). Something important to note is that the WTO was created in 1995 as an expansion of the GATT (created 1947) and some GATT members (e.g. angola below) did not join the WTO straight away in 1995, but waited until 1996 or later depending on the country. Some countries were also not GATT members but joined the WTO after it formed (e.g. Afghanistan below).
I would like to take my data in the format of the first tibble below and change the format to have a list of all years for each country and a categorical variable showing whether they were members of the GATT, the WTO, or neither yet. My actual data-set is much larger than this example with dates from 1948 until 2017 and many more countries so doing this manually would be awful.
for this example, just limiting dates from 1992 to 1996 and looking at the first 6 countries, basically I would like to go from this:
df <- data.frame(Country = c("Afghanistan", "Albania", "Angola", "Antigua and Barbuda", "Argentina", "Armenia"),
Year_joined_WTO = c(2016, 2000, 1996, 1995, 1995, 2003),
Year_joined_GATT = c(NA, NA, 1994, 1987, 1967, NA))
df <- as_tibble(df)
> df
# A tibble: 6 x 3
Country Year_joined_WTO Year_joined_GATT
<fct> <dbl> <dbl>
1 Afghanistan 2016 NA
2 Albania 2000 NA
3 Angola 1996 1994
4 Antigua and Barbuda 1995 1987
5 Argentina 1995 1967
6 Armenia 2003 NA
to this:
df_intended <- data.frame(Country = c("Afghanistan", "Afghanistan","Afghanistan","Afghanistan","Afghanistan", "Albania", "Albania","Albania","Albania","Albania","Angola", "Angola","Angola","Angola","Angola","Antigua and Barbuda","Antigua and Barbuda","Antigua and Barbuda","Antigua and Barbuda","Antigua and Barbuda", "Argentina", "Argentina","Argentina","Argentina","Argentina","Armenia","Armenia","Armenia","Armenia","Armenia"),
Year = c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996),
Member_WTO_GATT = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "GATT", "GATT", "WTO", "GATT","GATT","GATT", "WTO", "WTO", "GATT","GATT","GATT", "WTO", "WTO", NA, NA, NA, NA, NA))
df_intended <- as_tibble(df_intended)
print(tbl_df(df_intended), n =30)
# A tibble: 30 x 3
Country Year Member_WTO_GATT
<fct> <dbl> <fct>
1 Afghanistan 1992 NA
2 Afghanistan 1993 NA
3 Afghanistan 1994 NA
4 Afghanistan 1995 NA
5 Afghanistan 1996 NA
6 Albania 1992 NA
7 Albania 1993 NA
8 Albania 1994 NA
9 Albania 1995 NA
10 Albania 1996 NA
11 Angola 1992 NA
12 Angola 1993 NA
13 Angola 1994 GATT
14 Angola 1995 GATT
15 Angola 1996 WTO
16 Antigua and Barbuda 1992 GATT
17 Antigua and Barbuda 1993 GATT
18 Antigua and Barbuda 1994 GATT
19 Antigua and Barbuda 1995 WTO
20 Antigua and Barbuda 1996 WTO
21 Argentina 1992 GATT
22 Argentina 1993 GATT
23 Argentina 1994 GATT
24 Argentina 1995 WTO
25 Argentina 1996 WTO
26 Armenia 1992 NA
27 Armenia 1993 NA
28 Armenia 1994 NA
29 Armenia 1995 NA
30 Armenia 1996 NA
I've tried gathering the years into one column, but the problem I encounter is how to have this within a column showing every year for each country and also showing them being members in the years after they join.
My feeble attempt:
df2 <- df %>%
group_by(Country) %>%
gather(Year_joined_WTO, Year_joined_GATT, key = member_wto_gatt, value = Year)
> df2
# A tibble: 12 x 3
# Groups: Country [6]
Country member_wto_gatt Year
<fct> <chr> <dbl>
1 Afghanistan Year_joined_WTO 2016
2 Albania Year_joined_WTO 2000
3 Angola Year_joined_WTO 1996
4 Antigua and Barbuda Year_joined_WTO 1995
5 Argentina Year_joined_WTO 1995
6 Armenia Year_joined_WTO 2003
7 Afghanistan Year_joined_GATT NA
8 Albania Year_joined_GATT NA
9 Angola Year_joined_GATT 1994
10 Antigua and Barbuda Year_joined_GATT 1987
11 Argentina Year_joined_GATT 1967
12 Armenia Year_joined_GATT NA
I also have tried doing some joins and merges with a list of all the dates I want (e.g.
years <- data.frame(Year = c(1992:1996))
years <- as_tibble(years)
> df3 <- right_join(df2, years)
Joining, by = "Year"
Warning message:
Factor `Country` contains implicit NA, consider using `forcats::fct_explicit_na`
> df3
# A tibble: 6 x 3
# Groups: Country [7]
Country member_wto_gatt Year
<fct> <chr> <dbl>
1 NA NA 1992
2 NA NA 1993
3 Angola Year_joined_GATT 1994
4 Antigua and Barbuda Year_joined_WTO 1995
5 Argentina Year_joined_WTO 1995
6 Angola Year_joined_WTO 1996
)
but they were entirely unsuccessful and I cannot find any similar examples of how to do this. Any help would be appreciated
You could try using gather, complete and fill. gather the data to long format, use sub to have column name with "WTO" and "GATT", group_by Country and then fill the NA values with latest non-NA value.
library(dplyr)
library(tidyr)
df %>%
gather(key, Value, -Country) %>%
mutate(key = sub("Year_joined_", "", key)) %>%
group_by(Country) %>%
complete(Value = seq(1992, 1996)) %>%
fill(key)
For your real data you can use seq(min(Value), max(Value)) instead of hard coded years or if you already know which years every country should have you can use those numbers.
With the new version of tidyr (1.0.0), gather, spread would be deprecated as mentioned here and replaced with pivot_longer/pivot_wider and using only tidyverse functions
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('Year')) %>%
mutate(name = str_remove(name, 'Year_joined_')) %>%
group_by(Country) %>%
complete(value = seq(1992, 1996)) %>%
fill(name)
# A tibble: 38 x 3
# Groups: Country [6]
# Country value name
# <fct> <dbl> <chr>
# 1 Afghanistan 1992 <NA>
# 2 Afghanistan 1993 <NA>
# 3 Afghanistan 1994 <NA>
# 4 Afghanistan 1995 <NA>
# 5 Afghanistan 1996 <NA>
# 6 Albania 1992 <NA>
# 7 Albania 1993 <NA>
# 8 Albania 1994 <NA>
# 9 Albania 1995 <NA>
#10 Albania 1996 <NA>
# … with 28 more rows

Resources