I have a data set that contains a column with country names, it looks like this:
> xx
# A tibble: 139 × 5
Country `2012_value` `2013_value` `2014_value` `2015_value`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Albania NA NA NA 35
2 Algeria 1 1 1 NA
3 Andorra NA NA NA 50
4 Antigua & Barbuda NA NA NA 98
5 Argentina NA NA NA 65
6 Armenia NA 44 46 46.5
7 Ascension NA NA NA 100
8 Austria NA NA NA 98
9 Azerbaijan NA NA 49 50
10 Bahamas NA NA NA 95
I would like to change the names of the countries according to a table I have:
> print(itu_emi_countries, n = 30)
# A tibble: 215 × 2
`ITU Name` `EMI Name`
<chr> <chr>
1 Afghanistan Afghanistan
2 Albania Albania
3 Algeria Algeria
4 American Samoa American Samoa
5 Andorra Andorra
6 Angola Angola
7 Antigua & Barbuda Antigua
8 Argentina Argentina
9 Armenia Armenia
10 Aruba Aruba
11 Australia Australia
12 Austria Austria
13 Azerbaijan Azerbaijan
14 Bahamas Bahamas
15 Bahrain Bahrain
16 Bangladesh Bangladesh
17 Barbados Barbados
18 Belarus Belarus
19 Belgium Belgium
20 Belize Belize
21 Benin Benin
22 Bermuda Bermuda
23 Bhutan Bhutan
24 Bolivia Bolivia
25 Bosnia and Herzegovina Bosnia-Herzegovina
26 Botswana Botswana
27 Brazil Brazil
28 Brunei Darussalam Brunei
29 Bulgaria Bulgaria
30 Burkina Faso Burkina Faso
# … with 185 more rows
The country names are written as in the first column, and I want to change them to the second column. How can I do this?
A simple rename and left_join does the trick:
library(tidyverse)
itu_emi_countries <- itu_emi_countries %>%
rename(Country = `ITU Name`)
left_join(xx, itu_emi_countries, by = "Country")
With dplyr, you could use recode() and pass a named vector indicating the relations between the old and new names.
library(dplyr)
xx %>%
mutate(Country = recode(Country, !!!tibble::deframe(itu_emi_countries)))
Related
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
I am trying to replicate the same column values for the next 2 cells in the column using R.
I have a data-frame of the following form:
Time World Cate Data
1994 Africa A 12
1994 B 17
1994 C 22
1994 Asia A 55
1994 B 10
1994 C 58
1995 Africa A 62
1995 B 87
1995 C 12
1995 Asia A 59
1995 B 12
1995 C 38
and I want to convert it to the following form:
Time World Cate Data
1994 Africa A 12
1994 Africa B 17
1994 Africa C 22
1994 Asia A 55
1994 Asia B 10
1994 Asia C 58
1995 Africa A 62
1995 Africa B 87
1995 Africa C 12
1995 Asia A 59
1995 Asia B 12
1995 Asia C 38
Use fill from the tidyr package:
If your dataframe is called dat, then
dat <- tidyr::fill(dat, World)
Using na.locf function from library(zoo)
library(zoo)
na.locf(df)
Time World Cate Data
1 1994 Africa A 12
2 1994 Africa B 17
3 1994 Africa C 22
4 1994 Asia A 55
5 1994 Asia B 10
6 1994 Asia C 58
7 1995 Africa A 62
8 1995 Africa B 87
9 1995 Africa C 12
10 1995 Asia A 59
11 1995 Asia B 12
12 1995 Asia C 38
Code
dummy$World <- rep(dummy$World[(1:floor(dim(dummy)[1]/5))*5-4],each = 5)
dummy
I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)
I have the following sample:
I am trying to turn it into the following panel data:
As you can see in the last image, I would like to repeat the values between years for the same country, and repeat the last value for the subsequent years until the year 2020.
You can use grid.expand to get the country / year combinations you want, then left_join the main data frame to this, and finally fill the missing data, ensuring you filter out any remaining NAs.
library(dplyr)
library(tidyr)
panel <- expand.grid(year = min(df$year):2020,
country = unique(df$country),
stringsAsFactors = FALSE) %>%
left_join(df) %>%
group_by(country) %>%
fill(c("id", "regioncode", "prespowl")) %>%
filter(!is.na(id)) %>%
as.data.frame()
Which gives the following result:
panel
#> year country id regioncode prespowl
#> 1 2011 Albania 1 Europe 0.1817557
#> 2 2012 Albania 1 Europe 0.1817557
#> 3 2013 Albania 1 Europe 0.1817557
#> 4 2014 Albania 1 Europe 0.1817557
#> 5 2015 Albania 1 Europe 0.1817557
#> 6 2016 Albania 1 Europe 0.1817557
#> 7 2017 Albania 1 Europe 0.1817557
#> 8 2018 Albania 1 Europe 0.1411482
#> 9 2019 Albania 1 Europe 0.1411482
#> 10 2020 Albania 1 Europe 0.1411482
#> 11 2016 Algeria 2 Africa 0.3837466
#> 12 2017 Algeria 2 Africa 0.3837466
#> 13 2018 Algeria 2 Africa 0.4837466
#> 14 2019 Algeria 2 Africa 0.4837466
#> 15 2020 Algeria 2 Africa 0.4837466
#> 16 1999 Argentina 3 Americas 0.2887138
#> 17 2000 Argentina 3 Americas 0.2887138
#> 18 2001 Argentina 3 Americas 0.2887138
#> 19 2002 Argentina 3 Americas 0.2887138
#> 20 2003 Argentina 3 Americas 0.2887138
#> 21 2004 Argentina 3 Americas 0.2887138
#> 22 2005 Argentina 3 Americas 0.2887138
#> 23 2006 Argentina 3 Americas 0.4322523
#> 24 2007 Argentina 3 Americas 0.4322523
#> 25 2008 Argentina 3 Americas 0.4322523
#> 26 2009 Argentina 3 Americas 0.4322523
#> 27 2010 Argentina 3 Americas 0.4322523
#> 28 2011 Argentina 3 Americas 0.4322523
#> 29 2012 Argentina 3 Americas 0.4322523
#> 30 2013 Argentina 3 Americas 0.5453171
#> 31 2014 Argentina 3 Americas 0.5453171
#> 32 2015 Argentina 3 Americas 0.5453171
#> 33 2016 Argentina 3 Americas 0.5453171
#> 34 2017 Argentina 3 Americas 0.5453171
#> 35 2018 Argentina 3 Americas 0.5453171
#> 36 2019 Argentina 3 Americas 0.5453171
#> 37 2020 Argentina 3 Americas 0.5453171
Data used:
df <- read.table(text= 'country year id regioncode prespowl
Albania 2011 1 Europe 0.1817557
Albania 2018 1 Europe 0.1411482
Algeria 2016 2 Africa 0.3837466
Algeria 2018 2 Africa 0.4837466
Argentina 1999 3 Americas 0.2887138
Argentina 2006 3 Americas 0.4322523
Argentina 2013 3 Americas 0.5453171
', header = TRUE, stringsAsFactors = FALSE)
I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo
When I try to join two tables without the KEY, it works perfectly. But when I am providing the Key, it is giving me weird results:
Pls. help me understand what am I missing out.
library(gapminder)
A <- gapminder[gapminder$country=="India" & gapminder$year %in% 1952:1987, 1:4]
B <- gapminder[gapminder$country=="India" & gapminder$year %in% 1977:2007, c(1:3, 5, 6)]
left_join(A, B)
left_join(A, B, by = "country")
For without key: I am getting
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 India Asia 1952 37.4 NA NA
2 India Asia 1957 40.2 NA NA
3 India Asia 1962 43.6 NA NA
4 India Asia 1967 47.2 NA NA
5 India Asia 1972 50.7 NA NA
6 India Asia 1977 54.2 634000000 813.
7 India Asia 1982 56.6 708000000 856.
8 India Asia 1987 58.6 788000000 977.
But, when I use the Key, it gives me some 56 rows:
# A tibble: 56 x 7
country continent year.x lifeExp year.y pop
<fct> <fct> <int> <dbl> <int> <dbl>
1 India Asia 1952 37.4 1977 6.34e8
2 India Asia 1952 37.4 1982 7.08e8
3 India Asia 1952 37.4 1987 7.88e8
4 India Asia 1952 37.4 1992 8.72e8
5 India Asia 1952 37.4 1997 9.59e8
6 India Asia 1952 37.4 2002 1.03e9
7 India Asia 1952 37.4 2007 1.11e9
8 India Asia 1957 40.2 1977 6.34e8
9 India Asia 1957 40.2 1982 7.08e8
10 India Asia 1957 40.2 1987 7.88e8
# ... with 46 more rows, and 1 more variable:
# gdpPercap <dbl>
Its called a Cartesian Product / Cross-Join
Cross Joins
Its basically a multiplication of the rows, rather than a straight intersect.