Data transformation, almost like when you use cast and melt - r

I don't know how to name this data transformation neither know if there exists some kind of function to use
My data has this shape:
rank abbrv country eci_value delta year
(int) (fctr) (fctr) (dbl) (int) (int)
1 30 BRA Brazil 0.5588656 2 1995
2 47 URY Uruguay 0.2098838 -14 1995
3 52 PAN Panama 0.1164776 2 1995
4 56 ARG Argentina 0.0013733 7 1995
5 58 VEN Venezuela -0.0329851 11 1995
6 64 COL Colombia -0.2216275 -2 1995
And I want a data frame with just the information provided by "year, "rank" and country presented in this way:
country 1995 1996 1997 1998 ...
Peru rank1995 rank1996 rank1997 rank1998 ...
Brazil rank1995 rank1996 rank1997 rank1998 ...
Chile rank1995 rank1996 rank1997 rank1998 ...
... ... ... ... ...
The var "year" ranges from 1995 to 2014 and the rank varies each year
I've thought of using a melt and dcast functions from reshape2 package... but nothing useful goes out.
Thanks

This could work for you. Here is an example using dplyr and tidyr, using your small sample above (you will have to test on a larger data set or provide one).
library(dplyr)
library(tidyr)
df
# rank abbrv country eci_value delta year
#1 30 BRA Brazil 0.5588656 2 1995
#2 47 URY Uruguay 0.2098838 -14 1995
#3 52 PAN Panama 0.1164776 2 1995
#4 56 ARG Argentina 0.0013733 7 1995
#5 58 VEN Venezuela -0.0329851 11 1995
#6 64 COL Colombia -0.2216275 -2 1995
df %>% select(country, year, rank) %>% spread(year, rank)
# country 1995
#1 Argentina 56
#2 Brazil 30
#3 Colombia 64
#4 Panama 52
#5 Uruguay 47
#6 Venezuela 58

Related

How do I replicate the same column values for the next 2 cells in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
I am trying to replicate the same column values for the next 2 cells in the column using R.
I have a data-frame of the following form:
Time World Cate Data
1994 Africa A 12
1994 B 17
1994 C 22
1994 Asia A 55
1994 B 10
1994 C 58
1995 Africa A 62
1995 B 87
1995 C 12
1995 Asia A 59
1995 B 12
1995 C 38
and I want to convert it to the following form:
Time World Cate Data
1994 Africa A 12
1994 Africa B 17
1994 Africa C 22
1994 Asia A 55
1994 Asia B 10
1994 Asia C 58
1995 Africa A 62
1995 Africa B 87
1995 Africa C 12
1995 Asia A 59
1995 Asia B 12
1995 Asia C 38
Use fill from the tidyr package:
If your dataframe is called dat, then
dat <- tidyr::fill(dat, World)
Using na.locf function from library(zoo)
library(zoo)
na.locf(df)
Time World Cate Data
1 1994 Africa A 12
2 1994 Africa B 17
3 1994 Africa C 22
4 1994 Asia A 55
5 1994 Asia B 10
6 1994 Asia C 58
7 1995 Africa A 62
8 1995 Africa B 87
9 1995 Africa C 12
10 1995 Asia A 59
11 1995 Asia B 12
12 1995 Asia C 38
Code
dummy$World <- rep(dummy$World[(1:floor(dim(dummy)[1]/5))*5-4],each = 5)
dummy

R: Substituting missing values (NAs) with two different values

I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)

Merging two data frames with different rows in R

I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo

Remove rows with NA values and delete those observations in another year [duplicate]

This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93

Select "europe" from df

my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6

Resources