Removing certain rows whose column value does not match another column (all within the same data frame) - r

I'm attempting to remove all of the rows (cases) within a data frame in which a certain column's value does not match another column value.
The data frame bilat_total contains these 10 columns/variables:
bilat_total[,c("year", "importer1", "importer2", "flow1",
"flow2", "country", "imports", "exports", "bi_tot",
"mother")]
Thus the table's head is:
year importer1 importer2 flow1 flow2 country
6 2009 Afghanistan Bhutan NA NA Afghanistan
11 2009 Afghanistan Solomon Islands NA NA Afghanistan
12 2009 Afghanistan India 516.13 120.70 Afghanistan
13 2009 Afghanistan Japan 124.21 0.46 Afghanistan
15 2009 Afghanistan Maldives NA NA Afghanistan
19 2009 Afghanistan Bangladesh 4.56 1.09 Afghanistan
imports exports bi_tot mother
6 6689.35 448.25 NA United Kingdom
11 6689.35 448.25 NA United Kingdom
12 6689.35 448.25 1.804361e-02 United Kingdom
13 6689.35 448.25 6.876602e-05 United Kingdom
15 6689.35 448.25 NA United Kingdom
19 6689.35 448.25 1.629456e-04 United Kingdom
I've attempted to remove all the cases in which importer2 do not match mother by making a subset:
subset(bilat_total, importer2 == mother)
But each time I do, I get the error:
Error in Ops.factor(importer2, mother) : level sets of factors are different
How would I go about dropping all the rows/cases in which importer 2 and mother don't match?

The error may be because the columns are factor class. We can convert the columns to character class and then compare to subset the rows.
subset(bilat_total, as.character(importer2) == as.character(mother))
Based on the data example showed
subset(bilat_total, importer2 == mother)
# Error in Ops.factor(importer2, mother) :
# level sets of factors are different

Related

Joining two dataframes to plot a map with ggplot2

I want to make a worldmap visualization using an data frame, which look like this:
Country Year Sex Age Suicides Population Suicides_per_100k Country_Year HDI/Year Year_GDP
1 Albania 1987 Male 15-24 years 21 312900 6.71 Albania1987 NA 2156624900
2 Albania 1987 Male 35-54 years 16 308000 5.19 Albania1987 NA 2156624900
3 Albania 1987 Female 15-24 years 14 289700 4.83 Albania1987 NA 2156624900
4 Albania 1987 Male 75+ years 1 21800 4.59 Albania1987 NA 2156624900
5 Albania 1987 Male 25-34 years 9 274300 3.28 Albania1987 NA 2156624900
6 Albania 1987 Female 75+ years 1 35600 2.81 Albania1987 NA 2156624900
GDP_Per_Capita Generation Continent
1 796 Generation X Europe
2 796 Silent Europe
3 796 Generation X Europe
4 796 G.I. Generation Europe
5 796 Boomers Europe
6 796 G.I. Generation Europe
I tried to use the following code:
world <- ggplot2::map_data('world')
worldstart <- left_join(df,world,by = c("Country"="region")
This code created a new dataframe with 14 million observations.
But, I'd like to keep the same number of the dataset "df".
What is the best approach?
Indeed, the map_data functions returns the values for each point of each multipolygons in the world (~10k rows). As mentioned earlier, you cannot chose what point to keep.
You can use the sf library to go around this difficulty, keeping the geometry (here multipolygons) on one side and your data on the other.
My proposal would be the following :
library(dplyr)
library(sf)
library(ggplot2)
df <- tibble(Country = "Albania",
GDP_per_Capita = 796)
world <- maps::map('world', plot = F, fill = T) %>% st_as_sf(stringsAsFactors = F)
world_df <- df %>%
left_join(world, by = c("Country" = "ID"))
In my example, you would have only one row of data, but the geometry columns contains all necessary information for plotting.
sf and ggplot2 are well linked so you are good to go.
Best regards

Comparing two dataframes in R with different number of rows

I have two data frames, that have the same setup as below
Country Name Country Code Region Year Fertility Rate
Aruba ABW The Americas 1960 4.82
Afghanistan AFG Asia 1960 7.45
Angola AGO Africa 1960 7.379
Albania ALB Europe 1960 6.186
United Arab Emirates ARE Middle East 1960 6.928
Argentina ARG The Americas 1960 3.109
Armenia ARM Asia 1960 4.55
Antigua and Barbuda ATG The Americas 1960 4.425
Australia AUS Oceania 1960 3.453
Austria AUT Europe 1960 2.69
Azerbaijan AZE Asia 1960 5.571
Burundi BDI Africa 1960 6.953
Belgium BEL Europe 1960 2.54
I would like to create a data frame where I list out which countries are missing from the "merged" data frame as compared with the "merged2013" data frame. (Not my naming conventions)
I have tried numerous things I have found on the internet, with only this working below, but not to the way I would like it to
newmerged1 <- (paste(merged$Country.Name) %in% paste(merged2013$Country.Name))+1
newmerged1
This returns a "1" value for countries that aren't found in the merged2013 data frame. I'm assuming there is a way I can get this to list out the Country Name instead of a one or two, or just have a list of the countries not found in the merged2013 data frame without everything else.
You could use dplyr's anti_join, it is specifically designed to be used this way.
require(dplyr)
missing_data <-anti_join(merged2013, merged, by="Country.Name")
This will return all the rows in merged2013 not in merged.

How to recode and encode a country pair variable in R

I am trying to recode a variable for country pairs, e.g. an exporter EFG and an importeur ISR equals the country pair EFGISR. I need these pairs for a panel data analysis and therefore these country pairs have to be set to numeric variables. I am familiar to the as.numeric command, however recoding these variables back to the format seems to be a tough job. Do you guys know a better way to code it or a way to use the factor variable as a referene for a recode call ? I will have to use the plm package and the command make.pballanced().
Cheers and I would really appreciate your help!
edit:
idvar <- c(BRAWLD, BRAALB, BRADZA, BRAARG, BRAAUS, BRAAUT, BRABHR, BRAARM)
as.numeric(idvar)
[1] 108 2 30 5 7 8 12 6 9 15 11 17 23 19
as.factor(idvar)
[1] 108 2 30 5 7 8 12 6 9 15 11 17 23 19
This is the part where I would like to have again
idvar
BRAWLD, BRAALB, BRADZA, BRAARG, BRAAUS, BRAAUT, BRABHR, BRAARM
I am Heading my dataset here:
year exp exp_iso imp imp_iso nw tv nw_c nw_dc tv_c tv_dc tv_total nw_total id_var
1996-BRAARE 1996 Brazil BRA United Arab Emirates ARE 563812 1245639 563812 0 1245639 0 1245639 563812 BRAARE
1996-BRAARG 1996 Brazil BRA Argentina ARG 34006800 77508984 34006800 0 77508984 0 77508984 34006800 BRAARG
1996-BRAARM 1996 Brazil BRA Armenia ARM 38398 70656 38398 0 70656 0 70656 38398 BRAARM
1996-BRAAUS 1996 Brazil BRA Australia AUS 3213000 7864554 3213000 0 7864554 0 7864554 3213000 BRAAUS
1996-BRAAUT 1996 Brazil BRA Austria AUT 11189578 25442560 11189578 0 25442560 0 25442560 11189578 BRAAUT
1996-BRABEL 1996 Brazil BRA Belgium BEL 41944172 93179224 41944172 0 93179224 0 93179224 41944172 BRABEL
I found an appealing solution to the problem. Using the package countryodes provides a formula with which I could paste the charachter country codes as numeric codes using the countrycode = "iso3n".

How to reshape this complicated data frame?

Here is first 4 rows of my data;
X...Country.Name Country.Code Indicator.Name
1 Turkey TUR Inflation, GDP deflator (annual %)
2 Turkey TUR Unemployment, total (% of total labor force)
3 Afghanistan AFG Inflation, GDP deflator (annual %)
4 Afghanistan AFG Unemployment, total (% of total labor force)
Indicator.Code X2010
1 NY.GDP.DEFL.KD.ZG 5.675740
2 SL.UEM.TOTL.ZS 11.900000
3 NY.GDP.DEFL.KD.ZG 9.437322
4 SL.UEM.TOTL.ZS NA
I want my data reshaped into two colums, one of each Indicator code, and I want each row correspond to a country, something like this;
Country Name NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
Turkey 5.6 11.9
Afghanistan 9.43 NA
I think I could do this with Excel, but I want to learn the R way, so that I don't need to rely on excel everytime I have a problem. Here is dput of data if you need it.
Edit: I actually want 3 colums, one for each indicator and one for the country's name.
Sticking with base R, use reshape. I took the liberty of cleaning up the column names. Here, I'm only showing you a few rows of the output. Remove head to see the full output. This assumes your data.frame is named "mydata".
names(mydata) <- c("CountryName", "CountryCode",
"IndicatorName", "IndicatorCode", "X2010")
head(reshape(mydata[-c(2:3)],
direction = "wide",
idvar = "CountryName",
timevar = "IndicatorCode"))
# CountryName X2010.NY.GDP.DEFL.KD.ZG X2010.SL.UEM.TOTL.ZS
# 1 Turkey 5.675740 11.9
# 3 Afghanistan 9.437322 NA
# 5 Albania 3.459343 NA
# 7 Algeria 16.245617 11.4
# 9 American Samoa NA NA
# 11 Andorra NA NA
Another option in base R is xtabs, but NA gets replaced with 0:
head(xtabs(X2010 ~ CountryName + IndicatorCode, mydata))
# IndicatorCode
# CountryName NY.GDP.DEFL.KD.ZG SL.UEM.TOTL.ZS
# Afghanistan 9.437322 0.0
# Albania 3.459343 0.0
# Algeria 16.245617 11.4
# American Samoa 0.000000 0.0
# Andorra 0.000000 0.0
# Angola 22.393924 0.0
The result of xtabs is a matrix, so if you want a data.frame, wrap the output with as.data.frame.matrix.

Create a moving sum of past levels of a variable, summed over for each level of 3 other variables, in R

I have a data.frame of the following structure (panel data), with 16 levels of time(quarters) 14 levels of geo (countries) and 20 levels of citizen, each of them repeating accordingly in the dataframe.
time geo citizen X
2008Q1 Belgium Afghanistan 22
2008Q1 Belgium Armenia 10
2008Q1 Belgium Bangladesh 25
2008Q1 Belgium Democratic Republic of the Congo 55
2008Q1 Belgium China (including Hong Kong) 5
2008Q1 Belgium Eritrea 8
I would like to create a new column lets say MOVSUM where it will sum variable X for each level of citizen and geo and time for the previous 4 quarters, so that I would have for each quarter, t, how many X's of each citizen in each geo were available during t-4 to t-1 quarters.
Thanks in advance

Resources