Comparing two dataframes in R with different number of rows - r

I have two data frames, that have the same setup as below
Country Name Country Code Region Year Fertility Rate
Aruba ABW The Americas 1960 4.82
Afghanistan AFG Asia 1960 7.45
Angola AGO Africa 1960 7.379
Albania ALB Europe 1960 6.186
United Arab Emirates ARE Middle East 1960 6.928
Argentina ARG The Americas 1960 3.109
Armenia ARM Asia 1960 4.55
Antigua and Barbuda ATG The Americas 1960 4.425
Australia AUS Oceania 1960 3.453
Austria AUT Europe 1960 2.69
Azerbaijan AZE Asia 1960 5.571
Burundi BDI Africa 1960 6.953
Belgium BEL Europe 1960 2.54
I would like to create a data frame where I list out which countries are missing from the "merged" data frame as compared with the "merged2013" data frame. (Not my naming conventions)
I have tried numerous things I have found on the internet, with only this working below, but not to the way I would like it to
newmerged1 <- (paste(merged$Country.Name) %in% paste(merged2013$Country.Name))+1
newmerged1
This returns a "1" value for countries that aren't found in the merged2013 data frame. I'm assuming there is a way I can get this to list out the Country Name instead of a one or two, or just have a list of the countries not found in the merged2013 data frame without everything else.

You could use dplyr's anti_join, it is specifically designed to be used this way.
require(dplyr)
missing_data <-anti_join(merged2013, merged, by="Country.Name")
This will return all the rows in merged2013 not in merged.

Related

Creating a Variable Initial Values from a base variable in Panel Data Structure in R

I'm trying to create a new variable in R containing the initial values of another variable (crime) based on groups (countries) considering the initial period of time observable per group (on panel data framework), my current data looks like this:
country
year
Crime
Albania
2016
2.7369478
Albania
2017
2.0109779
Argentina
2002
9.474084
Argentina
2003
7.7898825
Argentina
2004
6.0739941
And I want it to look like this:
country
year
Crime
Initial_Crime
Albania
2016
2.7369478
2.7369478
Albania
2017
2.0109779
2.7369478
Argentina
2002
9.474084
9.474084
Argentina
2003
7.7898825
9.474084
Argentina
2004
6.0739941
9.474084
I saw that ddply could make it work this way, but the problem is that it is not longer supported by the latest R updates.
Thank you in advance.
Maybe arrange by year, then after grouping by country set Initial_Crime to be the first Crime in the group.
library(tidyverse)
df %>%
arrange(year) %>%
group_by(country) %>%
mutate(Initial_Crime = first(Crime))
Output
country year Crime Initial_Crime
<chr> <int> <dbl> <dbl>
1 Argentina 2002 9.47 9.47
2 Argentina 2003 7.79 9.47
3 Argentina 2004 6.07 9.47
4 Albania 2016 2.74 2.74
5 Albania 2017 2.01 2.74
library(data.table)
setDT(data)[, Initial_Crime:=.SD[1,Crime], by=country]
country year Crime Initial_Crime
1: Albania 2016 2.736948 2.736948
2: Albania 2017 2.010978 2.736948
3: Argentina 2002 9.474084 9.474084
4: Argentina 2003 7.789883 9.474084
5: Argentina 2004 6.073994 9.474084
A data.table solution
setDT(df)
df[, x := 1:.N, country
][x==1, initial_crime := crime
][, initial_crime := nafill(initial_crime, type = "locf")
][, x := NULL
]

Removing certain rows whose column value does not match another column (all within the same data frame)

I'm attempting to remove all of the rows (cases) within a data frame in which a certain column's value does not match another column value.
The data frame bilat_total contains these 10 columns/variables:
bilat_total[,c("year", "importer1", "importer2", "flow1",
"flow2", "country", "imports", "exports", "bi_tot",
"mother")]
Thus the table's head is:
year importer1 importer2 flow1 flow2 country
6 2009 Afghanistan Bhutan NA NA Afghanistan
11 2009 Afghanistan Solomon Islands NA NA Afghanistan
12 2009 Afghanistan India 516.13 120.70 Afghanistan
13 2009 Afghanistan Japan 124.21 0.46 Afghanistan
15 2009 Afghanistan Maldives NA NA Afghanistan
19 2009 Afghanistan Bangladesh 4.56 1.09 Afghanistan
imports exports bi_tot mother
6 6689.35 448.25 NA United Kingdom
11 6689.35 448.25 NA United Kingdom
12 6689.35 448.25 1.804361e-02 United Kingdom
13 6689.35 448.25 6.876602e-05 United Kingdom
15 6689.35 448.25 NA United Kingdom
19 6689.35 448.25 1.629456e-04 United Kingdom
I've attempted to remove all the cases in which importer2 do not match mother by making a subset:
subset(bilat_total, importer2 == mother)
But each time I do, I get the error:
Error in Ops.factor(importer2, mother) : level sets of factors are different
How would I go about dropping all the rows/cases in which importer 2 and mother don't match?
The error may be because the columns are factor class. We can convert the columns to character class and then compare to subset the rows.
subset(bilat_total, as.character(importer2) == as.character(mother))
Based on the data example showed
subset(bilat_total, importer2 == mother)
# Error in Ops.factor(importer2, mother) :
# level sets of factors are different

r read.table misread special symbols

I am supposed to use read.table (not other functions) to import my data.
The data looks like the following:
country year pop continent lifeExp gdpPercap
Afghanistan 1952 8425333 Asia 28.801 779.4453145
Afghanistan 1957 9240934 Asia 30.332 820.8530296
Afghanistan 1962 10267083 Asia 31.997 853.10071
...
Cote d'Ivoire 1987 10761098 Africa 54.655 2156.956069
Cote d'Ivoire 1992 12772596 Africa 52.044 1648.073791
Cote d'Ivoire 1997 14625967 Africa 47.991 1786.265407
Cote d'Ivoire 2002 16252726 Africa 46.832 1648.800823
Cote d'Ivoire 2007 18013409 Africa 48.328 1544.750112
...
The read.table cannot properly read "Cote d'Ivoire" because it has the prime symbol. How do I fix that by changing the parameters of the read.table function?
You will have to use quote = when you read.table to ignore the quoting character in Cote d'Ivoire.
df.1 <- read.table("your/file.txt", quote = "", header = TRUE, sep = "\t")

R t-test of mean vs observations for multiple factor levels

I have a dataset of some 39k rows of data, an excerpt is below:
'Country', 'Group', 'Item', 'Year' are categorical
'Production' and 'Waste' are numerical
'LF' is also numerical, but is the result of 'Waste'/'Production
Region Country Group Item Year Production Waste LF
Europe Bulgaria Cereals Wheat 1961 2040 274 0.134313725
Europe Bulgaria Cereals Wheat 1962 2090 262 0.125358852
Europe Bulgaria Cereals Wheat 1963 1894 277 0.14625132
Europe Bulgaria Cereals Wheat 1964 2121 286 0.134842056
Europe Bulgaria Cereals Wheat 1965 2923 341 0.116660965
Europe Bulgaria Cereals Wheat 1966 3193 385 0.120576261
Europe Bulgaria Cereals Barley 1961 612 15 0.024509804
Europe Bulgaria Cereals Barley 1962 599 16 0.026711185
Europe Bulgaria Cereals Barley 1963 618 16 0.025889968
Europe Bulgaria Cereals Barley 1964 764 21 0.027486911
Europe Bulgaria Cereals Barley 1965 876 22 0.025114155
Europe Bulgaria Cereals Barley 1966 1064 24 0.022556391
I have used the following code to generate 991 different means by Item and Group
df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')
The results of this function look ok.
I would like to test whether the respective means of LF in df2 are different to the underlying annual observations in df1 for each Country-Item combination (ie. if FALSE, then LF is really just a static ratio, if TRUE then 'Waste' is independent from 'Production').
How might this best be done? There seem to be 991 tests to conduct for this dataset alone and I don't know how to mix the apply and t.test functions in this manner.
Thanks!
t.test requires two groups to compare on a numeric/scale dependent output variable. Here, it seems to me that for each combination of country and item you want to compare all different year averages/means. In other words, you are trying to investigate if year is influencing the LF averages, for each combination of country and item.
The easiest way to do this is to create a linear model (LF ~ Year) for each combination of country and item and interpret the coefficient and p value of the variable year.
library(dplyr)
library(broom)
set.seed(115)
# example dataset
dt = data.frame(Country = rep("country1",12),
Item = c(rep("item1",6), rep("item2",6)),
Year = rep(1961:1966,2),
LF = runif(12,0,1))
# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))
# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))
# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))
Hope this helps. Let me know if I'm missing something and I'll update my code.

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources