merge by nearest neighbour in group - R - r

I have two country*year level datasets that cover the same countries but in different years. I would like to merge the two in a way that year is matched with its nearest neighbor, always within country (iso2code).
The first (dat1) looks like this (showing here only the head for AT, but iso2code has multiple different values):
iso2code year elect_polar_lrecon
<chr> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91
5 AT 2014 4.01
6 AT 2019 3.55
The second (dat2) looks like this:
iso2code year affpol
<chr> <dbl> <dbl>
1 AT 2008 2.47
2 AT 2013 2.49
3 DE 1998 2.63
4 DE 2002 2.83
5 DE 2005 2.89
6 DE 2009 2.09
In the end I would like to have something like (note that the value of affpol for 2008 could be matched both with 2010 and with 2006 as it is equally distant from both. If possible, I would go for the most recent date, as it is below):
iso2code year.1 elect_polar_lrecon year.2 affpol
<chr> <dbl> <dbl> <dbl> <dbl>
1 AT 1999 2.48
2 AT 2002 4.18
3 AT 2006 3.66
4 AT 2010 3.91 2008 2.47
5 AT 2014 4.01 2013 2.49
6 AT 2019 3.55
Not sure about how to do this... I am happy for a tidyverse solution, but really, all help is much appreciated!

As mentioned by Henrik, this can be solved by updating in a rolling join to the nearest which is available in the data.table package. Additionally, the OP has requested to go for the most recent date if matches are equally distant.
library(data.table)
setDT(dat1)[setDT(dat2), roll = "nearest", on = c("iso2code", "year"),
`:=`(year.2 = i.year, affpol = i.affpol)]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 2008 2.47
4: AT 2010 3.91 NA NA
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA
This operation has updated dat1 by reference, i.e., without copying the whole data object by adding two additional columns.
Now, the OP has requested to go for the most recent date if matches are equally distant but the join has picked the older date. Apparently, there is no parameter to control this in a rolling join to the nearest.
The workaround is to create a helper variable nyear which holds the negative year and to join on this:
setDT(dat1)[, nyear := -year][setDT(dat2)[, nyear := -year],
roll = "nearest", on = c("iso2code", "nyear"),
`:=`(year.2 = i.year, affpol = i.affpol)][
, nyear := NULL]
dat1
iso2code year elect_polar_lrecon year.2 affpol
1: AT 1999 2.48 NA NA
2: AT 2002 4.18 NA NA
3: AT 2006 3.66 NA NA
4: AT 2010 3.91 2008 2.47
5: AT 2014 4.01 2013 2.49
6: AT 2019 3.55 NA NA

I figured it out with the help of a friend. I leave it here in case anyone else is looking for a solution. Assuming that the first dataset is to_plot and the second is called to_plot2. Then:
find_nearest_year <- function(p_year, p_code){
years <- to_plot$year[to_plot$iso2code==p_code]
nearest_year <- years[1]
for (i in sort(years, decreasing = TRUE)) {
if (abs(i - p_year) < abs(nearest_year-p_year)) {
nearest_year <- i
}
}
return(nearest_year)
}
to_plot2 <- to_plot2 %>%
group_by(iso2code, year) %>%
mutate(matching_year=find_nearest_year(year, iso2code))
merged <- left_join(to_plot, to_plot2, by=c("iso2code", "year"="matching_year"))

Related

Euclidean distant for distinct classes of factors iterated by groups

*Update: The answer suggested by Rui is great and works as it should. However, when I run it on about 7 million observations (my actual dataset), R gets stuck in a computational block (I'm using a machine with 64gb of RAM). Any other solutions are greatly appreciated!
I have a dataframe of patents consisting of the firms, application years, patent number, and patent classes. I want to calculate the Euclidean distance between consecutive years for each firm based on patent classes according to the following formula:
Where Xi represents the number of patents belonging to a specific class in year t, and Yi represents the number of patents belonging to a specific class in the previous year (t-1).
To further illustrate this, consider the following dataset:
df <- data.table(Firm = rep(c(LETTERS[1:2]),each=6), Year = rep(c(1990,1990,1991,1992,1992,1993),2),
Patent_Number = sample(184785:194785,12,replace = FALSE),
Patent_Class = c(12,5,31,12,31,6,15,15,15,3,3,1))
> df
Firm Year Patent_Number Patent_Class
1: A 1990 192473 12
2: A 1990 193702 5
3: A 1991 191889 31
4: A 1992 193341 12
5: A 1992 189512 31
6: A 1993 185582 6
7: B 1990 190838 15
8: B 1990 189322 15
9: B 1991 190620 15
10: B 1992 193443 3
11: B 1992 189937 3
12: B 1993 194146 1
Since year 1990 is the beginning year for Firm A, there is no Euclidean distance for that year (NAs should be produced. Moving forward to year 1991, the distinct classses for this year (1991) and the previous year (1990) are 31, 5, and 12. Therefore, the above formula is summed over these three distinct classes (there is three distinc 'i's). So the formula's output will be:
Following the same calculation and reiterating over firms, the final output should be:
> df
Firm Year Patent_Number Patent_Class El_Dist
1: A 1990 192473 12 NA
2: A 1990 193702 5 NA
3: A 1991 191889 31 1.2247450
4: A 1992 193341 12 0.7071068
5: A 1992 189512 31 0.7071068
6: A 1993 185582 6 1.2247450
7: B 1990 190838 15 NA
8: B 1990 189322 15 NA
9: B 1991 190620 15 0.5000000
10: B 1992 193443 3 1.1180340
11: B 1992 189937 3 1.1180340
12: B 1993 194146 1 1.1180340
I'm preferably looking for a data.table solution for speed purposes.
Thank you very much in advance for any help.
I believe that the function below does what the question asks for, but the results for Firm == "B" are not equal to the question's.
fEl_Dist <- function(X){
Year <- X[["Year"]]
PatentClass <- X[["Patent_Class"]]
sapply(seq_along(Year), function(i){
j <- which(Year %in% (Year[i] - 1:0))
tbl <- table(Year[j], PatentClass[j])
if(NROW(tbl) == 1){
NA_real_
} else {
numer <- sum((tbl[2, ] - tbl[1, ])^2)
denom <- sum(tbl[2, ]^2)*sum(tbl[1, ]^2)
sqrt(numer/denom)
}
})
}
setDT(df)[, El_Dist := fEl_Dist(.SD),
by = .(Firm),
.SDcols = c("Year", "Patent_Class")]
head(df)
# Firm Year Patent_Number Patent_Class El_Dist
#1: A 1990 190948 12 NA
#2: A 1990 186156 5 NA
#3: A 1991 190801 31 1.2247449
#4: A 1992 185226 12 0.7071068
#5: A 1992 185900 31 0.7071068
#6: A 1993 186928 6 1.2247449

Making Sense of Time Series Data with > 43,000 observations

Updated Post
After a lot of work, I have finally merged three different datasets. The result is a time series data frame with 43,396 observations of 7 seven variables. Below, I have included a few rows of what my data looks like.
Dyad year cyberattack cybersev MID MIDsev peace score
2360 2005 NA NA 0 1 0
2360 2006 NA NA NA NA 0
2360 2007 1 3.0 0 1 0
2360 2008 1 4.0 0 1 0
2360 2009 3 3.33 1 4 0
2360 2010 1 3.0 NA NA 0
2360 2011 3 2.0 NA NA 0
2360 2012 1 2.0 NA NA 0
2360 2013 4 2.0 NA NA 0
If I am interested in comparing how different country pairs (dyads) differ in how often they launch attacks (either in cyberspace, physically with MIDs, or neither)...how should I go about doing this?
Since I am working with country/year data, how can I get descriptive statistics for the different countries (Dyads) in my Dyad variable? For example, I would like to know how the behavior of Dyad 2360 (USA and Iran) compares with other countries.
I tried this code, but it just gave me a list of my unique dyad pairs:
table(final$Dyadpair)
names(sort(-table(final$Dyadpair)))
You mentioned using aggregate or dplyr - but I don't see how those will allow me to descriptive statistics for all of my unique dyads? Would you mind elaborating on this?
Is it possible for a code to return something like this: For Dyad 2360 during the years 2005-2013, 80% were NA, 10% were cyber attacks, and 10% were MID attacks, etc. ?
Upate to clarify:
Ok, yes - the above example was just hypothetical. Based on the nine rows of data that I have provided - here is what I am hoping R can provide when it comes to descriptive statistics.
Dyad: 2360
No attacks: 22.22% (2/9) ….in 2005 and 2006
Cyber attacks: 77.78% (7/9) ….in the years 2007-2013
MID attacks: 11.11% (1/9) ….in 2009
Both cyber and MID: 11.11% (1/9) ….in 2009
Essentially, during a given time range (2005-2013 for the example I gave above), how many of those years result in NO attacks, how many of those years result in a cyber attack, how many of those years result in a MID attack, and how many of those years result in both a cyber and MID attack.
I do not know if this is possible with how my data is set up —> since I aggregated cyber-attacks and MID attacks per year? And yes, I would also like to take into consideration the severity of the attacks (both cyber attacks and MID attacks), but I don’t know how to do that.
Does this help clarify what I am looking for?
Here's a dplyr approach with my best guess for what you want. It will output a data frame with one row per dyad and the same summary statistics for each dyad.
library(dplyr)
your_data %>%
group_by(Dyad) %>%
summarize(
year_range = paste(min(year), max(year), sep = "-"),
no_attacks = mean(is.na(cyberattack) & (is.na(MID) | MID == 0)),
cyber_attacks = mean(!is.na(cyberattack)),
MID_attacks = mean(!is.na(MID) & MID > 0),
cyber_and_MID = mean(!is.na(cyberattack) & (!is.na(MID) & MID > 0)),
cyber_sev_weighted = weighted.mean(cyberattack, w = cybersev, na.rm = TRUE)
)
# # A tibble: 1 x 7
# Dyad year_range no_attacks cyber_attacks MID_attacks cyber_and_MID cyber_sev_weighted
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2360 2005-2013 0.222 0.778 0.111 0.111 1.86
Using this data:
your_data = read.table(text = 'Dyad year cyberattack cybersev MID MIDsev peace_score
2360 2005 NA NA 0 1 0
2360 2006 NA NA NA NA 0
2360 2007 1 3.0 0 1 0
2360 2008 1 4.0 0 1 0
2360 2009 3 3.33 1 4 0
2360 2010 1 3.0 NA NA 0
2360 2011 3 2.0 NA NA 0
2360 2012 1 2.0 NA NA 0
2360 2013 4 2.0 NA NA 0', header = T)

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

dplyr group_by and iterative loop calculation

I am trying to perform an iterative calculation on grouped data that depend on two previous elements within a group. As a toy example:
set.seed(100)
df = data.table(ID = c(rep("A_index1",9)),
Year = c(2001:2005, 2001:2004),
Price = c(NA, NA, 10, NA, NA, 15, NA, 13, NA),
Index = sample(seq(1, 3, by = 0.5), size = 9, replace = TRUE))
ID Year Price Index
R> df
1: A_index1 2001 NA 1.5
2: A_index1 2002 NA 1.5
3: A_index1 2003 10 2.0
4: A_index1 2004 NA 1.0
5: A_index1 2005 NA 2.0
6: A_index1 2006 15 2.0
7: A_index1 2007 NA 3.0
8: A_index1 2008 13 1.5
9: A_index1 2009 NA 2.0
The objective is to fill the missing prices using the last available price and an index to adjust. I have a loop that performs these calculations, which I am trying to vectorize using dplyr.
My logic is defined in the below loop:
df$Price_adj = df$Price
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = round(df$Price_adj[i-1] * df$Index[i] / df$Index[i-1], 2)
}
}
R> df
ID Year Price Index Price_adj
1: A_index1 2001 NA 1.5 NA
2: A_index1 2002 NA 1.5 NA
3: A_index1 2003 10 2.0 10.00
4: A_index1 2004 NA 1.0 5.00
5: A_index1 2005 NA 2.0 10.00
6: A_index1 2006 15 2.0 15.00
7: A_index1 2007 NA 3.0 22.50
8: A_index1 2008 13 1.5 13.00
9: A_index1 2009 NA 2.0 17.33
In my actual large data, I will have to apply this function to multiple groups and speed is a consideration. My attempt at this is below, that needs help to point me in the right direction. I did consider Reduce, but not sure how it can incorporate two previous elements within the group.
foo = function(Price, Index){
for (i in 2:nrow(df)) {
if (is.na(df$Price[i])) {
df$Price_adj[i] = df$Price_adj[i-1] * df$Index[i] / df$Index[i-1]
}
}
}
df %>%
group_by(ID) %>%
mutate(Price_adj = Price,
Price_adj = foo(Price, Index))
One option with cumprod:
df %>%
# group data frame into chunks starting from non na price
group_by(ID, g = cumsum(!is.na(Price))) %>%
# for each chunk multiply the first non na price with the cumprod of Index[i]/Index[i-1]
mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>%
ungroup() %>% select(-g)
# A tibble: 9 x 5
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Group data frame by ID and cumsum(!is.na(Price)), the letter split data frame into chunks and each chunk start with a non NA Price;
first(Price) * cumprod(Index / lag(Index, default=first(Index))) does the iterative calculation, which is equivalent to the formula given in the question if you substitute Price_adj[i-1] with Price_adj[i-2] until it's Price_adj[1] or first(Price);
Caveat: may not be very efficient if you have many NA chunks.
If the speed is the primary concern, you could write your function using Rcpp package:
library(Rcpp)
cppFunction("
NumericVector price_adj(NumericVector price, NumericVector index) {
int n = price.size();
NumericVector adjusted_price(n);
adjusted_price[0] = price[0];
for (int i = 1; i < n; i++) {
if(NumericVector::is_na(price[i])) {
adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];
} else {
adjusted_price[i] = price[i];
}
}
return adjusted_price;
}")
Now use the cpp function with dplyr as follows:
cpp_fun <- function() df %>% group_by(ID) %>% mutate(Price_adj = round(price_adj(Price, Index), 2))
cpp_fun()
# A tibble: 9 x 5
# Groups: ID [1]
# ID Year Price Index Price_adj
# <fctr> <int> <dbl> <dbl> <dbl>
#1 A_index1 2001 NA 1.5 NA
#2 A_index1 2002 NA 1.5 NA
#3 A_index1 2003 10 2.0 10.00
#4 A_index1 2004 NA 1.0 5.00
#5 A_index1 2005 NA 2.0 10.00
#6 A_index1 2001 15 2.0 15.00
#7 A_index1 2002 NA 3.0 22.50
#8 A_index1 2003 13 1.5 13.00
#9 A_index1 2004 NA 2.0 17.33
Benchmark:
Define r_fun as:
r_fun <- function() df %>% group_by(ID, g = cumsum(!is.na(Price))) %>% mutate(Price_adj = round(first(Price) * cumprod(Index / lag(Index, default=first(Index))), 2)) %>% ungroup() %>% select(-g)
On the small sample data, there's already a difference:
microbenchmark::microbenchmark(r_fun(), cpp_fun())
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100
# cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
Testing on a slightly larger data frame:
df <- bind_rows(rep(list(df), 10000))
#dim(df)
#[1] 90000 4
microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval
# r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10
# cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
Identity test:
identical(ungroup(r_fun()), ungroup(cpp_fun()))
# [1] TRUE

merge two data frames based on matching rows of multiple columns

Below is the summary and structure of the two data sets I tried to merge claimants and unemp, they can me found here claims.csv and unemp.csv
> tbl_df(claimants)
# A tibble: 6,960 × 5
X County Month Year Claimants
<int> <fctr> <fctr> <int> <int>
1 1 ALAMEDA Jan 2007 13034
2 2 ALPINE Jan 2007 12
3 3 AMADOR Jan 2007 487
4 4 BUTTE Jan 2007 3496
5 5 CALAVERAS Jan 2007 644
6 6 COLUSA Jan 2007 1244
7 7 CONTRA COSTA Jan 2007 8475
8 8 DEL NORTE Jan 2007 328
9 9 EL DORADO Jan 2007 2120
10 10 FRESNO Jan 2007 19974
# ... with 6,950 more rows
> tbl_df(unemp)
# A tibble: 6,960 × 7
County Year Month laborforce emplab unemp unemprate
* <chr> <int> <chr> <int> <int> <int> <dbl>
1 Alameda 2007 Jan 743100 708300 34800 4.7
2 Alameda 2007 Feb 744800 711000 33800 4.5
3 Alameda 2007 Mar 746600 713200 33300 4.5
4 Alameda 2007 Apr 738200 705800 32400 4.4
5 Alameda 2007 May 739100 707300 31800 4.3
6 Alameda 2007 Jun 744900 709100 35800 4.8
7 Alameda 2007 Jul 749600 710900 38700 5.2
8 Alameda 2007 Aug 746700 709600 37000 5.0
9 Alameda 2007 Sep 748200 712100 36000 4.8
10 Alameda 2007 Oct 749000 713000 36100 4.8
# ... with 6,950 more rows
I thought first I should change all the factor columns to character columns.
unemp[sapply(unemp, is.factor)] <- lapply(unemp[sapply(unemp, is.factor)], as.character)
claimants[sapply(claimants, is.factor)] <- lapply(claimants[sapply(claimants, is.factor)], as.character)
m <-merge(unemp, claimants, by = c("County", "Month", "Year"))
dim(m)
[1] 0 10
In the output of dim(m), 0 rows are in the resulting dataframe. All the 6960 rows should match each other uniquely.
To verify that the two data frames have unique combination of the the 3 columns 'County', 'Month', and 'Year' I reorder and rearrange these columns within the dataframes as below:
a <- claimants[ order(claimants[,"County"], claimants[,"Month"], claimants[,"Year"]), ]
b <- unemp[ order(unemp[,"County"], unemp[,"Month"], unemp[,"Year"]), ]
b[2:4] <- b[c(2,4,3)]
a[2:4] %in% b[2:4]
[1] TRUE TRUE TRUE
This last output confirms that all 'County', 'Month', and 'Year' columns match each other in these two dataframes.
I have tried looking into the documentation for merge and could not gather where do I go wrong, I have also tried the inner_join function from dplyr:
> m <- inner_join(unemp[2:8], claimants[2:5])
Joining, by = c("County", "Year", "Month")
> dim(m)
[1] 0 8
I am missing something and don't know what, would appreciate the help with understanding this, I know I should not have to rearrange the rows by the three columns to run merge R should identify the matching rows and merge the non-matching columns.
The claimants df has the counties in all uppercase, the unemp df has them in lower case.
I used the options(stringsAsFactors = FALSE) when reading in your data. A few suggestions drop the X column in both, it doesn't seem useful.
options(stringsAsFactors = FALSE)
claims <- read.csv("claims.csv",header=TRUE)
claims$X <- NULL
unemp <- read.csv("unemp.csv",header=TRUE)
unemp$X <- NULL
unemp$County <- toupper(unemp$County)
m <- inner_join(unemp, claims)
dim(m)
# [1] 6960 8

Resources