Explain joining please? - r

I need some help understanding the concept of joining.
I understand how to mentally model how a join works if you have 2 data files that have a common variable. Like:
Animal
Weight
Age
Dog
12
5
Cat
4
19
Fish
2
4
Mouse
1
2
Animal
Award
Dog
1st
Cat
1st
Fish
3rd
Mouse
5th
These can be joined because the animal column is exactly the same and it just adds on another variable to the same observations of animals.
But I don't understand it when its something like this:
Mortality Rate (Heart Attack)
Year
Place
Death Rate (Heart Attack)
2011
Paris
200
2011
Paris
94
2011
Rome
23
2009
London
15
Mortality Rate (Car Crash)
Year
Place
Death Rate (Car Crash)
2011
London
987
2012
London
34
2012
Paris
09
2007
Melbourne
12
The variable TYPES are the same (years, cities and death rates). But the year values aren't the same, they arent in the same order, there arent the same number of 2011's for example, the locations are different, and there are obviously two different death rates that need to be two different columns, but how does this join work? Which variable would you join by? How would it be configured once joined? Would it just result in lots of NA values if this was across a larger data set?
I understand there are different types of joins that do different things, but I'm just struggling to understand how the years and cities would sit if you were wanting to be able to compare the two different death rates in cities and years.
Thank you!

If you do
merge(heart, car, all=TRUE)
# Year Place Death_Rate_heart Death_Rate_Car
# 1 2007 Melbourne NA 12
# 2 2009 London 15 NA
# 3 2011 London NA 987
# 4 2011 Paris 200 NA
# 5 2011 Paris 94 NA
# 6 2011 Rome 23 NA
# 7 2012 London NA 34
# 8 2012 Paris NA 9
merge automatically looks for matching names and merges on them. It's looking for pairs in those columns, so they won't be mixed. More verbosely you could do
merge(heart, car, all=TRUE, by.x=c("Year", "Place"), by.y=c("Year", "Place"))
which is actually what happens in this case.
Data:
heart <- structure(list(Year = c(2011L, 2011L, 2011L, 2009L), Place = c("Paris",
"Paris", "Rome", "London"), Death_Rate_heart = c(200L, 94L, 23L,
15L)), class = "data.frame", row.names = c(NA, -4L))
car <- structure(list(Year = c(2011L, 2012L, 2012L, 2007L), Place = c("London",
"London", "Paris", "Melbourne"), Death_Rate_Car = c(987L, 34L,
9L, 12L)), class = "data.frame", row.names = c(NA, -4L))

Related

Identifying matching observations in dyadic data in R

Hell everyone,
I am struggling with the following issue. Currently, I have a dataset looking like this:
living_in from Year stock
Austria Australia 2014 2513
Austria Australia 2013 2000
Germany Austria 2010 6000
Australia Austria 2014 3000
Austria Australia 1993 NA
Now I would like to identify all observations that fulfill the following criteria:
Should be from same year
Should contain the same country pairs in that year
Should not contain NA
For instance, I want to find all observations for combinations of two countries like Austria-Australia and Australia-Austria within the same year that contain values. This is due to the fact that some combinations in a given year in the dataset have only one value for stock not two. I want to remove those.
What is the best way to proceed here? Many thanks in advance!
P.S. I have about 14 country pairs in my dataset that need this kind of identification
A helpful output might be something like this.
living_in from Year stock dummy
Austria Australia 2014 2513 1
Austria Australia 2013 2000 0
Germany Austria 2010 6000 0
Australia Austria 2014 3000 1
Austria Australia 1993 NA 0
For each combination of country irrespective of their order (A-B is same as B-A) assign 1 to dummy column if for the same Year it has more than 1 row and all the stock values are non-NA or else assign 0.
library(dplyr)
df %>%
group_by(col1 = pmin(living_in, from), col2 = pmax(living_in, from), Year) %>%
mutate(dummy = as.integer(n() > 1 && all(!is.na(stock)))) %>%
ungroup %>%
select(-col1, -col2)
# living_in from Year stock dummy
# <chr> <chr> <int> <int> <int>
#1 Austria Australia 2014 2513 1
#2 Austria Australia 2013 2000 0
#3 Germany Austria 2010 6000 0
#4 Australia Austria 2014 3000 1
#5 Austria Australia 1993 NA 0
data
df <- structure(list(living_in = c("Austria", "Austria", "Germany",
"Australia", "Austria"), from = c("Australia", "Australia", "Austria",
"Austria", "Australia"), Year = c(2014L, 2013L, 2010L, 2014L,
1993L), stock = c(2513L, 2000L, 6000L, 3000L, NA)),
class = "data.frame", row.names = c(NA, -5L))

Use unique pairs of column values to generate dyad identifiers in the dataframe

I want to generate a set of dyad identifiers for a bilateral trade flow dataframe (that is coded in from, to, and amount traded format) such that I could use these identifiers for further statistical analysis.
My example data is provided at below, from which I have extracted and identified unique country dyads from the data that involve the US.
# load the example data
trade_flow <- readRDS(gzcon(url("https://www.dropbox.com/s/ep7xldoq9go4f0g/trade_flow.rds?dl=1")))
# extract country dyads
country_dyad <- trade_flow[, c("from", "to")]
# identify unique pairs
up <- country_dyad[!duplicated(t(apply(country_dyad, 1, sort))),]
# extract only unique pairs that involve the US
up <- up[(up$from == "USA") | (up$to == "USA"), ]
## how can I use the unique pair object (up) to generate dyad identifiers and include them as a new column in the trade_flow dataframe
The next step is match these unique dyad pairs from the original dataframe's (trade_flow) from and to columns and generate a list of unique dyad identifiers as a new column (say, dyad) to the df (trade_flow). It should look something like the format below in which each unique dyad is identified and coded as a unique numerical value. I will be grateful if someone could help me on this.
from to trade_flow dyad
USA ITA 5100 2
USA UKG 4000 1
USA GMY 17000 3
USA ITA 4500 2
USA JPN 2900 4
USA UKG 6700 1
USA ROK 7000 5
USA UKG 2300 1
USA SAF 1500 6
IND USA 2400 7
Assuming that flows are directioinal so that A/B and B/A are different flows, paste the from and to columns together and convert to factor. The internal codes that factor uses are 1, 2, ..., no_of_levels and to extract those use as.numeric.
transform(DF, dyad = as.numeric(factor(paste(from, to))))
giving:
from to trade_flow dyad
1 USA ITA 5100 3
2 USA UKG 4000 7
3 USA GMY 17000 2
4 USA ITA 4500 3
5 USA JPN 2900 4
6 USA UKG 6700 7
7 USA ROK 7000 5
8 USA UKG 2300 7
9 USA SAF 1500 6
10 IND USA 2400 1
Applying assignments made on subset to whole
If we want to perform this assignment only for a subset of rows of DF, for example head(DF), and then use those assignments for all of DF using NA for flows in DF that are not in DF0 then first perform the assignment of dyads as above (see first line below) and then remove the flow numbers from DF0 and extract its unique rows using unique. Finally merge that with the DF along the first two columns using all.x=TRUE so that unmatched rows in DF are not dropped.
DF0 <- transform(head(DF), dyad = as.numeric(factor(paste(from, to))))
merge(DF, unique(DF0[-3]), all.x = TRUE, by = 1:2)
giving:
from to trade_flow dyad
1 IND USA 2400 NA
2 USA GMY 17000 1
3 USA ITA 4500 2
4 USA ITA 5100 2
5 USA JPN 2900 3
6 USA ROK 7000 NA
7 USA SAF 1500 NA
8 USA UKG 4000 4
9 USA UKG 2300 4
10 USA UKG 6700 4
Note
Input in reproducible form:
Lines <- "from to trade_flow
USA ITA 5100
USA UKG 4000
USA GMY 17000
USA ITA 4500
USA JPN 2900
USA UKG 6700
USA ROK 7000
USA UKG 2300
USA SAF 1500
IND USA 2400"
DF <- read.table(text = Lines, header = TRUE)
Here is an option using base R
df1$dyad <- with(df1, as.integer(droplevels(interaction(from, to,
lex.order = TRUE))))
df1$dyad
#[1] 3 7 2 3 4 7 5 7 6 1
data
df1 <- structure(list(from = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L), .Label = c("IND", "USA"), class = "factor"), to = structure(c(2L,
6L, 1L, 2L, 3L, 6L, 4L, 6L, 5L, 7L), .Label = c("GMY", "ITA",
"JPN", "ROK", "SAF", "UKG", "USA"), class = "factor"), trade_flow = c(5100L,
4000L, 17000L, 4500L, 2900L, 6700L, 7000L, 2300L, 1500L, 2400L
)), class = "data.frame", row.names = c(NA, -10L))

RNOAA R package data access

I've been trying to use the r package rnoaa to download climate data from weather stations closest to my sites of study (essentially almost every state or national park in the state of Florida) over the course of two decades.
I have not found any vignettes or tutorials that help or really make sense to me especially considering the number of parks I'm working with. I was wondering if someone on here has any experience working with this package and could show an example on how to do this with a few parks from my list?
I also have the park longitudes and latitudes:
df<-structure(list(ParkName = structure(c(2L, 6L, 4L, 7L, 5L, 6L,
3L, 3L, 1L), .Label = c("Big Talbot Island State Park", "Fakahatchee Strand Preserve State Park",
"Jonathan Dickinson State Park", "Key Largo Hammocks", "Myakka River State Park",
"Paynes Prairie Preserve State Park", "Sebastian Inlet State Park"
), class = "factor"), ParkLatitude = c(26.02109, 29.57728, 25.25342,
27.86018, 27.2263, 29.57728, 27.00857, 27.00857, 30.47957), ParkLongitude = c(-81.42208,
-82.30675, -80.31574, -80.45221, -82.26661, -82.30675, -80.13897,
-80.13897, -81.43955), Year = c(2004L, 2000L, 1996L, 1997L, 2008L,
2002L, 2004L, 2002L, 1995L)), .Names = c("ParkName", "ParkLatitude",
"ParkLongitude", "Year"), class = "data.frame", row.names = c(NA,
-9L))
The end goal from this example data would be to have annual temperatures, humidity and other environmental variables from weather stations closest to these parks (or park coordinates) for the years listed in the data. I know that there might be missing data for those years depending on the weather station.
This should get you started (using df from your question):
library(rnooa)
# load station data - takes some minutes
station_data <- ghcnd_stations()
# add id column for each location (necessary for next function)
df$id <- 1:nrow(df)
# retrieve all stations in radius (e.g. 20km) using lapply
stations <- lapply(1:nrow(df),
function(i) meteo_nearby_stations(df[i,],lat_colname = 'ParkLatitude',lon_colname = 'ParkLongitude',radius = 20,station_data = station_data)[[1]])
# pull data for nearest stations - x$id[1] selects ID of closest station
stations_data <- lapply(stations,function(x) meteo_pull_monitors(x$id[1]))
This will give you all variables for the nearest station. Of course, you can specify which variables you need with var in meteo_pull_monitors from all the available variables.
Your next step would be to check if the variables you want are available for these stations within your desired time frame. If not, you could use the next closest one.
E.g.
The closest station to your first park only has precipitation, min and max temperature:
stations_data[[1]]
# # A tibble: 4,077 x 5
# id date prcp tmax tmin
# <chr> <date> <dbl> <dbl> <dbl>
# 1 USW00092826 2007-02-01 NA NA NA
# 2 USW00092826 2007-02-02 NA NA NA
# 3 USW00092826 2007-02-03 NA NA NA
# 4 USW00092826 2007-02-04 NA NA NA
# 5 USW00092826 2007-02-05 NA NA NA
# 6 USW00092826 2007-02-06 NA NA NA
# 7 USW00092826 2007-02-07 NA NA NA
# 8 USW00092826 2007-02-08 NA NA NA
# 9 USW00092826 2007-02-09 NA NA NA
#10 USW00092826 2007-02-10 NA NA NA
# # ... with 4,067 more rows
And you can see that there's missing measurements which you'll need to handle.

Compare values in data.frame from different rows [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have an R data.frame of college football data, with two entries for each game (one for each team, with stats and whatnot). I would like to compare points from these to create a binary Win/Loss variable, but I have no idea how (I'm not very experienced with R).
Is there a way I can iterate through the columns and try to match them up against another column (I have a game ID variable, so I'd match on that) and create aforementioned binary Win/Loss variable by comparing points values?
Excerpt of dataframe (many variables left out):
Team Code Name Game Code Date Site Points
5 Akron 5050320051201 12/1/2005 NEUTRAL 32
5 Akron 404000520051226 12/26/2005 NEUTRAL 23
8 Alabama 419000820050903 9/3/2005 TEAM 37
8 Alabama 664000820050910 9/10/2005 TEAM 43
What I want is to append a new column, a binary variable that's assigned 1 or 0 based on if the team won or lost. To figure this out, I need to take the game code, say 5050320051201, find the other row with that same game code (there's only one other row with that same game code, for the other team in that game), and compare the points value for the two, and use that to assign the 1 or 0 for the Win/Loss variable.
Assuming that your data has exactly two teams for each unique Game Code and there are no tie games as given by the following example:
df <- structure(list(`Team Code` = c(5L, 6L, 5L, 5L, 8L, 9L, 9L, 8L
), Name = c("Akron", "St. Joseph", "Akron", "Miami(Ohio)", "Alabama",
"Florida", "Tennessee", "Alabama"), `Game Code` = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L), .Label = c("5050320051201", "404000520051226",
"419000820050903", "664000820050910"), class = "factor"), Date = structure(c(13118,
13118, 13143, 13143, 13029, 13029, 13036, 13036), class = "Date"),
Site = c("NEUTRAL", "NEUTRAL", "NEUTRAL", "NEUTRAL", "TEAM",
"AWAY", "AWAY", "TEAM"), Points = c(32L, 25L, 23L, 42L, 37L,
45L, 42L, 43L)), .Names = c("Team Code", "Name", "Game Code",
"Date", "Site", "Points"), row.names = c(NA, -8L), class = "data.frame")
print(df)
## Team Code Name Game Code Date Site Points
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37
##6 9 Florida 419000820050903 2005-09-03 AWAY 45
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43
You can use dplyr to generate what you want:
library(dplyr)
result <- df %>% group_by(`Game Code`) %>%
mutate(`Win/Loss`=if(first(Points) > last(Points)) as.integer(c(1,0)) else as.integer(c(0,1)))
print(result)
##Source: local data frame [8 x 7]
##Groups: Game Code [4]
##
## Team Code Name Game Code Date Site Points Win/Loss
## <int> <chr> <fctr> <date> <chr> <int> <int>
##1 5 Akron 5050320051201 2005-12-01 NEUTRAL 32 1
##2 6 St. Joseph 5050320051201 2005-12-01 NEUTRAL 25 0
##3 5 Akron 404000520051226 2005-12-26 NEUTRAL 23 0
##4 5 Miami(Ohio) 404000520051226 2005-12-26 NEUTRAL 42 1
##5 8 Alabama 419000820050903 2005-09-03 TEAM 37 0
##6 9 Florida 419000820050903 2005-09-03 AWAY 45 1
##7 9 Tennessee 664000820050910 2005-09-10 AWAY 42 0
##8 8 Alabama 664000820050910 2005-09-10 TEAM 43 1
Here, we first group_by the Game Code and then use mutate to create the Win/Loss column for each group. The logic here is simply that if the first Points is greater than the last (there are only two by assumption), then we set the column to c(1,0). Otherwise, we set it to (0,1). Note that this logic does not handle ties, but can easily be extended to do so. Note also that we surround the column names with back-quotes because of special characters such as space and /.
footballdata$SomeVariable[footballdata$Wins == "1"] = stuff
call yours wins by either 1 or 0, thus binomial
R's data frames are nice in that you can aggregate what you want like, I only want the data frames with wins are 1. Then you can set the data to some variable as above. If you wanna do another data frame to populate a data frame, make sure they have the same amount of data.
footballdata$SomeVariable[footballdata$Wins == "1"][footballdata$Team == "Browns"] = Hopeful

Cartesian rolling join with data.table

I'm not exactly sure how to describe this, but I'll gladly edit the title and/or post to reflect comments and answers.
Problem
I have two data.frames that I would like to merge with a combination of a left join, an outer join, and a rolling join.
One of the key columns (year) is for the rolling join.
Another key column (cat) is common to both data.frames. In the example below I've only supplied exemplary subsets of the full data, which has thousands of values for cat.
The first data.frame, X, has an additional key column cnty (county), and the second data.frame, Y, has an additional key column pol (pollutant).
For each group defined by cat and year, I would like the final result to contain a cartesian product of cnty and pol, with value columns emfac (from X) and tput (from Y). The goal is to be able to compute emfac * tput.
Here is an exemplary subset of X:
cat year cnty tput
1 29 2011 ALA 67852
2 29 2011 CC 33893
3 29 2011 MRN 11319
... and here is an exemplary subset of Y:
cat year pol emfac
1 29 1975 TOG 2.4
2 29 1975 PM 5.3
Closest attempt so far
I can almost, but not quite, get the output I want:
X <- structure(list(
cat = c(29L, 29L, 29L),
year = c(2011L, 2011L, 2011L),
cnty = c("ALA", "CC", "MRN"),
tput = c(67852, 33893, 11319)),
.Names = c("cat", "year", "cnty", "tput"),
class = c("data.frame"), row.names = c(NA, -3L))
Y <- structure(list(
cat = c(29L, 29L),
year = c(1975, 1975),
pol = c("PM", "TOG"),
emfac = c(2.4, 5.3)),
.Names = c("cat", "year", "pol", "emfac"),
class = c("data.frame"), row.names = c(NA, -2L))
library(data.table)
X <- data.table(X, key = c("cat", "cnty", "year"))
Y <- data.table(Y, key = c("cat", "pol", "year"))
Y[X, roll = TRUE]
cat year pol emfac cnty tput
1: 29 2011 PM 5.3 ALA 67852
2: 29 2011 PM 5.3 CC 33893
3: 29 2011 PM 5.3 MRN 11319
This is my "nearest miss". Most of my other attempts are much more wrong.
Expected result
cat year pol emfac cnty tput
1: 29 2011 PM 5.3 ALA 67852
2: 29 2011 PM 5.3 CC 33893
3: 29 2011 PM 5.3 MRN 11319
4: 29 2011 TOG 2.4 ALA 67852
5: 29 2011 TOG 2.4 CC 33893
6: 29 2011 TOG 2.4 MRN 11319
What am I doing wrong?

Resources