Combine migration in and out data by different common columns - r

I have two datasets, one for migration inflow to county A from other counties and other for migration outflow from county A to other counties. In order to combine the two data sets as:
Desired output:
Key County State FIPS Inflow Outflow FiscalYear Year
510012012 Accomack County VA 51001 NA 27 2011 - 2012 2012
160012012 Ada County ID 16001 12 18 2011 - 2012 2012
80012012 Adams County CO 8001 22 39 2011 - 2012 2012
80012011 Adams County CO 8001 42 31 2010 - 2011 2011
450032012 Aiken County SC 45003 NA 21 2011 - 2012 2012
120012012 Alachua County FL 12001 433 NA 2011 - 2012 2012
Part of the problem is that the common columns have unequal number of rows. Another problem is that different counties can belong to the same state as shown in the dummy data below. So, what I am thinking is to have a unique key (Key) for both destination and origin county by concatenating FIPS (unique for each county) and Year. That way I can combine the counties with their respective states and the rest of the associated columns values in a single row.
How can I combine the two into one dataset in such a way that I don't have to hardcode each and every common county and state name and FIPS and Year? Missing values would be NA.
My original migration outflow data has 517 observations and migration inflow has 441, thus different number of counties in each dataset.
Sample data:
# People moving out of county A to other counties
inflow_df = structure(list(Origin_FIPS = c(12001L, 8001L, 16001L, 12001L,
8001L, 16001L), Origin_StateName = c("FL", "CO", "ID", "FL",
"CO", "ID"), Origin_Place = c("Alachua County", "Adams County",
"Ada County", "Alachua County", "Adams County", "Ada County"),
InIndividuals = c(433L, 30L, 16L, 381L, 42L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2010 - 2011", "2010 - 2011",
"2010 - 2011"), Year = c(2012L, 2012L, 2012L, 2011L, 2011L,
2011L), Key = c(120012012L, 80012012L, 160012012L, 120012011L,
80012011L, 160012011L)), class = "data.frame", row.names = c(NA,
-6L))
# People moving in county A from other counties
outflow_df = structure(list(Dest_FIPS = c(51001L, 16001L, 8001L, 8001L, 45003L
), Dest_StateName = c("VA", "ID", "CO", "CO", "SC"), Dest_Place = c("Accomack County",
"Ada County", "Adams County", "Adams County", "Aiken County"),
OutIndividuals = c(27L, 16L, 39L, 31L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2010 - 2011", "2011 - 2012"
), Year = c(2012L, 2012L, 2012L, 2011L, 2012L), Key = c(510012012L,
160012012L, 80012012L, 80012011L, 450032012L)), class = "data.frame", row.names = c(NA,
-5L))

Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
bind_rows(lst(Inflow_df, Outflow_df), .id = 'datname') %>%
pivot_longer(cols = contains("_"), names_to = ".value",
names_pattern = ".*_([^_]+$)") %>%
mutate(Key = str_c(County, Year), rn = rowid(Key, datname)) %>%
pivot_wider(names_from = datname, values_from = Individuals) %>%
arrange(rn) %>%
select(-rn)

Related

Combine migration in and out data

I have two datasets, one for migration inflow to county A from other counties and other for migration outflow from county A to other counties. In order to combine the two data sets as:
Desired output:
Key County State FIPS Inflow Outflow FiscalYear Year
510012012 Accomack County VA 51001 NA 27 2011 - 2012 2012
160012012 Ada County ID 16001 16 16 2011 - 2012 2012
80012012 Adams County CO 8001 39 30 2011 - 2012 2012
80012011 Adams County CO 8001 42 31 2010 - 2011 2011
450032012 Aiken County SC 45003 NA 21 2011 - 2012 2012
120012012 Alachua County FL 12001 433 NA 2011 - 2012 2012
How can I combine the two into one dataset in such a way that I don't have to hardcode each and every common county and state name and FIPS and Year? Missing values would be NA.
The common value between the two data sets is the key.
My original migration outflow data has 517 observations and migration inflow has 441, thus different number of counties in each dataset.
Sample data:
# People moving out of county A to other counties
inflow_df = structure(list(Origin_FIPS = c(12001L, 8001L, 16001L, 12001L,
8001L, 16001L), Origin_StateName = c("FL", "CO", "ID", "FL",
"CO", "ID"), Origin_Place = c("Alachua County", "Adams County",
"Ada County", "Alachua County", "Adams County", "Ada County"),
InIndividuals = c(433L, 30L, 16L, 381L, 42L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2011 - 2012", "2010 - 2011",
"2010 - 2011"), Year = c(2012L, 2012L, 2012L, 2011L, 2011L,
2011L), Key = c(120012012L, 80012012L, 160012012L, 120012011L,
80012011L, 160012011L)), class = "data.frame", row.names = c(NA,
-6L))
# People moving in county A from other counties
outflow_df = structure(list(Dest_FIPS = c(51001L, 16001L, 8001L, 8001L, 45003L
), Dest_StateName = c("VA", "ID", "CO", "CO", "SC"), Dest_Place = c("Accomack County",
"Ada County", "Adams County", "Adams County", "Aiken County"),
OutIndividuals = c(27L, 16L, 39L, 31L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2011 - 2012", "2011 - 2012"
), Year = c(2012L, 2012L, 2012L, 2011L, 2012L), Key = c(510012012L,
160012012L, 80012012L, 80012011L, 450032012L)), class = "data.frame", row.names = c(NA,
-5L))
We can collate the two tables by giving them consistent names (presumably Origin_Place in one should match with Dest_Place in the other) and then performing a join. full_join outputs all the keys found in either table, in this case c("Key", "County", "State", "FIPS", "FiscalYear", "Year").
I would have expected that the inflow_df would reflect the counties that are seeing inflows (ie the destinations) and outflow_df would reflect the counties that have outflows (ie the origins), so it seems possible the table names are swapped in the question.
inflow2 <-
inflow_df %>%
transmute(Key,
County = Origin_Place,
State = Origin_StateName,
FIPS = Origin_FIPS,
Inflow = InIndividuals,
FiscalYear,
Year)
outflow2 <-
outflow_df %>%
transmute(Key,
County = Dest_Place,
State = Dest_StateName,
FIPS = Dest_FIPS,
Outflow = OutIndividuals,
FiscalYear,
Year)
inflow2 %>%
full_join(outflow2)
Result (updated with data from 2022-11-04)
Joining, by = c("Key", "County", "State", "FIPS", "FiscalYear", "Year")
Key County State FIPS Inflow FiscalYear Year Outflow
1 120012012 Alachua County FL 12001 433 2011 - 2012 2012 NA
2 80012012 Adams County CO 8001 30 2011 - 2012 2012 39
3 160012012 Ada County ID 16001 16 2011 - 2012 2012 16
4 120012011 Alachua County FL 12001 381 2011 - 2012 2011 NA
5 80012011 Adams County CO 8001 42 2010 - 2011 2011 NA
6 160012011 Ada County ID 16001 21 2010 - 2011 2011 NA
7 510012012 Accomack County VA 51001 NA 2011 - 2012 2012 27
8 80012011 Adams County CO 8001 NA 2011 - 2012 2011 31
9 450032012 Aiken County SC 45003 NA 2011 - 2012 2012 21

R - ddply(): Using min value of one column to find the corresponding value in different column [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I want to get a summary of min(cost) per country over the years with the specific airport. The dataset looks like this (around 1000 rows with multiple airports per country)
airport country cost year
ORD US 500 2010
SFO US 800 2010
LHR UK 250 2010
CDG FR 300 2010
FRA GR 200 2010
ORD US 650 2011
SFO US 500 2011
LHR UK 850 2011
CDG FR 350 2011
FRA GR 150 2011
ORD US 250 2012
SFO US 650 2012
LHR UK 350 2012
CDG FR 450 2012
FRA GR 100 2012
The code below gets me summary of min(cost) per country
ddply(df,c('country'), summarize, LowestCost = min(cost))
When I try to display min(cost) of the country along with the specific airport, I just get one airport listed
ddply(df,c('country'), summarize, LowestCost = min(cost), AirportName = df[which.min(df[,3]),1])
The output should look like below
country LowestCost AirportName
US 250 ORD
UK 250 LHR
FR 300 CDG
GR 100 FRA
But instead it looks like this
country LowestCost AirportName
US 250 ORD
UK 250 ORD
FR 300 ORD
GR 100 ORD
Any help is appreciated
We may use slice_min from dplyr
library(dplyr)
df %>%
select(-year) %>%
group_by(country) %>%
slice_min(cost, n = 1) %>%
ungroup %>%
rename(LowestCost = cost)
-output
# A tibble: 4 x 3
airport country LowestCost
<chr> <chr> <int>
1 CDG FR 300
2 FRA GR 100
3 LHR UK 250
4 ORD US 250
In the plyr, code, the which.min is applied on the whole column, instead of the grouped column. We just need to specify the column name
plyr::ddply(df, c("country"), plyr::summarise,
LowestCost = min(cost), AirportName = airport[which.min(cost)])
country LowestCost AirportName
1 FR 300 CDG
2 GR 100 FRA
3 UK 250 LHR
4 US 250 ORD
data
df <- structure(list(airport = c("ORD", "SFO", "LHR", "CDG", "FRA",
"ORD", "SFO", "LHR", "CDG", "FRA", "ORD", "SFO", "LHR", "CDG",
"FRA"), country = c("US", "US", "UK", "FR", "GR", "US", "US",
"UK", "FR", "GR", "US", "US", "UK", "FR", "GR"), cost = c(500L,
800L, 250L, 300L, 200L, 650L, 500L, 850L, 350L, 150L, 250L, 650L,
350L, 450L, 100L), year = c(2010L, 2010L, 2010L, 2010L, 2010L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2012L, 2012L, 2012L,
2012L)), class = "data.frame", row.names = c(NA, -15L))

How to use dplyr to group_by multiple variables and sum other variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe combined_data that looks like this (this is just an example):
Year state_name VoS_thousUSD industry
2008 Alabama 100 Shipping
2009 Alabama 100 Shipping
2008 Alabama 200 Shipping
2010 Alabama 100 Shipping
2010 Alabama 50 Shipping
2010 Alabama 100 Shipping
2008 Alabama 100 Shipping
There are multiple Year, state_name, and industry
variable, with associated VoS_thousUSD values, as well as other columns I no longer need.
I am trying to produce this
Year state_name VoS_thousUSD industry
2008 Alabama 400 Shipping
2009 Alabama 100 Shipping
2010 Alabama 250 Shipping
Where the dataframe is grouped by Year, state_name, and industry, and VoS_thousand is a sum by those groups.
So far I have
combined_data %>%
group_by(Year, state_name, GCAM_industry) %>%
summarise() -> VoS_thousUSD_state_ind
But I am not sure how/where to add in the sum for VoS_thousUSD. Would like to use a dplyr pipeline.
We can use
aggregate( VoS_thousUSD~ ., combined_data, FUN = sum)
Or with dplyr
library(dplyr)
combined_data %>%
group_by(Year, state_name, industry) %>%
summarise(VoS_thousUSD = sum(VoS_thousUSD))
# A tibble: 3 x 4
# Groups: Year, state_name [3]
# Year state_name industry VoS_thousUSD
# <int> <chr> <chr> <int>
#1 2008 Alabama Shipping 400
#2 2009 Alabama Shipping 100
#3 2010 Alabama Shipping 250
data
combined_data <- structure(list(Year = c(2008L, 2009L, 2008L, 2010L, 2010L, 2010L,
2008L), state_name = c("Alabama", "Alabama", "Alabama", "Alabama",
"Alabama", "Alabama", "Alabama"), VoS_thousUSD = c(100L, 100L,
200L, 100L, 50L, 100L, 100L), industry = c("Shipping", "Shipping",
"Shipping", "Shipping", "Shipping", "Shipping", "Shipping")),
class = "data.frame", row.names = c(NA,
-7L))

how to group data by more than two factors in R

I have a data set that looks like below.
In real data set, there are 8619 rows.
Athlete Competing Country Year Total Medals
Michael Phelps United States 2012 6
Alicia Coutts Australia 2012 5
Missy Franklin United States 2012 5
Brian Leetch United States 2002 1
Mario Lemieux Canada 2002 1
Ylva Lindberg Sweden 2002 1
Eric Lindros Canada 2002 1
Ulrica Lindström Sweden 2002 1
Shelley Looney United States 2002 1
and I want to rearrange this data by country, year and the sum of the medals.
I want result like
Country Year SumOfMedals
United States 2012 11
United States 2002 2
...
by(newmd$Total.Medals, newmd$Year, FUN=sum)
by(md$Total.Medals, md$Competing.Country, FUN=sum)
I tired to use by argument, but still stuck.
Can any of you help me out?
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Competing_Country', 'Year', get the sum of 'Total_Medalsand thenorder` by the variables of interest.
library(data.table)
setDT(df1)[,list(SumOfMedals = sum(Total_Medals)),
by = .(Competing_Country, Year)
][order(-Competing_Country, -Year, -SumOfMedals)]
Or with dplyr, we use the same methodology.
library(dplyr)
df1 %>%
group_by(Competing_Country, Year) %>%
summary(SumOfMedals = sum(Total_Medals) %>%
arrange(desc(Competing_Country), desc(Year), desc(SumOfMedals))
data
df1 <- structure(list(Athlete = c("Michael Phelps", "Alicia Coutts",
"Missy Franklin", "Brian Leetch", "Mario Lemieux", "Ylva Lindberg",
"Eric Lindros", "Ulrica Lindström", "Shelley Looney"), Competing_Country = c("United States",
"Australia", "United States", "United States", "Canada", "Sweden",
"Canada", "Sweden", "United States"), Year = c(2012L, 2012L,
2012L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L), Total_Medals = c(6L,
5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("Athlete", "Competing_Country",
"Year", "Total_Medals"), class = "data.frame", row.names = c(NA,
-9L))
You can do this pretty easily using aggregate to get the sum of the number of medals:
md2 <- aggregate(cbind(SumOfMedals = Total.Medals) ~ Competing.Country + Year),
data = md,
FUN = sum)
The next step is to sort md2 by Competing.Country and SumOfMedals, which is done using the order function:
md2 <- md2[order(Competing.Country, -SumOfMedals),]
All done.

Reshaping panel data

I need to reshape my data for panel data analysis. I searched the internet and only found out how to get the desired results by using Stata; however I am supposed to use R and Excel.
My initial and final data(the desired result) looks very similar to the given in the first page of this example of reshaping data with Stata.
http://spot.colorado.edu/~moonhawk/technical/C1912567120/E220703361/Media/reshape.pdf
Is it attainable with R or just Excel? I tried using melt function from reshape2 library, yet I get
CountryName ProductName Unit Years value
1 Belarus databaseHouseholds '000 Y1977 2942.702
2 Belarus databasePopulation '000 Y1977 9434.200
3 Belarus databaseUrbanPopulation '000 Y1977 4946.882
4 Belarus databaseRuralPopulation '000 Y1977 4487.318
5 Belarus originalHouseholds '000 Y1977 NA
6 Belarus originalUrban households '000 Y1977 NA
7 Poland ..............................................
...........................................................
when I would like to get something like this:
CountryName Years databaseHouseholds databasePopulation databaseUrbanPopulation databaseRuralPopulationUnit originalHousehold originalUrbanhouseholds
Belarus
In the columns databaseHouseholds, databasePopulation,... should be their respective values, so I can use dataset for panel modeling.
Thank you very much.
Try:
library(reshape2)
dcast(dat, CountryName+Years+Unit~ProductName, value.var="value")
# CountryName Years Unit databaseHouseholds databasePopulation
#1 Belarus Y1977 0 2942.702 9434.2
# databaseRuralPopulation databaseUrbanPopulation originalHouseholds
#1 4487.318 4946.882 NA
# originalUrban households
# 1 NA
data
dat <- structure(list(CountryName = c("Belarus", "Belarus", "Belarus",
"Belarus", "Belarus", "Belarus"), ProductName = c("databaseHouseholds",
"databasePopulation", "databaseUrbanPopulation", "databaseRuralPopulation",
"originalHouseholds", "originalUrban households"), Unit = c(0L,
0L, 0L, 0L, 0L, 0L), Years = c("Y1977", "Y1977", "Y1977", "Y1977",
"Y1977", "Y1977"), value = c(2942.702, 9434.2, 4946.882, 4487.318,
NA, NA)), .Names = c("CountryName", "ProductName", "Unit", "Years",
"value"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))

Resources